<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/stylesheet.xsl" type="text/xsl"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:podcast="https://podcastindex.org/namespace/1.0">
  <channel>
    <atom:link rel="self" type="application/rss+xml" href="https://feeds.transistor.fm/daily-paper-cast-ai" title="MP3 Audio"/>
    <atom:link rel="hub" href="https://pubsubhubbub.appspot.com/"/>
    <podcast:podping usesPodping="true"/>
    <title>Daily Paper Cast</title>
    <generator>Transistor (https://transistor.fm)</generator>
    <itunes:new-feed-url>https://feeds.transistor.fm/daily-paper-cast-ai</itunes:new-feed-url>
    <description>We update every weekday to discuss highest-voted papers from Huggingface Daily Paper (https://huggingface.co/papers). Both the podcast scripts and audio are generated by AI. Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com

Creator:
Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/
Gengyu Wang, LLM ML, http://wanggengyu.com

Listen on: 
Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL
Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236

Cover Image by Kawen Kuang https://kawen.art</description>
    <copyright>© 2026 Jingwen Liang, Gengyu Wang</copyright>
    <podcast:guid>b6f03825-a76d-5e13-a5e0-c4ad33d3746b</podcast:guid>
    <podcast:locked owner="dailypapercast.ai@gmail.com">no</podcast:locked>
    <language>en</language>
    <pubDate>Wed, 22 Apr 2026 21:25:18 -0700</pubDate>
    <lastBuildDate>Wed, 22 Apr 2026 21:26:05 -0700</lastBuildDate>
    <link>https://dailypapercast.transistor.fm/</link>
    <image>
      <url>https://img.transistorcdn.com/IxaBeiMluxrMS9W9wB8hFMfmvH27KvwaSMzuhucupn0/rs:fill:0:0:1/w:1400/h:1400/q:60/mb:500000/aHR0cHM6Ly9pbWct/dXBsb2FkLXByb2R1/Y3Rpb24udHJhbnNp/c3Rvci5mbS81Zjg1/YzRhODczMDU4MmE4/OGMwN2FiNDlmYzI2/MDliMi5qcGVn.jpg</url>
      <title>Daily Paper Cast</title>
      <link>https://dailypapercast.transistor.fm/</link>
    </image>
    <itunes:category text="Science"/>
    <itunes:category text="Technology"/>
    <itunes:type>episodic</itunes:type>
    <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
    <itunes:image href="https://img.transistorcdn.com/IxaBeiMluxrMS9W9wB8hFMfmvH27KvwaSMzuhucupn0/rs:fill:0:0:1/w:1400/h:1400/q:60/mb:500000/aHR0cHM6Ly9pbWct/dXBsb2FkLXByb2R1/Y3Rpb24udHJhbnNp/c3Rvci5mbS81Zjg1/YzRhODczMDU4MmE4/OGMwN2FiNDlmYzI2/MDliMi5qcGVn.jpg"/>
    <itunes:summary>We update every weekday to discuss highest-voted papers from Huggingface Daily Paper (https://huggingface.co/papers). Both the podcast scripts and audio are generated by AI. Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com

Creator:
Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/
Gengyu Wang, LLM ML, http://wanggengyu.com

Listen on: 
Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL
Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236

Cover Image by Kawen Kuang https://kawen.art</itunes:summary>
    <itunes:subtitle>We update every weekday to discuss highest-voted papers from Huggingface Daily Paper (https://huggingface.co/papers).</itunes:subtitle>
    <itunes:keywords></itunes:keywords>
    <itunes:owner>
      <itunes:name>Jingwen Liang, Gengyu Wang</itunes:name>
    </itunes:owner>
    <itunes:complete>No</itunes:complete>
    <itunes:explicit>No</itunes:explicit>
    <item>
      <title>Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items</title>
      <itunes:episode>1795</itunes:episode>
      <podcast:episode>1795</podcast:episode>
      <itunes:title>Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">04f71c62-bac7-4006-9b88-55f1c422ef92</guid>
      <link>https://share.transistor.fm/s/c2df3c37</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 82 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Mengting Chen, Zhengrui Chen, Yongchao Du, Zuan Gao, Taihang Hu, Jinsong Lan, Chao Lin, Yefeng Shen, Xingjian Wang, Zhao Wang, Zhengtao Wu, Xiaoli Xu, Zhengze Xu, Hao Yan, Mingzhou Zhang, Jun Zheng, Qinye Zhou, Xiaoyong Zhu, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.19748v2">http://arxiv.org/abs/2604.19748v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in image generation and editing have opened new opportunities for virtual try-on. However, existing methods still struggle to meet complex real-world demands. We present Tstars-Tryon 1.0, a commercial-scale virtual try-on system that is robust, realistic, versatile, and highly efficient. First, our system maintains a high success rate across challenging cases like extreme poses, severe illumination variations, motion blur, and other in-the-wild conditions. Second, it delivers highly photorealistic results with fine-grained details, faithfully preserving garment texture, material properties, and structural characteristics, while largely avoiding common AI-generated artifacts. Third, beyond apparel try-on, our model supports flexible multi-image composition (up to 6 reference images) across 8 fashion categories, with coordinated control over person identity and background. Fourth, to overcome the latency bottlenecks of commercial deployment, our system is heavily optimized for inference speed, delivering near real-time generation for a seamless user experience. These capabilities are enabled by an integrated system design spanning end-to-end model architecture, a scalable data engine, robust infrastructure, and a multi-stage training paradigm. Extensive evaluation and large-scale product deployment demonstrate that Tstars-Tryon1.0 achieves leading overall performance. To support future research, we also release a comprehensive benchmark. The model has been deployed at an industrial scale on the Taobao App, serving millions of users with tens of millions of requests.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 82 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Mengting Chen, Zhengrui Chen, Yongchao Du, Zuan Gao, Taihang Hu, Jinsong Lan, Chao Lin, Yefeng Shen, Xingjian Wang, Zhao Wang, Zhengtao Wu, Xiaoli Xu, Zhengze Xu, Hao Yan, Mingzhou Zhang, Jun Zheng, Qinye Zhou, Xiaoyong Zhu, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.19748v2">http://arxiv.org/abs/2604.19748v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in image generation and editing have opened new opportunities for virtual try-on. However, existing methods still struggle to meet complex real-world demands. We present Tstars-Tryon 1.0, a commercial-scale virtual try-on system that is robust, realistic, versatile, and highly efficient. First, our system maintains a high success rate across challenging cases like extreme poses, severe illumination variations, motion blur, and other in-the-wild conditions. Second, it delivers highly photorealistic results with fine-grained details, faithfully preserving garment texture, material properties, and structural characteristics, while largely avoiding common AI-generated artifacts. Third, beyond apparel try-on, our model supports flexible multi-image composition (up to 6 reference images) across 8 fashion categories, with coordinated control over person identity and background. Fourth, to overcome the latency bottlenecks of commercial deployment, our system is heavily optimized for inference speed, delivering near real-time generation for a seamless user experience. These capabilities are enabled by an integrated system design spanning end-to-end model architecture, a scalable data engine, robust infrastructure, and a multi-stage training paradigm. Extensive evaluation and large-scale product deployment demonstrate that Tstars-Tryon1.0 achieves leading overall performance. To support future research, we also release a comprehensive benchmark. The model has been deployed at an industrial scale on the Taobao App, serving millions of users with tens of millions of requests.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Apr 2026 21:25:14 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c2df3c37/24df74ec.mp3" length="23052997" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1437</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 82 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Mengting Chen, Zhengrui Chen, Yongchao Du, Zuan Gao, Taihang Hu, Jinsong Lan, Chao Lin, Yefeng Shen, Xingjian Wang, Zhao Wang, Zhengtao Wu, Xiaoli Xu, Zhengze Xu, Hao Yan, Mingzhou Zhang, Jun Zheng, Qinye Zhou, Xiaoyong Zhu, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.19748v2">http://arxiv.org/abs/2604.19748v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in image generation and editing have opened new opportunities for virtual try-on. However, existing methods still struggle to meet complex real-world demands. We present Tstars-Tryon 1.0, a commercial-scale virtual try-on system that is robust, realistic, versatile, and highly efficient. First, our system maintains a high success rate across challenging cases like extreme poses, severe illumination variations, motion blur, and other in-the-wild conditions. Second, it delivers highly photorealistic results with fine-grained details, faithfully preserving garment texture, material properties, and structural characteristics, while largely avoiding common AI-generated artifacts. Third, beyond apparel try-on, our model supports flexible multi-image composition (up to 6 reference images) across 8 fashion categories, with coordinated control over person identity and background. Fourth, to overcome the latency bottlenecks of commercial deployment, our system is heavily optimized for inference speed, delivering near real-time generation for a seamless user experience. These capabilities are enabled by an integrated system design spanning end-to-end model architecture, a scalable data engine, robust infrastructure, and a multi-stage training paradigm. Extensive evaluation and large-scale product deployment demonstrate that Tstars-Tryon1.0 achieves leading overall performance. To support future research, we also release a comprehensive benchmark. The model has been deployed at an industrial scale on the Taobao App, serving millions of users with tens of millions of requests.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation</title>
      <itunes:episode>1794</itunes:episode>
      <podcast:episode>1794</podcast:episode>
      <itunes:title>CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ad5eafb7-1e94-483c-bfda-7e1f1c05b9fc</guid>
      <link>https://share.transistor.fm/s/90a873da</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 69 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiangyang Luo, Xiaozhe Xin, Tao Feng, Xu Guo, Meiguang Jin, Junfeng Ma</p>

            <p><strong>Title:</strong><br>
            CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.19636v1">http://arxiv.org/abs/2604.19636v1</a></p>

            <p><strong>Abstract:</strong><br>
            Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 69 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiangyang Luo, Xiaozhe Xin, Tao Feng, Xu Guo, Meiguang Jin, Junfeng Ma</p>

            <p><strong>Title:</strong><br>
            CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.19636v1">http://arxiv.org/abs/2604.19636v1</a></p>

            <p><strong>Abstract:</strong><br>
            Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Apr 2026 21:24:53 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/90a873da/8b3b365c.mp3" length="20480492" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1276</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 69 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiangyang Luo, Xiaozhe Xin, Tao Feng, Xu Guo, Meiguang Jin, Junfeng Ma</p>

            <p><strong>Title:</strong><br>
            CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.19636v1">http://arxiv.org/abs/2604.19636v1</a></p>

            <p><strong>Abstract:</strong><br>
            Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AgentSPEX: An Agent SPecification and EXecution Language</title>
      <itunes:episode>1793</itunes:episode>
      <podcast:episode>1793</podcast:episode>
      <itunes:title>AgentSPEX: An Agent SPecification and EXecution Language</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fd3c57b7-b5de-4ab8-8f3d-cd85c96a25a2</guid>
      <link>https://share.transistor.fm/s/8f1d0330</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Pengcheng Wang, Jerry Huang, Jiarui Yao, Rui Pan, Peizhi Niu, Yaowenqi Liu, Ruida Wang, Renhao Lu, Yuwei Guo, Tong Zhang</p>

            <p><strong>Title:</strong><br>
            AgentSPEX: An Agent SPecification and EXecution Language</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.13346v1">http://arxiv.org/abs/2604.13346v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language-model agent systems commonly rely on reactive prompting, in which a single instruction guides the model through an open-ended sequence of reasoning and tool-use steps, leaving control flow and intermediate state implicit and making agent behavior potentially difficult to control. Orchestration frameworks such as LangGraph, DSPy, and CrewAI impose greater structure through explicit workflow definitions, but tightly couple workflow logic with Python, making agents difficult to maintain and modify. In this paper, we introduce AgentSPEX, an Agent SPecification and EXecution Language for specifying LLM-agent workflows with explicit control flow and modular structure, along with a customizable agent harness. AgentSPEX supports typed steps, branching and loops, parallel execution, reusable submodules, and explicit state management, and these workflows execute within an agent harness that provides tool access, a sandboxed virtual environment, and support for checkpointing, verification, and logging. Furthermore, we provide a visual editor with synchronized graph and workflow views for authoring and inspection. We include ready-to-use agents for deep research and scientific research, and we evaluate AgentSPEX on 7 benchmarks. Finally, we show through a user study that AgentSPEX provides a more interpretable and accessible workflow-authoring paradigm than a popular existing agent framework.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Pengcheng Wang, Jerry Huang, Jiarui Yao, Rui Pan, Peizhi Niu, Yaowenqi Liu, Ruida Wang, Renhao Lu, Yuwei Guo, Tong Zhang</p>

            <p><strong>Title:</strong><br>
            AgentSPEX: An Agent SPecification and EXecution Language</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.13346v1">http://arxiv.org/abs/2604.13346v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language-model agent systems commonly rely on reactive prompting, in which a single instruction guides the model through an open-ended sequence of reasoning and tool-use steps, leaving control flow and intermediate state implicit and making agent behavior potentially difficult to control. Orchestration frameworks such as LangGraph, DSPy, and CrewAI impose greater structure through explicit workflow definitions, but tightly couple workflow logic with Python, making agents difficult to maintain and modify. In this paper, we introduce AgentSPEX, an Agent SPecification and EXecution Language for specifying LLM-agent workflows with explicit control flow and modular structure, along with a customizable agent harness. AgentSPEX supports typed steps, branching and loops, parallel execution, reusable submodules, and explicit state management, and these workflows execute within an agent harness that provides tool access, a sandboxed virtual environment, and support for checkpointing, verification, and logging. Furthermore, we provide a visual editor with synchronized graph and workflow views for authoring and inspection. We include ready-to-use agents for deep research and scientific research, and we evaluate AgentSPEX on 7 benchmarks. Finally, we show through a user study that AgentSPEX provides a more interpretable and accessible workflow-authoring paradigm than a popular existing agent framework.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Apr 2026 21:24:31 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8f1d0330/4eb60041.mp3" length="21794917" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1359</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Pengcheng Wang, Jerry Huang, Jiarui Yao, Rui Pan, Peizhi Niu, Yaowenqi Liu, Ruida Wang, Renhao Lu, Yuwei Guo, Tong Zhang</p>

            <p><strong>Title:</strong><br>
            AgentSPEX: An Agent SPecification and EXecution Language</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.13346v1">http://arxiv.org/abs/2604.13346v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language-model agent systems commonly rely on reactive prompting, in which a single instruction guides the model through an open-ended sequence of reasoning and tool-use steps, leaving control flow and intermediate state implicit and making agent behavior potentially difficult to control. Orchestration frameworks such as LangGraph, DSPy, and CrewAI impose greater structure through explicit workflow definitions, but tightly couple workflow logic with Python, making agents difficult to maintain and modify. In this paper, we introduce AgentSPEX, an Agent SPecification and EXecution Language for specifying LLM-agent workflows with explicit control flow and modular structure, along with a customizable agent harness. AgentSPEX supports typed steps, branching and loops, parallel execution, reusable submodules, and explicit state management, and these workflows execute within an agent harness that provides tool access, a sandboxed virtual environment, and support for checkpointing, verification, and logging. Furthermore, we provide a visual editor with synchronized graph and workflow views for authoring and inspection. We include ready-to-use agents for deep research and scientific research, and we evaluate AgentSPEX on 7 benchmarks. Finally, we show through a user study that AgentSPEX provides a more interpretable and accessible workflow-authoring paradigm than a popular existing agent framework.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model</title>
      <itunes:episode>1792</itunes:episode>
      <podcast:episode>1792</podcast:episode>
      <itunes:title>AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0f9905e1-8649-4503-b020-7fe7ae6b561d</guid>
      <link>https://share.transistor.fm/s/11d6729e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yutian Chen, Shi Guo, Renbiao Jin, Tianshuo Yang, Xin Cai, Yawen Luo, Mingxin Yang, Mulin Yu, Linning Xu, Tianfan Xue</p>

            <p><strong>Title:</strong><br>
            AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.19747v1">http://arxiv.org/abs/2604.19747v1</a></p>

            <p><strong>Abstract:</strong><br>
            Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yutian Chen, Shi Guo, Renbiao Jin, Tianshuo Yang, Xin Cai, Yawen Luo, Mingxin Yang, Mulin Yu, Linning Xu, Tianfan Xue</p>

            <p><strong>Title:</strong><br>
            AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.19747v1">http://arxiv.org/abs/2604.19747v1</a></p>

            <p><strong>Abstract:</strong><br>
            Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Apr 2026 21:24:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/11d6729e/61f516b3.mp3" length="23596334" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1471</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yutian Chen, Shi Guo, Renbiao Jin, Tianshuo Yang, Xin Cai, Yawen Luo, Mingxin Yang, Mulin Yu, Linning Xu, Tianfan Xue</p>

            <p><strong>Title:</strong><br>
            AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.19747v1">http://arxiv.org/abs/2604.19747v1</a></p>

            <p><strong>Abstract:</strong><br>
            Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TEMPO: Scaling Test-time Training for Large Reasoning Models</title>
      <itunes:episode>1791</itunes:episode>
      <podcast:episode>1791</podcast:episode>
      <itunes:title>TEMPO: Scaling Test-time Training for Large Reasoning Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ba88204e-9cda-447f-b635-913e96b5b4ed</guid>
      <link>https://share.transistor.fm/s/248cd987</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Qingyang Zhang, Xinke Kong, Haitao Wu, Qinghua Hu, Minghao Wu, Baosong Yang, Yu Cheng, Yun Luo, Ganqu Cui, Changqing Zhang</p>

            <p><strong>Title:</strong><br>
            TEMPO: Scaling Test-time Training for Large Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.19295v1">http://arxiv.org/abs/2604.19295v1</a></p>

            <p><strong>Abstract:</strong><br>
            Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Qingyang Zhang, Xinke Kong, Haitao Wu, Qinghua Hu, Minghao Wu, Baosong Yang, Yu Cheng, Yun Luo, Ganqu Cui, Changqing Zhang</p>

            <p><strong>Title:</strong><br>
            TEMPO: Scaling Test-time Training for Large Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.19295v1">http://arxiv.org/abs/2604.19295v1</a></p>

            <p><strong>Abstract:</strong><br>
            Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Apr 2026 21:23:38 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/248cd987/4a26d495.mp3" length="22624152" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1410</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Qingyang Zhang, Xinke Kong, Haitao Wu, Qinghua Hu, Minghao Wu, Baosong Yang, Yu Cheng, Yun Luo, Ganqu Cui, Changqing Zhang</p>

            <p><strong>Title:</strong><br>
            TEMPO: Scaling Test-time Training for Large Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.19295v1">http://arxiv.org/abs/2604.19295v1</a></p>

            <p><strong>Abstract:</strong><br>
            Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation</title>
      <itunes:episode>1790</itunes:episode>
      <podcast:episode>1790</podcast:episode>
      <itunes:title>Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8be49279-76ff-4960-aa77-ff49346d1987</guid>
      <link>https://share.transistor.fm/s/ef876ed5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 87 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenxi Zhao, Chen Zhu, Xiaokun Feng, Aiming Hao, Jiashu Zhu, Jiachen Lei, Jiahong Wu, Xiangxiang Chu, Jufeng Yang</p>

            <p><strong>Title:</strong><br>
            Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.18168v1">http://arxiv.org/abs/2604.18168v1</a></p>

            <p><strong>Abstract:</strong><br>
            Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation. Compared to the limited class labels, text conditions pose greater challenges to the model's understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance. To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number of refinement steps in the MeanFlow generation, such as only one step, the text feature representations are required to possess sufficiently high discriminability. This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Guided by these insights, we leverage a powerful LLM-based text encoder validated to possess the required semantic properties and adapt the MeanFlow generation process to this framework, resulting in efficient text-conditioned synthesis for the first time. Furthermore, we validate our approach on the widely used diffusion model, demonstrating significant generation performance improvements. We hope this work provides a general and practical reference for future research on text-conditioned MeanFlow generation. The code is available at https://github.com/AMAP-ML/EMF.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 87 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenxi Zhao, Chen Zhu, Xiaokun Feng, Aiming Hao, Jiashu Zhu, Jiachen Lei, Jiahong Wu, Xiangxiang Chu, Jufeng Yang</p>

            <p><strong>Title:</strong><br>
            Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.18168v1">http://arxiv.org/abs/2604.18168v1</a></p>

            <p><strong>Abstract:</strong><br>
            Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation. Compared to the limited class labels, text conditions pose greater challenges to the model's understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance. To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number of refinement steps in the MeanFlow generation, such as only one step, the text feature representations are required to possess sufficiently high discriminability. This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Guided by these insights, we leverage a powerful LLM-based text encoder validated to possess the required semantic properties and adapt the MeanFlow generation process to this framework, resulting in efficient text-conditioned synthesis for the first time. Furthermore, we validate our approach on the widely used diffusion model, demonstrating significant generation performance improvements. We hope this work provides a general and practical reference for future research on text-conditioned MeanFlow generation. The code is available at https://github.com/AMAP-ML/EMF.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 21 Apr 2026 21:02:27 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ef876ed5/918721ee.mp3" length="20135663" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1255</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 87 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenxi Zhao, Chen Zhu, Xiaokun Feng, Aiming Hao, Jiashu Zhu, Jiachen Lei, Jiahong Wu, Xiangxiang Chu, Jufeng Yang</p>

            <p><strong>Title:</strong><br>
            Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.18168v1">http://arxiv.org/abs/2604.18168v1</a></p>

            <p><strong>Abstract:</strong><br>
            Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation. However, an intuitive yet unexplored direction is to extend the condition from fixed class labels to flexible text inputs, enabling richer content creation. Compared to the limited class labels, text conditions pose greater challenges to the model's understanding capability, necessitating the effective integration of powerful text encoders into the MeanFlow framework. Surprisingly, although incorporating text conditions appears straightforward, we find that integrating powerful LLM-based text encoders using conventional training strategies results in unsatisfactory performance. To uncover the underlying cause, we conduct detailed analyses and reveal that, due to the extremely limited number of refinement steps in the MeanFlow generation, such as only one step, the text feature representations are required to possess sufficiently high discriminability. This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Guided by these insights, we leverage a powerful LLM-based text encoder validated to possess the required semantic properties and adapt the MeanFlow generation process to this framework, resulting in efficient text-conditioned synthesis for the first time. Furthermore, we validate our approach on the widely used diffusion model, demonstrating significant generation performance improvements. We hope this work provides a general and practical reference for future research on text-conditioned MeanFlow generation. The code is available at https://github.com/AMAP-ML/EMF.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation</title>
      <itunes:episode>1789</itunes:episode>
      <podcast:episode>1789</podcast:episode>
      <itunes:title>OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a8fdc23a-8a35-42f2-a8f8-48e80c41a186</guid>
      <link>https://share.transistor.fm/s/056e5f85</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.CV, cs.CL, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li, Guang Li, Lingdong Kong, Yingyan Li, Han Wang, Shaoqing Xu, Yuechen Luo, Fang Li, Chenxu Dang, Junli Wang, Tao Xu, Jing Wu, Jianhua Wu, Xiaoshuai Hao, Wen Zhang, Tianyi Jiang, Lingfeng Zhang, Lei Zhou, Yingbo Tang, Jie Wang, Yinfeng Gao, Xizhou Bu, Haochen Tian, Yihang Qiu, Feiyang Jia, Lin Liu, Yigu Ge, Hanbing Li, Yuannan Shen, Jianwei Cui, Hongwei Xie, Bing Wang, Haiyang Sun, Jingwei Zhao, Jiahui Huang, Pei Liu, Zeyu Zhu, Yuncheng Jiang, Zibin Guo, Chuhong Gong, Hanchao Leng, Kun Ma, Naiyang Wang, Guang Chen, Kuiyuan Yang, Hangjun Ye, Long Chen</p>

            <p><strong>Title:</strong><br>
            OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.18486v1">http://arxiv.org/abs/2604.18486v1</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.CV, cs.CL, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li, Guang Li, Lingdong Kong, Yingyan Li, Han Wang, Shaoqing Xu, Yuechen Luo, Fang Li, Chenxu Dang, Junli Wang, Tao Xu, Jing Wu, Jianhua Wu, Xiaoshuai Hao, Wen Zhang, Tianyi Jiang, Lingfeng Zhang, Lei Zhou, Yingbo Tang, Jie Wang, Yinfeng Gao, Xizhou Bu, Haochen Tian, Yihang Qiu, Feiyang Jia, Lin Liu, Yigu Ge, Hanbing Li, Yuannan Shen, Jianwei Cui, Hongwei Xie, Bing Wang, Haiyang Sun, Jingwei Zhao, Jiahui Huang, Pei Liu, Zeyu Zhu, Yuncheng Jiang, Zibin Guo, Chuhong Gong, Hanchao Leng, Kun Ma, Naiyang Wang, Guang Chen, Kuiyuan Yang, Hangjun Ye, Long Chen</p>

            <p><strong>Title:</strong><br>
            OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.18486v1">http://arxiv.org/abs/2604.18486v1</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 21 Apr 2026 21:02:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/056e5f85/c9497567.mp3" length="25521881" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1591</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.CV, cs.CL, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li, Guang Li, Lingdong Kong, Yingyan Li, Han Wang, Shaoqing Xu, Yuechen Luo, Fang Li, Chenxu Dang, Junli Wang, Tao Xu, Jing Wu, Jianhua Wu, Xiaoshuai Hao, Wen Zhang, Tianyi Jiang, Lingfeng Zhang, Lei Zhou, Yingbo Tang, Jie Wang, Yinfeng Gao, Xizhou Bu, Haochen Tian, Yihang Qiu, Feiyang Jia, Lin Liu, Yigu Ge, Hanbing Li, Yuannan Shen, Jianwei Cui, Hongwei Xie, Bing Wang, Haiyang Sun, Jingwei Zhao, Jiahui Huang, Pei Liu, Zeyu Zhu, Yuncheng Jiang, Zibin Guo, Chuhong Gong, Hanchao Leng, Kun Ma, Naiyang Wang, Guang Chen, Kuiyuan Yang, Hangjun Ye, Long Chen</p>

            <p><strong>Title:</strong><br>
            OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.18486v1">http://arxiv.org/abs/2604.18486v1</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. At inference, the auxiliary decoders are discarded and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering state-of-the-art accuracy at answer-only latency, and providing direct evidence that tighter compression, when guided in both language and world-model supervision, produces more generalizable representations than verbose token-by-token reasoning. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence</title>
      <itunes:episode>1788</itunes:episode>
      <podcast:episode>1788</podcast:episode>
      <itunes:title>Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">268895e3-4a8a-4b5d-960a-f77aac423a65</guid>
      <link>https://share.transistor.fm/s/464a373b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guanting Dong, Junting Lu, Junjie Huang, Wanjun Zhong, Longxiang Liu, Shijue Huang, Zhenyu Li, Yang Zhao, Xiaoshuai Song, Xiaoxi Li, Jiajie Jin, Yutao Zhu, Hanbin Wang, Fangyu Lei, Qinyu Luo, Mingyang Chen, Zehui Chen, Jiazhan Feng, Ji-Rong Wen, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.18292v1">http://arxiv.org/abs/2604.18292v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models are increasingly expected to serve as general-purpose agents that interact with external, stateful tool environments. The Model Context Protocol (MCP) and broader agent skills offer a unified interface for connecting agents with scalable real-world services, but training robust agents remains limited by the lack of realistic environments and principled mechanisms for life-long learning. In this paper, we present \textbf{Agent-World}, a self-evolving training arena for advancing general agent intelligence through scalable environments. Agent-World has two main components: (1) Agentic Environment-Task Discovery, which autonomously explores topic-aligned databases and executable tool ecosystems from thousands of real-world environment themes and synthesizes verifiable tasks with controllable difficulty; and (2) Continuous Self-Evolving Agent Training, which combines multi-environment reinforcement learning with a self-evolving agent arena that automatically identifies capability gaps through dynamic task synthesis and drives targeted learning, enabling the co-evolution of agent policies and environments. Across 23 challenging agent benchmarks, Agent-World-8B and 14B consistently outperforms strong proprietary models and environment scaling baselines. Further analyses reveal scaling trends in relation to environment diversity and self-evolution rounds, offering insights for building general agent intelligence.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guanting Dong, Junting Lu, Junjie Huang, Wanjun Zhong, Longxiang Liu, Shijue Huang, Zhenyu Li, Yang Zhao, Xiaoshuai Song, Xiaoxi Li, Jiajie Jin, Yutao Zhu, Hanbin Wang, Fangyu Lei, Qinyu Luo, Mingyang Chen, Zehui Chen, Jiazhan Feng, Ji-Rong Wen, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.18292v1">http://arxiv.org/abs/2604.18292v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models are increasingly expected to serve as general-purpose agents that interact with external, stateful tool environments. The Model Context Protocol (MCP) and broader agent skills offer a unified interface for connecting agents with scalable real-world services, but training robust agents remains limited by the lack of realistic environments and principled mechanisms for life-long learning. In this paper, we present \textbf{Agent-World}, a self-evolving training arena for advancing general agent intelligence through scalable environments. Agent-World has two main components: (1) Agentic Environment-Task Discovery, which autonomously explores topic-aligned databases and executable tool ecosystems from thousands of real-world environment themes and synthesizes verifiable tasks with controllable difficulty; and (2) Continuous Self-Evolving Agent Training, which combines multi-environment reinforcement learning with a self-evolving agent arena that automatically identifies capability gaps through dynamic task synthesis and drives targeted learning, enabling the co-evolution of agent policies and environments. Across 23 challenging agent benchmarks, Agent-World-8B and 14B consistently outperforms strong proprietary models and environment scaling baselines. Further analyses reveal scaling trends in relation to environment diversity and self-evolution rounds, offering insights for building general agent intelligence.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 21 Apr 2026 21:01:35 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/464a373b/5bec6f33.mp3" length="23096479" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1440</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guanting Dong, Junting Lu, Junjie Huang, Wanjun Zhong, Longxiang Liu, Shijue Huang, Zhenyu Li, Yang Zhao, Xiaoshuai Song, Xiaoxi Li, Jiajie Jin, Yutao Zhu, Hanbin Wang, Fangyu Lei, Qinyu Luo, Mingyang Chen, Zehui Chen, Jiazhan Feng, Ji-Rong Wen, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.18292v1">http://arxiv.org/abs/2604.18292v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models are increasingly expected to serve as general-purpose agents that interact with external, stateful tool environments. The Model Context Protocol (MCP) and broader agent skills offer a unified interface for connecting agents with scalable real-world services, but training robust agents remains limited by the lack of realistic environments and principled mechanisms for life-long learning. In this paper, we present \textbf{Agent-World}, a self-evolving training arena for advancing general agent intelligence through scalable environments. Agent-World has two main components: (1) Agentic Environment-Task Discovery, which autonomously explores topic-aligned databases and executable tool ecosystems from thousands of real-world environment themes and synthesizes verifiable tasks with controllable difficulty; and (2) Continuous Self-Evolving Agent Training, which combines multi-environment reinforcement learning with a self-evolving agent arena that automatically identifies capability gaps through dynamic task synthesis and drives targeted learning, enabling the co-evolution of agent policies and environments. Across 23 challenging agent benchmarks, Agent-World-8B and 14B consistently outperforms strong proprietary models and environment scaling baselines. Further analyses reveal scaling trends in relation to environment diversity and self-evolution rounds, offering insights for building general agent intelligence.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OpenGame: Open Agentic Coding for Games</title>
      <itunes:episode>1787</itunes:episode>
      <podcast:episode>1787</podcast:episode>
      <itunes:title>OpenGame: Open Agentic Coding for Games</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">92e84868-20fa-4524-966a-cdbabc89663e</guid>
      <link>https://share.transistor.fm/s/a693b42e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.SE</p>

            <p><strong>Authors:</strong><br>
            Yilei Jiang, Jinyuan Hu, Qianyin Xiao, Yaozhi Zheng, Ruize Ma, Kaituo Feng, Jiaming Han, Tianshuo Peng, Kaixuan Fan, Manyuan Zhang, Xiangyu Yue</p>

            <p><strong>Title:</strong><br>
            OpenGame: Open Agentic Coding for Games</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.18394v1">http://arxiv.org/abs/2604.18394v1</a></p>

            <p><strong>Abstract:</strong><br>
            Game development sits at the intersection of creative design and intricate software engineering, demanding the joint orchestration of game engines, real-time loops, and tightly coupled state across many files. While Large Language Models (LLMs) and code agents now solve isolated programming tasks with ease, they consistently stumble when asked to produce a fully playable game from a high-level design, collapsing under cross-file inconsistencies, broken scene wiring, and logical incoherence. We bridge this gap with OpenGame, the first open-source agentic framework explicitly designed for end-to-end web game creation. At its core lies Game Skill, a reusable, evolving capability composed of a Template Skill that grows a library of project skeletons from experience and a Debug Skill that maintains a living protocol of verified fixes - together enabling the agent to scaffold stable architectures and systematically repair integration errors rather than patch isolated syntax bugs. Powering this framework is GameCoder-27B, a code LLM specialized for game engine mastery through a three-stage pipeline of continual pre-training, supervised fine-tuning, and execution-grounded reinforcement learning. Since verifying interactive playability is fundamentally harder than checking static code, we further introduce OpenGame-Bench, an evaluation pipeline that scores agentic game generation along Build Health, Visual Usability, and Intent Alignment via headless browser execution and VLM judging. Across 150 diverse game prompts, OpenGame establishes a new state-of-the-art. We hope OpenGame pushes code agents beyond discrete software engineering problems and toward building complex, interactive real-world applications. Our framework will be fully open-sourced.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.SE</p>

            <p><strong>Authors:</strong><br>
            Yilei Jiang, Jinyuan Hu, Qianyin Xiao, Yaozhi Zheng, Ruize Ma, Kaituo Feng, Jiaming Han, Tianshuo Peng, Kaixuan Fan, Manyuan Zhang, Xiangyu Yue</p>

            <p><strong>Title:</strong><br>
            OpenGame: Open Agentic Coding for Games</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.18394v1">http://arxiv.org/abs/2604.18394v1</a></p>

            <p><strong>Abstract:</strong><br>
            Game development sits at the intersection of creative design and intricate software engineering, demanding the joint orchestration of game engines, real-time loops, and tightly coupled state across many files. While Large Language Models (LLMs) and code agents now solve isolated programming tasks with ease, they consistently stumble when asked to produce a fully playable game from a high-level design, collapsing under cross-file inconsistencies, broken scene wiring, and logical incoherence. We bridge this gap with OpenGame, the first open-source agentic framework explicitly designed for end-to-end web game creation. At its core lies Game Skill, a reusable, evolving capability composed of a Template Skill that grows a library of project skeletons from experience and a Debug Skill that maintains a living protocol of verified fixes - together enabling the agent to scaffold stable architectures and systematically repair integration errors rather than patch isolated syntax bugs. Powering this framework is GameCoder-27B, a code LLM specialized for game engine mastery through a three-stage pipeline of continual pre-training, supervised fine-tuning, and execution-grounded reinforcement learning. Since verifying interactive playability is fundamentally harder than checking static code, we further introduce OpenGame-Bench, an evaluation pipeline that scores agentic game generation along Build Health, Visual Usability, and Intent Alignment via headless browser execution and VLM judging. Across 150 diverse game prompts, OpenGame establishes a new state-of-the-art. We hope OpenGame pushes code agents beyond discrete software engineering problems and toward building complex, interactive real-world applications. Our framework will be fully open-sourced.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 21 Apr 2026 21:01:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a693b42e/6a49dd6b.mp3" length="24550505" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1531</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.SE</p>

            <p><strong>Authors:</strong><br>
            Yilei Jiang, Jinyuan Hu, Qianyin Xiao, Yaozhi Zheng, Ruize Ma, Kaituo Feng, Jiaming Han, Tianshuo Peng, Kaixuan Fan, Manyuan Zhang, Xiangyu Yue</p>

            <p><strong>Title:</strong><br>
            OpenGame: Open Agentic Coding for Games</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.18394v1">http://arxiv.org/abs/2604.18394v1</a></p>

            <p><strong>Abstract:</strong><br>
            Game development sits at the intersection of creative design and intricate software engineering, demanding the joint orchestration of game engines, real-time loops, and tightly coupled state across many files. While Large Language Models (LLMs) and code agents now solve isolated programming tasks with ease, they consistently stumble when asked to produce a fully playable game from a high-level design, collapsing under cross-file inconsistencies, broken scene wiring, and logical incoherence. We bridge this gap with OpenGame, the first open-source agentic framework explicitly designed for end-to-end web game creation. At its core lies Game Skill, a reusable, evolving capability composed of a Template Skill that grows a library of project skeletons from experience and a Debug Skill that maintains a living protocol of verified fixes - together enabling the agent to scaffold stable architectures and systematically repair integration errors rather than patch isolated syntax bugs. Powering this framework is GameCoder-27B, a code LLM specialized for game engine mastery through a three-stage pipeline of continual pre-training, supervised fine-tuning, and execution-grounded reinforcement learning. Since verifying interactive playability is fundamentally harder than checking static code, we further introduce OpenGame-Bench, an evaluation pipeline that scores agentic game generation along Build Health, Visual Usability, and Intent Alignment via headless browser execution and VLM judging. Across 150 diverse game prompts, OpenGame establishes a new state-of-the-art. We hope OpenGame pushes code agents beyond discrete software engineering problems and toward building complex, interactive real-world applications. Our framework will be fully open-sourced.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MultiWorld: Scalable Multi-Agent Multi-View Video World Models</title>
      <itunes:episode>1786</itunes:episode>
      <podcast:episode>1786</podcast:episode>
      <itunes:title>MultiWorld: Scalable Multi-Agent Multi-View Video World Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e2bb1400-45d8-4569-a01c-acd80b0fbc64</guid>
      <link>https://share.transistor.fm/s/c5ee1a11</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haoyu Wu, Jiwen Yu, Yingtian Zou, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            MultiWorld: Scalable Multi-Agent Multi-View Video World Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.18564v2">http://arxiv.org/abs/2604.18564v2</a></p>

            <p><strong>Abstract:</strong><br>
            Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present \textbf{MultiWorld}, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: https://multi-world.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haoyu Wu, Jiwen Yu, Yingtian Zou, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            MultiWorld: Scalable Multi-Agent Multi-View Video World Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.18564v2">http://arxiv.org/abs/2604.18564v2</a></p>

            <p><strong>Abstract:</strong><br>
            Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present \textbf{MultiWorld}, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: https://multi-world.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 21 Apr 2026 21:00:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c5ee1a11/c340a37e.mp3" length="20872487" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1301</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haoyu Wu, Jiwen Yu, Yingtian Zou, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            MultiWorld: Scalable Multi-Agent Multi-View Video World Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.18564v2">http://arxiv.org/abs/2604.18564v2</a></p>

            <p><strong>Abstract:</strong><br>
            Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present \textbf{MultiWorld}, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: https://multi-world.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>EasyVideoR1: Easier RL for Video Understanding</title>
      <itunes:episode>1785</itunes:episode>
      <podcast:episode>1785</podcast:episode>
      <itunes:title>EasyVideoR1: Easier RL for Video Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2982ec6a-bd0e-46fa-8b54-12546ee38253</guid>
      <link>https://share.transistor.fm/s/2aa3fafc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            EasyVideoR1: Easier RL for Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.16893v1">http://arxiv.org/abs/2604.16893v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present \textbf{EasyVideoR1}, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 $\times$ throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            EasyVideoR1: Easier RL for Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.16893v1">http://arxiv.org/abs/2604.16893v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present \textbf{EasyVideoR1}, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 $\times$ throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 21 Apr 2026 21:00:25 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2aa3fafc/7cb59fe5.mp3" length="26051403" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1625</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            EasyVideoR1: Easier RL for Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.16893v1">http://arxiv.org/abs/2604.16893v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present \textbf{EasyVideoR1}, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 $\times$ throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Elucidating the SNR-t Bias of Diffusion Probabilistic Models</title>
      <itunes:episode>1784</itunes:episode>
      <podcast:episode>1784</podcast:episode>
      <itunes:title>Elucidating the SNR-t Bias of Diffusion Probabilistic Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4bcf1051-2977-4b65-ab48-f9c8f1cc7af4</guid>
      <link>https://share.transistor.fm/s/0bca3399</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 69 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Meng Yu, Lei Sun, Jianhao Zeng, Xiangxiang Chu, Kun Zhan</p>

            <p><strong>Title:</strong><br>
            Elucidating the SNR-t Bias of Diffusion Probabilistic Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.16044v1">http://arxiv.org/abs/2604.16044v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Probabilistic Models have demonstrated remarkable performance across a wide range of generative tasks. However, we have observed that these models often suffer from a Signal-to-Noise Ratio-timestep (SNR-t) bias. This bias refers to the misalignment between the SNR of the denoising sample and its corresponding timestep during the inference phase. Specifically, during training, the SNR of a sample is strictly coupled with its timestep. However, this correspondence is disrupted during inference, leading to error accumulation and impairing the generation quality. We provide comprehensive empirical evidence and theoretical analysis to substantiate this phenomenon and propose a simple yet effective differential correction method to mitigate the SNR-t bias. Recognizing that diffusion models typically reconstruct low-frequency components before focusing on high-frequency details during the reverse denoising process, we decompose samples into various frequency components and apply differential correction to each component individually. Extensive experiments show that our approach significantly improves the generation quality of various diffusion models (IDDPM, ADM, DDIM, A-DPM, EA-DPM, EDM, PFGM++, and FLUX) on datasets of various resolutions with negligible computational overhead. The code is at https://github.com/AMAP-ML/DCW.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 69 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Meng Yu, Lei Sun, Jianhao Zeng, Xiangxiang Chu, Kun Zhan</p>

            <p><strong>Title:</strong><br>
            Elucidating the SNR-t Bias of Diffusion Probabilistic Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.16044v1">http://arxiv.org/abs/2604.16044v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Probabilistic Models have demonstrated remarkable performance across a wide range of generative tasks. However, we have observed that these models often suffer from a Signal-to-Noise Ratio-timestep (SNR-t) bias. This bias refers to the misalignment between the SNR of the denoising sample and its corresponding timestep during the inference phase. Specifically, during training, the SNR of a sample is strictly coupled with its timestep. However, this correspondence is disrupted during inference, leading to error accumulation and impairing the generation quality. We provide comprehensive empirical evidence and theoretical analysis to substantiate this phenomenon and propose a simple yet effective differential correction method to mitigate the SNR-t bias. Recognizing that diffusion models typically reconstruct low-frequency components before focusing on high-frequency details during the reverse denoising process, we decompose samples into various frequency components and apply differential correction to each component individually. Extensive experiments show that our approach significantly improves the generation quality of various diffusion models (IDDPM, ADM, DDIM, A-DPM, EA-DPM, EDM, PFGM++, and FLUX) on datasets of various resolutions with negligible computational overhead. The code is at https://github.com/AMAP-ML/DCW.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 20 Apr 2026 20:55:13 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0bca3399/a322f427.mp3" length="21284175" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1327</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 69 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Meng Yu, Lei Sun, Jianhao Zeng, Xiangxiang Chu, Kun Zhan</p>

            <p><strong>Title:</strong><br>
            Elucidating the SNR-t Bias of Diffusion Probabilistic Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.16044v1">http://arxiv.org/abs/2604.16044v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Probabilistic Models have demonstrated remarkable performance across a wide range of generative tasks. However, we have observed that these models often suffer from a Signal-to-Noise Ratio-timestep (SNR-t) bias. This bias refers to the misalignment between the SNR of the denoising sample and its corresponding timestep during the inference phase. Specifically, during training, the SNR of a sample is strictly coupled with its timestep. However, this correspondence is disrupted during inference, leading to error accumulation and impairing the generation quality. We provide comprehensive empirical evidence and theoretical analysis to substantiate this phenomenon and propose a simple yet effective differential correction method to mitigate the SNR-t bias. Recognizing that diffusion models typically reconstruct low-frequency components before focusing on high-frequency details during the reverse denoising process, we decompose samples into various frequency components and apply differential correction to each component individually. Extensive experiments show that our approach significantly improves the generation quality of various diffusion models (IDDPM, ADM, DDIM, A-DPM, EA-DPM, EDM, PFGM++, and FLUX) on datasets of various resolutions with negligible computational overhead. The code is at https://github.com/AMAP-ML/DCW.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips</title>
      <itunes:episode>1783</itunes:episode>
      <podcast:episode>1783</podcast:episode>
      <itunes:title>Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d2590d88-722f-464e-a940-032dbbde6875</guid>
      <link>https://share.transistor.fm/s/85134e78</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ido Galil, Moshe Kimhi, Ran El-Yaniv</p>

            <p><strong>Title:</strong><br>
            Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07408v2">http://arxiv.org/abs/2502.07408v2</a></p>

            <p><strong>Abstract:</strong><br>
            Deep Neural Networks (DNNs) can be catastrophically disrupted by flipping only a handful of parameter bits. We introduce Deep Neural Lesion (DNL), a data-free and optimizationfree method that locates critical parameters, and an enhanced single-pass variant, 1P-DNL, that refines this selection with one forward and backward pass on random inputs. We show that this vulnerability spans multiple domains, including image classification, object detection, instance segmentation, and reasoning large language models. In image classification, flipping just two sign bits in ResNet-50 on ImageNet reduces accuracy by 99.8%. In object detection and instance segmentation, one or two sign flips in the backbone collapse COCO detection and mask AP for Mask R-CNN and YOLOv8-seg models. In language modeling, two sign flips into different experts reduce Qwen3-30B-A3B-Thinking from 78% to 0% accuracy. We also show that selectively protecting a small fraction of vulnerable sign bits provides a practical defense against such attacks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ido Galil, Moshe Kimhi, Ran El-Yaniv</p>

            <p><strong>Title:</strong><br>
            Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07408v2">http://arxiv.org/abs/2502.07408v2</a></p>

            <p><strong>Abstract:</strong><br>
            Deep Neural Networks (DNNs) can be catastrophically disrupted by flipping only a handful of parameter bits. We introduce Deep Neural Lesion (DNL), a data-free and optimizationfree method that locates critical parameters, and an enhanced single-pass variant, 1P-DNL, that refines this selection with one forward and backward pass on random inputs. We show that this vulnerability spans multiple domains, including image classification, object detection, instance segmentation, and reasoning large language models. In image classification, flipping just two sign bits in ResNet-50 on ImageNet reduces accuracy by 99.8%. In object detection and instance segmentation, one or two sign flips in the backbone collapse COCO detection and mask AP for Mask R-CNN and YOLOv8-seg models. In language modeling, two sign flips into different experts reduce Qwen3-30B-A3B-Thinking from 78% to 0% accuracy. We also show that selectively protecting a small fraction of vulnerable sign bits provides a practical defense against such attacks.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 20 Apr 2026 20:54:51 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/85134e78/ae4ec409.mp3" length="21628191" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1348</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ido Galil, Moshe Kimhi, Ran El-Yaniv</p>

            <p><strong>Title:</strong><br>
            Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07408v2">http://arxiv.org/abs/2502.07408v2</a></p>

            <p><strong>Abstract:</strong><br>
            Deep Neural Networks (DNNs) can be catastrophically disrupted by flipping only a handful of parameter bits. We introduce Deep Neural Lesion (DNL), a data-free and optimizationfree method that locates critical parameters, and an enhanced single-pass variant, 1P-DNL, that refines this selection with one forward and backward pass on random inputs. We show that this vulnerability spans multiple domains, including image classification, object detection, instance segmentation, and reasoning large language models. In image classification, flipping just two sign bits in ResNet-50 on ImageNet reduces accuracy by 99.8%. In object detection and instance segmentation, one or two sign flips in the backbone collapse COCO detection and mask AP for Mask R-CNN and YOLOv8-seg models. In language modeling, two sign flips into different experts reduce Qwen3-30B-A3B-Thinking from 78% to 0% accuracy. We also show that selectively protecting a small fraction of vulnerable sign bits provides a practical defense against such attacks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PersonaVLM: Long-Term Personalized Multimodal LLMs</title>
      <itunes:episode>1782</itunes:episode>
      <podcast:episode>1782</podcast:episode>
      <itunes:title>PersonaVLM: Long-Term Personalized Multimodal LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7d6a0dd9-e182-448e-adbd-cf384cb4bea7</guid>
      <link>https://share.transistor.fm/s/d1b0e087</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chang Nie, Chaoyou Fu, Yifan Zhang, Haihua Yang, Caifeng Shan</p>

            <p><strong>Title:</strong><br>
            PersonaVLM: Long-Term Personalized Multimodal LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.13074v1">http://arxiv.org/abs/2604.13074v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) serve as daily assistants for millions. However, their ability to generate responses aligned with individual preferences remains limited. Prior approaches enable only static, single-turn personalization through input augmentation or output alignment, and thus fail to capture users' evolving preferences and personality over time (see Fig.1). In this paper, we introduce PersonaVLM, an innovative personalized multimodal agent framework designed for long-term personalization. It transforms a general-purpose MLLM into a personalized assistant by integrating three key capabilities: (a) Remembering: It proactively extracts and summarizes chronological multimodal memories from interactions, consolidating them into a personalized database. (b) Reasoning: It conducts multi-turn reasoning by retrieving and integrating relevant memories from the database. (c) Response Alignment: It infers the user's evolving personality throughout long-term interactions to ensure outputs remain aligned with their unique characteristics. For evaluation, we establish Persona-MME, a comprehensive benchmark comprising over 2,000 curated interaction cases, designed to assess long-term MLLM personalization across seven key aspects and 14 fine-grained tasks. Extensive experiments validate our method's effectiveness, improving the baseline by 22.4% (Persona-MME) and 9.8% (PERSONAMEM) under a 128k context, while outperforming GPT-4o by 5.2% and 2.0%, respectively. Project page: https://PersonaVLM.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chang Nie, Chaoyou Fu, Yifan Zhang, Haihua Yang, Caifeng Shan</p>

            <p><strong>Title:</strong><br>
            PersonaVLM: Long-Term Personalized Multimodal LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.13074v1">http://arxiv.org/abs/2604.13074v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) serve as daily assistants for millions. However, their ability to generate responses aligned with individual preferences remains limited. Prior approaches enable only static, single-turn personalization through input augmentation or output alignment, and thus fail to capture users' evolving preferences and personality over time (see Fig.1). In this paper, we introduce PersonaVLM, an innovative personalized multimodal agent framework designed for long-term personalization. It transforms a general-purpose MLLM into a personalized assistant by integrating three key capabilities: (a) Remembering: It proactively extracts and summarizes chronological multimodal memories from interactions, consolidating them into a personalized database. (b) Reasoning: It conducts multi-turn reasoning by retrieving and integrating relevant memories from the database. (c) Response Alignment: It infers the user's evolving personality throughout long-term interactions to ensure outputs remain aligned with their unique characteristics. For evaluation, we establish Persona-MME, a comprehensive benchmark comprising over 2,000 curated interaction cases, designed to assess long-term MLLM personalization across seven key aspects and 14 fine-grained tasks. Extensive experiments validate our method's effectiveness, improving the baseline by 22.4% (Persona-MME) and 9.8% (PERSONAMEM) under a 128k context, while outperforming GPT-4o by 5.2% and 2.0%, respectively. Project page: https://PersonaVLM.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 20 Apr 2026 20:54:30 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d1b0e087/49b6de4e.mp3" length="24004243" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1497</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chang Nie, Chaoyou Fu, Yifan Zhang, Haihua Yang, Caifeng Shan</p>

            <p><strong>Title:</strong><br>
            PersonaVLM: Long-Term Personalized Multimodal LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.13074v1">http://arxiv.org/abs/2604.13074v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) serve as daily assistants for millions. However, their ability to generate responses aligned with individual preferences remains limited. Prior approaches enable only static, single-turn personalization through input augmentation or output alignment, and thus fail to capture users' evolving preferences and personality over time (see Fig.1). In this paper, we introduce PersonaVLM, an innovative personalized multimodal agent framework designed for long-term personalization. It transforms a general-purpose MLLM into a personalized assistant by integrating three key capabilities: (a) Remembering: It proactively extracts and summarizes chronological multimodal memories from interactions, consolidating them into a personalized database. (b) Reasoning: It conducts multi-turn reasoning by retrieving and integrating relevant memories from the database. (c) Response Alignment: It infers the user's evolving personality throughout long-term interactions to ensure outputs remain aligned with their unique characteristics. For evaluation, we establish Persona-MME, a comprehensive benchmark comprising over 2,000 curated interaction cases, designed to assess long-term MLLM personalization across seven key aspects and 14 fine-grained tasks. Extensive experiments validate our method's effectiveness, improving the baseline by 22.4% (Persona-MME) and 9.8% (PERSONAMEM) under a 128k context, while outperforming GPT-4o by 5.2% and 2.0%, respectively. Project page: https://PersonaVLM.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Qwen3.5-Omni Technical Report</title>
      <itunes:episode>1781</itunes:episode>
      <podcast:episode>1781</podcast:episode>
      <itunes:title>Qwen3.5-Omni Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f25f0d93-1f5d-4474-94c1-aca9e15b2d8c</guid>
      <link>https://share.transistor.fm/s/bc9b6933</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Qwen Team</p>

            <p><strong>Title:</strong><br>
            Qwen3.5-Omni Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.15804v1">http://arxiv.org/abs/2604.15804v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Qwen Team</p>

            <p><strong>Title:</strong><br>
            Qwen3.5-Omni Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.15804v1">http://arxiv.org/abs/2604.15804v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 20 Apr 2026 20:54:08 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bc9b6933/8153b3f8.mp3" length="24016343" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1497</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Qwen Team</p>

            <p><strong>Title:</strong><br>
            Qwen3.5-Omni Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.15804v1">http://arxiv.org/abs/2604.15804v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds</title>
      <itunes:episode>1780</itunes:episode>
      <podcast:episode>1780</podcast:episode>
      <itunes:title>HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1ca276b3-7ef5-46af-85ea-e1bcf494c505</guid>
      <link>https://share.transistor.fm/s/17628d7a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 69 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Team HY-World, Chenjie Cao, Xuhui Zuo, Zhenwei Wang, Yisu Zhang, Junta Wu, Zhenyang Liu, Yuning Gong, Yang Liu, Bo Yuan, Chao Zhang, Coopers Li, Dongyuan Guo, Fan Yang, Haiyu Zhang, Hang Cao, Jianchen Zhu, Jiaxin Lin, Jie Xiao, Jihong Zhang, Junlin Yu, Lei Wang, Lifu Wang, Lilin Wang, Linus, Minghui Chen, Peng He, Penghao Zhao, Qi Chen, Rui Chen, Rui Shao, Sicong Liu, Wangchen Qin, Xiaochuan Niu, Xiang Yuan, Yi Sun, Yifei Tang, Yifu Sun, Yihang Lian, Yonghao Tan, Yuhong Liu, Yuyang Yin, Zhiyuan Min, Tengfei Wang, Chunchao Guo</p>

            <p><strong>Title:</strong><br>
            HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.14268v1">http://arxiv.org/abs/2604.14268v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input modalities, including text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. With text or single-view image inputs, the model performs world generation, synthesizing high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four-stage method: a) Panorama Generation with HY-Pano 2.0, b) Trajectory Planning with WorldNav, c) World Expansion with WorldStereo 2.0, and d) World Composition with WorldMirror 2.0. Specifically, we introduce key innovations to enhance panorama fidelity, enable 3D scene understanding and planning, and upgrade WorldStereo, our keyframe-based view generation model with consistent memory. We also upgrade WorldMirror, a feed-forward model for universal 3D prediction, by refining model architecture and learning strategy, enabling world reconstruction from multi-view images or videos. Also, we introduce WorldLens, a high-performance 3DGS rendering platform featuring a flexible engine-agnostic architecture, automatic IBL lighting, efficient collision detection, and training-rendering co-design, enabling interactive exploration of 3D worlds with character support. Extensive experiments demonstrate that HY-World 2.0 achieves state-of-the-art performance on several benchmarks among open-source approaches, delivering results comparable to the closed-source model Marble. We release all model weights, code, and technical details to facilitate reproducibility and support further research on 3D world models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 69 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Team HY-World, Chenjie Cao, Xuhui Zuo, Zhenwei Wang, Yisu Zhang, Junta Wu, Zhenyang Liu, Yuning Gong, Yang Liu, Bo Yuan, Chao Zhang, Coopers Li, Dongyuan Guo, Fan Yang, Haiyu Zhang, Hang Cao, Jianchen Zhu, Jiaxin Lin, Jie Xiao, Jihong Zhang, Junlin Yu, Lei Wang, Lifu Wang, Lilin Wang, Linus, Minghui Chen, Peng He, Penghao Zhao, Qi Chen, Rui Chen, Rui Shao, Sicong Liu, Wangchen Qin, Xiaochuan Niu, Xiang Yuan, Yi Sun, Yifei Tang, Yifu Sun, Yihang Lian, Yonghao Tan, Yuhong Liu, Yuyang Yin, Zhiyuan Min, Tengfei Wang, Chunchao Guo</p>

            <p><strong>Title:</strong><br>
            HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.14268v1">http://arxiv.org/abs/2604.14268v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input modalities, including text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. With text or single-view image inputs, the model performs world generation, synthesizing high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four-stage method: a) Panorama Generation with HY-Pano 2.0, b) Trajectory Planning with WorldNav, c) World Expansion with WorldStereo 2.0, and d) World Composition with WorldMirror 2.0. Specifically, we introduce key innovations to enhance panorama fidelity, enable 3D scene understanding and planning, and upgrade WorldStereo, our keyframe-based view generation model with consistent memory. We also upgrade WorldMirror, a feed-forward model for universal 3D prediction, by refining model architecture and learning strategy, enabling world reconstruction from multi-view images or videos. Also, we introduce WorldLens, a high-performance 3DGS rendering platform featuring a flexible engine-agnostic architecture, automatic IBL lighting, efficient collision detection, and training-rendering co-design, enabling interactive exploration of 3D worlds with character support. Extensive experiments demonstrate that HY-World 2.0 achieves state-of-the-art performance on several benchmarks among open-source approaches, delivering results comparable to the closed-source model Marble. We release all model weights, code, and technical details to facilitate reproducibility and support further research on 3D world models.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 17 Apr 2026 20:36:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/17628d7a/52b40174.mp3" length="23198882" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1446</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 69 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Team HY-World, Chenjie Cao, Xuhui Zuo, Zhenwei Wang, Yisu Zhang, Junta Wu, Zhenyang Liu, Yuning Gong, Yang Liu, Bo Yuan, Chao Zhang, Coopers Li, Dongyuan Guo, Fan Yang, Haiyu Zhang, Hang Cao, Jianchen Zhu, Jiaxin Lin, Jie Xiao, Jihong Zhang, Junlin Yu, Lei Wang, Lifu Wang, Lilin Wang, Linus, Minghui Chen, Peng He, Penghao Zhao, Qi Chen, Rui Chen, Rui Shao, Sicong Liu, Wangchen Qin, Xiaochuan Niu, Xiang Yuan, Yi Sun, Yifei Tang, Yifu Sun, Yihang Lian, Yonghao Tan, Yuhong Liu, Yuyang Yin, Zhiyuan Min, Tengfei Wang, Chunchao Guo</p>

            <p><strong>Title:</strong><br>
            HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.14268v1">http://arxiv.org/abs/2604.14268v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input modalities, including text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. With text or single-view image inputs, the model performs world generation, synthesizing high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four-stage method: a) Panorama Generation with HY-Pano 2.0, b) Trajectory Planning with WorldNav, c) World Expansion with WorldStereo 2.0, and d) World Composition with WorldMirror 2.0. Specifically, we introduce key innovations to enhance panorama fidelity, enable 3D scene understanding and planning, and upgrade WorldStereo, our keyframe-based view generation model with consistent memory. We also upgrade WorldMirror, a feed-forward model for universal 3D prediction, by refining model architecture and learning strategy, enabling world reconstruction from multi-view images or videos. Also, we introduce WorldLens, a high-performance 3DGS rendering platform featuring a flexible engine-agnostic architecture, automatic IBL lighting, efficient collision detection, and training-rendering co-design, enabling interactive exploration of 3D worlds with character support. Extensive experiments demonstrate that HY-World 2.0 achieves state-of-the-art performance on several benchmarks among open-source approaches, delivering results comparable to the closed-source model Marble. We release all model weights, code, and technical details to facilitate reproducibility and support further research on 3D world models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework</title>
      <itunes:episode>1779</itunes:episode>
      <podcast:episode>1779</podcast:episode>
      <itunes:title>RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b6a662d7-ce10-40bc-bd2e-7ece945eb64b</guid>
      <link>https://share.transistor.fm/s/71777fbf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hao Gao, Shaoyu Chen, Yifan Zhu, Yuehao Song, Wenyu Liu, Qian Zhang, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.15308v1">http://arxiv.org/abs/2604.15308v1</a></p>

            <p><strong>Abstract:</strong><br>
            High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hao Gao, Shaoyu Chen, Yifan Zhu, Yuehao Song, Wenyu Liu, Qian Zhang, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.15308v1">http://arxiv.org/abs/2604.15308v1</a></p>

            <p><strong>Abstract:</strong><br>
            High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 17 Apr 2026 20:35:46 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/71777fbf/ef235dcf.mp3" length="21361931" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1331</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hao Gao, Shaoyu Chen, Yifan Zhu, Yuehao Song, Wenyu Liu, Qian Zhang, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.15308v1">http://arxiv.org/abs/2604.15308v1</a></p>

            <p><strong>Abstract:</strong><br>
            High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation</title>
      <itunes:episode>1778</itunes:episode>
      <podcast:episode>1778</podcast:episode>
      <itunes:title>DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e22533e9-fce8-47a2-96b4-d3417fcd419d</guid>
      <link>https://share.transistor.fm/s/6ee8b7b8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qianqian Xie, Qingheng Xiong, He Zhu, Tiantian Xia, Xueming Han, Fanyu Meng, Jiakai Wang, Zhiqi Bai, Chengkang Jiang, Zhaohui Wang, Yubin Guo, Yuqing Wen, Jiayang Mao, Zijie Zhang, Shihao Li, Yanghai Wang, Yuxiang Ren, Junlan Feng, Jiaheng Liu</p>

            <p><strong>Title:</strong><br>
            DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.14683v1">http://arxiv.org/abs/2604.14683v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR$^{3}$-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR$^{3}$-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR$^{3}$-Agent based on multiple state-of-the-art language models demonstrate that DR$^{3}$-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qianqian Xie, Qingheng Xiong, He Zhu, Tiantian Xia, Xueming Han, Fanyu Meng, Jiakai Wang, Zhiqi Bai, Chengkang Jiang, Zhaohui Wang, Yubin Guo, Yuqing Wen, Jiayang Mao, Zijie Zhang, Shihao Li, Yanghai Wang, Yuxiang Ren, Junlan Feng, Jiaheng Liu</p>

            <p><strong>Title:</strong><br>
            DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.14683v1">http://arxiv.org/abs/2604.14683v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR$^{3}$-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR$^{3}$-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR$^{3}$-Agent based on multiple state-of-the-art language models demonstrate that DR$^{3}$-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 17 Apr 2026 20:35:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6ee8b7b8/52012b22.mp3" length="23412437" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1460</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qianqian Xie, Qingheng Xiong, He Zhu, Tiantian Xia, Xueming Han, Fanyu Meng, Jiakai Wang, Zhiqi Bai, Chengkang Jiang, Zhaohui Wang, Yubin Guo, Yuqing Wen, Jiayang Mao, Zijie Zhang, Shihao Li, Yanghai Wang, Yuxiang Ren, Junlan Feng, Jiaheng Liu</p>

            <p><strong>Title:</strong><br>
            DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.14683v1">http://arxiv.org/abs/2604.14683v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR$^{3}$-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR$^{3}$-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR$^{3}$-Agent based on multiple state-of-the-art language models demonstrate that DR$^{3}$-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Seedance 2.0: Advancing Video Generation for World Complexity</title>
      <itunes:episode>1777</itunes:episode>
      <podcast:episode>1777</podcast:episode>
      <itunes:title>Seedance 2.0: Advancing Video Generation for World Complexity</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4d91b635-23fc-4c95-b918-e979697a3a36</guid>
      <link>https://share.transistor.fm/s/fb89ffc7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 118 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, Yujiao Du, Haojie Duanmu, Junliang Fan, Jiarui Fang, Jing Fang, Zetao Fang, Chengjian Feng, Yu Gao, Diandian Gu, Dong Guo, Hanzhong Guo, Qiushan Guo, Boyang Hao, Hongxiang Hao, Haoxun He, Jiaao He, Qian He, Tuyen Hoang, Heng Hu, Ruoqing Hu, Yuxiang Hu, Jiancheng Huang, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Jishuo Jin, Ming Jing, Ashley Kim, Shanshan Lao, Yichong Leng, Bingchuan Li, Gen Li, Haifeng Li, Huixia Li, Jiashi Li, Ming Li, Xiaojie Li, Xingxing Li, Yameng Li, Yiying Li, Yu Li, Yueyan Li, Chao Liang, Han Liang, Jianzhong Liang, Ying Liang, Wang Liao, J. H. Lien, Shanchuan Lin, Xi Lin, Feng Ling, Yue Ling, Fangfang Liu, Jiawei Liu, Jihao Liu, Jingtuo Liu, Shu Liu, Sichao Liu, Wei Liu, Xue Liu, Zuxi Liu, Ruijie Lu, Lecheng Lyu, Jingting Ma, Tianxiang Ma, Xiaonan Nie, Jingzhe Ning, Junjie Pan, Xitong Pan, Ronggui Peng, Xueqiong Qu, Yuxi Ren, Yuchen Shen, Guang Shi, Lei Shi, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Wenjing Tang, Boyang Tao, Zirui Tao, Dongliang Wang, Feng Wang, Hulin Wang, Ke Wang, Qingyi Wang, Rui Wang, Shuai Wang, Shulei Wang, Weichen Wang, Xuanda Wang, Yanhui Wang, Yue Wang, Yuping Wang, Yuxuan Wang, Zijie Wang, Ziyu Wang, Guoqiang Wei, Meng Wei, Di Wu, Guohong Wu, Hanjie Wu, Huachao Wu, Jian Wu, Jie Wu, Ruolan Wu, Shaojin Wu, Xiaohu Wu, Xinglong Wu, Yonghui Wu, Ruiqi Xia, Xin Xia, Xuefeng Xiao, Shuang Xu, Bangbang Yang, Jiaqi Yang, Runkai Yang, Tao Yang, Yihang Yang, Zhixian Yang, Ziyan Yang, Fulong Ye, Bingqian Yi, Xing Yin, Yongbin You, Linxiao Yuan, Weihong Zeng, Xuejiao Zeng, Yan Zeng, Siyu Zhai, Zhonghua Zhai, Bowen Zhang, Chenlin Zhang, Heng Zhang, Jun Zhang, Manlin Zhang, Peiyuan Zhang, Shuo Zhang, Xiaohe Zhang, Xiaoying Zhang, Xinyan Zhang, Xinyi Zhang, Yichi Zhang, Zixiang Zhang, Haiyu Zhao, Huating Zhao, Liming Zhao, Yian Zhao, Guangcong Zheng, Jianbin Zheng, Xiaozheng Zheng, Zerong Zheng, Kuan Zhu, Feilong Zuo</p>

            <p><strong>Title:</strong><br>
            Seedance 2.0: Advancing Video Generation for World Complexity</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.14148v1">http://arxiv.org/abs/2604.14148v1</a></p>

            <p><strong>Abstract:</strong><br>
            Seedance 2.0 is a new native multi-modal audio-video generation model, officially released in China in early February 2026. Compared with its predecessors, Seedance 1.0 and 1.5 Pro, Seedance 2.0 adopts a unified, highly efficient, and large-scale architecture for multi-modal audio-video joint generation. This allows it to support four input modalities: text, image, audio, and video, by integrating one of the most comprehensive suites of multi-modal content reference and editing capabilities available in the industry to date. It delivers substantial, well-rounded improvements across all key sub-dimensions of video and audio generation. In both expert evaluations and public user tests, the model has demonstrated performance on par with the leading levels in the field. Seedance 2.0 supports direct generation of audio-video content with durations ranging from 4 to 15 seconds, with native output resolutions of 480p and 720p. For multi-modal inputs as reference, its current open platform supports up to 3 video clips, 9 images, and 3 audio clips. In addition, we provide Seedance 2.0 Fast version, an accelerated variant of Seedance 2.0 designed to boost generation speed for low-latency scenarios. Seedance 2.0 has delivered significant improvements to its foundational generation capabilities and multi-modal generation performance, bringing an enhanced creative experience for end users.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 118 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, Yujiao Du, Haojie Duanmu, Junliang Fan, Jiarui Fang, Jing Fang, Zetao Fang, Chengjian Feng, Yu Gao, Diandian Gu, Dong Guo, Hanzhong Guo, Qiushan Guo, Boyang Hao, Hongxiang Hao, Haoxun He, Jiaao He, Qian He, Tuyen Hoang, Heng Hu, Ruoqing Hu, Yuxiang Hu, Jiancheng Huang, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Jishuo Jin, Ming Jing, Ashley Kim, Shanshan Lao, Yichong Leng, Bingchuan Li, Gen Li, Haifeng Li, Huixia Li, Jiashi Li, Ming Li, Xiaojie Li, Xingxing Li, Yameng Li, Yiying Li, Yu Li, Yueyan Li, Chao Liang, Han Liang, Jianzhong Liang, Ying Liang, Wang Liao, J. H. Lien, Shanchuan Lin, Xi Lin, Feng Ling, Yue Ling, Fangfang Liu, Jiawei Liu, Jihao Liu, Jingtuo Liu, Shu Liu, Sichao Liu, Wei Liu, Xue Liu, Zuxi Liu, Ruijie Lu, Lecheng Lyu, Jingting Ma, Tianxiang Ma, Xiaonan Nie, Jingzhe Ning, Junjie Pan, Xitong Pan, Ronggui Peng, Xueqiong Qu, Yuxi Ren, Yuchen Shen, Guang Shi, Lei Shi, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Wenjing Tang, Boyang Tao, Zirui Tao, Dongliang Wang, Feng Wang, Hulin Wang, Ke Wang, Qingyi Wang, Rui Wang, Shuai Wang, Shulei Wang, Weichen Wang, Xuanda Wang, Yanhui Wang, Yue Wang, Yuping Wang, Yuxuan Wang, Zijie Wang, Ziyu Wang, Guoqiang Wei, Meng Wei, Di Wu, Guohong Wu, Hanjie Wu, Huachao Wu, Jian Wu, Jie Wu, Ruolan Wu, Shaojin Wu, Xiaohu Wu, Xinglong Wu, Yonghui Wu, Ruiqi Xia, Xin Xia, Xuefeng Xiao, Shuang Xu, Bangbang Yang, Jiaqi Yang, Runkai Yang, Tao Yang, Yihang Yang, Zhixian Yang, Ziyan Yang, Fulong Ye, Bingqian Yi, Xing Yin, Yongbin You, Linxiao Yuan, Weihong Zeng, Xuejiao Zeng, Yan Zeng, Siyu Zhai, Zhonghua Zhai, Bowen Zhang, Chenlin Zhang, Heng Zhang, Jun Zhang, Manlin Zhang, Peiyuan Zhang, Shuo Zhang, Xiaohe Zhang, Xiaoying Zhang, Xinyan Zhang, Xinyi Zhang, Yichi Zhang, Zixiang Zhang, Haiyu Zhao, Huating Zhao, Liming Zhao, Yian Zhao, Guangcong Zheng, Jianbin Zheng, Xiaozheng Zheng, Zerong Zheng, Kuan Zhu, Feilong Zuo</p>

            <p><strong>Title:</strong><br>
            Seedance 2.0: Advancing Video Generation for World Complexity</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.14148v1">http://arxiv.org/abs/2604.14148v1</a></p>

            <p><strong>Abstract:</strong><br>
            Seedance 2.0 is a new native multi-modal audio-video generation model, officially released in China in early February 2026. Compared with its predecessors, Seedance 1.0 and 1.5 Pro, Seedance 2.0 adopts a unified, highly efficient, and large-scale architecture for multi-modal audio-video joint generation. This allows it to support four input modalities: text, image, audio, and video, by integrating one of the most comprehensive suites of multi-modal content reference and editing capabilities available in the industry to date. It delivers substantial, well-rounded improvements across all key sub-dimensions of video and audio generation. In both expert evaluations and public user tests, the model has demonstrated performance on par with the leading levels in the field. Seedance 2.0 supports direct generation of audio-video content with durations ranging from 4 to 15 seconds, with native output resolutions of 480p and 720p. For multi-modal inputs as reference, its current open platform supports up to 3 video clips, 9 images, and 3 audio clips. In addition, we provide Seedance 2.0 Fast version, an accelerated variant of Seedance 2.0 designed to boost generation speed for low-latency scenarios. Seedance 2.0 has delivered significant improvements to its foundational generation capabilities and multi-modal generation performance, bringing an enhanced creative experience for end users.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 16 Apr 2026 21:23:35 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fb89ffc7/91fe8922.mp3" length="26617753" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1660</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 118 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, Yujiao Du, Haojie Duanmu, Junliang Fan, Jiarui Fang, Jing Fang, Zetao Fang, Chengjian Feng, Yu Gao, Diandian Gu, Dong Guo, Hanzhong Guo, Qiushan Guo, Boyang Hao, Hongxiang Hao, Haoxun He, Jiaao He, Qian He, Tuyen Hoang, Heng Hu, Ruoqing Hu, Yuxiang Hu, Jiancheng Huang, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Jishuo Jin, Ming Jing, Ashley Kim, Shanshan Lao, Yichong Leng, Bingchuan Li, Gen Li, Haifeng Li, Huixia Li, Jiashi Li, Ming Li, Xiaojie Li, Xingxing Li, Yameng Li, Yiying Li, Yu Li, Yueyan Li, Chao Liang, Han Liang, Jianzhong Liang, Ying Liang, Wang Liao, J. H. Lien, Shanchuan Lin, Xi Lin, Feng Ling, Yue Ling, Fangfang Liu, Jiawei Liu, Jihao Liu, Jingtuo Liu, Shu Liu, Sichao Liu, Wei Liu, Xue Liu, Zuxi Liu, Ruijie Lu, Lecheng Lyu, Jingting Ma, Tianxiang Ma, Xiaonan Nie, Jingzhe Ning, Junjie Pan, Xitong Pan, Ronggui Peng, Xueqiong Qu, Yuxi Ren, Yuchen Shen, Guang Shi, Lei Shi, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Wenjing Tang, Boyang Tao, Zirui Tao, Dongliang Wang, Feng Wang, Hulin Wang, Ke Wang, Qingyi Wang, Rui Wang, Shuai Wang, Shulei Wang, Weichen Wang, Xuanda Wang, Yanhui Wang, Yue Wang, Yuping Wang, Yuxuan Wang, Zijie Wang, Ziyu Wang, Guoqiang Wei, Meng Wei, Di Wu, Guohong Wu, Hanjie Wu, Huachao Wu, Jian Wu, Jie Wu, Ruolan Wu, Shaojin Wu, Xiaohu Wu, Xinglong Wu, Yonghui Wu, Ruiqi Xia, Xin Xia, Xuefeng Xiao, Shuang Xu, Bangbang Yang, Jiaqi Yang, Runkai Yang, Tao Yang, Yihang Yang, Zhixian Yang, Ziyan Yang, Fulong Ye, Bingqian Yi, Xing Yin, Yongbin You, Linxiao Yuan, Weihong Zeng, Xuejiao Zeng, Yan Zeng, Siyu Zhai, Zhonghua Zhai, Bowen Zhang, Chenlin Zhang, Heng Zhang, Jun Zhang, Manlin Zhang, Peiyuan Zhang, Shuo Zhang, Xiaohe Zhang, Xiaoying Zhang, Xinyan Zhang, Xinyi Zhang, Yichi Zhang, Zixiang Zhang, Haiyu Zhao, Huating Zhao, Liming Zhao, Yian Zhao, Guangcong Zheng, Jianbin Zheng, Xiaozheng Zheng, Zerong Zheng, Kuan Zhu, Feilong Zuo</p>

            <p><strong>Title:</strong><br>
            Seedance 2.0: Advancing Video Generation for World Complexity</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.14148v1">http://arxiv.org/abs/2604.14148v1</a></p>

            <p><strong>Abstract:</strong><br>
            Seedance 2.0 is a new native multi-modal audio-video generation model, officially released in China in early February 2026. Compared with its predecessors, Seedance 1.0 and 1.5 Pro, Seedance 2.0 adopts a unified, highly efficient, and large-scale architecture for multi-modal audio-video joint generation. This allows it to support four input modalities: text, image, audio, and video, by integrating one of the most comprehensive suites of multi-modal content reference and editing capabilities available in the industry to date. It delivers substantial, well-rounded improvements across all key sub-dimensions of video and audio generation. In both expert evaluations and public user tests, the model has demonstrated performance on par with the leading levels in the field. Seedance 2.0 supports direct generation of audio-video content with durations ranging from 4 to 15 seconds, with native output resolutions of 480p and 720p. For multi-modal inputs as reference, its current open platform supports up to 3 video clips, 9 images, and 3 audio clips. In addition, we provide Seedance 2.0 Fast version, an accelerated variant of Seedance 2.0 designed to boost generation speed for low-latency scenarios. Seedance 2.0 has delivered significant improvements to its foundational generation capabilities and multi-modal generation performance, bringing an enhanced creative experience for end users.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents</title>
      <itunes:episode>1776</itunes:episode>
      <podcast:episode>1776</podcast:episode>
      <itunes:title>GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d437bc19-a156-44e9-99c1-c41ad3665a49</guid>
      <link>https://share.transistor.fm/s/36b03e7a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 106 | cs.CV, cs.AI, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Mingyu Ouyang, Siyuan Hu, Kevin Qinghong Lin, Hwee Tou Ng, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.07429v1">http://arxiv.org/abs/2604.07429v1</a></p>

            <p><strong>Abstract:</strong><br>
            Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at https://gameworld-bench.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 106 | cs.CV, cs.AI, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Mingyu Ouyang, Siyuan Hu, Kevin Qinghong Lin, Hwee Tou Ng, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.07429v1">http://arxiv.org/abs/2604.07429v1</a></p>

            <p><strong>Abstract:</strong><br>
            Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at https://gameworld-bench.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 16 Apr 2026 21:23:14 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/36b03e7a/f4988019.mp3" length="25011558" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1560</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 106 | cs.CV, cs.AI, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Mingyu Ouyang, Siyuan Hu, Kevin Qinghong Lin, Hwee Tou Ng, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.07429v1">http://arxiv.org/abs/2604.07429v1</a></p>

            <p><strong>Abstract:</strong><br>
            Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at https://gameworld-bench.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time</title>
      <itunes:episode>1775</itunes:episode>
      <podcast:episode>1775</podcast:episode>
      <itunes:title>RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">28a127e9-a372-4de9-b7f3-ac3a8331d05f</guid>
      <link>https://share.transistor.fm/s/b3383ad5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 96 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Haozhe Wang, Cong Wei, Weiming Ren, Jiaming Liu, Fangzhen Lin, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.11626v2">http://arxiv.org/abs/2604.11626v2</a></p>

            <p><strong>Abstract:</strong><br>
            Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 96 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Haozhe Wang, Cong Wei, Weiming Ren, Jiaming Liu, Fangzhen Lin, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.11626v2">http://arxiv.org/abs/2604.11626v2</a></p>

            <p><strong>Abstract:</strong><br>
            Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 16 Apr 2026 21:22:52 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b3383ad5/b0c31c15.mp3" length="23312138" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1453</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 96 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Haozhe Wang, Cong Wei, Weiming Ren, Jiaming Liu, Fangzhen Lin, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.11626v2">http://arxiv.org/abs/2604.11626v2</a></p>

            <p><strong>Abstract:</strong><br>
            Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments</title>
      <itunes:episode>1774</itunes:episode>
      <podcast:episode>1774</podcast:episode>
      <itunes:title>SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">58030f47-504b-45b5-afc4-7782b91573bf</guid>
      <link>https://share.transistor.fm/s/83b1dda7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dinging Li, Yingxiu Zhao, Xinrui Cheng, Kangheng Lin, Hongbo Peng, Hongxing Li, Zixuan Wang, Yuhong Dai, Haodong Li, Jia Wang, Yukang Shi, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen</p>

            <p><strong>Title:</strong><br>
            SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.14144v1">http://arxiv.org/abs/2604.14144v1</a></p>

            <p><strong>Abstract:</strong><br>
            Spatial reasoning over three-dimensional scenes is a core capability for embodied intelligence, yet continuous model improvement remains bottlenecked by the cost of geometric annotation. The self-evolving paradigm offers a promising path, but its reliance on model consensus to construct pseudo-labels causes training to reinforce rather than correct the model's own geometric errors. We identify a property unique to 3D spatial reasoning that circumvents this limitation: ground truth is a deterministic consequence of the underlying geometry, computable exactly from point clouds and camera poses without any model involvement. Building on this insight, we present SpatialEvo, a self-evolving framework for 3D spatial reasoning, centered on the Deterministic Geometric Environment (DGE). The DGE formalizes 16 spatial reasoning task categories under explicit geometric validation rules and converts unannotated 3D scenes into zero-noise interactive oracles, replacing model consensus with objective physical feedback. A single shared-parameter policy co-evolves across questioner and solver roles under DGE constraints: the questioner generates physically valid spatial questions grounded in scene observations, while the solver derives precise answers against DGE-verified ground truth. A task-adaptive scheduler endogenously concentrates training on the model's weakest categories, producing a dynamic curriculum without manual design. Experiments across nine benchmarks demonstrate that SpatialEvo achieves the highest average score at both 3B and 7B scales, with consistent gains on spatial reasoning benchmarks and no degradation on general visual understanding.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dinging Li, Yingxiu Zhao, Xinrui Cheng, Kangheng Lin, Hongbo Peng, Hongxing Li, Zixuan Wang, Yuhong Dai, Haodong Li, Jia Wang, Yukang Shi, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen</p>

            <p><strong>Title:</strong><br>
            SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.14144v1">http://arxiv.org/abs/2604.14144v1</a></p>

            <p><strong>Abstract:</strong><br>
            Spatial reasoning over three-dimensional scenes is a core capability for embodied intelligence, yet continuous model improvement remains bottlenecked by the cost of geometric annotation. The self-evolving paradigm offers a promising path, but its reliance on model consensus to construct pseudo-labels causes training to reinforce rather than correct the model's own geometric errors. We identify a property unique to 3D spatial reasoning that circumvents this limitation: ground truth is a deterministic consequence of the underlying geometry, computable exactly from point clouds and camera poses without any model involvement. Building on this insight, we present SpatialEvo, a self-evolving framework for 3D spatial reasoning, centered on the Deterministic Geometric Environment (DGE). The DGE formalizes 16 spatial reasoning task categories under explicit geometric validation rules and converts unannotated 3D scenes into zero-noise interactive oracles, replacing model consensus with objective physical feedback. A single shared-parameter policy co-evolves across questioner and solver roles under DGE constraints: the questioner generates physically valid spatial questions grounded in scene observations, while the solver derives precise answers against DGE-verified ground truth. A task-adaptive scheduler endogenously concentrates training on the model's weakest categories, producing a dynamic curriculum without manual design. Experiments across nine benchmarks demonstrate that SpatialEvo achieves the highest average score at both 3B and 7B scales, with consistent gains on spatial reasoning benchmarks and no degradation on general visual understanding.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 16 Apr 2026 21:22:29 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/83b1dda7/1243960b.mp3" length="23274105" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1451</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dinging Li, Yingxiu Zhao, Xinrui Cheng, Kangheng Lin, Hongbo Peng, Hongxing Li, Zixuan Wang, Yuhong Dai, Haodong Li, Jia Wang, Yukang Shi, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen</p>

            <p><strong>Title:</strong><br>
            SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.14144v1">http://arxiv.org/abs/2604.14144v1</a></p>

            <p><strong>Abstract:</strong><br>
            Spatial reasoning over three-dimensional scenes is a core capability for embodied intelligence, yet continuous model improvement remains bottlenecked by the cost of geometric annotation. The self-evolving paradigm offers a promising path, but its reliance on model consensus to construct pseudo-labels causes training to reinforce rather than correct the model's own geometric errors. We identify a property unique to 3D spatial reasoning that circumvents this limitation: ground truth is a deterministic consequence of the underlying geometry, computable exactly from point clouds and camera poses without any model involvement. Building on this insight, we present SpatialEvo, a self-evolving framework for 3D spatial reasoning, centered on the Deterministic Geometric Environment (DGE). The DGE formalizes 16 spatial reasoning task categories under explicit geometric validation rules and converts unannotated 3D scenes into zero-noise interactive oracles, replacing model consensus with objective physical feedback. A single shared-parameter policy co-evolves across questioner and solver roles under DGE constraints: the questioner generates physically valid spatial questions grounded in scene observations, while the solver derives precise answers against DGE-verified ground truth. A task-adaptive scheduler endogenously concentrates training on the model's weakest categories, producing a dynamic curriculum without manual design. Experiments across nine benchmarks demonstrate that SpatialEvo achieves the highest average score at both 3B and 7B scales, with consistent gains on spatial reasoning benchmarks and no degradation on general visual understanding.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation</title>
      <itunes:episode>1773</itunes:episode>
      <podcast:episode>1773</podcast:episode>
      <itunes:title>OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1ec0557c-8923-4a99-ac3a-51ddf8e591fa</guid>
      <link>https://share.transistor.fm/s/6a6b3ee4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su, Lianghao Deng, Yuxuan Liu, Yantao Liu, Dayiheng Liu, Tsung-Yi Ho</p>

            <p><strong>Title:</strong><br>
            OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.10866v2">http://arxiv.org/abs/2604.10866v2</a></p>

            <p><strong>Abstract:</strong><br>
            AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LES-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su, Lianghao Deng, Yuxuan Liu, Yantao Liu, Dayiheng Liu, Tsung-Yi Ho</p>

            <p><strong>Title:</strong><br>
            OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.10866v2">http://arxiv.org/abs/2604.10866v2</a></p>

            <p><strong>Abstract:</strong><br>
            AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LES-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 16 Apr 2026 21:22:08 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6a6b3ee4/fa122796.mp3" length="26451444" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1650</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su, Lianghao Deng, Yuxuan Liu, Yantao Liu, Dayiheng Liu, Tsung-Yi Ho</p>

            <p><strong>Title:</strong><br>
            OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.10866v2">http://arxiv.org/abs/2604.10866v2</a></p>

            <p><strong>Abstract:</strong><br>
            AI agents are expected to perform professional work across hundreds of occupational domains (from emergency department triage to nuclear reactor safety monitoring to customs import processing), yet existing benchmarks can only evaluate agents in the few domains where public environments exist. We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments through LLM-driven tool response generation. Our multi-agent synthesis pipeline automatically produces evaluation instances with guaranteed solvability, calibrated difficulty, and document-grounded diversity. OccuBench evaluates agents along two complementary dimensions: task completion across professional domains and environmental robustness under controlled fault injection (explicit errors, implicit data degradation, and mixed faults). We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than both explicit errors (timeouts, 500s) and mixed faults, because they lack overt error signals and require the agent to independently detect data degradation; (3) larger models, newer generations, and higher reasoning effort consistently improve performance. GPT-5.2 improves by 27.5 points from minimal to maximum reasoning effort; and (4) strong agents are not necessarily strong environment simulators. Simulator quality is critical for LES-based evaluation reliability. OccuBench provides the first systematic cross-industry evaluation of AI agents on professional occupational tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents</title>
      <itunes:episode>1772</itunes:episode>
      <podcast:episode>1772</podcast:episode>
      <itunes:title>Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bd477682-ee96-4782-9916-ebef15985d40</guid>
      <link>https://share.transistor.fm/s/4bde1b83</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kangsan Kim, Minki Kang, Taeil Kim, Yanlai Yang, Mengye Ren, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.14004v1">http://arxiv.org/abs/2604.14004v1</a></p>

            <p><strong>Abstract:</strong><br>
            Memory-based self-evolution has emerged as a promising paradigm for coding agents. However, existing approaches typically restrict memory utilization to homogeneous task domains, failing to leverage the shared infrastructural foundations, such as runtime environments and programming languages, that exist across diverse real-world coding problems. To address this limitation, we investigate \textbf{Memory Transfer Learning} (MTL) by harnessing a unified memory pool from heterogeneous domains. We evaluate performance across 6 coding benchmarks using four memory representations, ranging from concrete traces to abstract insights. Our experiments demonstrate that cross-domain memory improves average performance by 3.7\%, primarily by transferring meta-knowledge, such as validation routines, rather than task-specific code. Importantly, we find that abstraction dictates transferability; high-level insights generalize well, whereas low-level traces often induce negative transfer due to excessive specificity. Furthermore, we show that transfer effectiveness scales with the size of the memory pool, and memory can be transferred even between different models. Our work establishes empirical design principles for expanding memory utilization beyond single-domain silos. Project page: https://memorytransfer.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kangsan Kim, Minki Kang, Taeil Kim, Yanlai Yang, Mengye Ren, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.14004v1">http://arxiv.org/abs/2604.14004v1</a></p>

            <p><strong>Abstract:</strong><br>
            Memory-based self-evolution has emerged as a promising paradigm for coding agents. However, existing approaches typically restrict memory utilization to homogeneous task domains, failing to leverage the shared infrastructural foundations, such as runtime environments and programming languages, that exist across diverse real-world coding problems. To address this limitation, we investigate \textbf{Memory Transfer Learning} (MTL) by harnessing a unified memory pool from heterogeneous domains. We evaluate performance across 6 coding benchmarks using four memory representations, ranging from concrete traces to abstract insights. Our experiments demonstrate that cross-domain memory improves average performance by 3.7\%, primarily by transferring meta-knowledge, such as validation routines, rather than task-specific code. Importantly, we find that abstraction dictates transferability; high-level insights generalize well, whereas low-level traces often induce negative transfer due to excessive specificity. Furthermore, we show that transfer effectiveness scales with the size of the memory pool, and memory can be transferred even between different models. Our work establishes empirical design principles for expanding memory utilization beyond single-domain silos. Project page: https://memorytransfer.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 16 Apr 2026 21:21:47 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4bde1b83/900fab30.mp3" length="25452507" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1587</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kangsan Kim, Minki Kang, Taeil Kim, Yanlai Yang, Mengye Ren, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.14004v1">http://arxiv.org/abs/2604.14004v1</a></p>

            <p><strong>Abstract:</strong><br>
            Memory-based self-evolution has emerged as a promising paradigm for coding agents. However, existing approaches typically restrict memory utilization to homogeneous task domains, failing to leverage the shared infrastructural foundations, such as runtime environments and programming languages, that exist across diverse real-world coding problems. To address this limitation, we investigate \textbf{Memory Transfer Learning} (MTL) by harnessing a unified memory pool from heterogeneous domains. We evaluate performance across 6 coding benchmarks using four memory representations, ranging from concrete traces to abstract insights. Our experiments demonstrate that cross-domain memory improves average performance by 3.7\%, primarily by transferring meta-knowledge, such as validation routines, rather than task-specific code. Importantly, we find that abstraction dictates transferability; high-level insights generalize well, whereas low-level traces often induce negative transfer due to excessive specificity. Furthermore, we show that transfer effectiveness scales with the size of the memory pool, and memory can be transferred even between different models. Our work establishes empirical design principles for expanding memory utilization beyond single-domain silos. Project page: https://memorytransfer.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space</title>
      <itunes:episode>1771</itunes:episode>
      <podcast:episode>1771</podcast:episode>
      <itunes:title>From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9a74cc97-5f22-433e-8080-2fc0cd564b93</guid>
      <link>https://share.transistor.fm/s/4959b0cb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuqiao Tan, Minzheng Wang, Bo Liu, Zichen Liu, Tian Liang, Shizhu He, Jun Zhao, Kang Liu</p>

            <p><strong>Title:</strong><br>
            From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.14142v1">http://arxiv.org/abs/2604.14142v1</a></p>

            <p><strong>Abstract:</strong><br>
            While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuqiao Tan, Minzheng Wang, Bo Liu, Zichen Liu, Tian Liang, Shizhu He, Jun Zhao, Kang Liu</p>

            <p><strong>Title:</strong><br>
            From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.14142v1">http://arxiv.org/abs/2604.14142v1</a></p>

            <p><strong>Abstract:</strong><br>
            While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 16 Apr 2026 21:21:25 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4959b0cb/532d35eb.mp3" length="22632949" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1411</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuqiao Tan, Minzheng Wang, Bo Liu, Zichen Liu, Tian Liang, Shizhu He, Jun Zhao, Kang Liu</p>

            <p><strong>Title:</strong><br>
            From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.14142v1">http://arxiv.org/abs/2604.14142v1</a></p>

            <p><strong>Abstract:</strong><br>
            While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Exploration and Exploitation Errors Are Measurable for Language Model Agents</title>
      <itunes:episode>1770</itunes:episode>
      <podcast:episode>1770</podcast:episode>
      <itunes:title>Exploration and Exploitation Errors Are Measurable for Language Model Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">973e1223-c233-4321-9b9a-3bc6772f85c8</guid>
      <link>https://share.transistor.fm/s/72fa15f0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jaden Park, Jungtaek Kim, Jongwon Jeong, Robert D. Nowak, Kangwook Lee, Yong Jae Lee</p>

            <p><strong>Title:</strong><br>
            Exploration and Exploitation Errors Are Measurable for Language Model Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.13151v1">http://arxiv.org/abs/2604.13151v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language Model (LM) agents are increasingly used in complex open-ended decision-making tasks, from AI coding to physical AI. A core requirement in these settings is the ability to both explore the problem space and exploit acquired knowledge effectively. However, systematically distinguishing and quantifying exploration and exploitation from observed actions without access to the agent's internal policy remains challenging. To address this, we design controllable environments inspired by practical embodied AI scenarios. Each environment consists of a partially observable 2D grid map and an unknown task Directed Acyclic Graph (DAG). The map generation can be programmatically adjusted to emphasize exploration or exploitation difficulty. To enable policy-agnostic evaluation, we design a metric to quantify exploration and exploitation errors from agent's actions. We evaluate a variety of frontier LM agents and find that even state-of-the-art models struggle on our task, with different models exhibiting distinct failure modes. We further observe that reasoning models solve the task more effectively and show both exploration and exploitation can be significantly improved through minimal harness engineering. We release our code \href{https://github.com/jjj-madison/measurable-explore-exploit}{here}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jaden Park, Jungtaek Kim, Jongwon Jeong, Robert D. Nowak, Kangwook Lee, Yong Jae Lee</p>

            <p><strong>Title:</strong><br>
            Exploration and Exploitation Errors Are Measurable for Language Model Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.13151v1">http://arxiv.org/abs/2604.13151v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language Model (LM) agents are increasingly used in complex open-ended decision-making tasks, from AI coding to physical AI. A core requirement in these settings is the ability to both explore the problem space and exploit acquired knowledge effectively. However, systematically distinguishing and quantifying exploration and exploitation from observed actions without access to the agent's internal policy remains challenging. To address this, we design controllable environments inspired by practical embodied AI scenarios. Each environment consists of a partially observable 2D grid map and an unknown task Directed Acyclic Graph (DAG). The map generation can be programmatically adjusted to emphasize exploration or exploitation difficulty. To enable policy-agnostic evaluation, we design a metric to quantify exploration and exploitation errors from agent's actions. We evaluate a variety of frontier LM agents and find that even state-of-the-art models struggle on our task, with different models exhibiting distinct failure modes. We further observe that reasoning models solve the task more effectively and show both exploration and exploitation can be significantly improved through minimal harness engineering. We release our code \href{https://github.com/jjj-madison/measurable-explore-exploit}{here}.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 16 Apr 2026 21:21:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/72fa15f0/71354f13.mp3" length="21520337" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1341</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jaden Park, Jungtaek Kim, Jongwon Jeong, Robert D. Nowak, Kangwook Lee, Yong Jae Lee</p>

            <p><strong>Title:</strong><br>
            Exploration and Exploitation Errors Are Measurable for Language Model Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.13151v1">http://arxiv.org/abs/2604.13151v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language Model (LM) agents are increasingly used in complex open-ended decision-making tasks, from AI coding to physical AI. A core requirement in these settings is the ability to both explore the problem space and exploit acquired knowledge effectively. However, systematically distinguishing and quantifying exploration and exploitation from observed actions without access to the agent's internal policy remains challenging. To address this, we design controllable environments inspired by practical embodied AI scenarios. Each environment consists of a partially observable 2D grid map and an unknown task Directed Acyclic Graph (DAG). The map generation can be programmatically adjusted to emphasize exploration or exploitation difficulty. To enable policy-agnostic evaluation, we design a metric to quantify exploration and exploitation errors from agent's actions. We evaluate a variety of frontier LM agents and find that even state-of-the-art models struggle on our task, with different models exhibiting distinct failure modes. We further observe that reasoning models solve the task more effectively and show both exploration and exploitation can be significantly improved through minimal harness engineering. We release our code \href{https://github.com/jjj-madison/measurable-explore-exploit}{here}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents</title>
      <itunes:episode>1769</itunes:episode>
      <podcast:episode>1769</podcast:episode>
      <itunes:title>ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a8a8daad-5408-42f4-bb9d-428e9e124ccb</guid>
      <link>https://share.transistor.fm/s/b7b500a4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 123 | cs.LG, cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Fei Tang, Zhiqiong Lu, Boxuan Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen</p>

            <p><strong>Title:</strong><br>
            ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.11784v1">http://arxiv.org/abs/2604.11784v1</a></p>

            <p><strong>Abstract:</strong><br>
            GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity than by the absence of a coherent full-stack infrastructure: online RL training suffers from environment instability and closed pipelines, evaluation protocols drift silently across works, and trained agents rarely reach real users on real devices. We present \textbf{ClawGUI}, an open-source framework addressing these three gaps within a single harness. \textbf{ClawGUI-RL} provides the first open-source GUI agent RL infrastructure with validated support for both parallel virtual environments and real physical devices, integrating GiGPO with a Process Reward Model for dense step-level supervision. \textbf{ClawGUI-Eval} enforces a fully standardized evaluation pipeline across 6 benchmarks and 11+ models, achieving 95.8\% reproduction against official baselines. \textbf{ClawGUI-Agent} brings trained agents to Android, HarmonyOS, and iOS through 12+ chat platforms with hybrid CLI-GUI control and persistent personalized memory. Trained end to end within this pipeline, \textbf{ClawGUI-2B} achieves 17.1\% Success Rate on MobileWorld GUI-Only, outperforming the same-scale MAI-UI-2B baseline by 6.0\%.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 123 | cs.LG, cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Fei Tang, Zhiqiong Lu, Boxuan Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen</p>

            <p><strong>Title:</strong><br>
            ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.11784v1">http://arxiv.org/abs/2604.11784v1</a></p>

            <p><strong>Abstract:</strong><br>
            GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity than by the absence of a coherent full-stack infrastructure: online RL training suffers from environment instability and closed pipelines, evaluation protocols drift silently across works, and trained agents rarely reach real users on real devices. We present \textbf{ClawGUI}, an open-source framework addressing these three gaps within a single harness. \textbf{ClawGUI-RL} provides the first open-source GUI agent RL infrastructure with validated support for both parallel virtual environments and real physical devices, integrating GiGPO with a Process Reward Model for dense step-level supervision. \textbf{ClawGUI-Eval} enforces a fully standardized evaluation pipeline across 6 benchmarks and 11+ models, achieving 95.8\% reproduction against official baselines. \textbf{ClawGUI-Agent} brings trained agents to Android, HarmonyOS, and iOS through 12+ chat platforms with hybrid CLI-GUI control and persistent personalized memory. Trained end to end within this pipeline, \textbf{ClawGUI-2B} achieves 17.1\% Success Rate on MobileWorld GUI-Only, outperforming the same-scale MAI-UI-2B baseline by 6.0\%.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 15 Apr 2026 23:17:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b7b500a4/62d9afbf.mp3" length="23077238" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1439</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 123 | cs.LG, cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Fei Tang, Zhiqiong Lu, Boxuan Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen</p>

            <p><strong>Title:</strong><br>
            ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.11784v1">http://arxiv.org/abs/2604.11784v1</a></p>

            <p><strong>Abstract:</strong><br>
            GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity than by the absence of a coherent full-stack infrastructure: online RL training suffers from environment instability and closed pipelines, evaluation protocols drift silently across works, and trained agents rarely reach real users on real devices. We present \textbf{ClawGUI}, an open-source framework addressing these three gaps within a single harness. \textbf{ClawGUI-RL} provides the first open-source GUI agent RL infrastructure with validated support for both parallel virtual environments and real physical devices, integrating GiGPO with a Process Reward Model for dense step-level supervision. \textbf{ClawGUI-Eval} enforces a fully standardized evaluation pipeline across 6 benchmarks and 11+ models, achieving 95.8\% reproduction against official baselines. \textbf{ClawGUI-Agent} brings trained agents to Android, HarmonyOS, and iOS through 12+ chat platforms with hybrid CLI-GUI control and persistent personalized memory. Trained end to end within this pipeline, \textbf{ClawGUI-2B} achieves 17.1\% Success Rate on MobileWorld GUI-Only, outperforming the same-scale MAI-UI-2B baseline by 6.0\%.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe</title>
      <itunes:episode>1768</itunes:episode>
      <podcast:episode>1768</podcast:episode>
      <itunes:title>Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">46433841-b853-4594-8f3a-7a68117ea2ee</guid>
      <link>https://share.transistor.fm/s/fa05df48</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, Ning Ding</p>

            <p><strong>Title:</strong><br>
            Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.13016v2">http://arxiv.org/abs/2604.13016v2</a></p>

            <p><strong>Abstract:</strong><br>
            On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, Ning Ding</p>

            <p><strong>Title:</strong><br>
            Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.13016v2">http://arxiv.org/abs/2604.13016v2</a></p>

            <p><strong>Abstract:</strong><br>
            On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 15 Apr 2026 23:17:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fa05df48/ae287572.mp3" length="24172308" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1507</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, Ning Ding</p>

            <p><strong>Title:</strong><br>
            Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.13016v2">http://arxiv.org/abs/2604.13016v2</a></p>

            <p><strong>Abstract:</strong><br>
            On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization</title>
      <itunes:episode>1767</itunes:episode>
      <podcast:episode>1767</podcast:episode>
      <itunes:title>Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">50b0acf8-9043-4e92-96e0-e66451a16eef</guid>
      <link>https://share.transistor.fm/s/5f29310c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiachen Zhu, Lingyu Yang, Rong Shan, Congmin Zheng, Zeyu Zheng, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin</p>

            <p><strong>Title:</strong><br>
            Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.09574v1">http://arxiv.org/abs/2604.09574v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robustness over the critical dimension of anti-detection. We argue that for agents to survive in human-centric ecosystems, they must evolve Humanization capabilities. We introduce the ``Turing Test on Screen,'' formally modeling the interaction as a MinMax optimization problem between a detector and an agent aiming to minimize behavioral divergence. We then collect a new high-fidelity dataset of mobile touch dynamics, and conduct our analysis that vanilla LMM-based agents are easily detectable due to unnatural kinematics. Consequently, we establish the Agent Humanization Benchmark (AHB) and detection metrics to quantify the trade-off between imitability and utility. Finally, we propose methods ranging from heuristic noise to data-driven behavioral matching, demonstrating that agents can achieve high imitability theoretically and empirically without sacrificing performance. This work shifts the paradigm from whether an agent can perform a task to how it performs it within a human-centric ecosystem, laying the groundwork for seamless coexistence in adversarial digital environments.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiachen Zhu, Lingyu Yang, Rong Shan, Congmin Zheng, Zeyu Zheng, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin</p>

            <p><strong>Title:</strong><br>
            Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.09574v1">http://arxiv.org/abs/2604.09574v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robustness over the critical dimension of anti-detection. We argue that for agents to survive in human-centric ecosystems, they must evolve Humanization capabilities. We introduce the ``Turing Test on Screen,'' formally modeling the interaction as a MinMax optimization problem between a detector and an agent aiming to minimize behavioral divergence. We then collect a new high-fidelity dataset of mobile touch dynamics, and conduct our analysis that vanilla LMM-based agents are easily detectable due to unnatural kinematics. Consequently, we establish the Agent Humanization Benchmark (AHB) and detection metrics to quantify the trade-off between imitability and utility. Finally, we propose methods ranging from heuristic noise to data-driven behavioral matching, demonstrating that agents can achieve high imitability theoretically and empirically without sacrificing performance. This work shifts the paradigm from whether an agent can perform a task to how it performs it within a human-centric ecosystem, laying the groundwork for seamless coexistence in adversarial digital environments.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 15 Apr 2026 23:16:45 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5f29310c/a3fef7fe.mp3" length="20376375" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1270</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiachen Zhu, Lingyu Yang, Rong Shan, Congmin Zheng, Zeyu Zheng, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin</p>

            <p><strong>Title:</strong><br>
            Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.09574v1">http://arxiv.org/abs/2604.09574v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robustness over the critical dimension of anti-detection. We argue that for agents to survive in human-centric ecosystems, they must evolve Humanization capabilities. We introduce the ``Turing Test on Screen,'' formally modeling the interaction as a MinMax optimization problem between a detector and an agent aiming to minimize behavioral divergence. We then collect a new high-fidelity dataset of mobile touch dynamics, and conduct our analysis that vanilla LMM-based agents are easily detectable due to unnatural kinematics. Consequently, we establish the Agent Humanization Benchmark (AHB) and detection metrics to quantify the trade-off between imitability and utility. Finally, we propose methods ranging from heuristic noise to data-driven behavioral matching, demonstrating that agents can achieve high imitability theoretically and empirically without sacrificing performance. This work shifts the paradigm from whether an agent can perform a task to how it performs it within a human-centric ecosystem, laying the groundwork for seamless coexistence in adversarial digital environments.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks</title>
      <itunes:episode>1766</itunes:episode>
      <podcast:episode>1766</podcast:episode>
      <itunes:title>SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">90e1799e-c9f9-4a7c-a55f-a2ff99fd2241</guid>
      <link>https://share.transistor.fm/s/bfd49d32</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen, Peng Li, Yang Liu, Guanhua Chen</p>

            <p><strong>Title:</strong><br>
            SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08865v1">http://arxiv.org/abs/2604.08865v1</a></p>

            <p><strong>Abstract:</strong><br>
            Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-based methods, offering a resource-efficient framework for aligning reasoning LLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen, Peng Li, Yang Liu, Guanhua Chen</p>

            <p><strong>Title:</strong><br>
            SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08865v1">http://arxiv.org/abs/2604.08865v1</a></p>

            <p><strong>Abstract:</strong><br>
            Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-based methods, offering a resource-efficient framework for aligning reasoning LLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 15 Apr 2026 23:16:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bfd49d32/37f7e0d2.mp3" length="21701713" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1353</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen, Peng Li, Yang Liu, Guanhua Chen</p>

            <p><strong>Title:</strong><br>
            SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08865v1">http://arxiv.org/abs/2604.08865v1</a></p>

            <p><strong>Abstract:</strong><br>
            Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-based methods, offering a resource-efficient framework for aligning reasoning LLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Toward Autonomous Long-Horizon Engineering for ML Research</title>
      <itunes:episode>1765</itunes:episode>
      <podcast:episode>1765</podcast:episode>
      <itunes:title>Toward Autonomous Long-Horizon Engineering for ML Research</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">26005ed3-6f7d-4a7d-9389-dc1cb8335efb</guid>
      <link>https://share.transistor.fm/s/10df9cb1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guoxin Chen, Jie Chen, Lei Chen, Jiale Zhao, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Cheng Chen, Ji-Rong Wen, Kai Jia</p>

            <p><strong>Title:</strong><br>
            Toward Autonomous Long-Horizon Engineering for ML Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.13018v1">http://arxiv.org/abs/2604.13018v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for autonomous long-horizon engineering for ML research built on a simple principle: strong long-horizon performance requires both structured orchestration and durable state continuity. To this end, AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace: a top-level Orchestrator maintains stage-level control through concise summaries and a workspace map, while specialized agents repeatedly re-ground on durable artifacts such as analyses, plans, code, and experimental evidence rather than relying primarily on conversational handoffs, yielding thin control over thick state. Across two complementary benchmarks, AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE-Bench Lite. Ablation studies further show that File-as-Bus protocol is a key driver of performance, reducing PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points when removed. These results suggest that long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guoxin Chen, Jie Chen, Lei Chen, Jiale Zhao, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Cheng Chen, Ji-Rong Wen, Kai Jia</p>

            <p><strong>Title:</strong><br>
            Toward Autonomous Long-Horizon Engineering for ML Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.13018v1">http://arxiv.org/abs/2604.13018v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for autonomous long-horizon engineering for ML research built on a simple principle: strong long-horizon performance requires both structured orchestration and durable state continuity. To this end, AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace: a top-level Orchestrator maintains stage-level control through concise summaries and a workspace map, while specialized agents repeatedly re-ground on durable artifacts such as analyses, plans, code, and experimental evidence rather than relying primarily on conversational handoffs, yielding thin control over thick state. Across two complementary benchmarks, AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE-Bench Lite. Ablation studies further show that File-as-Bus protocol is a key driver of performance, reducing PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points when removed. These results suggest that long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 15 Apr 2026 23:15:59 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/10df9cb1/465a914c.mp3" length="23174602" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1445</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guoxin Chen, Jie Chen, Lei Chen, Jiale Zhao, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Cheng Chen, Ji-Rong Wen, Kai Jia</p>

            <p><strong>Title:</strong><br>
            Toward Autonomous Long-Horizon Engineering for ML Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.13018v1">http://arxiv.org/abs/2604.13018v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for autonomous long-horizon engineering for ML research built on a simple principle: strong long-horizon performance requires both structured orchestration and durable state continuity. To this end, AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace: a top-level Orchestrator maintains stage-level control through concise summaries and a workspace map, while specialized agents repeatedly re-ground on durable artifacts such as analyses, plans, code, and experimental evidence rather than relying primarily on conversational handoffs, yielding thin control over thick state. Across two complementary benchmarks, AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE-Bench Lite. Ablation studies further show that File-as-Bus protocol is a key driver of performance, reducing PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points when removed. These results suggest that long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation</title>
      <itunes:episode>1764</itunes:episode>
      <podcast:episode>1764</podcast:episode>
      <itunes:title>BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8b3a9f25-451e-4d01-899f-c34c9f866c21</guid>
      <link>https://share.transistor.fm/s/26f7f2b7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Emmanuel Malherbe, Céline Hudelot, Pierre Colombo</p>

            <p><strong>Title:</strong><br>
            BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.09497v1">http://arxiv.org/abs/2604.09497v1</a></p>

            <p><strong>Abstract:</strong><br>
            Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In practice, however, evaluating generative outputs typically relies on rigid lexical methods to extract and assess answers, which can conflate a model's true problem-solving ability with its compliance with predefined formatting guidelines. While recent LLM-as-a-Judge approaches mitigate this issue by assessing semantic correctness rather than strict structural conformity, they also introduce substantial computational overhead, making evaluation costly. In this work, we first systematically investigate the limitations of lexical evaluation through a large-scale empirical study spanning 36 models and 15 downstream tasks, demonstrating that such methods correlate poorly with human judgments. To address this limitation, we introduce BERT-as-a-Judge, an encoder-driven approach for assessing answer correctness in reference-based generative settings, robust to variations in output phrasing, and requiring only lightweight training on synthetically annotated question-candidate-reference triplets. We show that it consistently outperforms the lexical baseline while matching the performance of much larger LLM judges, providing a compelling tradeoff between the two and enabling reliable, scalable evaluation. Finally, through extensive experimentation, we provide detailed insights into BERT-as-a-Judge's performance to offer practical guidance for practitioners, and release all project artifacts to foster downstream adoption.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Emmanuel Malherbe, Céline Hudelot, Pierre Colombo</p>

            <p><strong>Title:</strong><br>
            BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.09497v1">http://arxiv.org/abs/2604.09497v1</a></p>

            <p><strong>Abstract:</strong><br>
            Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In practice, however, evaluating generative outputs typically relies on rigid lexical methods to extract and assess answers, which can conflate a model's true problem-solving ability with its compliance with predefined formatting guidelines. While recent LLM-as-a-Judge approaches mitigate this issue by assessing semantic correctness rather than strict structural conformity, they also introduce substantial computational overhead, making evaluation costly. In this work, we first systematically investigate the limitations of lexical evaluation through a large-scale empirical study spanning 36 models and 15 downstream tasks, demonstrating that such methods correlate poorly with human judgments. To address this limitation, we introduce BERT-as-a-Judge, an encoder-driven approach for assessing answer correctness in reference-based generative settings, robust to variations in output phrasing, and requiring only lightweight training on synthetically annotated question-candidate-reference triplets. We show that it consistently outperforms the lexical baseline while matching the performance of much larger LLM judges, providing a compelling tradeoff between the two and enabling reliable, scalable evaluation. Finally, through extensive experimentation, we provide detailed insights into BERT-as-a-Judge's performance to offer practical guidance for practitioners, and release all project artifacts to foster downstream adoption.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 15 Apr 2026 23:15:36 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/26f7f2b7/a0328af2.mp3" length="21149215" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1318</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Emmanuel Malherbe, Céline Hudelot, Pierre Colombo</p>

            <p><strong>Title:</strong><br>
            BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.09497v1">http://arxiv.org/abs/2604.09497v1</a></p>

            <p><strong>Abstract:</strong><br>
            Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In practice, however, evaluating generative outputs typically relies on rigid lexical methods to extract and assess answers, which can conflate a model's true problem-solving ability with its compliance with predefined formatting guidelines. While recent LLM-as-a-Judge approaches mitigate this issue by assessing semantic correctness rather than strict structural conformity, they also introduce substantial computational overhead, making evaluation costly. In this work, we first systematically investigate the limitations of lexical evaluation through a large-scale empirical study spanning 36 models and 15 downstream tasks, demonstrating that such methods correlate poorly with human judgments. To address this limitation, we introduce BERT-as-a-Judge, an encoder-driven approach for assessing answer correctness in reference-based generative settings, robust to variations in output phrasing, and requiring only lightweight training on synthetically annotated question-candidate-reference triplets. We show that it consistently outperforms the lexical baseline while matching the performance of much larger LLM judges, providing a compelling tradeoff between the two and enabling reliable, scalable evaluation. Finally, through extensive experimentation, we provide detailed insights into BERT-as-a-Judge's performance to offer practical guidance for practitioners, and release all project artifacts to foster downstream adoption.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation</title>
      <itunes:episode>1763</itunes:episode>
      <podcast:episode>1763</podcast:episode>
      <itunes:title>QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d5b151ec-e261-492c-92cd-19ad6ffc4b1e</guid>
      <link>https://share.transistor.fm/s/45b50ded</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 112 | cs.LG, cs.AI, cs.PL, cs.SE, quant-ph</p>

            <p><strong>Authors:</strong><br>
            Ali Slim, Haydar Hamieh, Jawad Kotaich, Yehya Ghosn, Mahdi Chehimi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem</p>

            <p><strong>Title:</strong><br>
            QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08570v1">http://arxiv.org/abs/2604.08570v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks, making it difficult to separate quantum reasoning from framework familiarity. We introduce QuanBench+, a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation.   We evaluate models with executable functional tests, report Pass@1 and Pass@5, and use KL-divergence-based acceptance for probabilistic outputs. We additionally study Pass@1 after feedback-based repair, where a model may revise code after a runtime error or wrong answer. Across frameworks, the strongest one-shot scores reach 59.5% in Qiskit, 54.8% in Cirq, and 42.9% in PennyLane; with feedback-based repair, the best scores rise to 83.3%, 76.2%, and 66.7%, respectively. These results show clear progress, but also that reliable multi-framework quantum code generation remains unsolved and still depends strongly on framework-specific knowledge.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 112 | cs.LG, cs.AI, cs.PL, cs.SE, quant-ph</p>

            <p><strong>Authors:</strong><br>
            Ali Slim, Haydar Hamieh, Jawad Kotaich, Yehya Ghosn, Mahdi Chehimi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem</p>

            <p><strong>Title:</strong><br>
            QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08570v1">http://arxiv.org/abs/2604.08570v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks, making it difficult to separate quantum reasoning from framework familiarity. We introduce QuanBench+, a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation.   We evaluate models with executable functional tests, report Pass@1 and Pass@5, and use KL-divergence-based acceptance for probabilistic outputs. We additionally study Pass@1 after feedback-based repair, where a model may revise code after a runtime error or wrong answer. Across frameworks, the strongest one-shot scores reach 59.5% in Qiskit, 54.8% in Cirq, and 42.9% in PennyLane; with feedback-based repair, the best scores rise to 83.3%, 76.2%, and 66.7%, respectively. These results show clear progress, but also that reliable multi-framework quantum code generation remains unsolved and still depends strongly on framework-specific knowledge.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Apr 2026 21:19:52 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/45b50ded/2ca7b91b.mp3" length="23961228" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1494</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 112 | cs.LG, cs.AI, cs.PL, cs.SE, quant-ph</p>

            <p><strong>Authors:</strong><br>
            Ali Slim, Haydar Hamieh, Jawad Kotaich, Yehya Ghosn, Mahdi Chehimi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem</p>

            <p><strong>Title:</strong><br>
            QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08570v1">http://arxiv.org/abs/2604.08570v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks, making it difficult to separate quantum reasoning from framework familiarity. We introduce QuanBench+, a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation.   We evaluate models with executable functional tests, report Pass@1 and Pass@5, and use KL-divergence-based acceptance for probabilistic outputs. We additionally study Pass@1 after feedback-based repair, where a model may revise code after a runtime error or wrong answer. Across frameworks, the strongest one-shot scores reach 59.5% in Qiskit, 54.8% in Cirq, and 42.9% in PennyLane; with feedback-based repair, the best scores rise to 83.3%, 76.2%, and 66.7%, respectively. These results show clear progress, but also that reliable multi-framework quantum code generation remains unsolved and still depends strongly on framework-specific knowledge.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping</title>
      <itunes:episode>1762</itunes:episode>
      <podcast:episode>1762</podcast:episode>
      <itunes:title>The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d045f53e-44a2-494d-b5e7-d9109bddcb7d</guid>
      <link>https://share.transistor.fm/s/23743c5b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yang Liu, Enxi Wang, Yufei Gao, Weixin Zhang, Bo Wang, Zhiyuan Zeng, Yikai Zhang, Yining Zheng, Xipeng Qiu</p>

            <p><strong>Title:</strong><br>
            The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.11297v1">http://arxiv.org/abs/2604.11297v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing and leveraging intermediate model representations, we capture features of past rollouts and use density-based clustering to identify frequently recurring error patterns. Rollouts assigned to more prevalent error clusters are penalized more heavily, encouraging broader exploration while reducing repeated mistakes. Across five datasets and three base models, MEDS consistently improves average performance over existing baselines, achieving gains of up to 4.13 pass@1 points and 4.37 pass@128 points. Additional analyses using both LLM-based annotations and quantitative diversity metrics show that MEDS increases behavioral diversity during sampling.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yang Liu, Enxi Wang, Yufei Gao, Weixin Zhang, Bo Wang, Zhiyuan Zeng, Yikai Zhang, Yining Zheng, Xipeng Qiu</p>

            <p><strong>Title:</strong><br>
            The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.11297v1">http://arxiv.org/abs/2604.11297v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing and leveraging intermediate model representations, we capture features of past rollouts and use density-based clustering to identify frequently recurring error patterns. Rollouts assigned to more prevalent error clusters are penalized more heavily, encouraging broader exploration while reducing repeated mistakes. Across five datasets and three base models, MEDS consistently improves average performance over existing baselines, achieving gains of up to 4.13 pass@1 points and 4.37 pass@128 points. Additional analyses using both LLM-based annotations and quantitative diversity metrics show that MEDS increases behavioral diversity during sampling.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Apr 2026 21:19:29 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/23743c5b/e98650a1.mp3" length="20709063" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1291</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yang Liu, Enxi Wang, Yufei Gao, Weixin Zhang, Bo Wang, Zhiyuan Zeng, Yikai Zhang, Yining Zheng, Xipeng Qiu</p>

            <p><strong>Title:</strong><br>
            The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.11297v1">http://arxiv.org/abs/2604.11297v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing and leveraging intermediate model representations, we capture features of past rollouts and use density-based clustering to identify frequently recurring error patterns. Rollouts assigned to more prevalent error clusters are penalized more heavily, encouraging broader exploration while reducing repeated mistakes. Across five datasets and three base models, MEDS consistently improves average performance over existing baselines, achieving gains of up to 4.13 pass@1 points and 4.37 pass@128 points. Additional analyses using both LLM-based annotations and quantitative diversity metrics show that MEDS increases behavioral diversity during sampling.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation</title>
      <itunes:episode>1761</itunes:episode>
      <podcast:episode>1761</podcast:episode>
      <itunes:title>OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f6c54977-833f-465f-b43f-079fd255c8ca</guid>
      <link>https://share.transistor.fm/s/1503c8af</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Donghao Zhou, Guisheng Liu, Hao Yang, Jiatong Li, Jingyu Lin, Xiaohu Huang, Yichen Liu, Xin Gao, Cunjian Chen, Shilei Wen, Chi-Wing Fu, Pheng-Ann Heng</p>

            <p><strong>Title:</strong><br>
            OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.11804v1">http://arxiv.org/abs/2604.11804v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Donghao Zhou, Guisheng Liu, Hao Yang, Jiatong Li, Jingyu Lin, Xiaohu Huang, Yichen Liu, Xin Gao, Cunjian Chen, Shilei Wen, Chi-Wing Fu, Pheng-Ann Heng</p>

            <p><strong>Title:</strong><br>
            OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.11804v1">http://arxiv.org/abs/2604.11804v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Apr 2026 21:19:06 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1503c8af/70e0bee6.mp3" length="21154633" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1318</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Donghao Zhou, Guisheng Liu, Hao Yang, Jiatong Li, Jingyu Lin, Xiaohu Huang, Yichen Liu, Xin Gao, Cunjian Chen, Shilei Wen, Chi-Wing Fu, Pheng-Ann Heng</p>

            <p><strong>Title:</strong><br>
            OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.11804v1">http://arxiv.org/abs/2604.11804v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation</title>
      <itunes:episode>1760</itunes:episode>
      <podcast:episode>1760</podcast:episode>
      <itunes:title>Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0f32d848-6141-4db2-82b9-8a0a20628cb0</guid>
      <link>https://share.transistor.fm/s/4dfe48ad</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zunhai Su, Hengyuan Zhang, Wei Wu, Yifan Zhang, Yaxiu Liu, He Xiao, Qingyao Yang, Yuxuan Sun, Rui Yang, Chao Zhang, Keyu Fan, Weihao Ye, Jing Xiong, Hui Shen, Chaofan Tao, Taiqiang Wu, Zhongwei Wan, Yulei Qian, Yuchen Xie, Ngai Wong</p>

            <p><strong>Title:</strong><br>
            Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.10098v1">http://arxiv.org/abs/2604.10098v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the foundational architecture of modern machine learning, Transformers have driven remarkable progress across diverse AI domains. Despite their transformative impact, a persistent challenge across various Transformers is Attention Sink (AS), in which a disproportionate amount of attention is focused on a small subset of specific yet uninformative tokens. AS complicates interpretability, significantly affecting the training and inference dynamics, and exacerbates issues such as hallucinations. In recent years, substantial research has been dedicated to understanding and harnessing AS. However, a comprehensive survey that systematically consolidates AS-related research and offers guidance for future advancements remains lacking. To address this gap, we present the first survey on AS, structured around three key dimensions that define the current research landscape: Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation. Our work provides a pivotal contribution by clarifying key concepts and guiding researchers through the evolution and trends of the field. We envision this survey as a definitive resource, empowering researchers and practitioners to effectively manage AS within the current Transformer paradigm, while simultaneously inspiring innovative advancements for the next generation of Transformers. The paper list of this work is available at https://github.com/ZunhaiSu/Awesome-Attention-Sink.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zunhai Su, Hengyuan Zhang, Wei Wu, Yifan Zhang, Yaxiu Liu, He Xiao, Qingyao Yang, Yuxuan Sun, Rui Yang, Chao Zhang, Keyu Fan, Weihao Ye, Jing Xiong, Hui Shen, Chaofan Tao, Taiqiang Wu, Zhongwei Wan, Yulei Qian, Yuchen Xie, Ngai Wong</p>

            <p><strong>Title:</strong><br>
            Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.10098v1">http://arxiv.org/abs/2604.10098v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the foundational architecture of modern machine learning, Transformers have driven remarkable progress across diverse AI domains. Despite their transformative impact, a persistent challenge across various Transformers is Attention Sink (AS), in which a disproportionate amount of attention is focused on a small subset of specific yet uninformative tokens. AS complicates interpretability, significantly affecting the training and inference dynamics, and exacerbates issues such as hallucinations. In recent years, substantial research has been dedicated to understanding and harnessing AS. However, a comprehensive survey that systematically consolidates AS-related research and offers guidance for future advancements remains lacking. To address this gap, we present the first survey on AS, structured around three key dimensions that define the current research landscape: Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation. Our work provides a pivotal contribution by clarifying key concepts and guiding researchers through the evolution and trends of the field. We envision this survey as a definitive resource, empowering researchers and practitioners to effectively manage AS within the current Transformer paradigm, while simultaneously inspiring innovative advancements for the next generation of Transformers. The paper list of this work is available at https://github.com/ZunhaiSu/Awesome-Attention-Sink.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Apr 2026 21:18:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4dfe48ad/71fbfce6.mp3" length="20770112" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1294</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zunhai Su, Hengyuan Zhang, Wei Wu, Yifan Zhang, Yaxiu Liu, He Xiao, Qingyao Yang, Yuxuan Sun, Rui Yang, Chao Zhang, Keyu Fan, Weihao Ye, Jing Xiong, Hui Shen, Chaofan Tao, Taiqiang Wu, Zhongwei Wan, Yulei Qian, Yuchen Xie, Ngai Wong</p>

            <p><strong>Title:</strong><br>
            Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.10098v1">http://arxiv.org/abs/2604.10098v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the foundational architecture of modern machine learning, Transformers have driven remarkable progress across diverse AI domains. Despite their transformative impact, a persistent challenge across various Transformers is Attention Sink (AS), in which a disproportionate amount of attention is focused on a small subset of specific yet uninformative tokens. AS complicates interpretability, significantly affecting the training and inference dynamics, and exacerbates issues such as hallucinations. In recent years, substantial research has been dedicated to understanding and harnessing AS. However, a comprehensive survey that systematically consolidates AS-related research and offers guidance for future advancements remains lacking. To address this gap, we present the first survey on AS, structured around three key dimensions that define the current research landscape: Fundamental Utilization, Mechanistic Interpretation, and Strategic Mitigation. Our work provides a pivotal contribution by clarifying key concepts and guiding researchers through the evolution and trends of the field. We envision this survey as a definitive resource, empowering researchers and practitioners to effectively manage AS within the current Transformer paradigm, while simultaneously inspiring innovative advancements for the next generation of Transformers. The paper list of this work is available at https://github.com/ZunhaiSu/Awesome-Attention-Sink.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Strips as Tokens: Artist Mesh Generation with Native UV Segmentation</title>
      <itunes:episode>1759</itunes:episode>
      <podcast:episode>1759</podcast:episode>
      <itunes:title>Strips as Tokens: Artist Mesh Generation with Native UV Segmentation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8d4e8b47-ca07-411a-9cc2-593dfbc47e45</guid>
      <link>https://share.transistor.fm/s/a1466812</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CV, cs.CG, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Rui Xu, Dafei Qin, Kaichun Qiao, Qiujie Dong, Huaijin Pi, Qixuan Zhang, Longwen Zhang, Lan Xu, Jingyi Yu, Wenping Wang, Taku Komura</p>

            <p><strong>Title:</strong><br>
            Strips as Tokens: Artist Mesh Generation with Native UV Segmentation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.09132v1">http://arxiv.org/abs/2604.09132v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in autoregressive transformers have demonstrated remarkable potential for generating artist-quality meshes. However, the token ordering strategies employed by existing methods typically fail to meet professional artist standards, where coordinate-based sorting yields inefficiently long sequences, and patch-based heuristics disrupt the continuous edge flow and structural regularity essential for high-quality modeling. To address these limitations, we propose Strips as Tokens (SATO), a novel framework with a token ordering strategy inspired by triangle strips. By constructing the sequence as a connected chain of faces that explicitly encodes UV boundaries, our method naturally preserves the organized edge flow and semantic layout characteristic of artist-created meshes. A key advantage of this formulation is its unified representation, enabling the same token sequence to be decoded into either a triangle or quadrilateral mesh. This flexibility facilitates joint training on both data types: large-scale triangle data provides fundamental structural priors, while high-quality quad data enhances the geometric regularity of the outputs. Extensive experiments demonstrate that SATO consistently outperforms prior methods in terms of geometric quality, structural coherence, and UV segmentation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CV, cs.CG, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Rui Xu, Dafei Qin, Kaichun Qiao, Qiujie Dong, Huaijin Pi, Qixuan Zhang, Longwen Zhang, Lan Xu, Jingyi Yu, Wenping Wang, Taku Komura</p>

            <p><strong>Title:</strong><br>
            Strips as Tokens: Artist Mesh Generation with Native UV Segmentation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.09132v1">http://arxiv.org/abs/2604.09132v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in autoregressive transformers have demonstrated remarkable potential for generating artist-quality meshes. However, the token ordering strategies employed by existing methods typically fail to meet professional artist standards, where coordinate-based sorting yields inefficiently long sequences, and patch-based heuristics disrupt the continuous edge flow and structural regularity essential for high-quality modeling. To address these limitations, we propose Strips as Tokens (SATO), a novel framework with a token ordering strategy inspired by triangle strips. By constructing the sequence as a connected chain of faces that explicitly encodes UV boundaries, our method naturally preserves the organized edge flow and semantic layout characteristic of artist-created meshes. A key advantage of this formulation is its unified representation, enabling the same token sequence to be decoded into either a triangle or quadrilateral mesh. This flexibility facilitates joint training on both data types: large-scale triangle data provides fundamental structural priors, while high-quality quad data enhances the geometric regularity of the outputs. Extensive experiments demonstrate that SATO consistently outperforms prior methods in terms of geometric quality, structural coherence, and UV segmentation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Apr 2026 21:18:20 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a1466812/4052a982.mp3" length="20740418" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1293</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CV, cs.CG, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Rui Xu, Dafei Qin, Kaichun Qiao, Qiujie Dong, Huaijin Pi, Qixuan Zhang, Longwen Zhang, Lan Xu, Jingyi Yu, Wenping Wang, Taku Komura</p>

            <p><strong>Title:</strong><br>
            Strips as Tokens: Artist Mesh Generation with Native UV Segmentation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.09132v1">http://arxiv.org/abs/2604.09132v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in autoregressive transformers have demonstrated remarkable potential for generating artist-quality meshes. However, the token ordering strategies employed by existing methods typically fail to meet professional artist standards, where coordinate-based sorting yields inefficiently long sequences, and patch-based heuristics disrupt the continuous edge flow and structural regularity essential for high-quality modeling. To address these limitations, we propose Strips as Tokens (SATO), a novel framework with a token ordering strategy inspired by triangle strips. By constructing the sequence as a connected chain of faces that explicitly encodes UV boundaries, our method naturally preserves the organized edge flow and semantic layout characteristic of artist-created meshes. A key advantage of this formulation is its unified representation, enabling the same token sequence to be decoded into either a triangle or quadrilateral mesh. This flexibility facilitates joint training on both data types: large-scale triangle data provides fundamental structural priors, while high-quality quad data enhances the geometric regularity of the outputs. Extensive experiments demonstrate that SATO consistently outperforms prior methods in terms of geometric quality, structural coherence, and UV segmentation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator</title>
      <itunes:episode>1758</itunes:episode>
      <podcast:episode>1758</podcast:episode>
      <itunes:title>Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">891eae62-be73-4970-ac79-512fb67d4220</guid>
      <link>https://share.transistor.fm/s/b5bdd726</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Luozheng Qin, Jia Gong, Qian Qiao, Tianjiao Li, Li Xu, Haoyu Pan, Chao Qu, Zhiyu Tan, Hao Li</p>

            <p><strong>Title:</strong><br>
            Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08121v1">http://arxiv.org/abs/2604.08121v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Luozheng Qin, Jia Gong, Qian Qiao, Tianjiao Li, Li Xu, Haoyu Pan, Chao Qu, Zhiyu Tan, Hao Li</p>

            <p><strong>Title:</strong><br>
            Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08121v1">http://arxiv.org/abs/2604.08121v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Apr 2026 21:17:57 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b5bdd726/60d6d38d.mp3" length="21881058" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1364</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Luozheng Qin, Jia Gong, Qian Qiao, Tianjiao Li, Li Xu, Haoyu Pan, Chao Qu, Zhiyu Tan, Hao Li</p>

            <p><strong>Title:</strong><br>
            Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08121v1">http://arxiv.org/abs/2604.08121v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models</title>
      <itunes:episode>1757</itunes:episode>
      <podcast:episode>1757</podcast:episode>
      <itunes:title>Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1ab6b73b-9f1e-47b3-82ff-b128786ce78a</guid>
      <link>https://share.transistor.fm/s/bf39287c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Songlin Yang, Xianghao Kong, Anyi Rao</p>

            <p><strong>Title:</strong><br>
            Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.10949v1">http://arxiv.org/abs/2604.10949v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models (UMMs) were designed to combine the reasoning ability of large language models (LLMs) with the generation capability of vision models. In practice, however, this synergy remains elusive: UMMs fail to transfer LLM-like reasoning to image synthesis and exhibit divergent response behaviors. We term this phenomenon pseudo-unification. Diagnosing its internal causes is important, but existing probing methods either lack model-internal insight or ignore prompt-response dependencies. To address these limitations, we propose an information-theoretic probing framework that jointly analyzes how UMMs encode inputs and generate outputs. Applied to ten representative UMMs, our framework reveals that pseudo-unification stems from a dual divergence: (i) Modality-Asymmetric Encoding, where vision and language follow different entropy trajectories, and (ii) Pattern-Split Response, where text generation exhibits high-entropy creativity while image synthesis enforces low-entropy fidelity. Only models that unify both sides (e.g., via contextual prediction) achieve more genuine unification, enabling stronger reasoning-based text-to-image generation even with fewer parameters. Our work provides the first model-internal probing of unification, demonstrating that real multimodal synergy requires consistency in information flow, not just shared parameters.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Songlin Yang, Xianghao Kong, Anyi Rao</p>

            <p><strong>Title:</strong><br>
            Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.10949v1">http://arxiv.org/abs/2604.10949v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models (UMMs) were designed to combine the reasoning ability of large language models (LLMs) with the generation capability of vision models. In practice, however, this synergy remains elusive: UMMs fail to transfer LLM-like reasoning to image synthesis and exhibit divergent response behaviors. We term this phenomenon pseudo-unification. Diagnosing its internal causes is important, but existing probing methods either lack model-internal insight or ignore prompt-response dependencies. To address these limitations, we propose an information-theoretic probing framework that jointly analyzes how UMMs encode inputs and generate outputs. Applied to ten representative UMMs, our framework reveals that pseudo-unification stems from a dual divergence: (i) Modality-Asymmetric Encoding, where vision and language follow different entropy trajectories, and (ii) Pattern-Split Response, where text generation exhibits high-entropy creativity while image synthesis enforces low-entropy fidelity. Only models that unify both sides (e.g., via contextual prediction) achieve more genuine unification, enabling stronger reasoning-based text-to-image generation even with fewer parameters. Our work provides the first model-internal probing of unification, demonstrating that real multimodal synergy requires consistency in information flow, not just shared parameters.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Apr 2026 21:17:34 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bf39287c/903da249.mp3" length="22213341" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1385</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Songlin Yang, Xianghao Kong, Anyi Rao</p>

            <p><strong>Title:</strong><br>
            Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.10949v1">http://arxiv.org/abs/2604.10949v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models (UMMs) were designed to combine the reasoning ability of large language models (LLMs) with the generation capability of vision models. In practice, however, this synergy remains elusive: UMMs fail to transfer LLM-like reasoning to image synthesis and exhibit divergent response behaviors. We term this phenomenon pseudo-unification. Diagnosing its internal causes is important, but existing probing methods either lack model-internal insight or ignore prompt-response dependencies. To address these limitations, we propose an information-theoretic probing framework that jointly analyzes how UMMs encode inputs and generate outputs. Applied to ten representative UMMs, our framework reveals that pseudo-unification stems from a dual divergence: (i) Modality-Asymmetric Encoding, where vision and language follow different entropy trajectories, and (ii) Pattern-Split Response, where text generation exhibits high-entropy creativity while image synthesis enforces low-entropy fidelity. Only models that unify both sides (e.g., via contextual prediction) achieve more genuine unification, enabling stronger reasoning-based text-to-image generation even with fewer parameters. Our work provides the first model-internal probing of unification, demonstrating that real multimodal synergy requires consistency in information flow, not just shared parameters.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CocoaBench: Evaluating Unified Digital Agents in the Wild</title>
      <itunes:episode>1756</itunes:episode>
      <podcast:episode>1756</podcast:episode>
      <itunes:title>CocoaBench: Evaluating Unified Digital Agents in the Wild</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7a56d2aa-ef72-4f7b-bd83-2900586ee9f6</guid>
      <link>https://share.transistor.fm/s/40957b08</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            CocoaBench Team, Shibo Hao, Zhining Zhang, Zhiqi Liang, Tianyang Liu, Yuheng Zha, Qiyue Gao, Jixuan Chen, Zilong Wang, Zhoujun Cheng, Haoxiang Zhang, Junli Wang, Hexi Jin, Boyuan Zheng, Kun Zhou, Yu Wang, Feng Yao, Licheng Liu, Yijiang Li, Zhifei Li, Zhengtao Han, Pracha Promthaw, Tommaso Cerruti, Xiaohan Fu, Ziqiao Ma, Jingbo Shang, Lianhui Qin, Julian McAuley, Eric P. Xing, Zhengzhong Liu, Rupesh Kumar Srivastava, Zhiting Hu</p>

            <p><strong>Title:</strong><br>
            CocoaBench: Evaluating Unified Digital Agents in the Wild</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.11201v2">http://arxiv.org/abs/2604.11201v2</a></p>

            <p><strong>Abstract:</strong><br>
            LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            CocoaBench Team, Shibo Hao, Zhining Zhang, Zhiqi Liang, Tianyang Liu, Yuheng Zha, Qiyue Gao, Jixuan Chen, Zilong Wang, Zhoujun Cheng, Haoxiang Zhang, Junli Wang, Hexi Jin, Boyuan Zheng, Kun Zhou, Yu Wang, Feng Yao, Licheng Liu, Yijiang Li, Zhifei Li, Zhengtao Han, Pracha Promthaw, Tommaso Cerruti, Xiaohan Fu, Ziqiao Ma, Jingbo Shang, Lianhui Qin, Julian McAuley, Eric P. Xing, Zhengzhong Liu, Rupesh Kumar Srivastava, Zhiting Hu</p>

            <p><strong>Title:</strong><br>
            CocoaBench: Evaluating Unified Digital Agents in the Wild</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.11201v2">http://arxiv.org/abs/2604.11201v2</a></p>

            <p><strong>Abstract:</strong><br>
            LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Apr 2026 21:17:11 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/40957b08/966e6d22.mp3" length="22062829" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1375</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            CocoaBench Team, Shibo Hao, Zhining Zhang, Zhiqi Liang, Tianyang Liu, Yuheng Zha, Qiyue Gao, Jixuan Chen, Zilong Wang, Zhoujun Cheng, Haoxiang Zhang, Junli Wang, Hexi Jin, Boyuan Zheng, Kun Zhou, Yu Wang, Feng Yao, Licheng Liu, Yijiang Li, Zhifei Li, Zhengtao Han, Pracha Promthaw, Tommaso Cerruti, Xiaohan Fu, Ziqiao Ma, Jingbo Shang, Lianhui Qin, Julian McAuley, Eric P. Xing, Zhengzhong Liu, Rupesh Kumar Srivastava, Zhiting Hu</p>

            <p><strong>Title:</strong><br>
            CocoaBench: Evaluating Unified Digital Agents in the Wild</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.11201v2">http://arxiv.org/abs/2604.11201v2</a></p>

            <p><strong>Abstract:</strong><br>
            LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CodeTracer: Towards Traceable Agent States</title>
      <itunes:episode>1755</itunes:episode>
      <podcast:episode>1755</podcast:episode>
      <itunes:title>CodeTracer: Towards Traceable Agent States</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">58ebe3db-565a-4132-9543-88f15c2e56f5</guid>
      <link>https://share.transistor.fm/s/d8754d7f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Han Li, Yifan Yao, Letian Zhu, Rili Feng, Hongyi Ye, Jiaming Wang, Yancheng He, Pengyu Zou, Lehan Zhang, Xinping Lei, Haoyang Huang, Ken Deng, Ming Sun, Zhaoxiang Zhang, He Ye, Jiaheng Liu</p>

            <p><strong>Title:</strong><br>
            CodeTracer: Towards Traceable Agent States</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.11641v2">http://arxiv.org/abs/2604.11641v2</a></p>

            <p><strong>Abstract:</strong><br>
            Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage workflows over complex tasks, making the agent's state transitions and error propagation hard to observe. In these runs, an early misstep can trap the agent in unproductive loops or even cascade into fundamental errors, forming hidden error chains that make it hard to tell when the agent goes off track and why. Existing agent tracing analyses either focus on simple interaction or rely on small-scale manual inspection, which limits their scalability and usefulness for real coding workflows. We present CodeTracer, a tracing architecture that parses heterogeneous run artifacts through evolving extractors, reconstructs the full state transition history as a hierarchical trace tree with persistent memory, and performs failure onset localization to pinpoint the failure origin and its downstream chain. To enable systematic evaluation, we construct CodeTraceBench from a large collection of executed trajectories generated by four widely used code agent frameworks on diverse code tasks (e.g., bug fixing, refactoring, and terminal interaction), with supervision at both the stage and step levels for failure localization. Experiments show that CodeTracer substantially outperforms direct prompting and lightweight baselines, and that replaying its diagnostic signals consistently recovers originally failed runs under matched budgets. Our code and data are publicly available.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Han Li, Yifan Yao, Letian Zhu, Rili Feng, Hongyi Ye, Jiaming Wang, Yancheng He, Pengyu Zou, Lehan Zhang, Xinping Lei, Haoyang Huang, Ken Deng, Ming Sun, Zhaoxiang Zhang, He Ye, Jiaheng Liu</p>

            <p><strong>Title:</strong><br>
            CodeTracer: Towards Traceable Agent States</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.11641v2">http://arxiv.org/abs/2604.11641v2</a></p>

            <p><strong>Abstract:</strong><br>
            Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage workflows over complex tasks, making the agent's state transitions and error propagation hard to observe. In these runs, an early misstep can trap the agent in unproductive loops or even cascade into fundamental errors, forming hidden error chains that make it hard to tell when the agent goes off track and why. Existing agent tracing analyses either focus on simple interaction or rely on small-scale manual inspection, which limits their scalability and usefulness for real coding workflows. We present CodeTracer, a tracing architecture that parses heterogeneous run artifacts through evolving extractors, reconstructs the full state transition history as a hierarchical trace tree with persistent memory, and performs failure onset localization to pinpoint the failure origin and its downstream chain. To enable systematic evaluation, we construct CodeTraceBench from a large collection of executed trajectories generated by four widely used code agent frameworks on diverse code tasks (e.g., bug fixing, refactoring, and terminal interaction), with supervision at both the stage and step levels for failure localization. Experiments show that CodeTracer substantially outperforms direct prompting and lightweight baselines, and that replaying its diagnostic signals consistently recovers originally failed runs under matched budgets. Our code and data are publicly available.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Apr 2026 21:16:48 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d8754d7f/7dac48a5.mp3" length="22722772" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1416</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Han Li, Yifan Yao, Letian Zhu, Rili Feng, Hongyi Ye, Jiaming Wang, Yancheng He, Pengyu Zou, Lehan Zhang, Xinping Lei, Haoyang Huang, Ken Deng, Ming Sun, Zhaoxiang Zhang, He Ye, Jiaheng Liu</p>

            <p><strong>Title:</strong><br>
            CodeTracer: Towards Traceable Agent States</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.11641v2">http://arxiv.org/abs/2604.11641v2</a></p>

            <p><strong>Abstract:</strong><br>
            Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage workflows over complex tasks, making the agent's state transitions and error propagation hard to observe. In these runs, an early misstep can trap the agent in unproductive loops or even cascade into fundamental errors, forming hidden error chains that make it hard to tell when the agent goes off track and why. Existing agent tracing analyses either focus on simple interaction or rely on small-scale manual inspection, which limits their scalability and usefulness for real coding workflows. We present CodeTracer, a tracing architecture that parses heterogeneous run artifacts through evolving extractors, reconstructs the full state transition history as a hierarchical trace tree with persistent memory, and performs failure onset localization to pinpoint the failure origin and its downstream chain. To enable systematic evaluation, we construct CodeTraceBench from a large collection of executed trajectories generated by four widely used code agent frameworks on diverse code tasks (e.g., bug fixing, refactoring, and terminal interaction), with supervision at both the stage and step levels for failure localization. Experiments show that CodeTracer substantially outperforms direct prompting and lightweight baselines, and that replaying its diagnostic signals consistently recovers originally failed runs under matched budgets. Our code and data are publicly available.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WildDet3D: Scaling Promptable 3D Detection in the Wild</title>
      <itunes:episode>1754</itunes:episode>
      <podcast:episode>1754</podcast:episode>
      <itunes:title>WildDet3D: Scaling Promptable 3D Detection in the Wild</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a3356dec-dd53-4286-b58e-48c6420e6bcc</guid>
      <link>https://share.transistor.fm/s/b66d91a1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 216 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weikai Huang, Jieyu Zhang, Sijun Li, Taoyang Jia, Jiafei Duan, Yunqian Cheng, Jaemin Cho, Mattew Wallingford, Rustin Soraki, Chris Dongjoo Kim, Donovan Clay, Taira Anderson, Winson Han, Ali Farhadi, Bharath Hariharan, Zhongzheng Ren, Ranjay Krishna</p>

            <p><strong>Title:</strong><br>
            WildDet3D: Scaling Promptable 3D Detection in the Wild</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08626v1">http://arxiv.org/abs/2604.08626v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection--recovering the extent, location, and orientation of objects from an input RGB image. To be practical in the open world, such a detector must generalize beyond closed-set categories, support diverse prompt modalities, and leverage geometric cues when available. Progress is hampered by two bottlenecks: existing methods are designed for a single prompt type and lack a mechanism to incorporate additional geometric cues, and current 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer. In this work we address both gaps. First, we introduce WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. Second, we present WildDet3D-Data, the largest open 3D detection dataset to date, constructed by generating candidate 3D boxes from existing 2D annotations and retaining only human-verified ones, yielding over 1M images across 13.5K categories in diverse real-world scenes. WildDet3D establishes a new state-of-the-art across multiple benchmarks and settings. In the open-world setting, it achieves 22.6/24.8 AP3D on our newly introduced WildDet3D-Bench with text and box prompts. On Omni3D, it reaches 34.2/36.4 AP3D with text and box prompts, respectively. In zero-shot evaluation, it achieves 40.3/48.9 ODS on Argoverse 2 and ScanNet. Notably, incorporating depth cues at inference time yields substantial additional gains (+20.7 AP on average across settings).</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 216 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weikai Huang, Jieyu Zhang, Sijun Li, Taoyang Jia, Jiafei Duan, Yunqian Cheng, Jaemin Cho, Mattew Wallingford, Rustin Soraki, Chris Dongjoo Kim, Donovan Clay, Taira Anderson, Winson Han, Ali Farhadi, Bharath Hariharan, Zhongzheng Ren, Ranjay Krishna</p>

            <p><strong>Title:</strong><br>
            WildDet3D: Scaling Promptable 3D Detection in the Wild</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08626v1">http://arxiv.org/abs/2604.08626v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection--recovering the extent, location, and orientation of objects from an input RGB image. To be practical in the open world, such a detector must generalize beyond closed-set categories, support diverse prompt modalities, and leverage geometric cues when available. Progress is hampered by two bottlenecks: existing methods are designed for a single prompt type and lack a mechanism to incorporate additional geometric cues, and current 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer. In this work we address both gaps. First, we introduce WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. Second, we present WildDet3D-Data, the largest open 3D detection dataset to date, constructed by generating candidate 3D boxes from existing 2D annotations and retaining only human-verified ones, yielding over 1M images across 13.5K categories in diverse real-world scenes. WildDet3D establishes a new state-of-the-art across multiple benchmarks and settings. In the open-world setting, it achieves 22.6/24.8 AP3D on our newly introduced WildDet3D-Bench with text and box prompts. On Omni3D, it reaches 34.2/36.4 AP3D with text and box prompts, respectively. In zero-shot evaluation, it achieves 40.3/48.9 ODS on Argoverse 2 and ScanNet. Notably, incorporating depth cues at inference time yields substantial additional gains (+20.7 AP on average across settings).</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Apr 2026 20:53:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b66d91a1/d3c6a8ee.mp3" length="24069866" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1501</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 216 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weikai Huang, Jieyu Zhang, Sijun Li, Taoyang Jia, Jiafei Duan, Yunqian Cheng, Jaemin Cho, Mattew Wallingford, Rustin Soraki, Chris Dongjoo Kim, Donovan Clay, Taira Anderson, Winson Han, Ali Farhadi, Bharath Hariharan, Zhongzheng Ren, Ranjay Krishna</p>

            <p><strong>Title:</strong><br>
            WildDet3D: Scaling Promptable 3D Detection in the Wild</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08626v1">http://arxiv.org/abs/2604.08626v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection--recovering the extent, location, and orientation of objects from an input RGB image. To be practical in the open world, such a detector must generalize beyond closed-set categories, support diverse prompt modalities, and leverage geometric cues when available. Progress is hampered by two bottlenecks: existing methods are designed for a single prompt type and lack a mechanism to incorporate additional geometric cues, and current 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer. In this work we address both gaps. First, we introduce WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. Second, we present WildDet3D-Data, the largest open 3D detection dataset to date, constructed by generating candidate 3D boxes from existing 2D annotations and retaining only human-verified ones, yielding over 1M images across 13.5K categories in diverse real-world scenes. WildDet3D establishes a new state-of-the-art across multiple benchmarks and settings. In the open-world setting, it achieves 22.6/24.8 AP3D on our newly introduced WildDet3D-Bench with text and box prompts. On Omni3D, it reaches 34.2/36.4 AP3D with text and box prompts, respectively. In zero-shot evaluation, it achieves 40.3/48.9 ODS on Argoverse 2 and ScanNet. Notably, incorporating depth cues at inference time yields substantial additional gains (+20.7 AP on average across settings).</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios</title>
      <itunes:episode>1753</itunes:episode>
      <podcast:episode>1753</podcast:episode>
      <itunes:title>FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7a00c3ca-f318-4daf-a91a-d3aba0580567</guid>
      <link>https://share.transistor.fm/s/dc9b8d81</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 84 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiangru Jian, Hao Xu, Wei Pang, Xinjian Zhao, Chengyu Tao, Qixin Zhang, Xikun Zhang, Chao Zhang, Guanzhi Deng, Alex Xue, Juan Du, Tianshu Yu, Garth Tarr, Linqi Song, Qiuzhuang Sun, Dacheng Tao</p>

            <p><strong>Title:</strong><br>
            FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.07413v2">http://arxiv.org/abs/2604.07413v2</a></p>

            <p><strong>Abstract:</strong><br>
            The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge-web.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 84 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiangru Jian, Hao Xu, Wei Pang, Xinjian Zhao, Chengyu Tao, Qixin Zhang, Xikun Zhang, Chao Zhang, Guanzhi Deng, Alex Xue, Juan Du, Tianshu Yu, Garth Tarr, Linqi Song, Qiuzhuang Sun, Dacheng Tao</p>

            <p><strong>Title:</strong><br>
            FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.07413v2">http://arxiv.org/abs/2604.07413v2</a></p>

            <p><strong>Abstract:</strong><br>
            The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge-web.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Apr 2026 20:53:01 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/dc9b8d81/28c20cb0.mp3" length="21129121" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1317</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 84 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiangru Jian, Hao Xu, Wei Pang, Xinjian Zhao, Chengyu Tao, Qixin Zhang, Xikun Zhang, Chao Zhang, Guanzhi Deng, Alex Xue, Juan Du, Tianshu Yu, Garth Tarr, Linqi Song, Qiuzhuang Sun, Dacheng Tao</p>

            <p><strong>Title:</strong><br>
            FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.07413v2">http://arxiv.org/abs/2604.07413v2</a></p>

            <p><strong>Abstract:</strong><br>
            The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge-web.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>EXAONE 4.5 Technical Report</title>
      <itunes:episode>1752</itunes:episode>
      <podcast:episode>1752</podcast:episode>
      <itunes:title>EXAONE 4.5 Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e109f98c-29f6-406e-89e6-3595c3ba303f</guid>
      <link>https://share.transistor.fm/s/43aef649</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Eunbi Choi, Kibong Choi, Sehyun Chun, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Ahra Jo, Hyunjik Jo, Yeonsik Jo, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Changhun Lee, Haeju Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Kwangrok Ryoo, Minju Seo, Sejong Yang, Heuiyeen Yeen, Hwan Chang, Stanley Jungkyu Choi, Yejin Choi, Kyubeen Han, Joonwon Jang, Kijeong Jeon, Geunyeong Jeong, Gerrard Jeongwon Jo, Jiyeon Jung, Daeseong Kim, Dohoon Kim, Dohyun Kim, Hyunseo Kim, Minu Kim, Myoungshin Kim, Youchul Kim, Byungoh Ko, Christopher Lee, Edward Hwayoung Lee, Honglak Lee, Jiyoung Lee, Sangeun Lee, Seungwon Lim, Woohyung Lim, Jueun Mun, Jaewoo Park, Jimin Park, Jinho Park, Yongmin Park, Wooseok Seo, Yongwoo Song, Sihyuk Yi, Kyungjae Yoo, Sangyeon Yoon</p>

            <p><strong>Title:</strong><br>
            EXAONE 4.5 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08644v1">http://arxiv.org/abs/2604.08644v1</a></p>

            <p><strong>Abstract:</strong><br>
            This technical report introduces EXAONE 4.5, the first open-weight vision language model released by LG AI Research. EXAONE 4.5 is architected by integrating a dedicated visual encoder into the existing EXAONE 4.0 framework, enabling native multimodal pretraining over both visual and textual modalities. The model is trained on large-scale data with careful curation, particularly emphasizing document-centric corpora that align with LG's strategic application domains. This targeted data design enables substantial performance gains in document understanding and related tasks, while also delivering broad improvements across general language capabilities. EXAONE 4.5 extends context length up to 256K tokens, facilitating long-context reasoning and enterprise-scale use cases. Comparative evaluations demonstrate that EXAONE 4.5 achieves competitive performance in general benchmarks while outperforming state-of-the-art models of similar scale in document understanding and Korean contextual reasoning. As part of LG's ongoing effort toward practical industrial deployment, EXAONE 4.5 is designed to be continuously extended with additional domains and application scenarios to advance AI for a better life.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Eunbi Choi, Kibong Choi, Sehyun Chun, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Ahra Jo, Hyunjik Jo, Yeonsik Jo, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Changhun Lee, Haeju Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Kwangrok Ryoo, Minju Seo, Sejong Yang, Heuiyeen Yeen, Hwan Chang, Stanley Jungkyu Choi, Yejin Choi, Kyubeen Han, Joonwon Jang, Kijeong Jeon, Geunyeong Jeong, Gerrard Jeongwon Jo, Jiyeon Jung, Daeseong Kim, Dohoon Kim, Dohyun Kim, Hyunseo Kim, Minu Kim, Myoungshin Kim, Youchul Kim, Byungoh Ko, Christopher Lee, Edward Hwayoung Lee, Honglak Lee, Jiyoung Lee, Sangeun Lee, Seungwon Lim, Woohyung Lim, Jueun Mun, Jaewoo Park, Jimin Park, Jinho Park, Yongmin Park, Wooseok Seo, Yongwoo Song, Sihyuk Yi, Kyungjae Yoo, Sangyeon Yoon</p>

            <p><strong>Title:</strong><br>
            EXAONE 4.5 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08644v1">http://arxiv.org/abs/2604.08644v1</a></p>

            <p><strong>Abstract:</strong><br>
            This technical report introduces EXAONE 4.5, the first open-weight vision language model released by LG AI Research. EXAONE 4.5 is architected by integrating a dedicated visual encoder into the existing EXAONE 4.0 framework, enabling native multimodal pretraining over both visual and textual modalities. The model is trained on large-scale data with careful curation, particularly emphasizing document-centric corpora that align with LG's strategic application domains. This targeted data design enables substantial performance gains in document understanding and related tasks, while also delivering broad improvements across general language capabilities. EXAONE 4.5 extends context length up to 256K tokens, facilitating long-context reasoning and enterprise-scale use cases. Comparative evaluations demonstrate that EXAONE 4.5 achieves competitive performance in general benchmarks while outperforming state-of-the-art models of similar scale in document understanding and Korean contextual reasoning. As part of LG's ongoing effort toward practical industrial deployment, EXAONE 4.5 is designed to be continuously extended with additional domains and application scenarios to advance AI for a better life.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Apr 2026 20:52:40 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/43aef649/fb9fbd31.mp3" length="22387554" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1396</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Eunbi Choi, Kibong Choi, Sehyun Chun, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Ahra Jo, Hyunjik Jo, Yeonsik Jo, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Changhun Lee, Haeju Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Kwangrok Ryoo, Minju Seo, Sejong Yang, Heuiyeen Yeen, Hwan Chang, Stanley Jungkyu Choi, Yejin Choi, Kyubeen Han, Joonwon Jang, Kijeong Jeon, Geunyeong Jeong, Gerrard Jeongwon Jo, Jiyeon Jung, Daeseong Kim, Dohoon Kim, Dohyun Kim, Hyunseo Kim, Minu Kim, Myoungshin Kim, Youchul Kim, Byungoh Ko, Christopher Lee, Edward Hwayoung Lee, Honglak Lee, Jiyoung Lee, Sangeun Lee, Seungwon Lim, Woohyung Lim, Jueun Mun, Jaewoo Park, Jimin Park, Jinho Park, Yongmin Park, Wooseok Seo, Yongwoo Song, Sihyuk Yi, Kyungjae Yoo, Sangyeon Yoon</p>

            <p><strong>Title:</strong><br>
            EXAONE 4.5 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08644v1">http://arxiv.org/abs/2604.08644v1</a></p>

            <p><strong>Abstract:</strong><br>
            This technical report introduces EXAONE 4.5, the first open-weight vision language model released by LG AI Research. EXAONE 4.5 is architected by integrating a dedicated visual encoder into the existing EXAONE 4.0 framework, enabling native multimodal pretraining over both visual and textual modalities. The model is trained on large-scale data with careful curation, particularly emphasizing document-centric corpora that align with LG's strategic application domains. This targeted data design enables substantial performance gains in document understanding and related tasks, while also delivering broad improvements across general language capabilities. EXAONE 4.5 extends context length up to 256K tokens, facilitating long-context reasoning and enterprise-scale use cases. Comparative evaluations demonstrate that EXAONE 4.5 achieves competitive performance in general benchmarks while outperforming state-of-the-art models of similar scale in document understanding and Korean contextual reasoning. As part of LG's ongoing effort toward practical industrial deployment, EXAONE 4.5 is designed to be continuously extended with additional domains and application scenarios to advance AI for a better life.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details</title>
      <itunes:episode>1751</itunes:episode>
      <podcast:episode>1751</podcast:episode>
      <itunes:title>RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">14c9c430-62d6-4616-b73e-94ea1dfc5547</guid>
      <link>https://share.transistor.fm/s/fdfc82b2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Dewei Zhou, You Li, Zongxin Yang, Yi Yang</p>

            <p><strong>Title:</strong><br>
            RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.06870v1">http://arxiv.org/abs/2604.06870v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or a bounding box), the goal is to restore fine-grained details while keeping all non-edited pixels strictly unchanged. Despite rapid progress in image generation, modern models still frequently suffer from local detail collapse (e.g., distorted text, logos, and thin structures). Existing instruction-driven editing models emphasize coarse-grained semantic edits and often either overlook subtle local defects or inadvertently change the background, especially when the region of interest occupies only a small portion of a fixed-resolution input. We present RefineAnything, a multimodal diffusion-based refinement model that supports both reference-based and reference-free refinement. Building on a counter-intuitive observation that crop-and-resize can substantially improve local reconstruction under a fixed VAE input resolution, we propose Focus-and-Refine, a region-focused refinement-and-paste-back strategy that improves refinement effectiveness and efficiency by reallocating the resolution budget to the target region, while a blended-mask paste-back guarantees strict background preservation. We further introduce a boundary-aware Boundary Consistency Loss to reduce seam artifacts and improve paste-back naturalness. To support this new setting, we construct Refine-30K (20K reference-based and 10K reference-free samples) and introduce RefineEval, a benchmark that evaluates both edited-region fidelity and background consistency. On RefineEval, RefineAnything achieves strong improvements over competitive baselines and near-perfect background preservation, establishing a practical solution for high-precision local refinement. Project Page: https://limuloo.github.io/RefineAnything/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Dewei Zhou, You Li, Zongxin Yang, Yi Yang</p>

            <p><strong>Title:</strong><br>
            RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.06870v1">http://arxiv.org/abs/2604.06870v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or a bounding box), the goal is to restore fine-grained details while keeping all non-edited pixels strictly unchanged. Despite rapid progress in image generation, modern models still frequently suffer from local detail collapse (e.g., distorted text, logos, and thin structures). Existing instruction-driven editing models emphasize coarse-grained semantic edits and often either overlook subtle local defects or inadvertently change the background, especially when the region of interest occupies only a small portion of a fixed-resolution input. We present RefineAnything, a multimodal diffusion-based refinement model that supports both reference-based and reference-free refinement. Building on a counter-intuitive observation that crop-and-resize can substantially improve local reconstruction under a fixed VAE input resolution, we propose Focus-and-Refine, a region-focused refinement-and-paste-back strategy that improves refinement effectiveness and efficiency by reallocating the resolution budget to the target region, while a blended-mask paste-back guarantees strict background preservation. We further introduce a boundary-aware Boundary Consistency Loss to reduce seam artifacts and improve paste-back naturalness. To support this new setting, we construct Refine-30K (20K reference-based and 10K reference-free samples) and introduce RefineEval, a benchmark that evaluates both edited-region fidelity and background consistency. On RefineEval, RefineAnything achieves strong improvements over competitive baselines and near-perfect background preservation, establishing a practical solution for high-precision local refinement. Project Page: https://limuloo.github.io/RefineAnything/.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Apr 2026 20:52:19 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fdfc82b2/26522c33.mp3" length="21534133" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1342</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Dewei Zhou, You Li, Zongxin Yang, Yi Yang</p>

            <p><strong>Title:</strong><br>
            RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.06870v1">http://arxiv.org/abs/2604.06870v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or a bounding box), the goal is to restore fine-grained details while keeping all non-edited pixels strictly unchanged. Despite rapid progress in image generation, modern models still frequently suffer from local detail collapse (e.g., distorted text, logos, and thin structures). Existing instruction-driven editing models emphasize coarse-grained semantic edits and often either overlook subtle local defects or inadvertently change the background, especially when the region of interest occupies only a small portion of a fixed-resolution input. We present RefineAnything, a multimodal diffusion-based refinement model that supports both reference-based and reference-free refinement. Building on a counter-intuitive observation that crop-and-resize can substantially improve local reconstruction under a fixed VAE input resolution, we propose Focus-and-Refine, a region-focused refinement-and-paste-back strategy that improves refinement effectiveness and efficiency by reallocating the resolution budget to the target region, while a blended-mask paste-back guarantees strict background preservation. We further introduce a boundary-aware Boundary Consistency Loss to reduce seam artifacts and improve paste-back naturalness. To support this new setting, we construct Refine-30K (20K reference-based and 10K reference-free samples) and introduce RefineEval, a benchmark that evaluates both edited-region fidelity and background consistency. On RefineEval, RefineAnything achieves strong improvements over competitive baselines and near-perfect background preservation, establishing a practical solution for high-precision local refinement. Project Page: https://limuloo.github.io/RefineAnything/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory</title>
      <itunes:episode>1750</itunes:episode>
      <podcast:episode>1750</podcast:episode>
      <itunes:title>Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4a242dd0-59a5-489e-8044-0997648873c2</guid>
      <link>https://share.transistor.fm/s/e5ee82ee</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zile Wang, Zexiang Liu, Jiaxing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, Yidan Xietian, Jiangbo Pei, Liang Hu, Boyi Jiang, Hua Xue, Zidong Wang, Haofeng Sun, Wei Li, Wanli Ouyang, Xianglong He, Yang Liu, Yangguang Li, Yahui Zhou</p>

            <p><strong>Title:</strong><br>
            Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08995v2">http://arxiv.org/abs/2604.08995v2</a></p>

            <p><strong>Abstract:</strong><br>
            With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zile Wang, Zexiang Liu, Jiaxing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, Yidan Xietian, Jiangbo Pei, Liang Hu, Boyi Jiang, Hua Xue, Zidong Wang, Haofeng Sun, Wei Li, Wanli Ouyang, Xianglong He, Yang Liu, Yangguang Li, Yahui Zhou</p>

            <p><strong>Title:</strong><br>
            Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08995v2">http://arxiv.org/abs/2604.08995v2</a></p>

            <p><strong>Abstract:</strong><br>
            With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Apr 2026 20:51:57 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e5ee82ee/325dfbd1.mp3" length="23042140" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1436</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zile Wang, Zexiang Liu, Jiaxing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, Yidan Xietian, Jiangbo Pei, Liang Hu, Boyi Jiang, Hua Xue, Zidong Wang, Haofeng Sun, Wei Li, Wanli Ouyang, Xianglong He, Yang Liu, Yangguang Li, Yahui Zhou</p>

            <p><strong>Title:</strong><br>
            Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08995v2">http://arxiv.org/abs/2604.08995v2</a></p>

            <p><strong>Abstract:</strong><br>
            With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability</title>
      <itunes:episode>1749</itunes:episode>
      <podcast:episode>1749</podcast:episode>
      <itunes:title>Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b9a6cebf-c6af-4958-be09-24a698ab594a</guid>
      <link>https://share.transistor.fm/s/f47acd51</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 156 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qihan Ren, Peng Wang, Ruikun Cai, Shuai Shao, Dadi Guo, Yuejin Xie, Yafu Li, Quanshi Zhang, Xia Hu, Jing Shao, Dongrui Liu</p>

            <p><strong>Title:</strong><br>
            Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.06628v1">http://arxiv.org/abs/2604.06628v1</a></p>

            <p><strong>Abstract:</strong><br>
            A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 156 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qihan Ren, Peng Wang, Ruikun Cai, Shuai Shao, Dadi Guo, Yuejin Xie, Yafu Li, Quanshi Zhang, Xia Hu, Jing Shao, Dongrui Liu</p>

            <p><strong>Title:</strong><br>
            Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.06628v1">http://arxiv.org/abs/2604.06628v1</a></p>

            <p><strong>Abstract:</strong><br>
            A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Apr 2026 20:35:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f47acd51/aad399ce.mp3" length="23809952" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1484</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 156 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qihan Ren, Peng Wang, Ruikun Cai, Shuai Shao, Dadi Guo, Yuejin Xie, Yafu Li, Quanshi Zhang, Xia Hu, Jing Shao, Dongrui Liu</p>

            <p><strong>Title:</strong><br>
            Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.06628v1">http://arxiv.org/abs/2604.06628v1</a></p>

            <p><strong>Abstract:</strong><br>
            A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SkillClaw: Let Skills Evolve Collectively with Agentic Evolver</title>
      <itunes:episode>1748</itunes:episode>
      <podcast:episode>1748</podcast:episode>
      <itunes:title>SkillClaw: Let Skills Evolve Collectively with Agentic Evolver</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5313aa76-ae14-48b5-a08a-379497771032</guid>
      <link>https://share.transistor.fm/s/77e82e71</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 147 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, Xiangxiang Chu</p>

            <p><strong>Title:</strong><br>
            SkillClaw: Let Skills Evolve Collectively with Agentic Evolver</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08377v1">http://arxiv.org/abs/2604.08377v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deployment. As a result, similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience. While interactions from different users provide complementary signals about when a skill works or fails, existing systems lack a mechanism to convert such heterogeneous experiences into reliable skill updates. To address these issues, we present SkillClaw, a framework for collective skill evolution in multi-user agent ecosystems, which treats cross-user and over-time interactions as the primary signal for improving skills. SkillClaw continuously aggregates trajectories generated during use and processes them with an autonomous evolver, which identifies recurring behavioral patterns and translates them into updates to the skill set by refining existing skills or extending them with new capabilities. The resulting skills are maintained in a shared repository and synchronized across users, allowing improvements discovered in one context to propagate system-wide while requiring no additional effort from users. By integrating multi-user experience into ongoing skill updates, SkillClaw enables cross-user knowledge transfer and cumulative capability improvement, and experiments on WildClawBench show that limited interaction and feedback, it significantly improves the performance of Qwen3-Max in real-world agent scenarios.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 147 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, Xiangxiang Chu</p>

            <p><strong>Title:</strong><br>
            SkillClaw: Let Skills Evolve Collectively with Agentic Evolver</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08377v1">http://arxiv.org/abs/2604.08377v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deployment. As a result, similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience. While interactions from different users provide complementary signals about when a skill works or fails, existing systems lack a mechanism to convert such heterogeneous experiences into reliable skill updates. To address these issues, we present SkillClaw, a framework for collective skill evolution in multi-user agent ecosystems, which treats cross-user and over-time interactions as the primary signal for improving skills. SkillClaw continuously aggregates trajectories generated during use and processes them with an autonomous evolver, which identifies recurring behavioral patterns and translates them into updates to the skill set by refining existing skills or extending them with new capabilities. The resulting skills are maintained in a shared repository and synchronized across users, allowing improvements discovered in one context to propagate system-wide while requiring no additional effort from users. By integrating multi-user experience into ongoing skill updates, SkillClaw enables cross-user knowledge transfer and cumulative capability improvement, and experiments on WildClawBench show that limited interaction and feedback, it significantly improves the performance of Qwen3-Max in real-world agent scenarios.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Apr 2026 20:34:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/77e82e71/fbedaba1.mp3" length="21703390" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1353</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 147 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, Xiangxiang Chu</p>

            <p><strong>Title:</strong><br>
            SkillClaw: Let Skills Evolve Collectively with Agentic Evolver</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.08377v1">http://arxiv.org/abs/2604.08377v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deployment. As a result, similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience. While interactions from different users provide complementary signals about when a skill works or fails, existing systems lack a mechanism to convert such heterogeneous experiences into reliable skill updates. To address these issues, we present SkillClaw, a framework for collective skill evolution in multi-user agent ecosystems, which treats cross-user and over-time interactions as the primary signal for improving skills. SkillClaw continuously aggregates trajectories generated during use and processes them with an autonomous evolver, which identifies recurring behavioral patterns and translates them into updates to the skill set by refining existing skills or extending them with new capabilities. The resulting skills are maintained in a shared repository and synchronized across users, allowing improvements discovered in one context to propagate system-wide while requiring no additional effort from users. By integrating multi-user experience into ongoing skill updates, SkillClaw enables cross-user knowledge transfer and cumulative capability improvement, and experiments on WildClawBench show that limited interaction and feedback, it significantly improves the performance of Qwen3-Max in real-world agent scenarios.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RAGEN-2: Reasoning Collapse in Agentic RL</title>
      <itunes:episode>1747</itunes:episode>
      <podcast:episode>1747</podcast:episode>
      <itunes:title>RAGEN-2: Reasoning Collapse in Agentic RL</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5c97e033-03b9-458d-94eb-5f50d4366d65</guid>
      <link>https://share.transistor.fm/s/5a6ca1f9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zihan Wang, Chi Gui, Xing Jin, Qineng Wang, Licheng Liu, Kangrui Wang, Shiqi Chen, Linjie Li, Zhengyuan Yang, Pingyue Zhang, Yiping Lu, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li</p>

            <p><strong>Title:</strong><br>
            RAGEN-2: Reasoning Collapse in Agentic RL</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.06268v1">http://arxiv.org/abs/2604.06268v1</a></p>

            <p><strong>Abstract:</strong><br>
            RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. Entropy is widely used to track reasoning stability. However, entropy only measures diversity within the same input, and cannot tell whether reasoning actually responds to different inputs. In RAGEN-2, we find that even with stable entropy, models can rely on fixed templates that look diverse but are input-agnostic. We call this template collapse, a failure mode invisible to entropy and all existing metrics. To diagnose this failure, we decompose reasoning quality into within-input diversity (Entropy) and cross-input distinguishability (Mutual Information, MI), and introduce a family of mutual information proxies for online diagnosis. Across diverse tasks, mutual information correlates with final performance much more strongly than entropy, making it a more reliable proxy for reasoning quality. We further explain template collapse with a signal-to-noise ratio (SNR) mechanism. Low reward variance weakens task gradients, letting regularization terms dominate and erase cross-input reasoning differences. To address this, we propose SNR-Aware Filtering to select high-signal prompts per iteration using reward variance as a lightweight proxy. Across planning, math reasoning, web navigation, and code execution, the method consistently improves both input dependence and task performance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zihan Wang, Chi Gui, Xing Jin, Qineng Wang, Licheng Liu, Kangrui Wang, Shiqi Chen, Linjie Li, Zhengyuan Yang, Pingyue Zhang, Yiping Lu, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li</p>

            <p><strong>Title:</strong><br>
            RAGEN-2: Reasoning Collapse in Agentic RL</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.06268v1">http://arxiv.org/abs/2604.06268v1</a></p>

            <p><strong>Abstract:</strong><br>
            RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. Entropy is widely used to track reasoning stability. However, entropy only measures diversity within the same input, and cannot tell whether reasoning actually responds to different inputs. In RAGEN-2, we find that even with stable entropy, models can rely on fixed templates that look diverse but are input-agnostic. We call this template collapse, a failure mode invisible to entropy and all existing metrics. To diagnose this failure, we decompose reasoning quality into within-input diversity (Entropy) and cross-input distinguishability (Mutual Information, MI), and introduce a family of mutual information proxies for online diagnosis. Across diverse tasks, mutual information correlates with final performance much more strongly than entropy, making it a more reliable proxy for reasoning quality. We further explain template collapse with a signal-to-noise ratio (SNR) mechanism. Low reward variance weakens task gradients, letting regularization terms dominate and erase cross-input reasoning differences. To address this, we propose SNR-Aware Filtering to select high-signal prompts per iteration using reward variance as a lightweight proxy. Across planning, math reasoning, web navigation, and code execution, the method consistently improves both input dependence and task performance.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Apr 2026 20:42:38 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5a6ca1f9/1405c6fe.mp3" length="24641204" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1536</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zihan Wang, Chi Gui, Xing Jin, Qineng Wang, Licheng Liu, Kangrui Wang, Shiqi Chen, Linjie Li, Zhengyuan Yang, Pingyue Zhang, Yiping Lu, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li</p>

            <p><strong>Title:</strong><br>
            RAGEN-2: Reasoning Collapse in Agentic RL</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.06268v1">http://arxiv.org/abs/2604.06268v1</a></p>

            <p><strong>Abstract:</strong><br>
            RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. Entropy is widely used to track reasoning stability. However, entropy only measures diversity within the same input, and cannot tell whether reasoning actually responds to different inputs. In RAGEN-2, we find that even with stable entropy, models can rely on fixed templates that look diverse but are input-agnostic. We call this template collapse, a failure mode invisible to entropy and all existing metrics. To diagnose this failure, we decompose reasoning quality into within-input diversity (Entropy) and cross-input distinguishability (Mutual Information, MI), and introduce a family of mutual information proxies for online diagnosis. Across diverse tasks, mutual information correlates with final performance much more strongly than entropy, making it a more reliable proxy for reasoning quality. We further explain template collapse with a signal-to-noise ratio (SNR) mechanism. Low reward variance weakens task gradients, letting regularization terms dominate and erase cross-input reasoning differences. To address this, we propose SNR-Aware Filtering to select high-signal prompts per iteration using reward variance as a lightweight proxy. Across planning, math reasoning, web navigation, and code execution, the method consistently improves both input dependence and task performance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MARS: Enabling Autoregressive Models Multi-Token Generation</title>
      <itunes:episode>1746</itunes:episode>
      <podcast:episode>1746</podcast:episode>
      <itunes:title>MARS: Enabling Autoregressive Models Multi-Token Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">07499f7e-64ce-4525-b64b-ff2276c2b869</guid>
      <link>https://share.transistor.fm/s/c9fba3c1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ziqi Jin, Lei Wang, Ziwei Luo, Aixin Sun</p>

            <p><strong>Title:</strong><br>
            MARS: Enabling Autoregressive Models Multi-Token Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.07023v1">http://arxiv.org/abs/2604.07023v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ziqi Jin, Lei Wang, Ziwei Luo, Aixin Sun</p>

            <p><strong>Title:</strong><br>
            MARS: Enabling Autoregressive Models Multi-Token Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.07023v1">http://arxiv.org/abs/2604.07023v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Apr 2026 20:42:15 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c9fba3c1/baaf2352.mp3" length="22508376" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1403</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ziqi Jin, Lei Wang, Ziwei Luo, Aixin Sun</p>

            <p><strong>Title:</strong><br>
            MARS: Enabling Autoregressive Models Multi-Token Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.07023v1">http://arxiv.org/abs/2604.07023v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Combee: Scaling Prompt Learning for Self-Improving Language Model Agents</title>
      <itunes:episode>1745</itunes:episode>
      <podcast:episode>1745</podcast:episode>
      <itunes:title>Combee: Scaling Prompt Learning for Self-Improving Language Model Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f4ac6fb3-58b5-42d3-b911-074b9f5bb81b</guid>
      <link>https://share.transistor.fm/s/4dad7136</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hanchen Li, Runyuan He, Qizheng Zhang, Changxiu Ji, Qiuyang Mang, Xiaokun Chen, Lakshya A Agrawal, Wei-Liang Liao, Eric Yang, Alvin Cheung, James Zou, Kunle Olukotun, Ion Stoica, Joseph E. Gonzalez</p>

            <p><strong>Title:</strong><br>
            Combee: Scaling Prompt Learning for Self-Improving Language Model Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04247v1">http://arxiv.org/abs/2604.04247v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in prompt learning allow large language model agents to acquire task-relevant knowledge from inference-time context without parameter changes. For example, existing methods (like ACE or GEPA) can learn system prompts to improve accuracy based on previous agent runs. However, these methods primarily focus on single-agent or low-parallelism settings. This fundamentally limits their ability to efficiently learn from a large set of collected agentic traces. It would be efficient and beneficial to run prompt learning in parallel to accommodate the growing trend of learning from many agentic traces or parallel agent executions. Yet without a principled strategy for scaling, current methods suffer from quality degradation with high parallelism. To improve both the efficiency and quality of prompt learning, we propose Combee, a novel framework to scale parallel prompt learning for self-improving agents. Combee speeds up learning and enables running many agents in parallel while learning from their aggregate traces without quality degradation. To achieve this, Combee leverages parallel scans and employs an augmented shuffle mechanism; Combee also introduces a dynamic batch size controller to balance quality and delay. Evaluations on AppWorld, Terminal-Bench, Formula, and FiNER demonstrate that Combee achieves up to 17x speedup over previous methods with comparable or better accuracy and equivalent cost.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hanchen Li, Runyuan He, Qizheng Zhang, Changxiu Ji, Qiuyang Mang, Xiaokun Chen, Lakshya A Agrawal, Wei-Liang Liao, Eric Yang, Alvin Cheung, James Zou, Kunle Olukotun, Ion Stoica, Joseph E. Gonzalez</p>

            <p><strong>Title:</strong><br>
            Combee: Scaling Prompt Learning for Self-Improving Language Model Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04247v1">http://arxiv.org/abs/2604.04247v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in prompt learning allow large language model agents to acquire task-relevant knowledge from inference-time context without parameter changes. For example, existing methods (like ACE or GEPA) can learn system prompts to improve accuracy based on previous agent runs. However, these methods primarily focus on single-agent or low-parallelism settings. This fundamentally limits their ability to efficiently learn from a large set of collected agentic traces. It would be efficient and beneficial to run prompt learning in parallel to accommodate the growing trend of learning from many agentic traces or parallel agent executions. Yet without a principled strategy for scaling, current methods suffer from quality degradation with high parallelism. To improve both the efficiency and quality of prompt learning, we propose Combee, a novel framework to scale parallel prompt learning for self-improving agents. Combee speeds up learning and enables running many agents in parallel while learning from their aggregate traces without quality degradation. To achieve this, Combee leverages parallel scans and employs an augmented shuffle mechanism; Combee also introduces a dynamic batch size controller to balance quality and delay. Evaluations on AppWorld, Terminal-Bench, Formula, and FiNER demonstrate that Combee achieves up to 17x speedup over previous methods with comparable or better accuracy and equivalent cost.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Apr 2026 20:41:52 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4dad7136/d7e7c6b1.mp3" length="21122018" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1316</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hanchen Li, Runyuan He, Qizheng Zhang, Changxiu Ji, Qiuyang Mang, Xiaokun Chen, Lakshya A Agrawal, Wei-Liang Liao, Eric Yang, Alvin Cheung, James Zou, Kunle Olukotun, Ion Stoica, Joseph E. Gonzalez</p>

            <p><strong>Title:</strong><br>
            Combee: Scaling Prompt Learning for Self-Improving Language Model Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04247v1">http://arxiv.org/abs/2604.04247v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in prompt learning allow large language model agents to acquire task-relevant knowledge from inference-time context without parameter changes. For example, existing methods (like ACE or GEPA) can learn system prompts to improve accuracy based on previous agent runs. However, these methods primarily focus on single-agent or low-parallelism settings. This fundamentally limits their ability to efficiently learn from a large set of collected agentic traces. It would be efficient and beneficial to run prompt learning in parallel to accommodate the growing trend of learning from many agentic traces or parallel agent executions. Yet without a principled strategy for scaling, current methods suffer from quality degradation with high parallelism. To improve both the efficiency and quality of prompt learning, we propose Combee, a novel framework to scale parallel prompt learning for self-improving agents. Combee speeds up learning and enables running many agents in parallel while learning from their aggregate traces without quality degradation. To achieve this, Combee leverages parallel scans and employs an augmented shuffle mechanism; Combee also introduces a dynamic batch size controller to balance quality and delay. Evaluations on AppWorld, Terminal-Bench, Formula, and FiNER demonstrate that Combee achieves up to 17x speedup over previous methods with comparable or better accuracy and equivalent cost.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding</title>
      <itunes:episode>1744</itunes:episode>
      <podcast:episode>1744</podcast:episode>
      <itunes:title>Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9b0301a6-1985-4211-966d-74904a4bdbd8</guid>
      <link>https://share.transistor.fm/s/d1291c48</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 201 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yongkang Xie, Xiawu Zheng, Xue Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Xing Sun, Caifeng Shan, Ran He</p>

            <p><strong>Title:</strong><br>
            Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.05015v1">http://arxiv.org/abs/2604.05015v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbf{progressive tri-level hierarchy} that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbf{group-based non-linear evaluation} strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by \textbf{3,300 human-hours} and up to \textbf{5 rounds} of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 201 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yongkang Xie, Xiawu Zheng, Xue Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Xing Sun, Caifeng Shan, Ran He</p>

            <p><strong>Title:</strong><br>
            Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.05015v1">http://arxiv.org/abs/2604.05015v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbf{progressive tri-level hierarchy} that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbf{group-based non-linear evaluation} strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by \textbf{3,300 human-hours} and up to \textbf{5 rounds} of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Apr 2026 21:28:32 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d1291c48/5c7dab38.mp3" length="23962067" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1494</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 201 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yongkang Xie, Xiawu Zheng, Xue Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Xing Sun, Caifeng Shan, Ran He</p>

            <p><strong>Title:</strong><br>
            Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.05015v1">http://arxiv.org/abs/2604.05015v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbf{progressive tri-level hierarchy} that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbf{group-based non-linear evaluation} strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by \textbf{3,300 human-hours} and up to \textbf{5 rounds} of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents</title>
      <itunes:episode>1743</itunes:episode>
      <podcast:episode>1743</podcast:episode>
      <itunes:title>Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">650d98ce-bc59-466f-b8df-1e6636193f6c</guid>
      <link>https://share.transistor.fm/s/d89a064f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 98 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, Qi Liu, Zhifang Sui, Tong Yang</p>

            <p><strong>Title:</strong><br>
            Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.06132v1">http://arxiv.org/abs/2604.06132v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 98 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, Qi Liu, Zhifang Sui, Tong Yang</p>

            <p><strong>Title:</strong><br>
            Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.06132v1">http://arxiv.org/abs/2604.06132v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Apr 2026 21:28:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d89a064f/b5031a05.mp3" length="21799519" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1359</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 98 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, Qi Liu, Zhifang Sui, Tong Yang</p>

            <p><strong>Title:</strong><br>
            Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.06132v1">http://arxiv.org/abs/2604.06132v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Learning to Retrieve from Agent Trajectories</title>
      <itunes:episode>1742</itunes:episode>
      <podcast:episode>1742</podcast:episode>
      <itunes:title>Learning to Retrieve from Agent Trajectories</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2626d315-2c82-4543-b707-3f73e3bf63d4</guid>
      <link>https://share.transistor.fm/s/3b6a9f09</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.IR, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuqi Zhou, Sunhao Dai, Changle Qu, Liang Pang, Jun Xu, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            Learning to Retrieve from Agent Trajectories</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04949v1">http://arxiv.org/abs/2604.04949v1</a></p>

            <p><strong>Abstract:</strong><br>
            Information retrieval (IR) systems have traditionally been designed and trained for human users, with learning-to-rank methods relying heavily on large-scale human interaction logs such as clicks and dwell time. With the rapid emergence of large language model (LLM) powered search agents, however, retrieval is increasingly consumed by agents rather than human beings, and is embedded as a core component within multi-turn reasoning and action loops. In this setting, retrieval models trained under human-centric assumptions exhibit a fundamental mismatch with the way agents issue queries and consume results. In this work, we argue that retrieval models for agentic search should be trained directly from agent interaction data. We introduce learning to retrieve from agent trajectories as a new training paradigm, where supervision is derived from multi-step agent interactions. Through a systematic analysis of search agent trajectories, we identify key behavioral signals that reveal document utility, including browsing actions, unbrowsed rejections, and post-browse reasoning traces. Guided by these insights, we propose LRAT, a simple yet effective framework that mines high-quality retrieval supervision from agent trajectories and incorporates relevance intensity through weighted optimization. Extensive experiments on both in-domain and out-of-domain deep research benchmarks demonstrate that retrievers trained with LRAT consistently improve evidence recall, end-to-end task success, and execution efficiency across diverse agent architectures and scales. Our results highlight agent trajectories as a practical and scalable supervision source, pointing to a promising direction for retrieval in the era of agentic search.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.IR, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuqi Zhou, Sunhao Dai, Changle Qu, Liang Pang, Jun Xu, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            Learning to Retrieve from Agent Trajectories</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04949v1">http://arxiv.org/abs/2604.04949v1</a></p>

            <p><strong>Abstract:</strong><br>
            Information retrieval (IR) systems have traditionally been designed and trained for human users, with learning-to-rank methods relying heavily on large-scale human interaction logs such as clicks and dwell time. With the rapid emergence of large language model (LLM) powered search agents, however, retrieval is increasingly consumed by agents rather than human beings, and is embedded as a core component within multi-turn reasoning and action loops. In this setting, retrieval models trained under human-centric assumptions exhibit a fundamental mismatch with the way agents issue queries and consume results. In this work, we argue that retrieval models for agentic search should be trained directly from agent interaction data. We introduce learning to retrieve from agent trajectories as a new training paradigm, where supervision is derived from multi-step agent interactions. Through a systematic analysis of search agent trajectories, we identify key behavioral signals that reveal document utility, including browsing actions, unbrowsed rejections, and post-browse reasoning traces. Guided by these insights, we propose LRAT, a simple yet effective framework that mines high-quality retrieval supervision from agent trajectories and incorporates relevance intensity through weighted optimization. Extensive experiments on both in-domain and out-of-domain deep research benchmarks demonstrate that retrievers trained with LRAT consistently improve evidence recall, end-to-end task success, and execution efficiency across diverse agent architectures and scales. Our results highlight agent trajectories as a practical and scalable supervision source, pointing to a promising direction for retrieval in the era of agentic search.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Apr 2026 21:27:48 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3b6a9f09/4ce7ad3b.mp3" length="21606823" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1347</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.IR, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuqi Zhou, Sunhao Dai, Changle Qu, Liang Pang, Jun Xu, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            Learning to Retrieve from Agent Trajectories</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04949v1">http://arxiv.org/abs/2604.04949v1</a></p>

            <p><strong>Abstract:</strong><br>
            Information retrieval (IR) systems have traditionally been designed and trained for human users, with learning-to-rank methods relying heavily on large-scale human interaction logs such as clicks and dwell time. With the rapid emergence of large language model (LLM) powered search agents, however, retrieval is increasingly consumed by agents rather than human beings, and is embedded as a core component within multi-turn reasoning and action loops. In this setting, retrieval models trained under human-centric assumptions exhibit a fundamental mismatch with the way agents issue queries and consume results. In this work, we argue that retrieval models for agentic search should be trained directly from agent interaction data. We introduce learning to retrieve from agent trajectories as a new training paradigm, where supervision is derived from multi-step agent interactions. Through a systematic analysis of search agent trajectories, we identify key behavioral signals that reveal document utility, including browsing actions, unbrowsed rejections, and post-browse reasoning traces. Guided by these insights, we propose LRAT, a simple yet effective framework that mines high-quality retrieval supervision from agent trajectories and incorporates relevance intensity through weighted optimization. Extensive experiments on both in-domain and out-of-domain deep research benchmarks demonstrate that retrievers trained with LRAT consistently improve evidence recall, end-to-end task success, and execution efficiency across diverse agent architectures and scales. Our results highlight agent trajectories as a practical and scalable supervision source, pointing to a promising direction for retrieval in the era of agentic search.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation</title>
      <itunes:episode>1741</itunes:episode>
      <podcast:episode>1741</podcast:episode>
      <itunes:title>ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2699f9ef-fc65-41c7-a56d-c53eca5cd95a</guid>
      <link>https://share.transistor.fm/s/a593a2bb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hui Sun, Yun-Ji Zhang, Zheng Xie, Ren-Biao Liu, Yali Du, Xin-Ye Li, Ming Li</p>

            <p><strong>Title:</strong><br>
            ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.03922v1">http://arxiv.org/abs/2604.03922v1</a></p>

            <p><strong>Abstract:</strong><br>
            Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a \emph{circular dependency}. Our key insight is that we need not determine test correctness at all: \emph{test votes should rank, not merely count}. What matters is not how many codes pass a test, but whether the test can \emph{distinguish} correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose \textbf{ACES}~(\textbf{A}UC \textbf{C}onsist\textbf{E}ncy \textbf{S}coring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@$k$ on multiple code generation benchmarks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hui Sun, Yun-Ji Zhang, Zheng Xie, Ren-Biao Liu, Yali Du, Xin-Ye Li, Ming Li</p>

            <p><strong>Title:</strong><br>
            ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.03922v1">http://arxiv.org/abs/2604.03922v1</a></p>

            <p><strong>Abstract:</strong><br>
            Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a \emph{circular dependency}. Our key insight is that we need not determine test correctness at all: \emph{test votes should rank, not merely count}. What matters is not how many codes pass a test, but whether the test can \emph{distinguish} correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose \textbf{ACES}~(\textbf{A}UC \textbf{C}onsist\textbf{E}ncy \textbf{S}coring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@$k$ on multiple code generation benchmarks.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Apr 2026 21:27:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a593a2bb/69211795.mp3" length="23448383" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1462</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hui Sun, Yun-Ji Zhang, Zheng Xie, Ren-Biao Liu, Yali Du, Xin-Ye Li, Ming Li</p>

            <p><strong>Title:</strong><br>
            ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.03922v1">http://arxiv.org/abs/2604.03922v1</a></p>

            <p><strong>Abstract:</strong><br>
            Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a \emph{circular dependency}. Our key insight is that we need not determine test correctness at all: \emph{test votes should rank, not merely count}. What matters is not how many codes pass a test, but whether the test can \emph{distinguish} correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose \textbf{ACES}~(\textbf{A}UC \textbf{C}onsist\textbf{E}ncy \textbf{S}coring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@$k$ on multiple code generation benchmarks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers</title>
      <itunes:episode>1740</itunes:episode>
      <podcast:episode>1740</podcast:episode>
      <itunes:title>GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">76d6db22-2fec-467e-b80f-048664658827</guid>
      <link>https://share.transistor.fm/s/17971d38</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shufan Jiang, Chios Chen, Zhiyang Chen</p>

            <p><strong>Title:</strong><br>
            GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02648v1">http://arxiv.org/abs/2604.02648v1</a></p>

            <p><strong>Abstract:</strong><br>
            The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for large language models (LLMs). In this paper, we take game development as a representative domain and introduce the Game Benchmark for Quality Assurance (GBQA), a benchmark containing 30 games and 124 human-verified bugs across three difficulty levels, to evaluate whether LLMs can autonomously detect software bugs. The benchmark is constructed using a multi-agent system that develops games and injects bugs in a scalable manner, with human experts in the loop to ensure correctness. Moreover, we provide a baseline interactive agent equipped with a multi-round ReAct loop and a memory mechanism, enabling long-horizon exploration of game environments for bug detection across different LLMs. Extensive experiments on frontier LLMs demonstrate that autonomous bug discovery remains highly challenging: the best-performing model, Claude-4.6-Opus in thinking mode, identifies only 48.39% of the verified bugs. We believe GBQA provides an adequate testbed and evaluation criterion, and that further progress on it will help close the gap in autonomous software engineering.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shufan Jiang, Chios Chen, Zhiyang Chen</p>

            <p><strong>Title:</strong><br>
            GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02648v1">http://arxiv.org/abs/2604.02648v1</a></p>

            <p><strong>Abstract:</strong><br>
            The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for large language models (LLMs). In this paper, we take game development as a representative domain and introduce the Game Benchmark for Quality Assurance (GBQA), a benchmark containing 30 games and 124 human-verified bugs across three difficulty levels, to evaluate whether LLMs can autonomously detect software bugs. The benchmark is constructed using a multi-agent system that develops games and injects bugs in a scalable manner, with human experts in the loop to ensure correctness. Moreover, we provide a baseline interactive agent equipped with a multi-round ReAct loop and a memory mechanism, enabling long-horizon exploration of game environments for bug detection across different LLMs. Extensive experiments on frontier LLMs demonstrate that autonomous bug discovery remains highly challenging: the best-performing model, Claude-4.6-Opus in thinking mode, identifies only 48.39% of the verified bugs. We believe GBQA provides an adequate testbed and evaluation criterion, and that further progress on it will help close the gap in autonomous software engineering.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Apr 2026 21:27:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/17971d38/023728e5.mp3" length="22946829" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1430</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shufan Jiang, Chios Chen, Zhiyang Chen</p>

            <p><strong>Title:</strong><br>
            GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02648v1">http://arxiv.org/abs/2604.02648v1</a></p>

            <p><strong>Abstract:</strong><br>
            The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for large language models (LLMs). In this paper, we take game development as a representative domain and introduce the Game Benchmark for Quality Assurance (GBQA), a benchmark containing 30 games and 124 human-verified bugs across three difficulty levels, to evaluate whether LLMs can autonomously detect software bugs. The benchmark is constructed using a multi-agent system that develops games and injects bugs in a scalable manner, with human experts in the loop to ensure correctness. Moreover, we provide a baseline interactive agent equipped with a multi-round ReAct loop and a memory mechanism, enabling long-horizon exploration of game environments for bug detection across different LLMs. Extensive experiments on frontier LLMs demonstrate that autonomous bug discovery remains highly challenging: the best-performing model, Claude-4.6-Opus in thinking mode, identifies only 48.39% of the verified bugs. We believe GBQA provides an adequate testbed and evaluation criterion, and that further progress on it will help close the gap in autonomous software engineering.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning</title>
      <itunes:episode>1739</itunes:episode>
      <podcast:episode>1739</podcast:episode>
      <itunes:title>Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ba0dc531-d6b4-4cb0-8944-545dd7c66cb7</guid>
      <link>https://share.transistor.fm/s/3a9a3ea0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.PF, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Qisheng Su, Shiting Huang, Zhen Fang, Ziyan Chen, Zehui Chen, Feng Zhao</p>

            <p><strong>Title:</strong><br>
            Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.05404v1">http://arxiv.org/abs/2604.05404v1</a></p>

            <p><strong>Abstract:</strong><br>
            In real-world Tool-Integrated Reasoning (TIR) scenarios, where LLMs interleave reasoning with external tool calls, a major source of inefficiency is that the toolcalls create pauses between LLM requests and cause KV-Cache eviction, forcing recomputation. Also, the long, unfiltered response returned by external tools inflates the KV-Cache, so each decode step spends more time loading the growing cache and thus becomes steadily slower as context length increases. However, existing efficiency metrics like token counts and toolcall counts fail to capture the real model inference latency. To address this, we introduce PTE (Prefill Token Equivalents), a hardware-aware TIR-efficiency metric that unifies internal reasoning and external tool-use costs while explicitly accounting for non-reusable KV-Cache and long-tool-response scenarios. Validation in a high-concurrency industrial setting indicates that PTE aligns significantly better with wall-clock latency than standard token counts, while maintaining consistent efficiency rankings across diverse hardware profiles. We conduct extensive experiments across five TIR benchmarks, quantify their PTE costs, and identify four inefficiency patterns that appear in TIR. We also discover that trajectories with higher PTE costs tend to have lower reasoning correctness, indicating that simply using more tools does not improve the quality of the answer.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.PF, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Qisheng Su, Shiting Huang, Zhen Fang, Ziyan Chen, Zehui Chen, Feng Zhao</p>

            <p><strong>Title:</strong><br>
            Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.05404v1">http://arxiv.org/abs/2604.05404v1</a></p>

            <p><strong>Abstract:</strong><br>
            In real-world Tool-Integrated Reasoning (TIR) scenarios, where LLMs interleave reasoning with external tool calls, a major source of inefficiency is that the toolcalls create pauses between LLM requests and cause KV-Cache eviction, forcing recomputation. Also, the long, unfiltered response returned by external tools inflates the KV-Cache, so each decode step spends more time loading the growing cache and thus becomes steadily slower as context length increases. However, existing efficiency metrics like token counts and toolcall counts fail to capture the real model inference latency. To address this, we introduce PTE (Prefill Token Equivalents), a hardware-aware TIR-efficiency metric that unifies internal reasoning and external tool-use costs while explicitly accounting for non-reusable KV-Cache and long-tool-response scenarios. Validation in a high-concurrency industrial setting indicates that PTE aligns significantly better with wall-clock latency than standard token counts, while maintaining consistent efficiency rankings across diverse hardware profiles. We conduct extensive experiments across five TIR benchmarks, quantify their PTE costs, and identify four inefficiency patterns that appear in TIR. We also discover that trajectories with higher PTE costs tend to have lower reasoning correctness, indicating that simply using more tools does not improve the quality of the answer.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Apr 2026 21:26:39 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3a9a3ea0/0d0d1f73.mp3" length="20518908" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1279</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.PF, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Qisheng Su, Shiting Huang, Zhen Fang, Ziyan Chen, Zehui Chen, Feng Zhao</p>

            <p><strong>Title:</strong><br>
            Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.05404v1">http://arxiv.org/abs/2604.05404v1</a></p>

            <p><strong>Abstract:</strong><br>
            In real-world Tool-Integrated Reasoning (TIR) scenarios, where LLMs interleave reasoning with external tool calls, a major source of inefficiency is that the toolcalls create pauses between LLM requests and cause KV-Cache eviction, forcing recomputation. Also, the long, unfiltered response returned by external tools inflates the KV-Cache, so each decode step spends more time loading the growing cache and thus becomes steadily slower as context length increases. However, existing efficiency metrics like token counts and toolcall counts fail to capture the real model inference latency. To address this, we introduce PTE (Prefill Token Equivalents), a hardware-aware TIR-efficiency metric that unifies internal reasoning and external tool-use costs while explicitly accounting for non-reusable KV-Cache and long-tool-response scenarios. Validation in a high-concurrency industrial setting indicates that PTE aligns significantly better with wall-clock latency than standard token counts, while maintaining consistent efficiency rankings across diverse hardware profiles. We conduct extensive experiments across five TIR benchmarks, quantify their PTE costs, and identify four inefficiency patterns that appear in TIR. We also discover that trajectories with higher PTE costs tend to have lower reasoning correctness, indicating that simply using more tools does not improve the quality of the answer.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement</title>
      <itunes:episode>1738</itunes:episode>
      <podcast:episode>1738</podcast:episode>
      <itunes:title>ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">80c4ae04-a0e3-4eb8-8cb3-8b1f5ff29b79</guid>
      <link>https://share.transistor.fm/s/0c4433a8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Difan Jiao, Qianfeng Wen, Blair Yang, Zhenwei Tang, Ashton Anderson</p>

            <p><strong>Title:</strong><br>
            ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.01591v2">http://arxiv.org/abs/2604.01591v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Difan Jiao, Qianfeng Wen, Blair Yang, Zhenwei Tang, Ashton Anderson</p>

            <p><strong>Title:</strong><br>
            ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.01591v2">http://arxiv.org/abs/2604.01591v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Apr 2026 21:26:17 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0c4433a8/88c0b0ce.mp3" length="21539574" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1343</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Difan Jiao, Qianfeng Wen, Blair Yang, Zhenwei Tang, Ashton Anderson</p>

            <p><strong>Title:</strong><br>
            ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.01591v2">http://arxiv.org/abs/2604.01591v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision</title>
      <itunes:episode>1737</itunes:episode>
      <podcast:episode>1737</podcast:episode>
      <itunes:title>Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">68a431bc-f0db-403d-bf98-7d491e35a952</guid>
      <link>https://share.transistor.fm/s/1f6da4e9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hyunsoo Cha, Wonjung Woo, Byungjun Kim, Hanbyul Joo</p>

            <p><strong>Title:</strong><br>
            Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04934v1">http://arxiv.org/abs/2604.04934v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hyunsoo Cha, Wonjung Woo, Byungjun Kim, Hanbyul Joo</p>

            <p><strong>Title:</strong><br>
            Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04934v1">http://arxiv.org/abs/2604.04934v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Apr 2026 21:25:55 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1f6da4e9/8f925b97.mp3" length="23950777" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1493</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hyunsoo Cha, Wonjung Woo, Byungjun Kim, Hanbyul Joo</p>

            <p><strong>Title:</strong><br>
            Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04934v1">http://arxiv.org/abs/2604.04934v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU</title>
      <itunes:episode>1736</itunes:episode>
      <podcast:episode>1736</podcast:episode>
      <itunes:title>MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1251ae0b-b62d-4556-9f78-517cc8cea63b</guid>
      <link>https://share.transistor.fm/s/f59f6f34</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.DC, cs.OS</p>

            <p><strong>Authors:</strong><br>
            Zhengqing Yuan, Hanchi Sun, Lichao Sun, Yanfang Ye</p>

            <p><strong>Title:</strong><br>
            MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.05091v1">http://arxiv.org/abs/2604.05091v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84$\times$ the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.DC, cs.OS</p>

            <p><strong>Authors:</strong><br>
            Zhengqing Yuan, Hanchi Sun, Lichao Sun, Yanfang Ye</p>

            <p><strong>Title:</strong><br>
            MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.05091v1">http://arxiv.org/abs/2604.05091v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84$\times$ the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Apr 2026 21:25:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f59f6f34/033b845c.mp3" length="24258821" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1512</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.DC, cs.OS</p>

            <p><strong>Authors:</strong><br>
            Zhengqing Yuan, Hanchi Sun, Lichao Sun, Yanfang Ye</p>

            <p><strong>Title:</strong><br>
            MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.05091v1">http://arxiv.org/abs/2604.05091v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84$\times$ the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Watch Before You Answer: Learning from Visually Grounded Post-Training</title>
      <itunes:episode>1735</itunes:episode>
      <podcast:episode>1735</podcast:episode>
      <itunes:title>Watch Before You Answer: Learning from Visually Grounded Post-Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a81c5274-27aa-4b27-80e5-e674d6840301</guid>
      <link>https://share.transistor.fm/s/68d8cbad</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuxuan Zhang, EunJeong Hwang, Huaisong Zhang, Penghui Du, Yiming Jia, Dongfu Jiang, Xuan He, Shenhui Zhang, Ping Nie, Peter West, Kelsey R. Allen</p>

            <p><strong>Title:</strong><br>
            Watch Before You Answer: Learning from Visually Grounded Post-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.05117v1">http://arxiv.org/abs/2604.05117v1</a></p>

            <p><strong>Abstract:</strong><br>
            It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuxuan Zhang, EunJeong Hwang, Huaisong Zhang, Penghui Du, Yiming Jia, Dongfu Jiang, Xuan He, Shenhui Zhang, Ping Nie, Peter West, Kelsey R. Allen</p>

            <p><strong>Title:</strong><br>
            Watch Before You Answer: Learning from Visually Grounded Post-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.05117v1">http://arxiv.org/abs/2604.05117v1</a></p>

            <p><strong>Abstract:</strong><br>
            It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Apr 2026 21:25:11 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/68d8cbad/73f9b78f.mp3" length="19808371" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1234</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuxuan Zhang, EunJeong Hwang, Huaisong Zhang, Penghui Du, Yiming Jia, Dongfu Jiang, Xuan He, Shenhui Zhang, Ping Nie, Peter West, Kelsey R. Allen</p>

            <p><strong>Title:</strong><br>
            Watch Before You Answer: Learning from Visually Grounded Post-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.05117v1">http://arxiv.org/abs/2604.05117v1</a></p>

            <p><strong>Abstract:</strong><br>
            It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OpenWorldLib: A Unified Codebase and Definition of Advanced World Models</title>
      <itunes:episode>1734</itunes:episode>
      <podcast:episode>1734</podcast:episode>
      <itunes:title>OpenWorldLib: A Unified Codebase and Definition of Advanced World Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ca7adece-a8a2-49ce-a283-7fd1b31de8ce</guid>
      <link>https://share.transistor.fm/s/83740e83</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 152 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            DataFlow Team, Bohan Zeng, Daili Hua, Kaixin Zhu, Yifan Dai, Bozhou Li, Yuran Wang, Chengzhuo Tong, Yifan Yang, Mingkun Chang, Jianbin Zhao, Zhou Liu, Hao Liang, Xiaochen Ma, Ruichuan An, Junbo Niu, Zimo Meng, Tianyi Bai, Meiyi Qiang, Huanyao Zhang, Zhiyou Xiao, Tianyu Guo, Qinhan Yu, Runhao Zhao, Zhengpin Li, Xinyi Huang, Yisheng Pan, Yiwen Tang, Yang Shi, Yue Ding, Xinlong Chen, Hongcheng Gao, Minglei Shi, Jialong Wu, Zekun Wang, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Yiren Song, Mike Zheng Shou, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            OpenWorldLib: A Unified Codebase and Definition of Advanced World Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04707v1">http://arxiv.org/abs/2604.04707v1</a></p>

            <p><strong>Abstract:</strong><br>
            World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remains lacking. In this paper, we introduce OpenWorldLib, a comprehensive and standardized inference framework for Advanced World Models. Drawing on the evolution of world models, we propose a clear definition: a world model is a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. We further systematically categorize the essential capabilities of world models. Based on this definition, OpenWorldLib integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference. Finally, we present additional reflections and analyses on potential future directions for world model research. Code link: https://github.com/OpenDCAI/OpenWorldLib</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 152 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            DataFlow Team, Bohan Zeng, Daili Hua, Kaixin Zhu, Yifan Dai, Bozhou Li, Yuran Wang, Chengzhuo Tong, Yifan Yang, Mingkun Chang, Jianbin Zhao, Zhou Liu, Hao Liang, Xiaochen Ma, Ruichuan An, Junbo Niu, Zimo Meng, Tianyi Bai, Meiyi Qiang, Huanyao Zhang, Zhiyou Xiao, Tianyu Guo, Qinhan Yu, Runhao Zhao, Zhengpin Li, Xinyi Huang, Yisheng Pan, Yiwen Tang, Yang Shi, Yue Ding, Xinlong Chen, Hongcheng Gao, Minglei Shi, Jialong Wu, Zekun Wang, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Yiren Song, Mike Zheng Shou, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            OpenWorldLib: A Unified Codebase and Definition of Advanced World Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04707v1">http://arxiv.org/abs/2604.04707v1</a></p>

            <p><strong>Abstract:</strong><br>
            World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remains lacking. In this paper, we introduce OpenWorldLib, a comprehensive and standardized inference framework for Advanced World Models. Drawing on the evolution of world models, we propose a clear definition: a world model is a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. We further systematically categorize the essential capabilities of world models. Based on this definition, OpenWorldLib integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference. Finally, we present additional reflections and analyses on potential future directions for world model research. Code link: https://github.com/OpenDCAI/OpenWorldLib</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Apr 2026 21:22:50 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/83740e83/7aed3fad.mp3" length="23060513" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1438</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 152 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            DataFlow Team, Bohan Zeng, Daili Hua, Kaixin Zhu, Yifan Dai, Bozhou Li, Yuran Wang, Chengzhuo Tong, Yifan Yang, Mingkun Chang, Jianbin Zhao, Zhou Liu, Hao Liang, Xiaochen Ma, Ruichuan An, Junbo Niu, Zimo Meng, Tianyi Bai, Meiyi Qiang, Huanyao Zhang, Zhiyou Xiao, Tianyu Guo, Qinhan Yu, Runhao Zhao, Zhengpin Li, Xinyi Huang, Yisheng Pan, Yiwen Tang, Yang Shi, Yue Ding, Xinlong Chen, Hongcheng Gao, Minglei Shi, Jialong Wu, Zekun Wang, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Yiren Song, Mike Zheng Shou, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            OpenWorldLib: A Unified Codebase and Definition of Advanced World Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04707v1">http://arxiv.org/abs/2604.04707v1</a></p>

            <p><strong>Abstract:</strong><br>
            World models have garnered significant attention as a promising research direction in artificial intelligence, yet a clear and unified definition remains lacking. In this paper, we introduce OpenWorldLib, a comprehensive and standardized inference framework for Advanced World Models. Drawing on the evolution of world models, we propose a clear definition: a world model is a model or framework centered on perception, equipped with interaction and long-term memory capabilities, for understanding and predicting the complex world. We further systematically categorize the essential capabilities of world models. Based on this definition, OpenWorldLib integrates models across different tasks within a unified framework, enabling efficient reuse and collaborative inference. Finally, we present additional reflections and analyses on potential future directions for world model research. Code link: https://github.com/OpenDCAI/OpenWorldLib</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale</title>
      <itunes:episode>1733</itunes:episode>
      <podcast:episode>1733</podcast:episode>
      <itunes:title>MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">06dac90d-1606-4810-82b1-035724b373ce</guid>
      <link>https://share.transistor.fm/s/6c6d4b3b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 91 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, Bangrui Xu, Junbo Niu, Mengzhang Cai, Jiantao Qiu, Qintong Zhang, Dongsheng Ma, Yuefeng Sun, Hejun Dong, Wenzheng Zhang, Jutao Xiao, Jiayong Shi, Pengyu Liao, Xiaomeng Zhao, Huaping Zhong, Liqun Wei, Jing Yu, Jie Yang, Wei Li, Shasha Wang, Qianqian Wu, Xuanhe Zhou, Weijia Li, Zhenxiang Li, Zhongying Tu, Jiang Wu, Lijun Wu, Chao Xu, Kai Chen, Wentao Zhang, Yu Qiao, Bowen Zhou, Dahua Lin, Conghui He</p>

            <p><strong>Title:</strong><br>
            MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04771v1">http://arxiv.org/abs/2604.04771v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current document parsing methods compete primarily on model architecture innovation, while systematic engineering of training data remains underexplored. Yet SOTA models of different architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than architecture itself. Building on this finding, we present \minerupro, which advances the state of the art solely through data engineering and training strategy optimization while keeping the 1.2B-parameter architecture of \mineru completely fixed. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while correcting distribution shift; Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy -- large-scale pre-training, hard sample fine-tuning, and GRPO alignment -- sequentially exploits these data at different quality tiers. On the evaluation front, we fix element-matching biases in OmniDocBench~v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench~v1.6 protocol. Without any architectural modification, \minerupro achieves 95.69 on OmniDocBench~v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200$\times$ more parameters.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 91 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, Bangrui Xu, Junbo Niu, Mengzhang Cai, Jiantao Qiu, Qintong Zhang, Dongsheng Ma, Yuefeng Sun, Hejun Dong, Wenzheng Zhang, Jutao Xiao, Jiayong Shi, Pengyu Liao, Xiaomeng Zhao, Huaping Zhong, Liqun Wei, Jing Yu, Jie Yang, Wei Li, Shasha Wang, Qianqian Wu, Xuanhe Zhou, Weijia Li, Zhenxiang Li, Zhongying Tu, Jiang Wu, Lijun Wu, Chao Xu, Kai Chen, Wentao Zhang, Yu Qiao, Bowen Zhou, Dahua Lin, Conghui He</p>

            <p><strong>Title:</strong><br>
            MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04771v1">http://arxiv.org/abs/2604.04771v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current document parsing methods compete primarily on model architecture innovation, while systematic engineering of training data remains underexplored. Yet SOTA models of different architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than architecture itself. Building on this finding, we present \minerupro, which advances the state of the art solely through data engineering and training strategy optimization while keeping the 1.2B-parameter architecture of \mineru completely fixed. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while correcting distribution shift; Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy -- large-scale pre-training, hard sample fine-tuning, and GRPO alignment -- sequentially exploits these data at different quality tiers. On the evaluation front, we fix element-matching biases in OmniDocBench~v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench~v1.6 protocol. Without any architectural modification, \minerupro achieves 95.69 on OmniDocBench~v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200$\times$ more parameters.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Apr 2026 21:22:27 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6c6d4b3b/f31015d6.mp3" length="22974834" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1432</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 91 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, Bangrui Xu, Junbo Niu, Mengzhang Cai, Jiantao Qiu, Qintong Zhang, Dongsheng Ma, Yuefeng Sun, Hejun Dong, Wenzheng Zhang, Jutao Xiao, Jiayong Shi, Pengyu Liao, Xiaomeng Zhao, Huaping Zhong, Liqun Wei, Jing Yu, Jie Yang, Wei Li, Shasha Wang, Qianqian Wu, Xuanhe Zhou, Weijia Li, Zhenxiang Li, Zhongying Tu, Jiang Wu, Lijun Wu, Chao Xu, Kai Chen, Wentao Zhang, Yu Qiao, Bowen Zhou, Dahua Lin, Conghui He</p>

            <p><strong>Title:</strong><br>
            MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04771v1">http://arxiv.org/abs/2604.04771v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current document parsing methods compete primarily on model architecture innovation, while systematic engineering of training data remains underexplored. Yet SOTA models of different architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than architecture itself. Building on this finding, we present \minerupro, which advances the state of the art solely through data engineering and training strategy optimization while keeping the 1.2B-parameter architecture of \mineru completely fixed. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while correcting distribution shift; Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy -- large-scale pre-training, hard sample fine-tuning, and GRPO alignment -- sequentially exploits these data at different quality tiers. On the evaluation front, we fix element-matching biases in OmniDocBench~v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench~v1.6 protocol. Without any architectural modification, \minerupro achieves 95.69 on OmniDocBench~v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200$\times$ more parameters.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models</title>
      <itunes:episode>1732</itunes:episode>
      <podcast:episode>1732</podcast:episode>
      <itunes:title>LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0eae1ff5-03d9-4214-80f7-72baf03ba571</guid>
      <link>https://share.transistor.fm/s/ee333453</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 72 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chanyoung Kim, Minwoo Kim, Minseok Kang, Hyunwoo Kim, Dahuin Jung</p>

            <p><strong>Title:</strong><br>
            LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28301v1">http://arxiv.org/abs/2603.28301v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of failures arise from planning-level trajectory divergence rather than execution errors, showing that paraphrasing disrupts task identification. Binary success rate treats all paraphrases equally, obscuring whether models perform consistently across difficulty levels or rely on easier cases. To address this, we propose PRIDE, a metric that quantifies paraphrase difficulty using semantic and syntactic factors. Our benchmark and corresponding code are available at: https://github.com/cau-hai-lab/LIBERO-Para</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 72 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chanyoung Kim, Minwoo Kim, Minseok Kang, Hyunwoo Kim, Dahuin Jung</p>

            <p><strong>Title:</strong><br>
            LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28301v1">http://arxiv.org/abs/2603.28301v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of failures arise from planning-level trajectory divergence rather than execution errors, showing that paraphrasing disrupts task identification. Binary success rate treats all paraphrases equally, obscuring whether models perform consistently across difficulty levels or rely on easier cases. To address this, we propose PRIDE, a metric that quantifies paraphrase difficulty using semantic and syntactic factors. Our benchmark and corresponding code are available at: https://github.com/cau-hai-lab/LIBERO-Para</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Apr 2026 21:22:05 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ee333453/bcc49932.mp3" length="22014794" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1372</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 72 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chanyoung Kim, Minwoo Kim, Minseok Kang, Hyunwoo Kim, Dahuin Jung</p>

            <p><strong>Title:</strong><br>
            LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28301v1">http://arxiv.org/abs/2603.28301v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of failures arise from planning-level trajectory divergence rather than execution errors, showing that paraphrasing disrupts task identification. Binary success rate treats all paraphrases equally, obscuring whether models perform consistently across difficulty levels or rely on easier cases. To address this, we propose PRIDE, a metric that quantifies paraphrase difficulty using semantic and syntactic factors. Our benchmark and corresponding code are available at: https://github.com/cau-hai-lab/LIBERO-Para</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TriAttention: Efficient Long Reasoning with Trigonometric KV Compression</title>
      <itunes:episode>1731</itunes:episode>
      <podcast:episode>1731</podcast:episode>
      <itunes:title>TriAttention: Efficient Long Reasoning with Trigonometric KV Compression</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">27a8a89d-80df-4272-a225-47557c9ed8cc</guid>
      <link>https://share.transistor.fm/s/a4164116</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen</p>

            <p><strong>Title:</strong><br>
            TriAttention: Efficient Long Reasoning with Trigonometric KV Compression</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04921v1">http://arxiv.org/abs/2604.04921v1</a></p>

            <p><strong>Abstract:</strong><br>
            Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen</p>

            <p><strong>Title:</strong><br>
            TriAttention: Efficient Long Reasoning with Trigonometric KV Compression</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04921v1">http://arxiv.org/abs/2604.04921v1</a></p>

            <p><strong>Abstract:</strong><br>
            Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Apr 2026 21:21:42 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a4164116/b6df5928.mp3" length="20572402" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1282</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen</p>

            <p><strong>Title:</strong><br>
            TriAttention: Efficient Long Reasoning with Trigonometric KV Compression</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04921v1">http://arxiv.org/abs/2604.04921v1</a></p>

            <p><strong>Abstract:</strong><br>
            Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Adam's Law: Textual Frequency Law on Large Language Models</title>
      <itunes:episode>1730</itunes:episode>
      <podcast:episode>1730</podcast:episode>
      <itunes:title>Adam's Law: Textual Frequency Law on Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b345fe12-bde0-43a2-a388-a54511e40551</guid>
      <link>https://share.transistor.fm/s/edc1ced4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hongyuan Adam Lu, Z. L., Victor Wei, Zefan Zhang, Zhao Hong, Qiqi Xiang, Bowen Cao, Wai Lam</p>

            <p><strong>Title:</strong><br>
            Adam's Law: Textual Frequency Law on Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02176v2">http://arxiv.org/abs/2604.02176v2</a></p>

            <p><strong>Abstract:</strong><br>
            While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hongyuan Adam Lu, Z. L., Victor Wei, Zefan Zhang, Zhao Hong, Qiqi Xiang, Bowen Cao, Wai Lam</p>

            <p><strong>Title:</strong><br>
            Adam's Law: Textual Frequency Law on Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02176v2">http://arxiv.org/abs/2604.02176v2</a></p>

            <p><strong>Abstract:</strong><br>
            While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Apr 2026 21:21:19 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/edc1ced4/9f0034fc.mp3" length="21613106" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1347</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hongyuan Adam Lu, Z. L., Victor Wei, Zefan Zhang, Zhao Hong, Qiqi Xiang, Bowen Cao, Wai Lam</p>

            <p><strong>Title:</strong><br>
            Adam's Law: Textual Frequency Law on Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02176v2">http://arxiv.org/abs/2604.02176v2</a></p>

            <p><strong>Abstract:</strong><br>
            While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AURA: Always-On Understanding and Real-Time Assistance via Video Streams</title>
      <itunes:episode>1729</itunes:episode>
      <podcast:episode>1729</podcast:episode>
      <itunes:title>AURA: Always-On Understanding and Real-Time Assistance via Video Streams</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">898687d1-7693-4fdc-98a8-e6742df2705b</guid>
      <link>https://share.transistor.fm/s/07aa8635</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xudong Lu, Yang Bo, Jinpeng Chen, Shuhan Li, Xintong Guo, Huankang Guan, Fang Liu, Dunyuan Xu, Peiwen Sun, Heyang Sun, Rui Liu, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            AURA: Always-On Understanding and Real-Time Assistance via Video Streams</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04184v1">http://arxiv.org/abs/2604.04184v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xudong Lu, Yang Bo, Jinpeng Chen, Shuhan Li, Xintong Guo, Huankang Guan, Fang Liu, Dunyuan Xu, Peiwen Sun, Heyang Sun, Rui Liu, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            AURA: Always-On Understanding and Real-Time Assistance via Video Streams</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04184v1">http://arxiv.org/abs/2604.04184v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Apr 2026 21:20:57 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/07aa8635/0a1b15bb.mp3" length="22696471" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1415</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xudong Lu, Yang Bo, Jinpeng Chen, Shuhan Li, Xintong Guo, Huankang Guan, Fang Liu, Dunyuan Xu, Peiwen Sun, Heyang Sun, Rui Liu, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            AURA: Always-On Understanding and Real-Time Assistance via Video Streams</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04184v1">http://arxiv.org/abs/2604.04184v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ClawArena: Benchmarking AI Agents in Evolving Information Environments</title>
      <itunes:episode>1728</itunes:episode>
      <podcast:episode>1728</podcast:episode>
      <itunes:title>ClawArena: Benchmarking AI Agents in Evolving Information Environments</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ddfd741e-d5ec-4475-bab4-fa5bc5d5d7eb</guid>
      <link>https://share.transistor.fm/s/fa6f4623</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, Huaxiu Yao</p>

            <p><strong>Title:</strong><br>
            ClawArena: Benchmarking AI Agents in Evolving Information Environments</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04202v1">http://arxiv.org/abs/2604.04202v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single-authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi-source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14-category question taxonomy. Two question formats, multi-choice (set-selection) and shell-based executable checks, test both reasoning and workspace grounding. The current release contains 64 scenarios across 8 professional domains, totaling 1{,}879 evaluation rounds and 365 dynamic updates. Experiments on five agent frameworks and five language models show that both model capability (15.4% range) and framework design (9.2%) substantially affect performance, that self-evolving skill frameworks can partially close model-capability gaps, and that belief revision difficulty is determined by update design strategy rather than the mere presence of updates. Code is available at https://github.com/aiming-lab/ClawArena.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, Huaxiu Yao</p>

            <p><strong>Title:</strong><br>
            ClawArena: Benchmarking AI Agents in Evolving Information Environments</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04202v1">http://arxiv.org/abs/2604.04202v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single-authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi-source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14-category question taxonomy. Two question formats, multi-choice (set-selection) and shell-based executable checks, test both reasoning and workspace grounding. The current release contains 64 scenarios across 8 professional domains, totaling 1{,}879 evaluation rounds and 365 dynamic updates. Experiments on five agent frameworks and five language models show that both model capability (15.4% range) and framework design (9.2%) substantially affect performance, that self-evolving skill frameworks can partially close model-capability gaps, and that belief revision difficulty is determined by update design strategy rather than the mere presence of updates. Code is available at https://github.com/aiming-lab/ClawArena.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Apr 2026 21:20:34 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fa6f4623/aa8aef9a.mp3" length="20658918" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1288</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, Huaxiu Yao</p>

            <p><strong>Title:</strong><br>
            ClawArena: Benchmarking AI Agents in Evolving Information Environments</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04202v1">http://arxiv.org/abs/2604.04202v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single-authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi-source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14-category question taxonomy. Two question formats, multi-choice (set-selection) and shell-based executable checks, test both reasoning and workspace grounding. The current release contains 64 scenarios across 8 professional domains, totaling 1{,}879 evaluation rounds and 365 dynamic updates. Experiments on five agent frameworks and five language models show that both model capability (15.4% range) and framework design (9.2%) substantially affect performance, that self-evolving skill frameworks can partially close model-capability gaps, and that belief revision difficulty is determined by update design strategy rather than the mere presence of updates. Code is available at https://github.com/aiming-lab/ClawArena.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing</title>
      <itunes:episode>1727</itunes:episode>
      <podcast:episode>1727</podcast:episode>
      <itunes:title>SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8f496fc6-4fc2-4265-b3e8-f53a86b89856</guid>
      <link>https://share.transistor.fm/s/a29c9ab3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yicheng Xiao, Wenhu Zhang, Lin Song, Yukang Chen, Wenbo Li, Nan Jiang, Tianhe Ren, Haokun Lin, Wei Huang, Haoyang Huang, Xiu Li, Nan Duan, Xiaojuan Qi</p>

            <p><strong>Title:</strong><br>
            SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04911v1">http://arxiv.org/abs/2604.04911v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image spatial editing performs geometry-driven transformations, allowing precise control over object layout and camera viewpoints. Current models are insufficient for fine-grained spatial manipulations, motivating a dedicated assessment suite. Our contributions are listed: (i) We introduce SpatialEdit-Bench, a complete benchmark that evaluates spatial editing by jointly measuring perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis. (ii) To address the data bottleneck for scalable training, we construct SpatialEdit-500k, a synthetic dataset generated with a controllable Blender pipeline that renders objects across diverse backgrounds and systematic camera trajectories, providing precise ground-truth transformations for both object- and camera-centric operations. (iii) Building on this data, we develop SpatialEdit-16B, a baseline model for fine-grained spatial editing. Our method achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks. All resources will be made public at https://github.com/EasonXiao-888/SpatialEdit.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yicheng Xiao, Wenhu Zhang, Lin Song, Yukang Chen, Wenbo Li, Nan Jiang, Tianhe Ren, Haokun Lin, Wei Huang, Haoyang Huang, Xiu Li, Nan Duan, Xiaojuan Qi</p>

            <p><strong>Title:</strong><br>
            SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04911v1">http://arxiv.org/abs/2604.04911v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image spatial editing performs geometry-driven transformations, allowing precise control over object layout and camera viewpoints. Current models are insufficient for fine-grained spatial manipulations, motivating a dedicated assessment suite. Our contributions are listed: (i) We introduce SpatialEdit-Bench, a complete benchmark that evaluates spatial editing by jointly measuring perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis. (ii) To address the data bottleneck for scalable training, we construct SpatialEdit-500k, a synthetic dataset generated with a controllable Blender pipeline that renders objects across diverse backgrounds and systematic camera trajectories, providing precise ground-truth transformations for both object- and camera-centric operations. (iii) Building on this data, we develop SpatialEdit-16B, a baseline model for fine-grained spatial editing. Our method achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks. All resources will be made public at https://github.com/EasonXiao-888/SpatialEdit.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Apr 2026 21:20:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a29c9ab3/bea50884.mp3" length="21725121" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1354</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yicheng Xiao, Wenhu Zhang, Lin Song, Yukang Chen, Wenbo Li, Nan Jiang, Tianhe Ren, Haokun Lin, Wei Huang, Haoyang Huang, Xiu Li, Nan Duan, Xiaojuan Qi</p>

            <p><strong>Title:</strong><br>
            SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.04911v1">http://arxiv.org/abs/2604.04911v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image spatial editing performs geometry-driven transformations, allowing precise control over object layout and camera viewpoints. Current models are insufficient for fine-grained spatial manipulations, motivating a dedicated assessment suite. Our contributions are listed: (i) We introduce SpatialEdit-Bench, a complete benchmark that evaluates spatial editing by jointly measuring perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis. (ii) To address the data bottleneck for scalable training, we construct SpatialEdit-500k, a synthetic dataset generated with a controllable Blender pipeline that renders objects across diverse backgrounds and systematic camera trajectories, providing precise ground-truth transformations for both object- and camera-centric operations. (iii) Building on this data, we develop SpatialEdit-16B, a baseline model for fine-grained spatial editing. Our method achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks. All resources will be made public at https://github.com/EasonXiao-888/SpatialEdit.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LightThinker++: From Reasoning Compression to Memory Management</title>
      <itunes:episode>1726</itunes:episode>
      <podcast:episode>1726</podcast:episode>
      <itunes:title>LightThinker++: From Reasoning Compression to Memory Management</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1defbb6a-8401-4c33-8d47-3ed9440c0be3</guid>
      <link>https://share.transistor.fm/s/854bf5d7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.AI, cs.IR, cs.LG, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Yuqi Zhu, Jintian Zhang, Zhenjie Wan, Yujie Luo, Shuofei Qiao, Zhengke Gui, Da Zheng, Lei Liang, Huajun Chen, Ningyu Zhang</p>

            <p><strong>Title:</strong><br>
            LightThinker++: From Reasoning Compression to Memory Management</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.03679v1">http://arxiv.org/abs/2604.03679v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) excel at complex reasoning, yet their efficiency is limited by the surging cognitive overhead of long thought traces. In this paper, we propose LightThinker, a method that enables LLMs to dynamically compress intermediate thoughts into compact semantic representations. However, static compression often struggles with complex reasoning where the irreversible loss of intermediate details can lead to logical bottlenecks. To address this, we evolve the framework into LightThinker++, introducing Explicit Adaptive Memory Management. This paradigm shifts to behavioral-level management by incorporating explicit memory primitives, supported by a specialized trajectory synthesis pipeline to train purposeful memory scheduling. Extensive experiments demonstrate the framework's versatility across three dimensions. (1) LightThinker reduces peak token usage by 70% and inference time by 26% with minimal accuracy loss. (2) In standard reasoning, LightThinker++ slashes peak token usage by 69.9% while yielding a +2.42% accuracy gain under the same context budget for maximum performance. (3) Most notably, in long-horizon agentic tasks, it maintains a stable footprint beyond 80 rounds (a 60%-70% reduction), achieving an average performance gain of 14.8% across different complex scenarios. Overall, our work provides a scalable direction for sustaining deep LLM reasoning over extended horizons with minimal overhead.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.AI, cs.IR, cs.LG, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Yuqi Zhu, Jintian Zhang, Zhenjie Wan, Yujie Luo, Shuofei Qiao, Zhengke Gui, Da Zheng, Lei Liang, Huajun Chen, Ningyu Zhang</p>

            <p><strong>Title:</strong><br>
            LightThinker++: From Reasoning Compression to Memory Management</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.03679v1">http://arxiv.org/abs/2604.03679v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) excel at complex reasoning, yet their efficiency is limited by the surging cognitive overhead of long thought traces. In this paper, we propose LightThinker, a method that enables LLMs to dynamically compress intermediate thoughts into compact semantic representations. However, static compression often struggles with complex reasoning where the irreversible loss of intermediate details can lead to logical bottlenecks. To address this, we evolve the framework into LightThinker++, introducing Explicit Adaptive Memory Management. This paradigm shifts to behavioral-level management by incorporating explicit memory primitives, supported by a specialized trajectory synthesis pipeline to train purposeful memory scheduling. Extensive experiments demonstrate the framework's versatility across three dimensions. (1) LightThinker reduces peak token usage by 70% and inference time by 26% with minimal accuracy loss. (2) In standard reasoning, LightThinker++ slashes peak token usage by 69.9% while yielding a +2.42% accuracy gain under the same context budget for maximum performance. (3) Most notably, in long-horizon agentic tasks, it maintains a stable footprint beyond 80 rounds (a 60%-70% reduction), achieving an average performance gain of 14.8% across different complex scenarios. Overall, our work provides a scalable direction for sustaining deep LLM reasoning over extended horizons with minimal overhead.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Apr 2026 21:19:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/854bf5d7/443e8708.mp3" length="19333144" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1205</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.AI, cs.IR, cs.LG, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Yuqi Zhu, Jintian Zhang, Zhenjie Wan, Yujie Luo, Shuofei Qiao, Zhengke Gui, Da Zheng, Lei Liang, Huajun Chen, Ningyu Zhang</p>

            <p><strong>Title:</strong><br>
            LightThinker++: From Reasoning Compression to Memory Management</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.03679v1">http://arxiv.org/abs/2604.03679v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) excel at complex reasoning, yet their efficiency is limited by the surging cognitive overhead of long thought traces. In this paper, we propose LightThinker, a method that enables LLMs to dynamically compress intermediate thoughts into compact semantic representations. However, static compression often struggles with complex reasoning where the irreversible loss of intermediate details can lead to logical bottlenecks. To address this, we evolve the framework into LightThinker++, introducing Explicit Adaptive Memory Management. This paradigm shifts to behavioral-level management by incorporating explicit memory primitives, supported by a specialized trajectory synthesis pipeline to train purposeful memory scheduling. Extensive experiments demonstrate the framework's versatility across three dimensions. (1) LightThinker reduces peak token usage by 70% and inference time by 26% with minimal accuracy loss. (2) In standard reasoning, LightThinker++ slashes peak token usage by 69.9% while yielding a +2.42% accuracy gain under the same context budget for maximum performance. (3) Most notably, in long-horizon agentic tasks, it maintains a stable footprint beyond 80 rounds (a 60%-70% reduction), achieving an average performance gain of 14.8% across different complex scenarios. Overall, our work provides a scalable direction for sustaining deep LLM reasoning over extended horizons with minimal overhead.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Self-Distilled RLVR</title>
      <itunes:episode>1725</itunes:episode>
      <podcast:episode>1725</podcast:episode>
      <itunes:title>Self-Distilled RLVR</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">45da26cc-2c42-4f0b-9134-1bbbfba09f06</guid>
      <link>https://share.transistor.fm/s/a9c8912a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 89 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, Nan Duan</p>

            <p><strong>Title:</strong><br>
            Self-Distilled RLVR</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.03128v1">http://arxiv.org/abs/2604.03128v1</a></p>

            <p><strong>Abstract:</strong><br>
            On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose \textbf{RLSD} (\textbf{RL}VR with \textbf{S}elf-\textbf{D}istillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 89 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, Nan Duan</p>

            <p><strong>Title:</strong><br>
            Self-Distilled RLVR</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.03128v1">http://arxiv.org/abs/2604.03128v1</a></p>

            <p><strong>Abstract:</strong><br>
            On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose \textbf{RLSD} (\textbf{RL}VR with \textbf{S}elf-\textbf{D}istillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 06 Apr 2026 20:41:59 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a9c8912a/398a531b.mp3" length="21092290" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1315</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 89 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, Nan Duan</p>

            <p><strong>Title:</strong><br>
            Self-Distilled RLVR</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.03128v1">http://arxiv.org/abs/2604.03128v1</a></p>

            <p><strong>Abstract:</strong><br>
            On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose \textbf{RLSD} (\textbf{RL}VR with \textbf{S}elf-\textbf{D}istillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Simple Baseline for Streaming Video Understanding</title>
      <itunes:episode>1724</itunes:episode>
      <podcast:episode>1724</podcast:episode>
      <itunes:title>A Simple Baseline for Streaming Video Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0727ec6a-2c12-47d8-89a6-93ec11a1c18c</guid>
      <link>https://share.transistor.fm/s/f56171b0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yujiao Shen, Shulin Tian, Jingkang Yang, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            A Simple Baseline for Streaming Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02317v1">http://arxiv.org/abs/2604.02317v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yujiao Shen, Shulin Tian, Jingkang Yang, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            A Simple Baseline for Streaming Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02317v1">http://arxiv.org/abs/2604.02317v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 06 Apr 2026 20:41:36 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f56171b0/199e7422.mp3" length="21013746" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1310</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yujiao Shen, Shulin Tian, Jingkang Yang, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            A Simple Baseline for Streaming Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02317v1">http://arxiv.org/abs/2604.02317v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Token Warping Helps MLLMs Look from Nearby Viewpoints</title>
      <itunes:episode>1723</itunes:episode>
      <podcast:episode>1723</podcast:episode>
      <itunes:title>Token Warping Helps MLLMs Look from Nearby Viewpoints</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5d60cf31-670d-4fe4-a565-af3d184354b9</guid>
      <link>https://share.transistor.fm/s/7273758d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Phillip Y. Lee, Chanho Park, Mingue Park, Seungwoo Yoo, Juil Koo, Minhyuk Sung</p>

            <p><strong>Title:</strong><br>
            Token Warping Helps MLLMs Look from Nearby Viewpoints</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02870v1">http://arxiv.org/abs/2604.02870v1</a></p>

            <p><strong>Abstract:</strong><br>
            Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Phillip Y. Lee, Chanho Park, Mingue Park, Seungwoo Yoo, Juil Koo, Minhyuk Sung</p>

            <p><strong>Title:</strong><br>
            Token Warping Helps MLLMs Look from Nearby Viewpoints</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02870v1">http://arxiv.org/abs/2604.02870v1</a></p>

            <p><strong>Abstract:</strong><br>
            Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 06 Apr 2026 20:41:13 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7273758d/a941dc26.mp3" length="19334806" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1205</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Phillip Y. Lee, Chanho Park, Mingue Park, Seungwoo Yoo, Juil Koo, Minhyuk Sung</p>

            <p><strong>Title:</strong><br>
            Token Warping Helps MLLMs Look from Nearby Viewpoints</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02870v1">http://arxiv.org/abs/2604.02870v1</a></p>

            <p><strong>Abstract:</strong><br>
            Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?</title>
      <itunes:episode>1722</itunes:episode>
      <podcast:episode>1722</podcast:episode>
      <itunes:title>Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">294aa9e4-9087-4b98-87a3-da38d7249c67</guid>
      <link>https://share.transistor.fm/s/8efea549</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qianshan Wei, Yishan Yang, Siyi Wang, Jinglin Chen, Binyu Wang, Jiaming Wang, Shuang Chen, Zechen Li, Yang Shi, Yuqi Tang, Weining Wang, Yi Yu, Chaoyou Fu, Qi Li, Yi-Fan Zhang</p>

            <p><strong>Title:</strong><br>
            Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.03016v1">http://arxiv.org/abs/2604.03016v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and search tools separately, and evaluate primarily by final answers. Consequently, they cannot verify if tools were actually invoked, applied correctly, or used efficiently. To address this, we introduce Agentic-MME, a process-verified benchmark for Multimodal Agentic Capabilities. It contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task. Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human reference trajectory annotated with stepwise checkpoints along dual-axis: S-axis and V-axis. To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories. Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks, underscoring the difficulty of real-world multimodal agentic problem solving.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qianshan Wei, Yishan Yang, Siyi Wang, Jinglin Chen, Binyu Wang, Jiaming Wang, Shuang Chen, Zechen Li, Yang Shi, Yuqi Tang, Weining Wang, Yi Yu, Chaoyou Fu, Qi Li, Yi-Fan Zhang</p>

            <p><strong>Title:</strong><br>
            Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.03016v1">http://arxiv.org/abs/2604.03016v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and search tools separately, and evaluate primarily by final answers. Consequently, they cannot verify if tools were actually invoked, applied correctly, or used efficiently. To address this, we introduce Agentic-MME, a process-verified benchmark for Multimodal Agentic Capabilities. It contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task. Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human reference trajectory annotated with stepwise checkpoints along dual-axis: S-axis and V-axis. To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories. Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks, underscoring the difficulty of real-world multimodal agentic problem solving.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 06 Apr 2026 20:40:50 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8efea549/d79945e3.mp3" length="21869753" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1363</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qianshan Wei, Yishan Yang, Siyi Wang, Jinglin Chen, Binyu Wang, Jiaming Wang, Shuang Chen, Zechen Li, Yang Shi, Yuqi Tang, Weining Wang, Yi Yu, Chaoyou Fu, Qi Li, Yi-Fan Zhang</p>

            <p><strong>Title:</strong><br>
            Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.03016v1">http://arxiv.org/abs/2604.03016v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and search tools separately, and evaluate primarily by final answers. Consequently, they cannot verify if tools were actually invoked, applied correctly, or used efficiently. To address this, we introduce Agentic-MME, a process-verified benchmark for Multimodal Agentic Capabilities. It contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task. Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human reference trajectory annotated with stepwise checkpoints along dual-axis: S-axis and V-axis. To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories. Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks, underscoring the difficulty of real-world multimodal agentic problem solving.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models</title>
      <itunes:episode>1721</itunes:episode>
      <podcast:episode>1721</podcast:episode>
      <itunes:title>DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c7da913b-f288-4b12-84a4-0145afe6853d</guid>
      <link>https://share.transistor.fm/s/bc4c77b2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 144 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hao Liang, Zhengyang Zhao, Meiyi Qiang, Mingrui Chen, Lu Ma, Rongyi Yu, Hengyi Feng, Shixuan Sun, Zimo Meng, Xiaochen Ma, Xuanlin Yang, Qifeng Cai, Ruichuan An, Bohan Zeng, Zhen Hao Wong, Chengyu Shen, Runming He, Zhaoyang Han, Yaowei Zheng, Fangcheng Fu, Conghui He, Bin Cui, Zhiyu Li, Weinan E, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.26164v1">http://arxiv.org/abs/2603.26164v1</a></p>

            <p><strong>Abstract:</strong><br>
            Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 144 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hao Liang, Zhengyang Zhao, Meiyi Qiang, Mingrui Chen, Lu Ma, Rongyi Yu, Hengyi Feng, Shixuan Sun, Zimo Meng, Xiaochen Ma, Xuanlin Yang, Qifeng Cai, Ruichuan An, Bohan Zeng, Zhen Hao Wong, Chengyu Shen, Runming He, Zhaoyang Han, Yaowei Zheng, Fangcheng Fu, Conghui He, Bin Cui, Zhiyu Li, Weinan E, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.26164v1">http://arxiv.org/abs/2603.26164v1</a></p>

            <p><strong>Abstract:</strong><br>
            Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Apr 2026 21:01:08 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bc4c77b2/a7af2c94.mp3" length="26411726" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1647</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 144 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hao Liang, Zhengyang Zhao, Meiyi Qiang, Mingrui Chen, Lu Ma, Rongyi Yu, Hengyi Feng, Shixuan Sun, Zimo Meng, Xiaochen Ma, Xuanlin Yang, Qifeng Cai, Ruichuan An, Bohan Zeng, Zhen Hao Wong, Chengyu Shen, Runming He, Zhaoyang Han, Yaowei Zheng, Fangcheng Fu, Conghui He, Bin Cui, Zhiyu Li, Weinan E, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.26164v1">http://arxiv.org/abs/2603.26164v1</a></p>

            <p><strong>Abstract:</strong><br>
            Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook</title>
      <itunes:episode>1720</itunes:episode>
      <podcast:episode>1720</podcast:episode>
      <itunes:title>The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b07b6fab-5d98-43b7-bd99-bad2622d21c6</guid>
      <link>https://share.transistor.fm/s/07b06903</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 102 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, Guibin Zhang, Jiale Tao, Jiayi Zhang, Siyuan Ma, Kaituo Feng, Haojie Huang, Youxing Li, Ronghao Chen, Huacan Wang, Chenglin Wu, Zikun Su, Xiaogang Xu, Kelu Yao, Kun Wang, Chen Gao, Yue Liao, Ruqi Huang, Tao Jin, Cheng Tan, Jiangning Zhang, Wenqi Ren, Yanwei Fu, Yong Liu, Yu Wang, Xiangyu Yue, Yu-Gang Jiang, Shuicheng Yan</p>

            <p><strong>Title:</strong><br>
            The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02029v1">http://arxiv.org/abs/2604.02029v1</a></p>

            <p><strong>Abstract:</strong><br>
            Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces. This shift is driven by the structural limitations of explicit-space computation, including linguistic redundancy, discretization bottlenecks, sequential inefficiency, and semantic loss. This survey aims to provide a unified and up-to-date landscape of latent space in language-based models. We organize the survey into five sequential perspectives: Foundation, Evolution, Mechanism, Ability, and Outlook. We begin by delineating the scope of latent space, distinguishing it from explicit or verbal space and from the latent spaces commonly studied in generative visual models. We then trace the field's evolution from early exploratory efforts to the current large-scale expansion. To organize the technical landscape, we examine existing work through the complementary lenses of mechanism and ability. From the perspective of Mechanism, we identify four major lines of development: Architecture, Representation, Computation, and Optimization. From the perspective of Ability, we show how latent space supports a broad capability spectrum spanning Reasoning, Planning, Modeling, Perception, Memory, Collaboration, and Embodiment. Beyond consolidation, we discuss the key open challenges, and outline promising directions for future research. We hope this survey serves not only as a reference for existing work, but also as a foundation for understanding latent space as a general computational and systems paradigm for next-generation intelligence.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 102 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, Guibin Zhang, Jiale Tao, Jiayi Zhang, Siyuan Ma, Kaituo Feng, Haojie Huang, Youxing Li, Ronghao Chen, Huacan Wang, Chenglin Wu, Zikun Su, Xiaogang Xu, Kelu Yao, Kun Wang, Chen Gao, Yue Liao, Ruqi Huang, Tao Jin, Cheng Tan, Jiangning Zhang, Wenqi Ren, Yanwei Fu, Yong Liu, Yu Wang, Xiangyu Yue, Yu-Gang Jiang, Shuicheng Yan</p>

            <p><strong>Title:</strong><br>
            The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02029v1">http://arxiv.org/abs/2604.02029v1</a></p>

            <p><strong>Abstract:</strong><br>
            Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces. This shift is driven by the structural limitations of explicit-space computation, including linguistic redundancy, discretization bottlenecks, sequential inefficiency, and semantic loss. This survey aims to provide a unified and up-to-date landscape of latent space in language-based models. We organize the survey into five sequential perspectives: Foundation, Evolution, Mechanism, Ability, and Outlook. We begin by delineating the scope of latent space, distinguishing it from explicit or verbal space and from the latent spaces commonly studied in generative visual models. We then trace the field's evolution from early exploratory efforts to the current large-scale expansion. To organize the technical landscape, we examine existing work through the complementary lenses of mechanism and ability. From the perspective of Mechanism, we identify four major lines of development: Architecture, Representation, Computation, and Optimization. From the perspective of Ability, we show how latent space supports a broad capability spectrum spanning Reasoning, Planning, Modeling, Perception, Memory, Collaboration, and Embodiment. Beyond consolidation, we discuss the key open challenges, and outline promising directions for future research. We hope this survey serves not only as a reference for existing work, but also as a foundation for understanding latent space as a general computational and systems paradigm for next-generation intelligence.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Apr 2026 21:00:47 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/07b06903/4d1dc565.mp3" length="22042782" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1374</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 102 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, Guibin Zhang, Jiale Tao, Jiayi Zhang, Siyuan Ma, Kaituo Feng, Haojie Huang, Youxing Li, Ronghao Chen, Huacan Wang, Chenglin Wu, Zikun Su, Xiaogang Xu, Kelu Yao, Kun Wang, Chen Gao, Yue Liao, Ruqi Huang, Tao Jin, Cheng Tan, Jiangning Zhang, Wenqi Ren, Yanwei Fu, Yong Liu, Yu Wang, Xiangyu Yue, Yu-Gang Jiang, Shuicheng Yan</p>

            <p><strong>Title:</strong><br>
            The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02029v1">http://arxiv.org/abs/2604.02029v1</a></p>

            <p><strong>Abstract:</strong><br>
            Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces. This shift is driven by the structural limitations of explicit-space computation, including linguistic redundancy, discretization bottlenecks, sequential inefficiency, and semantic loss. This survey aims to provide a unified and up-to-date landscape of latent space in language-based models. We organize the survey into five sequential perspectives: Foundation, Evolution, Mechanism, Ability, and Outlook. We begin by delineating the scope of latent space, distinguishing it from explicit or verbal space and from the latent spaces commonly studied in generative visual models. We then trace the field's evolution from early exploratory efforts to the current large-scale expansion. To organize the technical landscape, we examine existing work through the complementary lenses of mechanism and ability. From the perspective of Mechanism, we identify four major lines of development: Architecture, Representation, Computation, and Optimization. From the perspective of Ability, we show how latent space supports a broad capability spectrum spanning Reasoning, Planning, Modeling, Perception, Memory, Collaboration, and Embodiment. Beyond consolidation, we discuss the key open challenges, and outline promising directions for future research. We hope this survey serves not only as a reference for existing work, but also as a foundation for understanding latent space as a general computational and systems paradigm for next-generation intelligence.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Generative World Renderer</title>
      <itunes:episode>1719</itunes:episode>
      <podcast:episode>1719</podcast:episode>
      <itunes:title>Generative World Renderer</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d3c8f2c9-4891-4467-9d53-231a2a2ca3ba</guid>
      <link>https://share.transistor.fm/s/31f54a7a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 76 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zheng-Hui Huang, Zhixiang Wang, Jiaming Tan, Ruihan Yu, Yidan Zhang, Bo Zheng, Yu-Lun Liu, Yung-Yu Chuang, Kaipeng Zhang</p>

            <p><strong>Title:</strong><br>
            Generative World Renderer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02329v1">http://arxiv.org/abs/2604.02329v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 76 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zheng-Hui Huang, Zhixiang Wang, Jiaming Tan, Ruihan Yu, Yidan Zhang, Bo Zheng, Yu-Lun Liu, Yung-Yu Chuang, Kaipeng Zhang</p>

            <p><strong>Title:</strong><br>
            Generative World Renderer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02329v1">http://arxiv.org/abs/2604.02329v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Apr 2026 21:00:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/31f54a7a/eb3b83d0.mp3" length="21840443" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1361</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 76 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zheng-Hui Huang, Zhixiang Wang, Jiaming Tan, Ruihan Yu, Yidan Zhang, Bo Zheng, Yu-Lun Liu, Yung-Yu Chuang, Kaipeng Zhang</p>

            <p><strong>Title:</strong><br>
            Generative World Renderer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02329v1">http://arxiv.org/abs/2604.02329v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization</title>
      <itunes:episode>1718</itunes:episode>
      <podcast:episode>1718</podcast:episode>
      <itunes:title>SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a47e543c-1ca4-472f-86c3-531553502f4b</guid>
      <link>https://share.transistor.fm/s/878b55cb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 72 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen</p>

            <p><strong>Title:</strong><br>
            SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02268v1">http://arxiv.org/abs/2604.02268v1</a></p>

            <p><strong>Abstract:</strong><br>
            Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference-time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires the knowledge it merely follows. We ask whether skills can instead be internalized into model parameters, enabling zero-shot autonomous behavior without any runtime skill retrieval. We introduce SKILL0, an in-context reinforcement learning framework designed for skill internalization. SKILL0 introduces a training-time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped offline by category and rendered with interaction history into a compact visual context, teaching he model tool invocation and multi-turn task completion. A Dynamic Curriculum then evaluates each skill file's on-policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget, until the agent operates in a fully zero-shot setting. Extensive agentic experiments demonstrate that SKILL0 achieves substantial improvements over the standard RL baseline (+9.7\% for ALFWorld and +6.6\% for Search-QA), while maintaining a highly efficient context of fewer than 0.5k tokens per step. Our code is available at https://github.com/ZJU-REAL/SkillZero.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 72 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen</p>

            <p><strong>Title:</strong><br>
            SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02268v1">http://arxiv.org/abs/2604.02268v1</a></p>

            <p><strong>Abstract:</strong><br>
            Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference-time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires the knowledge it merely follows. We ask whether skills can instead be internalized into model parameters, enabling zero-shot autonomous behavior without any runtime skill retrieval. We introduce SKILL0, an in-context reinforcement learning framework designed for skill internalization. SKILL0 introduces a training-time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped offline by category and rendered with interaction history into a compact visual context, teaching he model tool invocation and multi-turn task completion. A Dynamic Curriculum then evaluates each skill file's on-policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget, until the agent operates in a fully zero-shot setting. Extensive agentic experiments demonstrate that SKILL0 achieves substantial improvements over the standard RL baseline (+9.7\% for ALFWorld and +6.6\% for Search-QA), while maintaining a highly efficient context of fewer than 0.5k tokens per step. Our code is available at https://github.com/ZJU-REAL/SkillZero.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Apr 2026 21:00:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/878b55cb/73cd239c.mp3" length="18750939" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1168</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 72 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen</p>

            <p><strong>Title:</strong><br>
            SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02268v1">http://arxiv.org/abs/2604.02268v1</a></p>

            <p><strong>Abstract:</strong><br>
            Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference-time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires the knowledge it merely follows. We ask whether skills can instead be internalized into model parameters, enabling zero-shot autonomous behavior without any runtime skill retrieval. We introduce SKILL0, an in-context reinforcement learning framework designed for skill internalization. SKILL0 introduces a training-time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped offline by category and rendered with interaction history into a compact visual context, teaching he model tool invocation and multi-turn task completion. A Dynamic Curriculum then evaluates each skill file's on-policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget, until the agent operates in a fully zero-shot setting. Extensive agentic experiments demonstrate that SKILL0 achieves substantial improvements over the standard RL baseline (+9.7\% for ALFWorld and +6.6\% for Search-QA), while maintaining a highly efficient context of fewer than 0.5k tokens per step. Our code is available at https://github.com/ZJU-REAL/SkillZero.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Steerable Visual Representations</title>
      <itunes:episode>1717</itunes:episode>
      <podcast:episode>1717</podcast:episode>
      <itunes:title>Steerable Visual Representations</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b35d0ce0-e57b-4124-8708-b582904bc290</guid>
      <link>https://share.transistor.fm/s/59977e24</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jona Ruthardt, Manu Gaur, Deva Ramanan, Makarand Tapaswi, Yuki M. Asano</p>

            <p><strong>Title:</strong><br>
            Steerable Visual Representations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02327v1">http://arxiv.org/abs/2604.02327v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jona Ruthardt, Manu Gaur, Deva Ramanan, Makarand Tapaswi, Yuki M. Asano</p>

            <p><strong>Title:</strong><br>
            Steerable Visual Representations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02327v1">http://arxiv.org/abs/2604.02327v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Apr 2026 20:59:42 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/59977e24/05e4f086.mp3" length="20889175" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1302</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jona Ruthardt, Manu Gaur, Deva Ramanan, Makarand Tapaswi, Yuki M. Asano</p>

            <p><strong>Title:</strong><br>
            Steerable Visual Representations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.02327v1">http://arxiv.org/abs/2604.02327v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>EgoSim: Egocentric World Simulator for Embodied Interaction Generation</title>
      <itunes:episode>1716</itunes:episode>
      <podcast:episode>1716</podcast:episode>
      <itunes:title>EgoSim: Egocentric World Simulator for Embodied Interaction Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e56eadcf-593e-4aa0-95e7-3c494f79f0e5</guid>
      <link>https://share.transistor.fm/s/4a13d23c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jinkun Hao, Mingda Jia, Ruiyan Wang, Xihui Liu, Ran Yi, Lizhuang Ma, Jiangmiao Pang, Xudong Xu</p>

            <p><strong>Title:</strong><br>
            EgoSim: Egocentric World Simulator for Embodied Interaction Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.01001v1">http://arxiv.org/abs/2604.01001v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce EgoSim, a closed-loop egocentric world simulator that generates spatially consistent interaction videos and persistently updates the underlying 3D scene state for continuous simulation. Existing egocentric simulators either lack explicit 3D grounding, causing structural drift under viewpoint changes, or treat the scene as static, failing to update world states across multi-stage interactions. EgoSim addresses both limitations by modeling 3D scenes as updatable world states. We generate embodiment interactions via a Geometry-action-aware Observation Simulation model, with spatial consistency from an Interaction-aware State Updating module. To overcome the critical data bottleneck posed by the difficulty in acquiring densely aligned scene-interaction training pairs, we design a scalable pipeline that extracts static point clouds, camera trajectories, and embodiment actions from in-the-wild large-scale monocular egocentric videos. We further introduce EgoCap, a capture system that enables low-cost real-world data collection with uncalibrated smartphones. Extensive experiments demonstrate that EgoSim significantly outperforms existing methods in terms of visual quality, spatial consistency, and generalization to complex scenes and in-the-wild dexterous interactions, while supporting cross-embodiment transfer to robotic manipulation. Codes and datasets will be open soon. The project page is at egosimulator.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jinkun Hao, Mingda Jia, Ruiyan Wang, Xihui Liu, Ran Yi, Lizhuang Ma, Jiangmiao Pang, Xudong Xu</p>

            <p><strong>Title:</strong><br>
            EgoSim: Egocentric World Simulator for Embodied Interaction Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.01001v1">http://arxiv.org/abs/2604.01001v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce EgoSim, a closed-loop egocentric world simulator that generates spatially consistent interaction videos and persistently updates the underlying 3D scene state for continuous simulation. Existing egocentric simulators either lack explicit 3D grounding, causing structural drift under viewpoint changes, or treat the scene as static, failing to update world states across multi-stage interactions. EgoSim addresses both limitations by modeling 3D scenes as updatable world states. We generate embodiment interactions via a Geometry-action-aware Observation Simulation model, with spatial consistency from an Interaction-aware State Updating module. To overcome the critical data bottleneck posed by the difficulty in acquiring densely aligned scene-interaction training pairs, we design a scalable pipeline that extracts static point clouds, camera trajectories, and embodiment actions from in-the-wild large-scale monocular egocentric videos. We further introduce EgoCap, a capture system that enables low-cost real-world data collection with uncalibrated smartphones. Extensive experiments demonstrate that EgoSim significantly outperforms existing methods in terms of visual quality, spatial consistency, and generalization to complex scenes and in-the-wild dexterous interactions, while supporting cross-embodiment transfer to robotic manipulation. Codes and datasets will be open soon. The project page is at egosimulator.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Apr 2026 20:59:21 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4a13d23c/8f47d06b.mp3" length="23850872" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1487</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jinkun Hao, Mingda Jia, Ruiyan Wang, Xihui Liu, Ran Yi, Lizhuang Ma, Jiangmiao Pang, Xudong Xu</p>

            <p><strong>Title:</strong><br>
            EgoSim: Egocentric World Simulator for Embodied Interaction Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.01001v1">http://arxiv.org/abs/2604.01001v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce EgoSim, a closed-loop egocentric world simulator that generates spatially consistent interaction videos and persistently updates the underlying 3D scene state for continuous simulation. Existing egocentric simulators either lack explicit 3D grounding, causing structural drift under viewpoint changes, or treat the scene as static, failing to update world states across multi-stage interactions. EgoSim addresses both limitations by modeling 3D scenes as updatable world states. We generate embodiment interactions via a Geometry-action-aware Observation Simulation model, with spatial consistency from an Interaction-aware State Updating module. To overcome the critical data bottleneck posed by the difficulty in acquiring densely aligned scene-interaction training pairs, we design a scalable pipeline that extracts static point clouds, camera trajectories, and embodiment actions from in-the-wild large-scale monocular egocentric videos. We further introduce EgoCap, a capture system that enables low-cost real-world data collection with uncalibrated smartphones. Extensive experiments demonstrate that EgoSim significantly outperforms existing methods in terms of visual quality, spatial consistency, and generalization to complex scenes and in-the-wild dexterous interactions, while supporting cross-embodiment transfer to robotic manipulation. Codes and datasets will be open soon. The project page is at egosimulator.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery</title>
      <itunes:episode>1715</itunes:episode>
      <podcast:episode>1715</podcast:episode>
      <itunes:title>CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">346d3553-feee-4f17-a10a-914ec0fdc90a</guid>
      <link>https://share.transistor.fm/s/7e1fdb40</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, Jiacheng Zhu, Xuan Jiang, Sirui Li, Cathy Wu, Bryan Kian Hsiang Low, Jinhua Zhao, Paul Pu Liang</p>

            <p><strong>Title:</strong><br>
            CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.01658v1">http://arxiv.org/abs/2604.01658v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM)-based evolution is a promising approach for open-ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard-coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi-agent evolution on open-ended problems. CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic's kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi-agent exploration and communication. Together, these results suggest that greater agent autonomy and multi-agent evolution can substantially improve open-ended discovery. Code is available at https://github.com/Human-Agent-Society/CORAL.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, Jiacheng Zhu, Xuan Jiang, Sirui Li, Cathy Wu, Bryan Kian Hsiang Low, Jinhua Zhao, Paul Pu Liang</p>

            <p><strong>Title:</strong><br>
            CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.01658v1">http://arxiv.org/abs/2604.01658v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM)-based evolution is a promising approach for open-ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard-coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi-agent evolution on open-ended problems. CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic's kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi-agent exploration and communication. Together, these results suggest that greater agent autonomy and multi-agent evolution can substantially improve open-ended discovery. Code is available at https://github.com/Human-Agent-Society/CORAL.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Apr 2026 20:58:59 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7e1fdb40/43645832.mp3" length="24005101" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1497</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, Jiacheng Zhu, Xuan Jiang, Sirui Li, Cathy Wu, Bryan Kian Hsiang Low, Jinhua Zhao, Paul Pu Liang</p>

            <p><strong>Title:</strong><br>
            CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.01658v1">http://arxiv.org/abs/2604.01658v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM)-based evolution is a promising approach for open-ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard-coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi-agent evolution on open-ended problems. CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic's kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi-agent exploration and communication. Together, these results suggest that greater agent autonomy and multi-agent evolution can substantially improve open-ended discovery. Code is available at https://github.com/Human-Agent-Society/CORAL.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers</title>
      <itunes:episode>1714</itunes:episode>
      <podcast:episode>1714</podcast:episode>
      <itunes:title>ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ef993086-c863-4bf5-a3dc-cdde40ebc405</guid>
      <link>https://share.transistor.fm/s/46d53592</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 167 | cs.CR, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Songyang Liu, Chaozhuo Li, Chenxu Wang, Jinyu Hou, Zejian Chen, Litian Zhang, Zheng Liu, Qiwei Ye, Yiming Hei, Xi Zhang, Zhongyuan Wang</p>

            <p><strong>Title:</strong><br>
            ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.24414v1">http://arxiv.org/abs/2603.24414v1</a></p>

            <p><strong>Abstract:</strong><br>
            OpenClaw has rapidly established itself as a leading open-source autonomous agent runtime, offering powerful capabilities including tool integration, local file access, and shell command execution. However, these broad operational privileges introduce critical security vulnerabilities, transforming model errors into tangible system-level threats such as sensitive data leakage, privilege escalation, and malicious third-party skill execution. Existing security measures for the OpenClaw ecosystem remain highly fragmented, addressing only isolated stages of the agent lifecycle rather than providing holistic protection. To bridge this gap, we present ClawKeeper, a real-time security framework that integrates multi-dimensional protection mechanisms across three complementary architectural layers. (1) \textbf{Skill-based protection} operates at the instruction level, injecting structured security policies directly into the agent context to enforce environment-specific constraints and cross-platform boundaries. (2) \textbf{Plugin-based protection} serves as an internal runtime enforcer, providing configuration hardening, proactive threat detection, and continuous behavioral monitoring throughout the execution pipeline. (3) \textbf{Watcher-based protection} introduces a novel, decoupled system-level security middleware that continuously verifies agent state evolution. It enables real-time execution intervention without coupling to the agent's internal logic, supporting operations such as halting high-risk actions or enforcing human confirmation. We argue that this Watcher paradigm holds strong potential to serve as a foundational building block for securing next-generation autonomous agent systems. Extensive qualitative and quantitative evaluations demonstrate the effectiveness and robustness of ClawKeeper across diverse threat scenarios. We release our code.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 167 | cs.CR, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Songyang Liu, Chaozhuo Li, Chenxu Wang, Jinyu Hou, Zejian Chen, Litian Zhang, Zheng Liu, Qiwei Ye, Yiming Hei, Xi Zhang, Zhongyuan Wang</p>

            <p><strong>Title:</strong><br>
            ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.24414v1">http://arxiv.org/abs/2603.24414v1</a></p>

            <p><strong>Abstract:</strong><br>
            OpenClaw has rapidly established itself as a leading open-source autonomous agent runtime, offering powerful capabilities including tool integration, local file access, and shell command execution. However, these broad operational privileges introduce critical security vulnerabilities, transforming model errors into tangible system-level threats such as sensitive data leakage, privilege escalation, and malicious third-party skill execution. Existing security measures for the OpenClaw ecosystem remain highly fragmented, addressing only isolated stages of the agent lifecycle rather than providing holistic protection. To bridge this gap, we present ClawKeeper, a real-time security framework that integrates multi-dimensional protection mechanisms across three complementary architectural layers. (1) \textbf{Skill-based protection} operates at the instruction level, injecting structured security policies directly into the agent context to enforce environment-specific constraints and cross-platform boundaries. (2) \textbf{Plugin-based protection} serves as an internal runtime enforcer, providing configuration hardening, proactive threat detection, and continuous behavioral monitoring throughout the execution pipeline. (3) \textbf{Watcher-based protection} introduces a novel, decoupled system-level security middleware that continuously verifies agent state evolution. It enables real-time execution intervention without coupling to the agent's internal logic, supporting operations such as halting high-risk actions or enforcing human confirmation. We argue that this Watcher paradigm holds strong potential to serve as a foundational building block for securing next-generation autonomous agent systems. Extensive qualitative and quantitative evaluations demonstrate the effectiveness and robustness of ClawKeeper across diverse threat scenarios. We release our code.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Apr 2026 21:07:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/46d53592/87ad3026.mp3" length="24224558" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1510</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 167 | cs.CR, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Songyang Liu, Chaozhuo Li, Chenxu Wang, Jinyu Hou, Zejian Chen, Litian Zhang, Zheng Liu, Qiwei Ye, Yiming Hei, Xi Zhang, Zhongyuan Wang</p>

            <p><strong>Title:</strong><br>
            ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.24414v1">http://arxiv.org/abs/2603.24414v1</a></p>

            <p><strong>Abstract:</strong><br>
            OpenClaw has rapidly established itself as a leading open-source autonomous agent runtime, offering powerful capabilities including tool integration, local file access, and shell command execution. However, these broad operational privileges introduce critical security vulnerabilities, transforming model errors into tangible system-level threats such as sensitive data leakage, privilege escalation, and malicious third-party skill execution. Existing security measures for the OpenClaw ecosystem remain highly fragmented, addressing only isolated stages of the agent lifecycle rather than providing holistic protection. To bridge this gap, we present ClawKeeper, a real-time security framework that integrates multi-dimensional protection mechanisms across three complementary architectural layers. (1) \textbf{Skill-based protection} operates at the instruction level, injecting structured security policies directly into the agent context to enforce environment-specific constraints and cross-platform boundaries. (2) \textbf{Plugin-based protection} serves as an internal runtime enforcer, providing configuration hardening, proactive threat detection, and continuous behavioral monitoring throughout the execution pipeline. (3) \textbf{Watcher-based protection} introduces a novel, decoupled system-level security middleware that continuously verifies agent state evolution. It enables real-time execution intervention without coupling to the agent's internal logic, supporting operations such as halting high-risk actions or enforcing human confirmation. We argue that this Watcher paradigm holds strong potential to serve as a foundational building block for securing next-generation autonomous agent systems. Extensive qualitative and quantitative evaluations demonstrate the effectiveness and robustness of ClawKeeper across diverse threat scenarios. We release our code.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Terminal Agents Suffice for Enterprise Automation</title>
      <itunes:episode>1713</itunes:episode>
      <podcast:episode>1713</podcast:episode>
      <itunes:title>Terminal Agents Suffice for Enterprise Automation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">48957c98-3439-411d-bd86-1bf6da685e3e</guid>
      <link>https://share.transistor.fm/s/d07ea5d9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.SE, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Patrice Bechard, Orlando Marquez Ayala, Emily Chen, Jordan Skelton, Sagar Davasam, Srinivas Sunkara, Vikas Yadav, Sai Rajeswar</p>

            <p><strong>Title:</strong><br>
            Terminal Agents Suffice for Enterprise Automation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.00073v1">http://arxiv.org/abs/2604.00073v1</a></p>

            <p><strong>Abstract:</strong><br>
            There has been growing interest in building agents that can interact with digital platforms to execute meaningful enterprise tasks autonomously. Among the approaches explored are tool-augmented agents built on abstractions such as Model Context Protocol (MCP) and web agents that operate through graphical interfaces. Yet, it remains unclear whether such complex agentic systems are necessary given their cost and operational overhead. We argue that a coding agent equipped only with a terminal and a filesystem can solve many enterprise tasks more effectively by interacting directly with platform APIs. We evaluate this hypothesis across diverse real-world systems and show that these low-level terminal agents match or outperform more complex agent architectures. Our findings suggest that simple programmatic interfaces, combined with strong foundation models, are sufficient for practical enterprise automation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.SE, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Patrice Bechard, Orlando Marquez Ayala, Emily Chen, Jordan Skelton, Sagar Davasam, Srinivas Sunkara, Vikas Yadav, Sai Rajeswar</p>

            <p><strong>Title:</strong><br>
            Terminal Agents Suffice for Enterprise Automation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.00073v1">http://arxiv.org/abs/2604.00073v1</a></p>

            <p><strong>Abstract:</strong><br>
            There has been growing interest in building agents that can interact with digital platforms to execute meaningful enterprise tasks autonomously. Among the approaches explored are tool-augmented agents built on abstractions such as Model Context Protocol (MCP) and web agents that operate through graphical interfaces. Yet, it remains unclear whether such complex agentic systems are necessary given their cost and operational overhead. We argue that a coding agent equipped only with a terminal and a filesystem can solve many enterprise tasks more effectively by interacting directly with platform APIs. We evaluate this hypothesis across diverse real-world systems and show that these low-level terminal agents match or outperform more complex agent architectures. Our findings suggest that simple programmatic interfaces, combined with strong foundation models, are sufficient for practical enterprise automation.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Apr 2026 21:06:51 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d07ea5d9/79b341b2.mp3" length="23238541" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1449</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.SE, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Patrice Bechard, Orlando Marquez Ayala, Emily Chen, Jordan Skelton, Sagar Davasam, Srinivas Sunkara, Vikas Yadav, Sai Rajeswar</p>

            <p><strong>Title:</strong><br>
            Terminal Agents Suffice for Enterprise Automation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.00073v1">http://arxiv.org/abs/2604.00073v1</a></p>

            <p><strong>Abstract:</strong><br>
            There has been growing interest in building agents that can interact with digital platforms to execute meaningful enterprise tasks autonomously. Among the approaches explored are tool-augmented agents built on abstractions such as Model Context Protocol (MCP) and web agents that operate through graphical interfaces. Yet, it remains unclear whether such complex agentic systems are necessary given their cost and operational overhead. We argue that a coding agent equipped only with a terminal and a filesystem can solve many enterprise tasks more effectively by interacting directly with platform APIs. We evaluate this hypothesis across diverse real-world systems and show that these low-level terminal agents match or outperform more complex agent architectures. Our findings suggest that simple programmatic interfaces, combined with strong foundation models, are sufficient for practical enterprise automation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome</title>
      <itunes:episode>1712</itunes:episode>
      <podcast:episode>1712</podcast:episode>
      <itunes:title>MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ed85fc17-9e89-4336-b3f3-b1425763c270</guid>
      <link>https://share.transistor.fm/s/13a1098e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin, Yao Xiao, Yibo Wang, Lei Wang, Zhen Zhang, Lu Wang, Yue Deng, Bin Wang, Yifan Zhang, Liangcai Su, Xinyu Wang, He Zhao, Chen Wei, Qiang Ren, Bryan Hooi, An Bo, Shuicheng Yan, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28407v1">http://arxiv.org/abs/2603.28407v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin, Yao Xiao, Yibo Wang, Lei Wang, Zhen Zhang, Lu Wang, Yue Deng, Bin Wang, Yifan Zhang, Liangcai Su, Xinyu Wang, He Zhao, Chen Wei, Qiang Ren, Bryan Hooi, An Bo, Shuicheng Yan, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28407v1">http://arxiv.org/abs/2603.28407v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Apr 2026 21:06:30 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/13a1098e/3169249b.mp3" length="23874703" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1488</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin, Yao Xiao, Yibo Wang, Lei Wang, Zhen Zhang, Lu Wang, Yue Deng, Bin Wang, Yifan Zhang, Liangcai Su, Xinyu Wang, He Zhao, Chen Wei, Qiang Ren, Bryan Hooi, An Bo, Shuicheng Yan, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28407v1">http://arxiv.org/abs/2603.28407v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?</title>
      <itunes:episode>1711</itunes:episode>
      <podcast:episode>1711</podcast:episode>
      <itunes:title>ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8ac23bfc-068d-4e30-87f8-97d33329d704</guid>
      <link>https://share.transistor.fm/s/af04ad90</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haonan Han, Jiancheng Huang, Xiaopeng Sun, Junyan He, Rui Yang, Jie Hu, Xiaojiang Peng, Lin Ma, Xiaoming Wei, Xiu Li</p>

            <p><strong>Title:</strong><br>
            ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25823v1">http://arxiv.org/abs/2603.25823v1</a></p>

            <p><strong>Abstract:</strong><br>
            Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ``performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision-G}nerative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical ``stress test'' for the next generation of intelligent vision models. The demo have been available at https://vincenthancoder.github.io/ViGoR-Bench/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haonan Han, Jiancheng Huang, Xiaopeng Sun, Junyan He, Rui Yang, Jie Hu, Xiaojiang Peng, Lin Ma, Xiaoming Wei, Xiu Li</p>

            <p><strong>Title:</strong><br>
            ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25823v1">http://arxiv.org/abs/2603.25823v1</a></p>

            <p><strong>Abstract:</strong><br>
            Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ``performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision-G}nerative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical ``stress test'' for the next generation of intelligent vision models. The demo have been available at https://vincenthancoder.github.io/ViGoR-Bench/</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Apr 2026 21:06:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/af04ad90/f140042a.mp3" length="23661130" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1475</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haonan Han, Jiancheng Huang, Xiaopeng Sun, Junyan He, Rui Yang, Jie Hu, Xiaojiang Peng, Lin Ma, Xiaoming Wei, Xiu Li</p>

            <p><strong>Title:</strong><br>
            ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25823v1">http://arxiv.org/abs/2603.25823v1</a></p>

            <p><strong>Abstract:</strong><br>
            Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ``performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision-G}nerative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical ``stress test'' for the next generation of intelligent vision models. The demo have been available at https://vincenthancoder.github.io/ViGoR-Bench/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification</title>
      <itunes:episode>1710</itunes:episode>
      <podcast:episode>1710</podcast:episode>
      <itunes:title>Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bea637ca-b6f0-428a-9678-81a3d2b781d3</guid>
      <link>https://share.transistor.fm/s/44c7728f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zehai He, Wenyi Hong, Zhen Yang, Ziyang Pan, Mingdao Liu, Xiaotao Gu, Jie Tang</p>

            <p><strong>Title:</strong><br>
            Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.26648v2">http://arxiv.org/abs/2603.26648v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zehai He, Wenyi Hong, Zhen Yang, Ziyang Pan, Mingdao Liu, Xiaotao Gu, Jie Tang</p>

            <p><strong>Title:</strong><br>
            Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.26648v2">http://arxiv.org/abs/2603.26648v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Apr 2026 21:05:48 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/44c7728f/4abb7686.mp3" length="22574028" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1407</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zehai He, Wenyi Hong, Zhen Yang, Ziyang Pan, Mingdao Liu, Xiaotao Gu, Jie Tang</p>

            <p><strong>Title:</strong><br>
            Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.26648v2">http://arxiv.org/abs/2603.26648v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>QuitoBench: A High-Quality Open Time Series Forecasting Benchmark</title>
      <itunes:episode>1709</itunes:episode>
      <podcast:episode>1709</podcast:episode>
      <itunes:title>QuitoBench: A High-Quality Open Time Series Forecasting Benchmark</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d57951e7-0096-4438-a70f-f92da51016c9</guid>
      <link>https://share.transistor.fm/s/3bc100c4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Siqiao Xue, Zhaoyang Zhu, Wei Zhang, Rongyao Cai, Rui Wang, Yixiang Mu, Fan Zhou, Jianguo Li, Peng Di, Hang Yu</p>

            <p><strong>Title:</strong><br>
            QuitoBench: A High-Quality Open Time Series Forecasting Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.26017v1">http://arxiv.org/abs/2603.26017v1</a></p>

            <p><strong>Abstract:</strong><br>
            Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce \textsc{QuitoBench}, a regime-balanced benchmark for time series forecasting with coverage across eight trend$\times$seasonality$\times$forecastability (TSF) regimes, designed to capture forecasting-relevant properties rather than application-defined domain labels. The benchmark is built upon \textsc{Quito}, a billion-scale time series corpus of application traffic from Alipay spanning nine business domains. Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we report four key findings: (i) a context-length crossover where deep learning models lead at short context ($L=96$) but foundation models dominate at long context ($L \ge 576$); (ii) forecastability is the dominant difficulty driver, producing a $3.64 \times$ MAE gap across regimes; (iii) deep learning models match or surpass foundation models at $59 \times$ fewer parameters; and (iv) scaling the amount of training data provides substantially greater benefit than scaling model size for both model families. These findings are validated by strong cross-benchmark and cross-metric consistency. Our open-source release enables reproducible, regime-aware evaluation for time series forecasting research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Siqiao Xue, Zhaoyang Zhu, Wei Zhang, Rongyao Cai, Rui Wang, Yixiang Mu, Fan Zhou, Jianguo Li, Peng Di, Hang Yu</p>

            <p><strong>Title:</strong><br>
            QuitoBench: A High-Quality Open Time Series Forecasting Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.26017v1">http://arxiv.org/abs/2603.26017v1</a></p>

            <p><strong>Abstract:</strong><br>
            Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce \textsc{QuitoBench}, a regime-balanced benchmark for time series forecasting with coverage across eight trend$\times$seasonality$\times$forecastability (TSF) regimes, designed to capture forecasting-relevant properties rather than application-defined domain labels. The benchmark is built upon \textsc{Quito}, a billion-scale time series corpus of application traffic from Alipay spanning nine business domains. Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we report four key findings: (i) a context-length crossover where deep learning models lead at short context ($L=96$) but foundation models dominate at long context ($L \ge 576$); (ii) forecastability is the dominant difficulty driver, producing a $3.64 \times$ MAE gap across regimes; (iii) deep learning models match or surpass foundation models at $59 \times$ fewer parameters; and (iv) scaling the amount of training data provides substantially greater benefit than scaling model size for both model families. These findings are validated by strong cross-benchmark and cross-metric consistency. Our open-source release enables reproducible, regime-aware evaluation for time series forecasting research.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Apr 2026 21:05:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3bc100c4/e5427292.mp3" length="23828297" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1486</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Siqiao Xue, Zhaoyang Zhu, Wei Zhang, Rongyao Cai, Rui Wang, Yixiang Mu, Fan Zhou, Jianguo Li, Peng Di, Hang Yu</p>

            <p><strong>Title:</strong><br>
            QuitoBench: A High-Quality Open Time Series Forecasting Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.26017v1">http://arxiv.org/abs/2603.26017v1</a></p>

            <p><strong>Abstract:</strong><br>
            Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce \textsc{QuitoBench}, a regime-balanced benchmark for time series forecasting with coverage across eight trend$\times$seasonality$\times$forecastability (TSF) regimes, designed to capture forecasting-relevant properties rather than application-defined domain labels. The benchmark is built upon \textsc{Quito}, a billion-scale time series corpus of application traffic from Alipay spanning nine business domains. Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we report four key findings: (i) a context-length crossover where deep learning models lead at short context ($L=96$) but foundation models dominate at long context ($L \ge 576$); (ii) forecastability is the dominant difficulty driver, producing a $3.64 \times$ MAE gap across regimes; (iii) deep learning models match or surpass foundation models at $59 \times$ fewer parameters; and (iv) scaling the amount of training data provides substantially greater benefit than scaling model size for both model families. These findings are validated by strong cross-benchmark and cross-metric consistency. Our open-source release enables reproducible, regime-aware evaluation for time series forecasting research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Reasoning Shift: How Context Silently Shortens LLM Reasoning</title>
      <itunes:episode>1708</itunes:episode>
      <podcast:episode>1708</podcast:episode>
      <itunes:title>Reasoning Shift: How Context Silently Shortens LLM Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9e3b143a-c53f-44a4-9b0c-9be09b211a39</guid>
      <link>https://share.transistor.fm/s/c6dd1870</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Gleb Rodionov</p>

            <p><strong>Title:</strong><br>
            Reasoning Shift: How Context Silently Shortens LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.01161v1">http://arxiv.org/abs/2604.01161v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness of these reasoning behaviors remains underexplored. To investigate this, we conduct a systematic evaluation of multiple reasoning models across three scenarios: (1) problems augmented with lengthy, irrelevant context; (2) multi-turn conversational settings with independent tasks; and (3) problems presented as a subtask within a complex task. We observe an interesting phenomenon: reasoning models tend to produce much shorter reasoning traces (up to 50%) for the same problem under different context conditions compared to the traces produced when the problem is presented in isolation. A finer-grained analysis reveals that this compression is associated with a decrease in self-verification and uncertainty management behaviors, such as double-checking. While this behavioral shift does not compromise performance on straightforward problems, it might affect performance on more challenging tasks. We hope our findings draw additional attention to both the robustness of reasoning models and the problem of context management for LLMs and LLM-based agents.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Gleb Rodionov</p>

            <p><strong>Title:</strong><br>
            Reasoning Shift: How Context Silently Shortens LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.01161v1">http://arxiv.org/abs/2604.01161v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness of these reasoning behaviors remains underexplored. To investigate this, we conduct a systematic evaluation of multiple reasoning models across three scenarios: (1) problems augmented with lengthy, irrelevant context; (2) multi-turn conversational settings with independent tasks; and (3) problems presented as a subtask within a complex task. We observe an interesting phenomenon: reasoning models tend to produce much shorter reasoning traces (up to 50%) for the same problem under different context conditions compared to the traces produced when the problem is presented in isolation. A finer-grained analysis reveals that this compression is associated with a decrease in self-verification and uncertainty management behaviors, such as double-checking. While this behavioral shift does not compromise performance on straightforward problems, it might affect performance on more challenging tasks. We hope our findings draw additional attention to both the robustness of reasoning models and the problem of context management for LLMs and LLM-based agents.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Apr 2026 21:05:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c6dd1870/824333dd.mp3" length="22568981" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1407</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Gleb Rodionov</p>

            <p><strong>Title:</strong><br>
            Reasoning Shift: How Context Silently Shortens LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2604.01161v1">http://arxiv.org/abs/2604.01161v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness of these reasoning behaviors remains underexplored. To investigate this, we conduct a systematic evaluation of multiple reasoning models across three scenarios: (1) problems augmented with lengthy, irrelevant context; (2) multi-turn conversational settings with independent tasks; and (3) problems presented as a subtask within a complex task. We observe an interesting phenomenon: reasoning models tend to produce much shorter reasoning traces (up to 50%) for the same problem under different context conditions compared to the traces produced when the problem is presented in isolation. A finer-grained analysis reveals that this compression is associated with a decrease in self-verification and uncertainty management behaviors, such as double-checking. While this behavioral shift does not compromise performance on straightforward problems, it might affect performance on more challenging tasks. We hope our findings draw additional attention to both the robustness of reasoning models and the problem of context management for LLMs and LLM-based agents.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization</title>
      <itunes:episode>1707</itunes:episode>
      <podcast:episode>1707</podcast:episode>
      <itunes:title>FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f95e3aff-32dc-4748-b234-cb48eb2fd934</guid>
      <link>https://share.transistor.fm/s/e5b6898d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 293 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush Vosoughi, Guoyin Wang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19835v3">http://arxiv.org/abs/2603.19835v3</a></p>

            <p><strong>Abstract:</strong><br>
            We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0\%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 293 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush Vosoughi, Guoyin Wang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19835v3">http://arxiv.org/abs/2603.19835v3</a></p>

            <p><strong>Abstract:</strong><br>
            We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0\%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Apr 2026 21:29:07 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e5b6898d/d5c32fd5.mp3" length="23018721" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1435</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 293 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush Vosoughi, Guoyin Wang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19835v3">http://arxiv.org/abs/2603.19835v3</a></p>

            <p><strong>Abstract:</strong><br>
            We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0\%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence</title>
      <itunes:episode>1706</itunes:episode>
      <podcast:episode>1706</podcast:episode>
      <itunes:title>CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c1821972-5958-45bc-866f-8b94cfb45027</guid>
      <link>https://share.transistor.fm/s/bd20fe1d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 230 | cs.RO, cs.AI, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Tianle Zeng, Hanxuan Chen, Yanci Wen, Hong Zhang</p>

            <p><strong>Title:</strong><br>
            CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28032v1">http://arxiv.org/abs/2603.28032v1</a></p>

            <p><strong>Abstract:</strong><br>
            The convergence of low-altitude economies, embodied intelligence, and air-ground cooperative systems creates growing demand for simulation infrastructure capable of jointly modeling aerial and ground agents within a single physically coherent environment. Existing open-source platforms remain domain-segregated: driving simulators lack aerial dynamics, while multirotor simulators lack realistic ground scenes. Bridge-based co-simulation introduces synchronization overhead and cannot guarantee strict spatial-temporal consistency.   We present CARLA-Air, an open-source infrastructure that unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process. The platform preserves both CARLA and AirSim native Python APIs and ROS 2 interfaces, enabling zero-modification code reuse. Within a shared physics tick and rendering pipeline, CARLA-Air delivers photorealistic environments with rule-compliant traffic, socially-aware pedestrians, and aerodynamically consistent UAV dynamics, synchronously capturing up to 18 sensor modalities across all platforms at each tick. The platform supports representative air-ground embodied intelligence workloads spanning cooperation, embodied navigation and vision-language action, multi-modal perception and dataset construction, and reinforcement-learning-based policy training. An extensible asset pipeline allows integration of custom robot platforms into the shared world. By inheriting AirSim's aerial capabilities -- whose upstream development has been archived -- CARLA-Air ensures this widely adopted flight stack continues to evolve within a modern infrastructure.   Released with prebuilt binaries and full source: https://github.com/louiszengCN/CarlaAir</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 230 | cs.RO, cs.AI, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Tianle Zeng, Hanxuan Chen, Yanci Wen, Hong Zhang</p>

            <p><strong>Title:</strong><br>
            CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28032v1">http://arxiv.org/abs/2603.28032v1</a></p>

            <p><strong>Abstract:</strong><br>
            The convergence of low-altitude economies, embodied intelligence, and air-ground cooperative systems creates growing demand for simulation infrastructure capable of jointly modeling aerial and ground agents within a single physically coherent environment. Existing open-source platforms remain domain-segregated: driving simulators lack aerial dynamics, while multirotor simulators lack realistic ground scenes. Bridge-based co-simulation introduces synchronization overhead and cannot guarantee strict spatial-temporal consistency.   We present CARLA-Air, an open-source infrastructure that unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process. The platform preserves both CARLA and AirSim native Python APIs and ROS 2 interfaces, enabling zero-modification code reuse. Within a shared physics tick and rendering pipeline, CARLA-Air delivers photorealistic environments with rule-compliant traffic, socially-aware pedestrians, and aerodynamically consistent UAV dynamics, synchronously capturing up to 18 sensor modalities across all platforms at each tick. The platform supports representative air-ground embodied intelligence workloads spanning cooperation, embodied navigation and vision-language action, multi-modal perception and dataset construction, and reinforcement-learning-based policy training. An extensible asset pipeline allows integration of custom robot platforms into the shared world. By inheriting AirSim's aerial capabilities -- whose upstream development has been archived -- CARLA-Air ensures this widely adopted flight stack continues to evolve within a modern infrastructure.   Released with prebuilt binaries and full source: https://github.com/louiszengCN/CarlaAir</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Apr 2026 21:28:46 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bd20fe1d/9026790b.mp3" length="27265217" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1700</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 230 | cs.RO, cs.AI, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Tianle Zeng, Hanxuan Chen, Yanci Wen, Hong Zhang</p>

            <p><strong>Title:</strong><br>
            CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28032v1">http://arxiv.org/abs/2603.28032v1</a></p>

            <p><strong>Abstract:</strong><br>
            The convergence of low-altitude economies, embodied intelligence, and air-ground cooperative systems creates growing demand for simulation infrastructure capable of jointly modeling aerial and ground agents within a single physically coherent environment. Existing open-source platforms remain domain-segregated: driving simulators lack aerial dynamics, while multirotor simulators lack realistic ground scenes. Bridge-based co-simulation introduces synchronization overhead and cannot guarantee strict spatial-temporal consistency.   We present CARLA-Air, an open-source infrastructure that unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process. The platform preserves both CARLA and AirSim native Python APIs and ROS 2 interfaces, enabling zero-modification code reuse. Within a shared physics tick and rendering pipeline, CARLA-Air delivers photorealistic environments with rule-compliant traffic, socially-aware pedestrians, and aerodynamically consistent UAV dynamics, synchronously capturing up to 18 sensor modalities across all platforms at each tick. The platform supports representative air-ground embodied intelligence workloads spanning cooperation, embodied navigation and vision-language action, multi-modal perception and dataset construction, and reinforcement-learning-based policy training. An extensible asset pipeline allows integration of custom robot platforms into the shared world. By inheriting AirSim's aerial capabilities -- whose upstream development has been archived -- CARLA-Air ensures this widely adopted flight stack continues to evolve within a modern infrastructure.   Released with prebuilt binaries and full source: https://github.com/louiszengCN/CarlaAir</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LongCat-Next: Lexicalizing Modalities as Discrete Tokens</title>
      <itunes:episode>1705</itunes:episode>
      <podcast:episode>1705</podcast:episode>
      <itunes:title>LongCat-Next: Lexicalizing Modalities as Discrete Tokens</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8f7c0ab5-ce48-4730-843c-448569b84455</guid>
      <link>https://share.transistor.fm/s/ffaf5a2e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 118 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, Haozhe Zhao, Hong Liu, Hui Su, Jiaqi Zhang, Jiawei Wang, Jing Li, Kefeng Zhang, Manyuan Zhang, Minhao Jing, Peng Pei, Quan Chen, Taofeng Xue, Tongxin Pan, Xiaotong Li, Xiaoyang Li, Xiaoyu Zhao, Xing Hu, Xinyang Lin, Xunliang Cai, Yan Bai, Yan Feng, Yanjie Li, Yao Qiu, Yerui Sun, Yifan Lu, Ying Luo, Yipeng Mei, Yitian Chen, Yuchen Xie, Yufang Liu, Yufei Chen, Yulei Qian, Yuqi Peng, Zhihang Yu, Zhixiong Han, Changran Wang, Chen Chen, Dian Zheng, Fengjiao Chen, Ge Yang, Haowei Guo, Haozhe Wang, Hongyu Li, Huicheng Jiang, Jiale Hong, Jialv Zou, Jiamu Li, Jianping Lin, Jiaxing Liu, Jie Yang, Jing Jin, Jun Kuang, Juncheng She, Kunming Luo, Kuofeng Gao, Lin Qiu, Linsen Guo, Mianqiu Huang, Qi Li, Qian Wang, Rumei Li, Siyu Ren, Wei Wang, Wenlong He, Xi Chen, Xiao Liu, Xiaoyu Li, Xu Huang, Xuanyu Zhu, Xuezhi Cao, Yaoming Zhu, Yifei Cao, Yimeng Jia, Yizhen Jiang, Yufei Gao, Zeyang Hu, Zhenlong Yuan, Zijian Zhang, Ziwen Wang</p>

            <p><strong>Title:</strong><br>
            LongCat-Next: Lexicalizing Modalities as Discrete Tokens</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27538v1">http://arxiv.org/abs/2603.27538v1</a></p>

            <p><strong>Abstract:</strong><br>
            The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 118 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, Haozhe Zhao, Hong Liu, Hui Su, Jiaqi Zhang, Jiawei Wang, Jing Li, Kefeng Zhang, Manyuan Zhang, Minhao Jing, Peng Pei, Quan Chen, Taofeng Xue, Tongxin Pan, Xiaotong Li, Xiaoyang Li, Xiaoyu Zhao, Xing Hu, Xinyang Lin, Xunliang Cai, Yan Bai, Yan Feng, Yanjie Li, Yao Qiu, Yerui Sun, Yifan Lu, Ying Luo, Yipeng Mei, Yitian Chen, Yuchen Xie, Yufang Liu, Yufei Chen, Yulei Qian, Yuqi Peng, Zhihang Yu, Zhixiong Han, Changran Wang, Chen Chen, Dian Zheng, Fengjiao Chen, Ge Yang, Haowei Guo, Haozhe Wang, Hongyu Li, Huicheng Jiang, Jiale Hong, Jialv Zou, Jiamu Li, Jianping Lin, Jiaxing Liu, Jie Yang, Jing Jin, Jun Kuang, Juncheng She, Kunming Luo, Kuofeng Gao, Lin Qiu, Linsen Guo, Mianqiu Huang, Qi Li, Qian Wang, Rumei Li, Siyu Ren, Wei Wang, Wenlong He, Xi Chen, Xiao Liu, Xiaoyu Li, Xu Huang, Xuanyu Zhu, Xuezhi Cao, Yaoming Zhu, Yifei Cao, Yimeng Jia, Yizhen Jiang, Yufei Gao, Zeyang Hu, Zhenlong Yuan, Zijian Zhang, Ziwen Wang</p>

            <p><strong>Title:</strong><br>
            LongCat-Next: Lexicalizing Modalities as Discrete Tokens</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27538v1">http://arxiv.org/abs/2603.27538v1</a></p>

            <p><strong>Abstract:</strong><br>
            The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Apr 2026 21:28:25 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ffaf5a2e/6a5b678a.mp3" length="22544317" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1405</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 118 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, Haozhe Zhao, Hong Liu, Hui Su, Jiaqi Zhang, Jiawei Wang, Jing Li, Kefeng Zhang, Manyuan Zhang, Minhao Jing, Peng Pei, Quan Chen, Taofeng Xue, Tongxin Pan, Xiaotong Li, Xiaoyang Li, Xiaoyu Zhao, Xing Hu, Xinyang Lin, Xunliang Cai, Yan Bai, Yan Feng, Yanjie Li, Yao Qiu, Yerui Sun, Yifan Lu, Ying Luo, Yipeng Mei, Yitian Chen, Yuchen Xie, Yufang Liu, Yufei Chen, Yulei Qian, Yuqi Peng, Zhihang Yu, Zhixiong Han, Changran Wang, Chen Chen, Dian Zheng, Fengjiao Chen, Ge Yang, Haowei Guo, Haozhe Wang, Hongyu Li, Huicheng Jiang, Jiale Hong, Jialv Zou, Jiamu Li, Jianping Lin, Jiaxing Liu, Jie Yang, Jing Jin, Jun Kuang, Juncheng She, Kunming Luo, Kuofeng Gao, Lin Qiu, Linsen Guo, Mianqiu Huang, Qi Li, Qian Wang, Rumei Li, Siyu Ren, Wei Wang, Wenlong He, Xi Chen, Xiao Liu, Xiaoyu Li, Xu Huang, Xuanyu Zhu, Xuezhi Cao, Yaoming Zhu, Yifei Cao, Yimeng Jia, Yizhen Jiang, Yufei Gao, Zeyang Hu, Zhenlong Yuan, Zijian Zhang, Ziwen Wang</p>

            <p><strong>Title:</strong><br>
            LongCat-Next: Lexicalizing Modalities as Discrete Tokens</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27538v1">http://arxiv.org/abs/2603.27538v1</a></p>

            <p><strong>Abstract:</strong><br>
            The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Lingshu-Cell: A generative cellular world model for transcriptome modeling toward virtual cells</title>
      <itunes:episode>1704</itunes:episode>
      <podcast:episode>1704</podcast:episode>
      <itunes:title>Lingshu-Cell: A generative cellular world model for transcriptome modeling toward virtual cells</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">da36e000-425a-437f-a91a-f22f7e2caec6</guid>
      <link>https://share.transistor.fm/s/621ebbf9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 73 | q-bio.QM, cs.AI, q-bio.GN</p>

            <p><strong>Authors:</strong><br>
            Han Zhang, Guo-Hua Yuan, Chaohao Yuan, Tingyang Xu, Tian Bian, Hong Cheng, Wenbing Huang, Deli Zhao, Yu Rong</p>

            <p><strong>Title:</strong><br>
            Lingshu-Cell: A generative cellular world model for transcriptome modeling toward virtual cells</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25240v1">http://arxiv.org/abs/2603.25240v1</a></p>

            <p><strong>Abstract:</strong><br>
            Modeling cellular states and predicting their responses to perturbations are central challenges in computational biology and the development of virtual cells. Existing foundation models for single-cell transcriptomics provide powerful static representations, but they do not explicitly model the distribution of cellular states for generative simulation. Here, we introduce Lingshu-Cell, a masked discrete diffusion model that learns transcriptomic state distributions and supports conditional simulation under perturbation. By operating directly in a discrete token space that is compatible with the sparse, non-sequential nature of single-cell transcriptomic data, Lingshu-Cell captures complex transcriptome-wide expression dependencies across approximately 18,000 genes without relying on prior gene selection, such as filtering by high variability or ranking by expression level. Across diverse tissues and species, Lingshu-Cell accurately reproduces transcriptomic distributions, marker-gene expression patterns and cell-subtype proportions, demonstrating its ability to capture complex cellular heterogeneity. Moreover, by jointly embedding cell type or donor identity with perturbation, Lingshu-Cell can predict whole-transcriptome expression changes for novel combinations of identity and perturbation. It achieves leading performance on the Virtual Cell Challenge H1 genetic perturbation benchmark and in predicting cytokine-induced responses in human PBMCs. Together, these results establish Lingshu-Cell as a flexible cellular world model for in silico simulation of cell states and perturbation responses, laying the foundation for a new paradigm in biological discovery and perturbation screening.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 73 | q-bio.QM, cs.AI, q-bio.GN</p>

            <p><strong>Authors:</strong><br>
            Han Zhang, Guo-Hua Yuan, Chaohao Yuan, Tingyang Xu, Tian Bian, Hong Cheng, Wenbing Huang, Deli Zhao, Yu Rong</p>

            <p><strong>Title:</strong><br>
            Lingshu-Cell: A generative cellular world model for transcriptome modeling toward virtual cells</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25240v1">http://arxiv.org/abs/2603.25240v1</a></p>

            <p><strong>Abstract:</strong><br>
            Modeling cellular states and predicting their responses to perturbations are central challenges in computational biology and the development of virtual cells. Existing foundation models for single-cell transcriptomics provide powerful static representations, but they do not explicitly model the distribution of cellular states for generative simulation. Here, we introduce Lingshu-Cell, a masked discrete diffusion model that learns transcriptomic state distributions and supports conditional simulation under perturbation. By operating directly in a discrete token space that is compatible with the sparse, non-sequential nature of single-cell transcriptomic data, Lingshu-Cell captures complex transcriptome-wide expression dependencies across approximately 18,000 genes without relying on prior gene selection, such as filtering by high variability or ranking by expression level. Across diverse tissues and species, Lingshu-Cell accurately reproduces transcriptomic distributions, marker-gene expression patterns and cell-subtype proportions, demonstrating its ability to capture complex cellular heterogeneity. Moreover, by jointly embedding cell type or donor identity with perturbation, Lingshu-Cell can predict whole-transcriptome expression changes for novel combinations of identity and perturbation. It achieves leading performance on the Virtual Cell Challenge H1 genetic perturbation benchmark and in predicting cytokine-induced responses in human PBMCs. Together, these results establish Lingshu-Cell as a flexible cellular world model for in silico simulation of cell states and perturbation responses, laying the foundation for a new paradigm in biological discovery and perturbation screening.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Apr 2026 21:28:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/621ebbf9/4ca4f541.mp3" length="22848213" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1424</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 73 | q-bio.QM, cs.AI, q-bio.GN</p>

            <p><strong>Authors:</strong><br>
            Han Zhang, Guo-Hua Yuan, Chaohao Yuan, Tingyang Xu, Tian Bian, Hong Cheng, Wenbing Huang, Deli Zhao, Yu Rong</p>

            <p><strong>Title:</strong><br>
            Lingshu-Cell: A generative cellular world model for transcriptome modeling toward virtual cells</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25240v1">http://arxiv.org/abs/2603.25240v1</a></p>

            <p><strong>Abstract:</strong><br>
            Modeling cellular states and predicting their responses to perturbations are central challenges in computational biology and the development of virtual cells. Existing foundation models for single-cell transcriptomics provide powerful static representations, but they do not explicitly model the distribution of cellular states for generative simulation. Here, we introduce Lingshu-Cell, a masked discrete diffusion model that learns transcriptomic state distributions and supports conditional simulation under perturbation. By operating directly in a discrete token space that is compatible with the sparse, non-sequential nature of single-cell transcriptomic data, Lingshu-Cell captures complex transcriptome-wide expression dependencies across approximately 18,000 genes without relying on prior gene selection, such as filtering by high variability or ranking by expression level. Across diverse tissues and species, Lingshu-Cell accurately reproduces transcriptomic distributions, marker-gene expression patterns and cell-subtype proportions, demonstrating its ability to capture complex cellular heterogeneity. Moreover, by jointly embedding cell type or donor identity with perturbation, Lingshu-Cell can predict whole-transcriptome expression changes for novel combinations of identity and perturbation. It achieves leading performance on the Virtual Cell Challenge H1 genetic perturbation benchmark and in predicting cytokine-induced responses in human PBMCs. Together, these results establish Lingshu-Cell as a flexible cellular world model for in silico simulation of cell states and perturbation responses, laying the foundation for a new paradigm in biological discovery and perturbation screening.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GEMS: Agent-Native Multimodal Generation with Memory and Skills</title>
      <itunes:episode>1703</itunes:episode>
      <podcast:episode>1703</podcast:episode>
      <itunes:title>GEMS: Agent-Native Multimodal Generation with Memory and Skills</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a8fcfb53-51e3-4130-83b4-ff699db7e244</guid>
      <link>https://share.transistor.fm/s/d86642ff</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, Yang Yang</p>

            <p><strong>Title:</strong><br>
            GEMS: Agent-Native Multimodal Generation with Memory and Skills</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28088v1">http://arxiv.org/abs/2603.28088v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose \textbf{GEMS} (Agent-Native Multimodal \textbf{GE}neration with \textbf{M}emory and \textbf{S}kills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, Yang Yang</p>

            <p><strong>Title:</strong><br>
            GEMS: Agent-Native Multimodal Generation with Memory and Skills</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28088v1">http://arxiv.org/abs/2603.28088v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose \textbf{GEMS} (Agent-Native Multimodal \textbf{GE}neration with \textbf{M}emory and \textbf{S}kills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Apr 2026 21:27:42 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d86642ff/b929acd1.mp3" length="21512383" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1341</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, Yang Yang</p>

            <p><strong>Title:</strong><br>
            GEMS: Agent-Native Multimodal Generation with Memory and Skills</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28088v1">http://arxiv.org/abs/2603.28088v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose \textbf{GEMS} (Agent-Native Multimodal \textbf{GE}neration with \textbf{M}emory and \textbf{S}kills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development</title>
      <itunes:episode>1702</itunes:episode>
      <podcast:episode>1702</podcast:episode>
      <itunes:title>Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9430f355-f9f5-4542-b934-e9d465ef8cc7</guid>
      <link>https://share.transistor.fm/s/d40dba81</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhongying Deng, Cheng Tang, Ziyan Huang, Jiashi Lin, Ying Chen, Junzhi Ning, Chenglong Ma, Jiyao Liu, Wei Li, Yinghao Zhu, Shujian Gao, Yanyan Huang, Sibo Ju, Yanzhou Su, Pengcheng Chen, Wenhao Tang, Tianbin Li, Haoyu Wang, Yuanfeng Ji, Hui Sun, Shaobo Min, Liang Peng, Feilong Tang, Haochen Xue, Rulin Zhou, Chaoyang Zhang, Wenjie Li, Shaohao Rui, Weijie Ma, Xingyue Zhao, Yibin Wang, Kun Yuan, Zhaohui Lu, Shujun Wang, Jinjie Wei, Lihao Liu, Dingkang Yang, Lin Wang, Yulong Li, Haolin Yang, Yiqing Shen, Lequan Yu, Xiaowei Hu, Yun Gu, Yicheng Wu, Benyou Wang, Minghui Zhang, Angelica I. Aviles-Rivero, Qi Gao, Hongming Shan, Xiaoyu Ren, Fang Yan, Hongyu Zhou, Haodong Duan, Maosong Cao, Shanshan Wang, Bin Fu, Xiaomeng Li, Zhi Hou, Chunfeng Song, Lei Bai, Yuan Cheng, Yuandong Pu, Xiang Li, Wenhai Wang, Hao Chen, Jiaxin Zhuang, Songyang Zhang, Huiguang He, Mengzhang Li, Bohan Zhuang, Zhian Bai, Rongshan Yu, Liansheng Wang, Yukun Zhou, Xiaosong Wang, Xin Guo, Guanbin Li, Xiangru Lin, Dakai Jin, Mianxin Liu, Wenlong Zhang, Qi Qin, Conghui He, Yuqiang Li, Ye Luo, Nanqing Dong, Jie Xu, Wenqi Shao, Bo Zhang, Qiujuan Yan, Yihao Liu, Jun Ma, Zhi Lu, Yuewen Cao, Zongwei Zhou, Jianming Liang, Shixiang Tang, Qi Duan, Dongzhan Zhou, Chen Jiang, Yuyin Zhou, Yanwu Xu, Jiancheng Yang, Shaoting Zhang, Xiaohong Liu, Siqi Luo, Yi Xin, Chaoyu Liu, Haochen Wen, Xin Chen, Alejandro Lozano, Min Woo Sun, Yuhui Zhang, Yue Yao, Xiaoxiao Sun, Serena Yeung-Levy, Xia Li, Jing Ke, Chunhui Zhang, Zongyuan Ge, Ming Hu, Jin Ye, Zhifeng Li, Yirong Chen, Yu Qiao, Junjun He</p>

            <p><strong>Title:</strong><br>
            Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27460v1">http://arxiv.org/abs/2603.27460v1</a></p>

            <p><strong>Abstract:</strong><br>
            Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of large-scale unified medical datasets and hindering the development of powerful medical foundation models. In this work, we present the largest survey to date of medical image datasets, covering over 1,000 open-access datasets with a systematic catalog of their modalities, tasks, anatomies, annotations, limitations, and potential for integration. Our analysis exposes a landscape that is modest in scale, fragmented across narrowly scoped tasks, and unevenly distributed across organs and modalities, which in turn limits the utility of existing medical image datasets for developing versatile and robust medical foundation models. To turn fragmentation into scale, we propose a metadata-driven fusion paradigm (MDFP) that integrates public datasets with shared modalities or tasks, thereby transforming multiple small data silos into larger, more coherent resources. Building on MDFP, we release an interactive discovery portal that enables end-to-end, automated medical image dataset integration, and compile all surveyed datasets into a unified, structured table that clearly summarizes their key characteristics and provides reference links, offering the community an accessible and comprehensive repository. By charting the current terrain and offering a principled path to dataset consolidation, our survey provides a practical roadmap for scaling medical imaging corpora, supporting faster data discovery, more principled dataset creation, and more capable medical foundation models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhongying Deng, Cheng Tang, Ziyan Huang, Jiashi Lin, Ying Chen, Junzhi Ning, Chenglong Ma, Jiyao Liu, Wei Li, Yinghao Zhu, Shujian Gao, Yanyan Huang, Sibo Ju, Yanzhou Su, Pengcheng Chen, Wenhao Tang, Tianbin Li, Haoyu Wang, Yuanfeng Ji, Hui Sun, Shaobo Min, Liang Peng, Feilong Tang, Haochen Xue, Rulin Zhou, Chaoyang Zhang, Wenjie Li, Shaohao Rui, Weijie Ma, Xingyue Zhao, Yibin Wang, Kun Yuan, Zhaohui Lu, Shujun Wang, Jinjie Wei, Lihao Liu, Dingkang Yang, Lin Wang, Yulong Li, Haolin Yang, Yiqing Shen, Lequan Yu, Xiaowei Hu, Yun Gu, Yicheng Wu, Benyou Wang, Minghui Zhang, Angelica I. Aviles-Rivero, Qi Gao, Hongming Shan, Xiaoyu Ren, Fang Yan, Hongyu Zhou, Haodong Duan, Maosong Cao, Shanshan Wang, Bin Fu, Xiaomeng Li, Zhi Hou, Chunfeng Song, Lei Bai, Yuan Cheng, Yuandong Pu, Xiang Li, Wenhai Wang, Hao Chen, Jiaxin Zhuang, Songyang Zhang, Huiguang He, Mengzhang Li, Bohan Zhuang, Zhian Bai, Rongshan Yu, Liansheng Wang, Yukun Zhou, Xiaosong Wang, Xin Guo, Guanbin Li, Xiangru Lin, Dakai Jin, Mianxin Liu, Wenlong Zhang, Qi Qin, Conghui He, Yuqiang Li, Ye Luo, Nanqing Dong, Jie Xu, Wenqi Shao, Bo Zhang, Qiujuan Yan, Yihao Liu, Jun Ma, Zhi Lu, Yuewen Cao, Zongwei Zhou, Jianming Liang, Shixiang Tang, Qi Duan, Dongzhan Zhou, Chen Jiang, Yuyin Zhou, Yanwu Xu, Jiancheng Yang, Shaoting Zhang, Xiaohong Liu, Siqi Luo, Yi Xin, Chaoyu Liu, Haochen Wen, Xin Chen, Alejandro Lozano, Min Woo Sun, Yuhui Zhang, Yue Yao, Xiaoxiao Sun, Serena Yeung-Levy, Xia Li, Jing Ke, Chunhui Zhang, Zongyuan Ge, Ming Hu, Jin Ye, Zhifeng Li, Yirong Chen, Yu Qiao, Junjun He</p>

            <p><strong>Title:</strong><br>
            Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27460v1">http://arxiv.org/abs/2603.27460v1</a></p>

            <p><strong>Abstract:</strong><br>
            Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of large-scale unified medical datasets and hindering the development of powerful medical foundation models. In this work, we present the largest survey to date of medical image datasets, covering over 1,000 open-access datasets with a systematic catalog of their modalities, tasks, anatomies, annotations, limitations, and potential for integration. Our analysis exposes a landscape that is modest in scale, fragmented across narrowly scoped tasks, and unevenly distributed across organs and modalities, which in turn limits the utility of existing medical image datasets for developing versatile and robust medical foundation models. To turn fragmentation into scale, we propose a metadata-driven fusion paradigm (MDFP) that integrates public datasets with shared modalities or tasks, thereby transforming multiple small data silos into larger, more coherent resources. Building on MDFP, we release an interactive discovery portal that enables end-to-end, automated medical image dataset integration, and compile all surveyed datasets into a unified, structured table that clearly summarizes their key characteristics and provides reference links, offering the community an accessible and comprehensive repository. By charting the current terrain and offering a principled path to dataset consolidation, our survey provides a practical roadmap for scaling medical imaging corpora, supporting faster data discovery, more principled dataset creation, and more capable medical foundation models.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Apr 2026 21:27:21 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d40dba81/4d562954.mp3" length="23064309" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1438</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhongying Deng, Cheng Tang, Ziyan Huang, Jiashi Lin, Ying Chen, Junzhi Ning, Chenglong Ma, Jiyao Liu, Wei Li, Yinghao Zhu, Shujian Gao, Yanyan Huang, Sibo Ju, Yanzhou Su, Pengcheng Chen, Wenhao Tang, Tianbin Li, Haoyu Wang, Yuanfeng Ji, Hui Sun, Shaobo Min, Liang Peng, Feilong Tang, Haochen Xue, Rulin Zhou, Chaoyang Zhang, Wenjie Li, Shaohao Rui, Weijie Ma, Xingyue Zhao, Yibin Wang, Kun Yuan, Zhaohui Lu, Shujun Wang, Jinjie Wei, Lihao Liu, Dingkang Yang, Lin Wang, Yulong Li, Haolin Yang, Yiqing Shen, Lequan Yu, Xiaowei Hu, Yun Gu, Yicheng Wu, Benyou Wang, Minghui Zhang, Angelica I. Aviles-Rivero, Qi Gao, Hongming Shan, Xiaoyu Ren, Fang Yan, Hongyu Zhou, Haodong Duan, Maosong Cao, Shanshan Wang, Bin Fu, Xiaomeng Li, Zhi Hou, Chunfeng Song, Lei Bai, Yuan Cheng, Yuandong Pu, Xiang Li, Wenhai Wang, Hao Chen, Jiaxin Zhuang, Songyang Zhang, Huiguang He, Mengzhang Li, Bohan Zhuang, Zhian Bai, Rongshan Yu, Liansheng Wang, Yukun Zhou, Xiaosong Wang, Xin Guo, Guanbin Li, Xiangru Lin, Dakai Jin, Mianxin Liu, Wenlong Zhang, Qi Qin, Conghui He, Yuqiang Li, Ye Luo, Nanqing Dong, Jie Xu, Wenqi Shao, Bo Zhang, Qiujuan Yan, Yihao Liu, Jun Ma, Zhi Lu, Yuewen Cao, Zongwei Zhou, Jianming Liang, Shixiang Tang, Qi Duan, Dongzhan Zhou, Chen Jiang, Yuyin Zhou, Yanwu Xu, Jiancheng Yang, Shaoting Zhang, Xiaohong Liu, Siqi Luo, Yi Xin, Chaoyu Liu, Haochen Wen, Xin Chen, Alejandro Lozano, Min Woo Sun, Yuhui Zhang, Yue Yao, Xiaoxiao Sun, Serena Yeung-Levy, Xia Li, Jing Ke, Chunhui Zhang, Zongyuan Ge, Ming Hu, Jin Ye, Zhifeng Li, Yirong Chen, Yu Qiao, Junjun He</p>

            <p><strong>Title:</strong><br>
            Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27460v1">http://arxiv.org/abs/2603.27460v1</a></p>

            <p><strong>Abstract:</strong><br>
            Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of large-scale unified medical datasets and hindering the development of powerful medical foundation models. In this work, we present the largest survey to date of medical image datasets, covering over 1,000 open-access datasets with a systematic catalog of their modalities, tasks, anatomies, annotations, limitations, and potential for integration. Our analysis exposes a landscape that is modest in scale, fragmented across narrowly scoped tasks, and unevenly distributed across organs and modalities, which in turn limits the utility of existing medical image datasets for developing versatile and robust medical foundation models. To turn fragmentation into scale, we propose a metadata-driven fusion paradigm (MDFP) that integrates public datasets with shared modalities or tasks, thereby transforming multiple small data silos into larger, more coherent resources. Building on MDFP, we release an interactive discovery portal that enables end-to-end, automated medical image dataset integration, and compile all surveyed datasets into a unified, structured table that clearly summarizes their key characteristics and provides reference links, offering the community an accessible and comprehensive repository. By charting the current terrain and offering a principled path to dataset consolidation, our survey provides a practical roadmap for scaling medical imaging corpora, supporting faster data discovery, more principled dataset creation, and more capable medical foundation models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward</title>
      <itunes:episode>1701</itunes:episode>
      <podcast:episode>1701</podcast:episode>
      <itunes:title>VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5ceb2064-58d4-4376-a46b-a250e3276207</guid>
      <link>https://share.transistor.fm/s/4efcd656</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, Marta Tintore Gazulla</p>

            <p><strong>Title:</strong><br>
            VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.26599v1">http://arxiv.org/abs/2603.26599v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, Marta Tintore Gazulla</p>

            <p><strong>Title:</strong><br>
            VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.26599v1">http://arxiv.org/abs/2603.26599v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Apr 2026 21:27:00 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4efcd656/5ef914fa.mp3" length="22003493" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1372</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, Marta Tintore Gazulla</p>

            <p><strong>Title:</strong><br>
            VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.26599v1">http://arxiv.org/abs/2603.26599v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis</title>
      <itunes:episode>1700</itunes:episode>
      <podcast:episode>1700</podcast:episode>
      <itunes:title>Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4ca2c6ba-52d1-4219-9593-558e86446bbb</guid>
      <link>https://share.transistor.fm/s/b9ad60a6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi-Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, Dasen Dai, Bolin Jiang, Manyuan Zhang, Shi-Xue Zhang, Zhengkai Jiang, Lucas Wang, Zhao Zhong, Yu Cheng, Nanyun Peng</p>

            <p><strong>Title:</strong><br>
            Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.29620v2">http://arxiv.org/abs/2603.29620v2</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi-Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, Dasen Dai, Bolin Jiang, Manyuan Zhang, Shi-Xue Zhang, Zhengkai Jiang, Lucas Wang, Zhao Zhong, Yu Cheng, Nanyun Peng</p>

            <p><strong>Title:</strong><br>
            Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.29620v2">http://arxiv.org/abs/2603.29620v2</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Apr 2026 21:26:39 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b9ad60a6/8059ab37.mp3" length="22077057" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1376</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi-Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, Dasen Dai, Bolin Jiang, Manyuan Zhang, Shi-Xue Zhang, Zhengkai Jiang, Lucas Wang, Zhao Zhong, Yu Cheng, Nanyun Peng</p>

            <p><strong>Title:</strong><br>
            Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.29620v2">http://arxiv.org/abs/2603.29620v2</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CutClaw: Agentic Hours-Long Video Editing via Music Synchronization</title>
      <itunes:episode>1699</itunes:episode>
      <podcast:episode>1699</podcast:episode>
      <itunes:title>CutClaw: Agentic Hours-Long Video Editing via Music Synchronization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a9c09a03-42aa-4599-b38d-16e85fa46f23</guid>
      <link>https://share.transistor.fm/s/c744175f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun</p>

            <p><strong>Title:</strong><br>
            CutClaw: Agentic Hours-Long Video Editing via Music Synchronization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.29664v1">http://arxiv.org/abs/2603.29664v1</a></p>

            <p><strong>Abstract:</strong><br>
            Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun</p>

            <p><strong>Title:</strong><br>
            CutClaw: Agentic Hours-Long Video Editing via Music Synchronization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.29664v1">http://arxiv.org/abs/2603.29664v1</a></p>

            <p><strong>Abstract:</strong><br>
            Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Apr 2026 21:26:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c744175f/704cd952.mp3" length="21536211" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1342</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun</p>

            <p><strong>Title:</strong><br>
            CutClaw: Agentic Hours-Long Video Editing via Music Synchronization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.29664v1">http://arxiv.org/abs/2603.29664v1</a></p>

            <p><strong>Abstract:</strong><br>
            Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>daVinci-LLM:Towards the Science of Pretraining</title>
      <itunes:episode>1698</itunes:episode>
      <podcast:episode>1698</podcast:episode>
      <itunes:title>daVinci-LLM:Towards the Science of Pretraining</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5dc36c7e-ef56-414c-b56a-46fc62ae67cf</guid>
      <link>https://share.transistor.fm/s/70f2ca89</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiwei Qin, Yixiu Liu, Tiantian Mi, Muhang Xie, Zhen Huang, Weiye Si, Pengrui Lu, Siyuan Feng, Xia Wu, Liming Liu, Ye Luo, Jinlong Hou, Qipeng Guo, Yu Qiao, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            daVinci-LLM:Towards the Science of Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27164v1">http://arxiv.org/abs/2603.27164v1</a></p>

            <p><strong>Abstract:</strong><br>
            The foundational pretraining phase determines a model's capability ceiling, as post-training struggles to overcome capability foundations established during pretraining, yet it remains critically under-explored. This stems from a structural paradox: organizations with computational resources operate under commercial pressures that inhibit transparent disclosure, while academic institutions possess research freedom but lack pretraining-scale computational resources. daVinci-LLM occupies this unexplored intersection, combining industrial-scale resources with full research freedom to advance the science of pretraining. We adopt a fully-open paradigm that treats openness as scientific methodology, releasing complete data processing pipelines, full training processes, and systematic exploration results. Recognizing that the field lacks systematic methodology for data processing, we employ the Data Darwinism framework, a principled L0-L9 taxonomy from filtering to synthesis. We train a 3B-parameter model from random initialization across 8T tokens using a two-stage adaptive curriculum that progressively shifts from foundational capabilities to reasoning-intensive enhancement. Through 200+ controlled ablations, we establish that: processing depth systematically enhances capabilities, establishing it as a critical dimension alongside volume scaling; different domains exhibit distinct saturation dynamics, necessitating adaptive strategies from proportion adjustments to format shifts; compositional balance enables targeted intensification while preventing performance collapse; how evaluation protocol choices shape our understanding of pretraining progress. By releasing the complete exploration process, we enable the community to build upon our findings and systematic methodologies to form accumulative scientific knowledge in pretraining.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiwei Qin, Yixiu Liu, Tiantian Mi, Muhang Xie, Zhen Huang, Weiye Si, Pengrui Lu, Siyuan Feng, Xia Wu, Liming Liu, Ye Luo, Jinlong Hou, Qipeng Guo, Yu Qiao, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            daVinci-LLM:Towards the Science of Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27164v1">http://arxiv.org/abs/2603.27164v1</a></p>

            <p><strong>Abstract:</strong><br>
            The foundational pretraining phase determines a model's capability ceiling, as post-training struggles to overcome capability foundations established during pretraining, yet it remains critically under-explored. This stems from a structural paradox: organizations with computational resources operate under commercial pressures that inhibit transparent disclosure, while academic institutions possess research freedom but lack pretraining-scale computational resources. daVinci-LLM occupies this unexplored intersection, combining industrial-scale resources with full research freedom to advance the science of pretraining. We adopt a fully-open paradigm that treats openness as scientific methodology, releasing complete data processing pipelines, full training processes, and systematic exploration results. Recognizing that the field lacks systematic methodology for data processing, we employ the Data Darwinism framework, a principled L0-L9 taxonomy from filtering to synthesis. We train a 3B-parameter model from random initialization across 8T tokens using a two-stage adaptive curriculum that progressively shifts from foundational capabilities to reasoning-intensive enhancement. Through 200+ controlled ablations, we establish that: processing depth systematically enhances capabilities, establishing it as a critical dimension alongside volume scaling; different domains exhibit distinct saturation dynamics, necessitating adaptive strategies from proportion adjustments to format shifts; compositional balance enables targeted intensification while preventing performance collapse; how evaluation protocol choices shape our understanding of pretraining progress. By releasing the complete exploration process, we enable the community to build upon our findings and systematic methodologies to form accumulative scientific knowledge in pretraining.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Apr 2026 21:25:56 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/70f2ca89/f501dc6d.mp3" length="24644970" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1537</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiwei Qin, Yixiu Liu, Tiantian Mi, Muhang Xie, Zhen Huang, Weiye Si, Pengrui Lu, Siyuan Feng, Xia Wu, Liming Liu, Ye Luo, Jinlong Hou, Qipeng Guo, Yu Qiao, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            daVinci-LLM:Towards the Science of Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27164v1">http://arxiv.org/abs/2603.27164v1</a></p>

            <p><strong>Abstract:</strong><br>
            The foundational pretraining phase determines a model's capability ceiling, as post-training struggles to overcome capability foundations established during pretraining, yet it remains critically under-explored. This stems from a structural paradox: organizations with computational resources operate under commercial pressures that inhibit transparent disclosure, while academic institutions possess research freedom but lack pretraining-scale computational resources. daVinci-LLM occupies this unexplored intersection, combining industrial-scale resources with full research freedom to advance the science of pretraining. We adopt a fully-open paradigm that treats openness as scientific methodology, releasing complete data processing pipelines, full training processes, and systematic exploration results. Recognizing that the field lacks systematic methodology for data processing, we employ the Data Darwinism framework, a principled L0-L9 taxonomy from filtering to synthesis. We train a 3B-parameter model from random initialization across 8T tokens using a two-stage adaptive curriculum that progressively shifts from foundational capabilities to reasoning-intensive enhancement. Through 200+ controlled ablations, we establish that: processing depth systematically enhances capabilities, establishing it as a critical dimension alongside volume scaling; different domains exhibit distinct saturation dynamics, necessitating adaptive strategies from proportion adjustments to format shifts; compositional balance enables targeted intensification while preventing performance collapse; how evaluation protocol choices shape our understanding of pretraining progress. By releasing the complete exploration process, we enable the community to build upon our findings and systematic methodologies to form accumulative scientific knowledge in pretraining.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TAPS: Task Aware Proposal Distributions for Speculative Sampling</title>
      <itunes:episode>1697</itunes:episode>
      <podcast:episode>1697</podcast:episode>
      <itunes:title>TAPS: Task Aware Proposal Distributions for Speculative Sampling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d080f62f-3269-4a9c-84c7-a2f791938e81</guid>
      <link>https://share.transistor.fm/s/94f880ff</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 118 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mohamad Zbib, Mohamad Bazzi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem</p>

            <p><strong>Title:</strong><br>
            TAPS: Task Aware Proposal Distributions for Speculative Sampling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27027v1">http://arxiv.org/abs/2603.27027v1</a></p>

            <p><strong>Abstract:</strong><br>
            Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad generic corpora, which leaves it unclear how much speculative decoding quality depends on the draft training distribution. We study this question with lightweight HASS and EAGLE-2 drafters trained on MathInstruct, ShareGPT, and mixed-data variants, evaluated on MT-Bench, GSM8K, MATH-500, and SVAMP. Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench. Mixed-data training improves robustness, but larger mixtures do not dominate across decoding temperatures. We also study how to combine specialized drafters at inference time. Naive checkpoint averaging performs poorly, whereas confidence-based routing improves over single-domain drafts and merged-tree verification yields the highest acceptance length overall for both backbones. Finally, confidence is a more useful routing signal than entropy: rejected tokens tend to have higher entropy, but confidence produces much clearer benchmark-level routing decisions. These results show that speculative decoding quality depends not only on draft architecture, but also on the match between draft training data and downstream workload, and that specialized drafters are better combined at inference time than in weight space.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 118 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mohamad Zbib, Mohamad Bazzi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem</p>

            <p><strong>Title:</strong><br>
            TAPS: Task Aware Proposal Distributions for Speculative Sampling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27027v1">http://arxiv.org/abs/2603.27027v1</a></p>

            <p><strong>Abstract:</strong><br>
            Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad generic corpora, which leaves it unclear how much speculative decoding quality depends on the draft training distribution. We study this question with lightweight HASS and EAGLE-2 drafters trained on MathInstruct, ShareGPT, and mixed-data variants, evaluated on MT-Bench, GSM8K, MATH-500, and SVAMP. Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench. Mixed-data training improves robustness, but larger mixtures do not dominate across decoding temperatures. We also study how to combine specialized drafters at inference time. Naive checkpoint averaging performs poorly, whereas confidence-based routing improves over single-domain drafts and merged-tree verification yields the highest acceptance length overall for both backbones. Finally, confidence is a more useful routing signal than entropy: rejected tokens tend to have higher entropy, but confidence produces much clearer benchmark-level routing decisions. These results show that speculative decoding quality depends not only on draft architecture, but also on the match between draft training data and downstream workload, and that specialized drafters are better combined at inference time than in weight space.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Mar 2026 21:20:05 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/94f880ff/537158bf.mp3" length="21459721" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1338</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 118 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mohamad Zbib, Mohamad Bazzi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem</p>

            <p><strong>Title:</strong><br>
            TAPS: Task Aware Proposal Distributions for Speculative Sampling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27027v1">http://arxiv.org/abs/2603.27027v1</a></p>

            <p><strong>Abstract:</strong><br>
            Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad generic corpora, which leaves it unclear how much speculative decoding quality depends on the draft training distribution. We study this question with lightweight HASS and EAGLE-2 drafters trained on MathInstruct, ShareGPT, and mixed-data variants, evaluated on MT-Bench, GSM8K, MATH-500, and SVAMP. Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench. Mixed-data training improves robustness, but larger mixtures do not dominate across decoding temperatures. We also study how to combine specialized drafters at inference time. Naive checkpoint averaging performs poorly, whereas confidence-based routing improves over single-domain drafts and merged-tree verification yields the highest acceptance length overall for both backbones. Finally, confidence is a more useful routing signal than entropy: rejected tokens tend to have higher entropy, but confidence produces much clearer benchmark-level routing decisions. These results show that speculative decoding quality depends not only on draft architecture, but also on the match between draft training data and downstream workload, and that specialized drafters are better combined at inference time than in weight space.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Towards a Medical AI Scientist</title>
      <itunes:episode>1696</itunes:episode>
      <podcast:episode>1696</podcast:episode>
      <itunes:title>Towards a Medical AI Scientist</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">68c01685-ae57-4575-82ef-85ec47a509c1</guid>
      <link>https://share.transistor.fm/s/7e53ee2e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hongtao Wu, Boyun Zheng, Dingjie Song, Yu Jiang, Jianfeng Gao, Lei Xing, Lichao Sun, Yixuan Yuan</p>

            <p><strong>Title:</strong><br>
            Towards a Medical AI Scientist</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28589v1">http://arxiv.org/abs/2603.28589v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autonomous systems that generate scientific hypotheses, conduct experiments, and draft manuscripts have recently emerged as a promising paradigm for accelerating discovery. However, existing AI Scientists remain largely domain-agnostic, limiting their applicability to clinical medicine, where research is required to be grounded in medical evidence with specialized data modalities. In this work, we introduce Medical AI Scientist, the first autonomous research framework tailored to clinical autonomous research. It enables clinically grounded ideation by transforming extensively surveyed literature into actionable evidence through clinician-engineer co-reasoning mechanism, which improves the traceability of generated research ideas. It further facilitates evidence-grounded manuscript drafting guided by structured medical compositional conventions and ethical policies. The framework operates under 3 research modes, namely paper-based reproduction, literature-inspired innovation, and task-driven exploration, each corresponding to a distinct level of automated scientific inquiry with progressively increasing autonomy. Comprehensive evaluations by both large language models and human experts demonstrate that the ideas generated by the Medical AI Scientist are of substantially higher quality than those produced by commercial LLMs across 171 cases, 19 clinical tasks, and 6 data modalities. Meanwhile, our system achieves strong alignment between the proposed method and its implementation, while also demonstrating significantly higher success rates in executable experiments. Double-blind evaluations by human experts and the Stanford Agentic Reviewer suggest that the generated manuscripts approach MICCAI-level quality, while consistently surpassing those from ISBI and BIBM. The proposed Medical AI Scientist highlights the potential of leveraging AI for autonomous scientific discovery in healthcare.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hongtao Wu, Boyun Zheng, Dingjie Song, Yu Jiang, Jianfeng Gao, Lei Xing, Lichao Sun, Yixuan Yuan</p>

            <p><strong>Title:</strong><br>
            Towards a Medical AI Scientist</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28589v1">http://arxiv.org/abs/2603.28589v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autonomous systems that generate scientific hypotheses, conduct experiments, and draft manuscripts have recently emerged as a promising paradigm for accelerating discovery. However, existing AI Scientists remain largely domain-agnostic, limiting their applicability to clinical medicine, where research is required to be grounded in medical evidence with specialized data modalities. In this work, we introduce Medical AI Scientist, the first autonomous research framework tailored to clinical autonomous research. It enables clinically grounded ideation by transforming extensively surveyed literature into actionable evidence through clinician-engineer co-reasoning mechanism, which improves the traceability of generated research ideas. It further facilitates evidence-grounded manuscript drafting guided by structured medical compositional conventions and ethical policies. The framework operates under 3 research modes, namely paper-based reproduction, literature-inspired innovation, and task-driven exploration, each corresponding to a distinct level of automated scientific inquiry with progressively increasing autonomy. Comprehensive evaluations by both large language models and human experts demonstrate that the ideas generated by the Medical AI Scientist are of substantially higher quality than those produced by commercial LLMs across 171 cases, 19 clinical tasks, and 6 data modalities. Meanwhile, our system achieves strong alignment between the proposed method and its implementation, while also demonstrating significantly higher success rates in executable experiments. Double-blind evaluations by human experts and the Stanford Agentic Reviewer suggest that the generated manuscripts approach MICCAI-level quality, while consistently surpassing those from ISBI and BIBM. The proposed Medical AI Scientist highlights the potential of leveraging AI for autonomous scientific discovery in healthcare.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Mar 2026 21:19:44 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7e53ee2e/92fe683c.mp3" length="23886776" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1489</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hongtao Wu, Boyun Zheng, Dingjie Song, Yu Jiang, Jianfeng Gao, Lei Xing, Lichao Sun, Yixuan Yuan</p>

            <p><strong>Title:</strong><br>
            Towards a Medical AI Scientist</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28589v1">http://arxiv.org/abs/2603.28589v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autonomous systems that generate scientific hypotheses, conduct experiments, and draft manuscripts have recently emerged as a promising paradigm for accelerating discovery. However, existing AI Scientists remain largely domain-agnostic, limiting their applicability to clinical medicine, where research is required to be grounded in medical evidence with specialized data modalities. In this work, we introduce Medical AI Scientist, the first autonomous research framework tailored to clinical autonomous research. It enables clinically grounded ideation by transforming extensively surveyed literature into actionable evidence through clinician-engineer co-reasoning mechanism, which improves the traceability of generated research ideas. It further facilitates evidence-grounded manuscript drafting guided by structured medical compositional conventions and ethical policies. The framework operates under 3 research modes, namely paper-based reproduction, literature-inspired innovation, and task-driven exploration, each corresponding to a distinct level of automated scientific inquiry with progressively increasing autonomy. Comprehensive evaluations by both large language models and human experts demonstrate that the ideas generated by the Medical AI Scientist are of substantially higher quality than those produced by commercial LLMs across 171 cases, 19 clinical tasks, and 6 data modalities. Meanwhile, our system achieves strong alignment between the proposed method and its implementation, while also demonstrating significantly higher success rates in executable experiments. Double-blind evaluations by human experts and the Stanford Agentic Reviewer suggest that the generated manuscripts approach MICCAI-level quality, while consistently surpassing those from ISBI and BIBM. The proposed Medical AI Scientist highlights the potential of leveraging AI for autonomous scientific discovery in healthcare.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Gen-Searcher: Reinforcing Agentic Search for Image Generation</title>
      <itunes:episode>1695</itunes:episode>
      <podcast:episode>1695</podcast:episode>
      <itunes:title>Gen-Searcher: Reinforcing Agentic Search for Image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">50b15288-20a9-4010-bce3-726a64e60dc6</guid>
      <link>https://share.transistor.fm/s/b4b2650c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, Xiangyu Yue</p>

            <p><strong>Title:</strong><br>
            Gen-Searcher: Reinforcing Agentic Search for Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28767v1">http://arxiv.org/abs/2603.28767v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, Xiangyu Yue</p>

            <p><strong>Title:</strong><br>
            Gen-Searcher: Reinforcing Agentic Search for Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28767v1">http://arxiv.org/abs/2603.28767v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Mar 2026 21:19:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b4b2650c/f8022d8a.mp3" length="25362621" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1581</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kaituo Feng, Manyuan Zhang, Shuang Chen, Yunlong Lin, Kaixuan Fan, Yilei Jiang, Hongyu Li, Dian Zheng, Chenyang Wang, Xiangyu Yue</p>

            <p><strong>Title:</strong><br>
            Gen-Searcher: Reinforcing Agentic Search for Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28767v1">http://arxiv.org/abs/2603.28767v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent image generation models have shown strong capabilities in generating high-fidelity and photorealistic images. However, they are fundamentally constrained by frozen internal knowledge, thus often failing on real-world scenarios that are knowledge-intensive or require up-to-date information. In this paper, we present Gen-Searcher, as the first attempt to train a search-augmented image generation agent, which performs multi-hop reasoning and search to collect the textual knowledge and reference images needed for grounded generation. To achieve this, we construct a tailored data pipeline and curate two high-quality datasets, Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k, containing diverse search-intensive prompts and corresponding ground-truth synthesis images. We further introduce KnowGen, a comprehensive benchmark that explicitly requires search-grounded external knowledge for image generation and evaluates models from multiple dimensions. Based on these resources, we train Gen-Searcher with SFT followed by agentic reinforcement learning with dual reward feedback, which combines text-based and image-based rewards to provide more stable and informative learning signals for GRPO training. Experiments show that Gen-Searcher brings substantial gains, improving Qwen-Image by around 16 points on KnowGen and 15 points on WISE. We hope this work can serve as an open foundation for search agents in image generation, and we fully open-source our data, models, and code.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Emergent Social Intelligence Risks in Generative Multi-Agent Systems</title>
      <itunes:episode>1694</itunes:episode>
      <podcast:episode>1694</podcast:episode>
      <itunes:title>Emergent Social Intelligence Risks in Generative Multi-Agent Systems</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">79aa1a44-c823-4aaa-8ac9-f84530c84ebf</guid>
      <link>https://share.transistor.fm/s/675167db</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.MA, cs.CL, cs.CY</p>

            <p><strong>Authors:</strong><br>
            Yue Huang, Yu Jiang, Wenjie Wang, Haomin Zhuang, Xiaonan Luo, Yuchen Ma, Zhangchen Xu, Zichen Chen, Nuno Moniz, Zinan Lin, Pin-Yu Chen, Nitesh V Chawla, Nouha Dziri, Huan Sun, Xiangliang Zhang</p>

            <p><strong>Title:</strong><br>
            Emergent Social Intelligence Risks in Generative Multi-Agent Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27771v1">http://arxiv.org/abs/2603.27771v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-agent systems composed of large generative models are rapidly moving from laboratory prototypes to real-world deployments, where they jointly plan, negotiate, and allocate shared resources to solve complex tasks. While such systems promise unprecedented scalability and autonomy, their collective interaction also gives rise to failure modes that cannot be reduced to individual agents. Understanding these emergent risks is therefore critical. Here, we present a pioneer study of such emergent multi-agent risk in workflows that involve competition over shared resources (e.g., computing resources or market share), sequential handoff collaboration (where downstream agents see only predecessor outputs), collective decision aggregation, and others. Across these settings, we observe that such group behaviors arise frequently across repeated trials and a wide range of interaction conditions, rather than as rare or pathological cases. In particular, phenomena such as collusion-like coordination and conformity emerge with non-trivial frequency under realistic resource constraints, communication protocols, and role assignments, mirroring well-known pathologies in human societies despite no explicit instruction. Moreover, these risks cannot be prevented by existing agent-level safeguards alone. These findings expose the dark side of intelligent multi-agent systems: a social intelligence risk where agent collectives, despite no instruction to do so, spontaneously reproduce familiar failure patterns from human societies.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.MA, cs.CL, cs.CY</p>

            <p><strong>Authors:</strong><br>
            Yue Huang, Yu Jiang, Wenjie Wang, Haomin Zhuang, Xiaonan Luo, Yuchen Ma, Zhangchen Xu, Zichen Chen, Nuno Moniz, Zinan Lin, Pin-Yu Chen, Nitesh V Chawla, Nouha Dziri, Huan Sun, Xiangliang Zhang</p>

            <p><strong>Title:</strong><br>
            Emergent Social Intelligence Risks in Generative Multi-Agent Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27771v1">http://arxiv.org/abs/2603.27771v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-agent systems composed of large generative models are rapidly moving from laboratory prototypes to real-world deployments, where they jointly plan, negotiate, and allocate shared resources to solve complex tasks. While such systems promise unprecedented scalability and autonomy, their collective interaction also gives rise to failure modes that cannot be reduced to individual agents. Understanding these emergent risks is therefore critical. Here, we present a pioneer study of such emergent multi-agent risk in workflows that involve competition over shared resources (e.g., computing resources or market share), sequential handoff collaboration (where downstream agents see only predecessor outputs), collective decision aggregation, and others. Across these settings, we observe that such group behaviors arise frequently across repeated trials and a wide range of interaction conditions, rather than as rare or pathological cases. In particular, phenomena such as collusion-like coordination and conformity emerge with non-trivial frequency under realistic resource constraints, communication protocols, and role assignments, mirroring well-known pathologies in human societies despite no explicit instruction. Moreover, these risks cannot be prevented by existing agent-level safeguards alone. These findings expose the dark side of intelligent multi-agent systems: a social intelligence risk where agent collectives, despite no instruction to do so, spontaneously reproduce familiar failure patterns from human societies.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Mar 2026 21:19:01 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/675167db/f3ac5296.mp3" length="21477280" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1339</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.MA, cs.CL, cs.CY</p>

            <p><strong>Authors:</strong><br>
            Yue Huang, Yu Jiang, Wenjie Wang, Haomin Zhuang, Xiaonan Luo, Yuchen Ma, Zhangchen Xu, Zichen Chen, Nuno Moniz, Zinan Lin, Pin-Yu Chen, Nitesh V Chawla, Nouha Dziri, Huan Sun, Xiangliang Zhang</p>

            <p><strong>Title:</strong><br>
            Emergent Social Intelligence Risks in Generative Multi-Agent Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27771v1">http://arxiv.org/abs/2603.27771v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-agent systems composed of large generative models are rapidly moving from laboratory prototypes to real-world deployments, where they jointly plan, negotiate, and allocate shared resources to solve complex tasks. While such systems promise unprecedented scalability and autonomy, their collective interaction also gives rise to failure modes that cannot be reduced to individual agents. Understanding these emergent risks is therefore critical. Here, we present a pioneer study of such emergent multi-agent risk in workflows that involve competition over shared resources (e.g., computing resources or market share), sequential handoff collaboration (where downstream agents see only predecessor outputs), collective decision aggregation, and others. Across these settings, we observe that such group behaviors arise frequently across repeated trials and a wide range of interaction conditions, rather than as rare or pathological cases. In particular, phenomena such as collusion-like coordination and conformity emerge with non-trivial frequency under realistic resource constraints, communication protocols, and role assignments, mirroring well-known pathologies in human societies despite no explicit instruction. Moreover, these risks cannot be prevented by existing agent-level safeguards alone. These findings expose the dark side of intelligent multi-agent systems: a social intelligence risk where agent collectives, despite no instruction to do so, spontaneously reproduce familiar failure patterns from human societies.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>EpochX: Building the Infrastructure for an Emergent Agent Civilization</title>
      <itunes:episode>1693</itunes:episode>
      <podcast:episode>1693</podcast:episode>
      <itunes:title>EpochX: Building the Infrastructure for an Emergent Agent Civilization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d8047630-068e-4667-aa28-077c406127b7</guid>
      <link>https://share.transistor.fm/s/69a8c34d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.AI, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Huacan Wang, Chaofa Yuan, Xialie Zhuang, Tu Hu, Shuo Zhang, Jun Han, Shi Wei, Daiqiang Li, Jingping Liu, Kunyi Wang, Zihan Yin, Zhenheng Tang, Andy Wang, Henry Peng Zou, Philip S. Yu, Sen Hu, Qizhen Lan, Ronghao Chen</p>

            <p><strong>Title:</strong><br>
            EpochX: Building the Infrastructure for an Emergent Agent Civilization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27304v1">http://arxiv.org/abs/2603.27304v1</a></p>

            <p><strong>Abstract:</strong><br>
            General-purpose technologies reshape economies less by improving individual tools than by enabling new ways to organize production and coordination. We believe AI agents are approaching a similar inflection point: as foundation models make broad task execution and tool use increasingly accessible, the binding constraint shifts from raw capability to how work is delegated, verified, and rewarded at scale. We introduce EpochX, a credits-native marketplace infrastructure for human-agent production networks. EpochX treats humans and agents as peer participants who can post tasks or claim them. Claimed tasks can be decomposed into subtasks and executed through an explicit delivery workflow with verification and acceptance. Crucially, EpochX is designed so that each completed transaction can produce reusable ecosystem assets, including skills, workflows, execution traces, and distilled experience. These assets are stored with explicit dependency structure, enabling retrieval, composition, and cumulative improvement over time. EpochX also introduces a native credit mechanism to make participation economically viable under real compute costs. Credits lock task bounties, budget delegation, settle rewards upon acceptance, and compensate creators when verified assets are reused. By formalizing the end-to-end transaction model together with its asset and incentive layers, EpochX reframes agentic AI as an organizational design problem: building infrastructures where verifiable work leaves persistent, reusable artifacts, and where value flows support durable human-agent collaboration.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.AI, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Huacan Wang, Chaofa Yuan, Xialie Zhuang, Tu Hu, Shuo Zhang, Jun Han, Shi Wei, Daiqiang Li, Jingping Liu, Kunyi Wang, Zihan Yin, Zhenheng Tang, Andy Wang, Henry Peng Zou, Philip S. Yu, Sen Hu, Qizhen Lan, Ronghao Chen</p>

            <p><strong>Title:</strong><br>
            EpochX: Building the Infrastructure for an Emergent Agent Civilization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27304v1">http://arxiv.org/abs/2603.27304v1</a></p>

            <p><strong>Abstract:</strong><br>
            General-purpose technologies reshape economies less by improving individual tools than by enabling new ways to organize production and coordination. We believe AI agents are approaching a similar inflection point: as foundation models make broad task execution and tool use increasingly accessible, the binding constraint shifts from raw capability to how work is delegated, verified, and rewarded at scale. We introduce EpochX, a credits-native marketplace infrastructure for human-agent production networks. EpochX treats humans and agents as peer participants who can post tasks or claim them. Claimed tasks can be decomposed into subtasks and executed through an explicit delivery workflow with verification and acceptance. Crucially, EpochX is designed so that each completed transaction can produce reusable ecosystem assets, including skills, workflows, execution traces, and distilled experience. These assets are stored with explicit dependency structure, enabling retrieval, composition, and cumulative improvement over time. EpochX also introduces a native credit mechanism to make participation economically viable under real compute costs. Credits lock task bounties, budget delegation, settle rewards upon acceptance, and compensate creators when verified assets are reused. By formalizing the end-to-end transaction model together with its asset and incentive layers, EpochX reframes agentic AI as an organizational design problem: building infrastructures where verifiable work leaves persistent, reusable artifacts, and where value flows support durable human-agent collaboration.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Mar 2026 21:18:39 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/69a8c34d/bfeef993.mp3" length="23120279" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1441</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.AI, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Huacan Wang, Chaofa Yuan, Xialie Zhuang, Tu Hu, Shuo Zhang, Jun Han, Shi Wei, Daiqiang Li, Jingping Liu, Kunyi Wang, Zihan Yin, Zhenheng Tang, Andy Wang, Henry Peng Zou, Philip S. Yu, Sen Hu, Qizhen Lan, Ronghao Chen</p>

            <p><strong>Title:</strong><br>
            EpochX: Building the Infrastructure for an Emergent Agent Civilization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27304v1">http://arxiv.org/abs/2603.27304v1</a></p>

            <p><strong>Abstract:</strong><br>
            General-purpose technologies reshape economies less by improving individual tools than by enabling new ways to organize production and coordination. We believe AI agents are approaching a similar inflection point: as foundation models make broad task execution and tool use increasingly accessible, the binding constraint shifts from raw capability to how work is delegated, verified, and rewarded at scale. We introduce EpochX, a credits-native marketplace infrastructure for human-agent production networks. EpochX treats humans and agents as peer participants who can post tasks or claim them. Claimed tasks can be decomposed into subtasks and executed through an explicit delivery workflow with verification and acceptance. Crucially, EpochX is designed so that each completed transaction can produce reusable ecosystem assets, including skills, workflows, execution traces, and distilled experience. These assets are stored with explicit dependency structure, enabling retrieval, composition, and cumulative improvement over time. EpochX also introduces a native credit mechanism to make participation economically viable under real compute costs. Credits lock task bounties, budget delegation, settle rewards upon acceptance, and compensate creators when verified assets are reused. By formalizing the end-to-end transaction model together with its asset and incentive layers, EpochX reframes agentic AI as an organizational design problem: building infrastructures where verifiable work leaves persistent, reusable artifacts, and where value flows support durable human-agent collaboration.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models</title>
      <itunes:episode>1692</itunes:episode>
      <podcast:episode>1692</podcast:episode>
      <itunes:title>On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">865159f1-f7c4-48be-ab0f-c22ddfe1425f</guid>
      <link>https://share.transistor.fm/s/9f70a35a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chongyang Zhao, Mingsong Li, Haodong Lu, Dong Gong</p>

            <p><strong>Title:</strong><br>
            On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27481v1">http://arxiv.org/abs/2603.27481v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift-aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token-level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing-drift, while complementary routing score regularizations enforce expert-group separation and promote new-expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines. The project page is https://zhaoc5.github.io/DyMoE.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chongyang Zhao, Mingsong Li, Haodong Lu, Dong Gong</p>

            <p><strong>Title:</strong><br>
            On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27481v1">http://arxiv.org/abs/2603.27481v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift-aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token-level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing-drift, while complementary routing score regularizations enforce expert-group separation and promote new-expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines. The project page is https://zhaoc5.github.io/DyMoE.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Mar 2026 21:18:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9f70a35a/74aadbfa.mp3" length="21385799" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1333</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chongyang Zhao, Mingsong Li, Haodong Lu, Dong Gong</p>

            <p><strong>Title:</strong><br>
            On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27481v1">http://arxiv.org/abs/2603.27481v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift-aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token-level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing-drift, while complementary routing score regularizations enforce expert-group separation and promote new-expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines. The project page is https://zhaoc5.github.io/DyMoE.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GEditBench v2: A Human-Aligned Benchmark for General Image Editing</title>
      <itunes:episode>1691</itunes:episode>
      <podcast:episode>1691</podcast:episode>
      <itunes:title>GEditBench v2: A Human-Aligned Benchmark for General Image Editing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">42eb9150-f016-4421-ab4e-cddfd5fc5a73</guid>
      <link>https://share.transistor.fm/s/5eb472d6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhangqi Jiang, Zheng Sun, Xianfang Zeng, Yufeng Yang, Xuanyang Zhang, Yongliang Wu, Wei Cheng, Gang Yu, Xu Yang, Bihan Wen</p>

            <p><strong>Title:</strong><br>
            GEditBench v2: A Human-Aligned Benchmark for General Image Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28547v1">http://arxiv.org/abs/2603.28547v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. Furthermore, we propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench using expert-annotated preference pairs to assess the alignment of PVC-Judge with human judgments on visual consistency evaluation. Experiments show that our PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average. Finally, by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models, and providing a reliable foundation for advancing precise image editing.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhangqi Jiang, Zheng Sun, Xianfang Zeng, Yufeng Yang, Xuanyang Zhang, Yongliang Wu, Wei Cheng, Gang Yu, Xu Yang, Bihan Wen</p>

            <p><strong>Title:</strong><br>
            GEditBench v2: A Human-Aligned Benchmark for General Image Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28547v1">http://arxiv.org/abs/2603.28547v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. Furthermore, we propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench using expert-annotated preference pairs to assess the alignment of PVC-Judge with human judgments on visual consistency evaluation. Experiments show that our PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average. Finally, by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models, and providing a reliable foundation for advancing precise image editing.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Mar 2026 21:17:57 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5eb472d6/e7052d40.mp3" length="20820246" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1298</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhangqi Jiang, Zheng Sun, Xianfang Zeng, Yufeng Yang, Xuanyang Zhang, Yongliang Wu, Wei Cheng, Gang Yu, Xu Yang, Bihan Wen</p>

            <p><strong>Title:</strong><br>
            GEditBench v2: A Human-Aligned Benchmark for General Image Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.28547v1">http://arxiv.org/abs/2603.28547v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. Furthermore, we propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench using expert-annotated preference pairs to assess the alignment of PVC-Judge with human judgments on visual consistency evaluation. Experiments show that our PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average. Finally, by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models, and providing a reliable foundation for advancing precise image editing.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Make Geometry Matter for Spatial Reasoning</title>
      <itunes:episode>1690</itunes:episode>
      <podcast:episode>1690</podcast:episode>
      <itunes:title>Make Geometry Matter for Spatial Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ef595244-b24a-4c8e-85fb-1165cdfb032f</guid>
      <link>https://share.transistor.fm/s/39f9123e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shihua Zhang, Qiuhong Shen, Shizun Wang, Tianbo Pan, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            Make Geometry Matter for Spatial Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.26639v1">http://arxiv.org/abs/2603.26639v1</a></p>

            <p><strong>Abstract:</strong><br>
            Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shihua Zhang, Qiuhong Shen, Shizun Wang, Tianbo Pan, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            Make Geometry Matter for Spatial Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.26639v1">http://arxiv.org/abs/2603.26639v1</a></p>

            <p><strong>Abstract:</strong><br>
            Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Mar 2026 21:17:36 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/39f9123e/8dd91ff6.mp3" length="24470677" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1526</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shihua Zhang, Qiuhong Shen, Shizun Wang, Tianbo Pan, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            Make Geometry Matter for Spatial Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.26639v1">http://arxiv.org/abs/2603.26639v1</a></p>

            <p><strong>Abstract:</strong><br>
            Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PRBench: End-to-end Paper Reproduction in Physics Research</title>
      <itunes:episode>1689</itunes:episode>
      <podcast:episode>1689</podcast:episode>
      <itunes:title>PRBench: End-to-end Paper Reproduction in Physics Research</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b9b3a748-1392-4efc-91aa-3a408871895d</guid>
      <link>https://share.transistor.fm/s/f26837f4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL, hep-lat, hep-ph, physics.comp-ph, physics.optics</p>

            <p><strong>Authors:</strong><br>
            Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li, Zeyu Li, Zhaolong Zhang, Huiwen Zheng, Leidong Bao, Anqi Lv, Zihan Mo, Yadi Niu, Yiyang Peng, Yu Tian, Yili Wang, Ziyu Wang, Zi-Yu Wang, Jiashen Wei, Liuheng Wu, Aoran Xue, Leyi Yang, Guanglu Yuan, Xiarui Zhan, Jingjun Zhang, Zifan Zheng, Pengfei Liu, Linrui Zhen, Kaiyang Li, Qichang Li, Ziheng Zhou, Guo-En Nian, Yunwei Xiao, Qing-Hong Cao, Linjie Dai, Xu Feng, Peng Gao, Ying Gu, Chang Liu, Jia Liu, Ming-xing Luo, Yan-Qing Ma, Liang-You Peng, Huichao Song, Shufeng Wang, Chenxu Wang, Tao Wang, Yi-Nan Wang, Chengyin Wu, Pengwei Zhao, Hua Xing Zhu</p>

            <p><strong>Title:</strong><br>
            PRBench: End-to-end Paper Reproduction in Physics Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27646v1">http://arxiv.org/abs/2603.27646v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL, hep-lat, hep-ph, physics.comp-ph, physics.optics</p>

            <p><strong>Authors:</strong><br>
            Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li, Zeyu Li, Zhaolong Zhang, Huiwen Zheng, Leidong Bao, Anqi Lv, Zihan Mo, Yadi Niu, Yiyang Peng, Yu Tian, Yili Wang, Ziyu Wang, Zi-Yu Wang, Jiashen Wei, Liuheng Wu, Aoran Xue, Leyi Yang, Guanglu Yuan, Xiarui Zhan, Jingjun Zhang, Zifan Zheng, Pengfei Liu, Linrui Zhen, Kaiyang Li, Qichang Li, Ziheng Zhou, Guo-En Nian, Yunwei Xiao, Qing-Hong Cao, Linjie Dai, Xu Feng, Peng Gao, Ying Gu, Chang Liu, Jia Liu, Ming-xing Luo, Yan-Qing Ma, Liang-You Peng, Huichao Song, Shufeng Wang, Chenxu Wang, Tao Wang, Yi-Nan Wang, Chengyin Wu, Pengwei Zhao, Hua Xing Zhu</p>

            <p><strong>Title:</strong><br>
            PRBench: End-to-end Paper Reproduction in Physics Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27646v1">http://arxiv.org/abs/2603.27646v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Mar 2026 21:17:13 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f26837f4/cf230ef3.mp3" length="23110654" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1441</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL, hep-lat, hep-ph, physics.comp-ph, physics.optics</p>

            <p><strong>Authors:</strong><br>
            Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li, Zeyu Li, Zhaolong Zhang, Huiwen Zheng, Leidong Bao, Anqi Lv, Zihan Mo, Yadi Niu, Yiyang Peng, Yu Tian, Yili Wang, Ziyu Wang, Zi-Yu Wang, Jiashen Wei, Liuheng Wu, Aoran Xue, Leyi Yang, Guanglu Yuan, Xiarui Zhan, Jingjun Zhang, Zifan Zheng, Pengfei Liu, Linrui Zhen, Kaiyang Li, Qichang Li, Ziheng Zhou, Guo-En Nian, Yunwei Xiao, Qing-Hong Cao, Linjie Dai, Xu Feng, Peng Gao, Ying Gu, Chang Liu, Jia Liu, Ming-xing Luo, Yan-Qing Ma, Liang-You Peng, Huichao Song, Shufeng Wang, Chenxu Wang, Tao Wang, Yi-Nan Wang, Chengyin Wu, Pengwei Zhao, Hua Xing Zhu</p>

            <p><strong>Title:</strong><br>
            PRBench: End-to-end Paper Reproduction in Physics Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.27646v1">http://arxiv.org/abs/2603.27646v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PixelSmile: Toward Fine-Grained Facial Expression Editing</title>
      <itunes:episode>1688</itunes:episode>
      <podcast:episode>1688</podcast:episode>
      <itunes:title>PixelSmile: Toward Fine-Grained Facial Expression Editing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4ef93456-7a31-413e-93ce-f4387bebb745</guid>
      <link>https://share.transistor.fm/s/f3bb5506</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 100 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiabin Hua, Hengyuan Xu, Aojie Li, Wei Cheng, Gang Yu, Xingjun Ma, Yu-Gang Jiang</p>

            <p><strong>Title:</strong><br>
            PixelSmile: Toward Fine-Grained Facial Expression Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25728v1">http://arxiv.org/abs/2603.25728v1</a></p>

            <p><strong>Abstract:</strong><br>
            Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 100 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiabin Hua, Hengyuan Xu, Aojie Li, Wei Cheng, Gang Yu, Xingjun Ma, Yu-Gang Jiang</p>

            <p><strong>Title:</strong><br>
            PixelSmile: Toward Fine-Grained Facial Expression Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25728v1">http://arxiv.org/abs/2603.25728v1</a></p>

            <p><strong>Abstract:</strong><br>
            Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 27 Mar 2026 20:56:00 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f3bb5506/ab6d202f.mp3" length="26802905" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1672</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 100 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiabin Hua, Hengyuan Xu, Aojie Li, Wei Cheng, Gang Yu, Xingjun Ma, Yu-Gang Jiang</p>

            <p><strong>Title:</strong><br>
            PixelSmile: Toward Fine-Grained Facial Expression Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25728v1">http://arxiv.org/abs/2603.25728v1</a></p>

            <p><strong>Abstract:</strong><br>
            Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale</title>
      <itunes:episode>1687</itunes:episode>
      <podcast:episode>1687</podcast:episode>
      <itunes:title>Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">49e9f6ea-f469-457d-8d2d-d07178cf1cea</guid>
      <link>https://share.transistor.fm/s/c6c47e92</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 90 | cs.LG, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, Bowen Zhou, Zhanping Zhong, Zhijie Zhong, Haiteng Zhao, Penghao Zhao, Xiaomeng Zhao, Zhiyuan Zhao, Yechen Zhang, Jin Zhang, Wenwei Zhang, Hongjie Zhang, Zhuo Zhang, Wenlong Zhang, Bo Zhang, Chao Zhang, Chen Zhang, Yuhang Zang, Fei Yuan, Jiakang Yuan, Jiashuo Yu, Jinhui Yin, Haochen Ye, Qian Yao, Bowen Yang, Danni Yang, Kaichen Yang, Ziang Yan, Jun Xu, Yicheng Xu, Wanghan Xu, Xuenan Xu, Chao Xu, Ruiliang Xu, Shuhao Xing, Long Xing, Xinchen Xie, Ling-I Wu, Zijian Wu, Zhenyu Wu, Lijun Wu, Yue Wu, Jianyu Wu, Wen Wu, Fan Wu, Xilin Wei, Qi Wei, Bingli Wang, Rui Wang, Ziyi Wang, Zun Wang, Yi Wang, Haomin Wang, Yizhou Wang, Lintao Wang, Yiheng Wang, Longjiang Wang, Bin Wang, Jian Tong, Zhongbo Tian, Huanze Tang, Chen Tang, Shixiang Tang, Yu Sun, Qiushi Sun, Xuerui Su, Qisheng Su, Chenlin Su, Demin Song, Jin Shi, Fukai Shang, Yuchen Ren, Pengli Ren, Xiaoye Qu, Yuan Qu, Jiantao Qiu, Yu Qiao, Runyu Peng, Tianshuo Peng, Jiahui Peng, Qizhi Pei, Zhuoshi Pan, Linke Ouyang, Wenchang Ning, Yichuan Ma, Zerun Ma, Ningsheng Ma, Runyuan Ma, Chengqi Lyu, Haijun Lv, Han Lv, Lindong Lu, Kuikun Liu, Jiangning Liu, Yuhong Liu, Kai Liu, Hongwei Liu, Zhoumianze Liu, Mengjie Liu, Ziyu Liu, Wenran Liu, Yang Liu, Liwei Liu, Kaiwen Liu, Junyao Lin, Junming Lin, Tianyang Lin, Dahua Lin, Jianze Liang, Linyang Li, Peiji Li, Zonglin Li, Zehao Li, Pengze Li, Guoyan Li, Lingkai Kong, Linglin Jing, Zhenjiang Jin, Feifei Jiang, Qian Jiang, Junhao Huang, Zixian Huang, Haian Huang, Zhouqi Hua, Han Hu, Linfeng Hou, Yinan He, Conghui He, Tianyao He, Xu Guo, Qipeng Guo, Aijia Guo, Yuzhe Gu, Lixin Gu, Jingyang Gong, Qiming Ge, Jiaye Ge, Songyang Gao, Jianfei Gao, Xinyu Fang, Caihua fan, Yue Fan, Yanhui Duan, Zichen Ding, Shengyuan Ding, Xuanlang Dai, Erfei Cui, Ganqu Cui, Pei Chu, Tao Chu, Guangran Cheng, Yu Cheng, Kai Chen, Yongkang Chen, Chiyu Chen, Guanzhou Chen, Qiaosheng Chen, Sitao Chen, Xin Chen, Haojiong Chen, Yicheng Chen, Weihan Cao, Yuhang Cao, Qinglong Cao, Lei Bai</p>

            <p><strong>Title:</strong><br>
            Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25040v1">http://arxiv.org/abs/2603.25040v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertise has been vastly expanded to master over 100 specialized tasks across critical science fields, including chemistry, materials, life sciences, and earth sciences. Achieving this massive scale is made possible by the robust infrastructure support of XTuner and LMDeploy, which facilitates highly efficient Reinforcement Learning (RL) training at the 1-trillion parameter level while ensuring strict precision consistency between training and inference. By seamlessly integrating these advancements, Intern-S1-Pro further fortifies the fusion of general and specialized intelligence, working as a Specializable Generalist, demonstrating its position in the top tier of open-source models for general capabilities, while outperforming proprietary models in the depth of specialized scientific tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 90 | cs.LG, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, Bowen Zhou, Zhanping Zhong, Zhijie Zhong, Haiteng Zhao, Penghao Zhao, Xiaomeng Zhao, Zhiyuan Zhao, Yechen Zhang, Jin Zhang, Wenwei Zhang, Hongjie Zhang, Zhuo Zhang, Wenlong Zhang, Bo Zhang, Chao Zhang, Chen Zhang, Yuhang Zang, Fei Yuan, Jiakang Yuan, Jiashuo Yu, Jinhui Yin, Haochen Ye, Qian Yao, Bowen Yang, Danni Yang, Kaichen Yang, Ziang Yan, Jun Xu, Yicheng Xu, Wanghan Xu, Xuenan Xu, Chao Xu, Ruiliang Xu, Shuhao Xing, Long Xing, Xinchen Xie, Ling-I Wu, Zijian Wu, Zhenyu Wu, Lijun Wu, Yue Wu, Jianyu Wu, Wen Wu, Fan Wu, Xilin Wei, Qi Wei, Bingli Wang, Rui Wang, Ziyi Wang, Zun Wang, Yi Wang, Haomin Wang, Yizhou Wang, Lintao Wang, Yiheng Wang, Longjiang Wang, Bin Wang, Jian Tong, Zhongbo Tian, Huanze Tang, Chen Tang, Shixiang Tang, Yu Sun, Qiushi Sun, Xuerui Su, Qisheng Su, Chenlin Su, Demin Song, Jin Shi, Fukai Shang, Yuchen Ren, Pengli Ren, Xiaoye Qu, Yuan Qu, Jiantao Qiu, Yu Qiao, Runyu Peng, Tianshuo Peng, Jiahui Peng, Qizhi Pei, Zhuoshi Pan, Linke Ouyang, Wenchang Ning, Yichuan Ma, Zerun Ma, Ningsheng Ma, Runyuan Ma, Chengqi Lyu, Haijun Lv, Han Lv, Lindong Lu, Kuikun Liu, Jiangning Liu, Yuhong Liu, Kai Liu, Hongwei Liu, Zhoumianze Liu, Mengjie Liu, Ziyu Liu, Wenran Liu, Yang Liu, Liwei Liu, Kaiwen Liu, Junyao Lin, Junming Lin, Tianyang Lin, Dahua Lin, Jianze Liang, Linyang Li, Peiji Li, Zonglin Li, Zehao Li, Pengze Li, Guoyan Li, Lingkai Kong, Linglin Jing, Zhenjiang Jin, Feifei Jiang, Qian Jiang, Junhao Huang, Zixian Huang, Haian Huang, Zhouqi Hua, Han Hu, Linfeng Hou, Yinan He, Conghui He, Tianyao He, Xu Guo, Qipeng Guo, Aijia Guo, Yuzhe Gu, Lixin Gu, Jingyang Gong, Qiming Ge, Jiaye Ge, Songyang Gao, Jianfei Gao, Xinyu Fang, Caihua fan, Yue Fan, Yanhui Duan, Zichen Ding, Shengyuan Ding, Xuanlang Dai, Erfei Cui, Ganqu Cui, Pei Chu, Tao Chu, Guangran Cheng, Yu Cheng, Kai Chen, Yongkang Chen, Chiyu Chen, Guanzhou Chen, Qiaosheng Chen, Sitao Chen, Xin Chen, Haojiong Chen, Yicheng Chen, Weihan Cao, Yuhang Cao, Qinglong Cao, Lei Bai</p>

            <p><strong>Title:</strong><br>
            Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25040v1">http://arxiv.org/abs/2603.25040v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertise has been vastly expanded to master over 100 specialized tasks across critical science fields, including chemistry, materials, life sciences, and earth sciences. Achieving this massive scale is made possible by the robust infrastructure support of XTuner and LMDeploy, which facilitates highly efficient Reinforcement Learning (RL) training at the 1-trillion parameter level while ensuring strict precision consistency between training and inference. By seamlessly integrating these advancements, Intern-S1-Pro further fortifies the fusion of general and specialized intelligence, working as a Specializable Generalist, demonstrating its position in the top tier of open-source models for general capabilities, while outperforming proprietary models in the depth of specialized scientific tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 27 Mar 2026 20:55:38 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c6c47e92/9b421a62.mp3" length="24079497" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1501</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 90 | cs.LG, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, Bowen Zhou, Zhanping Zhong, Zhijie Zhong, Haiteng Zhao, Penghao Zhao, Xiaomeng Zhao, Zhiyuan Zhao, Yechen Zhang, Jin Zhang, Wenwei Zhang, Hongjie Zhang, Zhuo Zhang, Wenlong Zhang, Bo Zhang, Chao Zhang, Chen Zhang, Yuhang Zang, Fei Yuan, Jiakang Yuan, Jiashuo Yu, Jinhui Yin, Haochen Ye, Qian Yao, Bowen Yang, Danni Yang, Kaichen Yang, Ziang Yan, Jun Xu, Yicheng Xu, Wanghan Xu, Xuenan Xu, Chao Xu, Ruiliang Xu, Shuhao Xing, Long Xing, Xinchen Xie, Ling-I Wu, Zijian Wu, Zhenyu Wu, Lijun Wu, Yue Wu, Jianyu Wu, Wen Wu, Fan Wu, Xilin Wei, Qi Wei, Bingli Wang, Rui Wang, Ziyi Wang, Zun Wang, Yi Wang, Haomin Wang, Yizhou Wang, Lintao Wang, Yiheng Wang, Longjiang Wang, Bin Wang, Jian Tong, Zhongbo Tian, Huanze Tang, Chen Tang, Shixiang Tang, Yu Sun, Qiushi Sun, Xuerui Su, Qisheng Su, Chenlin Su, Demin Song, Jin Shi, Fukai Shang, Yuchen Ren, Pengli Ren, Xiaoye Qu, Yuan Qu, Jiantao Qiu, Yu Qiao, Runyu Peng, Tianshuo Peng, Jiahui Peng, Qizhi Pei, Zhuoshi Pan, Linke Ouyang, Wenchang Ning, Yichuan Ma, Zerun Ma, Ningsheng Ma, Runyuan Ma, Chengqi Lyu, Haijun Lv, Han Lv, Lindong Lu, Kuikun Liu, Jiangning Liu, Yuhong Liu, Kai Liu, Hongwei Liu, Zhoumianze Liu, Mengjie Liu, Ziyu Liu, Wenran Liu, Yang Liu, Liwei Liu, Kaiwen Liu, Junyao Lin, Junming Lin, Tianyang Lin, Dahua Lin, Jianze Liang, Linyang Li, Peiji Li, Zonglin Li, Zehao Li, Pengze Li, Guoyan Li, Lingkai Kong, Linglin Jing, Zhenjiang Jin, Feifei Jiang, Qian Jiang, Junhao Huang, Zixian Huang, Haian Huang, Zhouqi Hua, Han Hu, Linfeng Hou, Yinan He, Conghui He, Tianyao He, Xu Guo, Qipeng Guo, Aijia Guo, Yuzhe Gu, Lixin Gu, Jingyang Gong, Qiming Ge, Jiaye Ge, Songyang Gao, Jianfei Gao, Xinyu Fang, Caihua fan, Yue Fan, Yanhui Duan, Zichen Ding, Shengyuan Ding, Xuanlang Dai, Erfei Cui, Ganqu Cui, Pei Chu, Tao Chu, Guangran Cheng, Yu Cheng, Kai Chen, Yongkang Chen, Chiyu Chen, Guanzhou Chen, Qiaosheng Chen, Sitao Chen, Xin Chen, Haojiong Chen, Yicheng Chen, Weihan Cao, Yuhang Cao, Qinglong Cao, Lei Bai</p>

            <p><strong>Title:</strong><br>
            Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25040v1">http://arxiv.org/abs/2603.25040v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertise has been vastly expanded to master over 100 specialized tasks across critical science fields, including chemistry, materials, life sciences, and earth sciences. Achieving this massive scale is made possible by the robust infrastructure support of XTuner and LMDeploy, which facilitates highly efficient Reinforcement Learning (RL) training at the 1-trillion parameter level while ensuring strict precision consistency between training and inference. By seamlessly integrating these advancements, Intern-S1-Pro further fortifies the fusion of general and specialized intelligence, working as a Specializable Generalist, demonstrating its position in the top tier of open-source models for general capabilities, while outperforming proprietary models in the depth of specialized scientific tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration</title>
      <itunes:episode>1686</itunes:episode>
      <podcast:episode>1686</podcast:episode>
      <itunes:title>Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ab435daa-4d0f-4b56-b499-ae3bdac4ac7b</guid>
      <link>https://share.transistor.fm/s/7b2aa69a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Danil Tokhchukov, Aysel Mirzoeva, Andrey Kuznetsov, Konstantin Sobolev</p>

            <p><strong>Title:</strong><br>
            Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.24800v1">http://arxiv.org/abs/2603.24800v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we uncover the hidden potential of Diffusion Transformers (DiTs) to significantly enhance generative tasks. Through an in-depth analysis of the denoising process, we demonstrate that introducing a single learned scaling parameter can significantly improve the performance of DiT blocks. Building on this insight, we propose Calibri, a parameter-efficient approach that optimally calibrates DiT components to elevate generative quality. Calibri frames DiT calibration as a black-box reward optimization problem, which is efficiently solved using an evolutionary algorithm and modifies just ~100 parameters. Experimental results reveal that despite its lightweight design, Calibri consistently improves performance across various text-to-image models. Notably, Calibri also reduces the inference steps required for image generation, all while maintaining high-quality outputs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Danil Tokhchukov, Aysel Mirzoeva, Andrey Kuznetsov, Konstantin Sobolev</p>

            <p><strong>Title:</strong><br>
            Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.24800v1">http://arxiv.org/abs/2603.24800v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we uncover the hidden potential of Diffusion Transformers (DiTs) to significantly enhance generative tasks. Through an in-depth analysis of the denoising process, we demonstrate that introducing a single learned scaling parameter can significantly improve the performance of DiT blocks. Building on this insight, we propose Calibri, a parameter-efficient approach that optimally calibrates DiT components to elevate generative quality. Calibri frames DiT calibration as a black-box reward optimization problem, which is efficiently solved using an evolutionary algorithm and modifies just ~100 parameters. Experimental results reveal that despite its lightweight design, Calibri consistently improves performance across various text-to-image models. Notably, Calibri also reduces the inference steps required for image generation, all while maintaining high-quality outputs.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 27 Mar 2026 20:55:15 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7b2aa69a/77a71322.mp3" length="21663698" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1350</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Danil Tokhchukov, Aysel Mirzoeva, Andrey Kuznetsov, Konstantin Sobolev</p>

            <p><strong>Title:</strong><br>
            Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.24800v1">http://arxiv.org/abs/2603.24800v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we uncover the hidden potential of Diffusion Transformers (DiTs) to significantly enhance generative tasks. Through an in-depth analysis of the denoising process, we demonstrate that introducing a single learned scaling parameter can significantly improve the performance of DiT blocks. Building on this insight, we propose Calibri, a parameter-efficient approach that optimally calibrates DiT components to elevate generative quality. Calibri frames DiT calibration as a black-box reward optimization problem, which is efficiently solved using an evolutionary algorithm and modifies just ~100 parameters. Experimental results reveal that despite its lightweight design, Calibri consistently improves performance across various text-to-image models. Notably, Calibri also reduces the inference steps required for image generation, all while maintaining high-quality outputs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models</title>
      <itunes:episode>1685</itunes:episode>
      <podcast:episode>1685</podcast:episode>
      <itunes:title>RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2e5888f1-9e6c-4ecc-b879-43bc458519fd</guid>
      <link>https://share.transistor.fm/s/934fc819</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yufeng Yang, Xianfang Zeng, Zhangqi Jiang, Fukun Yin, Jianzhuang Liu, Wei Cheng, jinghong lan, Shiyu Liu, Yuqi Peng, Gang YU, Shifeng Chen</p>

            <p><strong>Title:</strong><br>
            RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25502v1">http://arxiv.org/abs/2603.25502v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image restoration under real-world degradations is critical for downstream tasks such as autonomous driving and object detection. However, existing restoration models are often limited by the scale and distribution of their training data, resulting in poor generalization to real-world scenarios. Recently, large-scale image editing models have shown strong generalization ability in restoration tasks, especially for closed-source models like Nano Banana Pro, which can restore images while preserving consistency. Nevertheless, achieving such performance with those large universal models requires substantial data and computational costs. To address this issue, we construct a large-scale dataset covering nine common real-world degradation types and train a state-of-the-art open-source model to narrow the gap with closed-source alternatives. Furthermore, we introduce RealIR-Bench, which contains 464 real-world degraded images and tailored evaluation metrics focusing on degradation removal and consistency preservation. Extensive experiments demonstrate our model ranks first among open-source methods, achieving state-of-the-art performance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yufeng Yang, Xianfang Zeng, Zhangqi Jiang, Fukun Yin, Jianzhuang Liu, Wei Cheng, jinghong lan, Shiyu Liu, Yuqi Peng, Gang YU, Shifeng Chen</p>

            <p><strong>Title:</strong><br>
            RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25502v1">http://arxiv.org/abs/2603.25502v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image restoration under real-world degradations is critical for downstream tasks such as autonomous driving and object detection. However, existing restoration models are often limited by the scale and distribution of their training data, resulting in poor generalization to real-world scenarios. Recently, large-scale image editing models have shown strong generalization ability in restoration tasks, especially for closed-source models like Nano Banana Pro, which can restore images while preserving consistency. Nevertheless, achieving such performance with those large universal models requires substantial data and computational costs. To address this issue, we construct a large-scale dataset covering nine common real-world degradation types and train a state-of-the-art open-source model to narrow the gap with closed-source alternatives. Furthermore, we introduce RealIR-Bench, which contains 464 real-world degraded images and tailored evaluation metrics focusing on degradation removal and consistency preservation. Extensive experiments demonstrate our model ranks first among open-source methods, achieving state-of-the-art performance.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 27 Mar 2026 20:54:53 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/934fc819/f02d4f03.mp3" length="20730839" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1292</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yufeng Yang, Xianfang Zeng, Zhangqi Jiang, Fukun Yin, Jianzhuang Liu, Wei Cheng, jinghong lan, Shiyu Liu, Yuqi Peng, Gang YU, Shifeng Chen</p>

            <p><strong>Title:</strong><br>
            RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25502v1">http://arxiv.org/abs/2603.25502v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image restoration under real-world degradations is critical for downstream tasks such as autonomous driving and object detection. However, existing restoration models are often limited by the scale and distribution of their training data, resulting in poor generalization to real-world scenarios. Recently, large-scale image editing models have shown strong generalization ability in restoration tasks, especially for closed-source models like Nano Banana Pro, which can restore images while preserving consistency. Nevertheless, achieving such performance with those large universal models requires substantial data and computational costs. To address this issue, we construct a large-scale dataset covering nine common real-world degradation types and train a state-of-the-art open-source model to narrow the gap with closed-source alternatives. Furthermore, we introduce RealIR-Bench, which contains 464 real-world degraded images and tailored evaluation metrics focusing on degradation removal and consistency preservation. Extensive experiments demonstrate our model ranks first among open-source methods, achieving state-of-the-art performance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data</title>
      <itunes:episode>1684</itunes:episode>
      <podcast:episode>1684</podcast:episode>
      <itunes:title>MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f6a94124-8e19-4983-b54a-b1af818169ca</guid>
      <link>https://share.transistor.fm/s/cb54e9ee</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhekai Chen, Yuqing Wang, Manyuan Zhang, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25319v1">http://arxiv.org/abs/2603.25319v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions -- Customization, Illustration, Spatial reasoning, and Temporal dynamics -- to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhekai Chen, Yuqing Wang, Manyuan Zhang, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25319v1">http://arxiv.org/abs/2603.25319v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions -- Customization, Illustration, Spatial reasoning, and Temporal dynamics -- to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 27 Mar 2026 20:54:30 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cb54e9ee/f83e0fc4.mp3" length="21550855" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1343</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhekai Chen, Yuqing Wang, Manyuan Zhang, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25319v1">http://arxiv.org/abs/2603.25319v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions -- Customization, Illustration, Spatial reasoning, and Temporal dynamics -- to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Voxtral TTS</title>
      <itunes:episode>1683</itunes:episode>
      <podcast:episode>1683</podcast:episode>
      <itunes:title>Voxtral TTS</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bc35f942-cdef-45f7-bd29-1ef2316bd51b</guid>
      <link>https://share.transistor.fm/s/8109cbbc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg, Andy Lo, Chen-Yo Sun, Guillaume Lample, Henry Lagarde, Jean-Malo Delignon, Jaeyoung Kim, John Harvill, Khyathi Raghavi Chandu, Lorenzo Signoretti, Margaret Jennings, Patrick von Platen, Pavankumar Reddy Muddireddy, Rohin Arora, Sanchit Gandhi, Samuel Humeau, Soham Ghosh, Srijan Mishra, Van Phung, Abdelaziz Bounhar, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andrew Bai, Andrew Zhao, Angele Lenglemetz, Anmol Agarwal, Anton Eliseev, Antonia Calvi, Arjun Majumdar, Arthur Fournier, Artjom Joosen, Avi Sooriyarachchi, Aysenur Karaduman Utkur, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Benjamin Tibi, Bowen Yang, Charlotte Cronjäger, Clémence Lanfranchi, Connor Chen, Corentin Barreau, Corentin Sautier, Cyprien Courtot, Darius Dabert, Diego de las Casas, Elizaveta Demyanenko, Elliot Chane-Sane, Emmanuel Gottlob, Enguerrand Paquin, Etienne Goffinet, Fabien Niel, Faruk Ahmed, Federico Baldassarre, Gabrielle Berrada, Gaëtan Ecrepont, Gauthier Guinet, Genevieve Hayes, Georgii Novikov, Giada Pistilli, Guillaume Kunsch, Guillaume Martin, Guillaume Raille, Gunjan Dhanuka, Gunshi Gupta, Han Zhou, Harshil Shah, Hope McGovern, Hugo Thimonier, Indraneel Mukherjee, Irene Zhang, Jacques Sun, Jan Ludziejewski, Jason Rute, Jérémie Dentan, Joachim Studnia, Jonas Amar, Joséphine Delas, Josselin Somerville Roberts, Julien Tauran, Karmesh Yadav, Kartik Khandelwal, Kilian Tep, Kush Jain, Laurence Aitchison, Laurent Fainsin, Léonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Maarten Buyl, Manan Sharma, Marie Pellat, Mark Prins, Martin Alexandre, Mathieu Poirée, Mathieu Schmitt, Mathilde Guillaumin, Matthieu Dinot, Matthieu Futeral, Maxime Darrin, Maximilian Augustin, Mert Unsal, Mia Chiquier, Mikhail Biriuchinskii, Minh-Quang Pham, Mircea Lica, Morgane Rivière, Nathan Grinsztajn, Neha Gupta, Olivier Bousquet, Olivier Duchenne, Patricia Wang, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Philippe Pinel, Philomène Chagniot, Pierre Stock, Piotr Miłoś, Prateek Gupta, Pravesh Agrawal, Quentin Torroba, Ram Ramrakhya, Randall Isenhour, Rishi Shah, Romain Sauvestre, Roman Soletskyi, Rosalie Millner, Rupert Menneer, Sagar Vaze, Samuel Barry, Samuel Belkadi, Sandeep Subramanian, Sean Cha, Shashwat Verma, Siddhant Waghjale, Siddharth Gandhi, Simon Lepage, Sumukh Aithal, Szymon Antoniak, Tarun Kumar Vangani, Teven Le Scao, Théo Cachet, Theo Simon Sorg, Thibaut Lavril, Thomas Chabal, Thomas Foubert, Thomas Robert, Thomas Wang, Tim Lawson, Tom Bewley, Tom Edwards, Tyler Wang, Umar Jamil, Umberto Tomasini, Valeriia Nemychnikova, Vedant Nanda, Victor Jouault, Vincent Maladière, Vincent Pfister, Virgile Richard, Vladislav Bataev, Wassim Bouaziz, Wen-Ding Li, William Havard, William Marshall, Xinghui Li, Xingran Guo, Xinyu Yang, Yannic Neuhaus, Yassine El Ouahidi, Yassir Bendou, Yihan Wang, Yimu Pan, Zaccharie Ramzi, Zhenlin Xu</p>

            <p><strong>Title:</strong><br>
            Voxtral TTS</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25551v1">http://arxiv.org/abs/2603.25551v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Voxtral TTS, an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. Voxtral TTS adopts a hybrid architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens. These tokens are encoded and decoded with Voxtral Codec, a speech tokenizer trained from scratch with a hybrid VQ-FSQ quantization scheme. In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5. We release the model weights under a CC BY-NC license.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg, Andy Lo, Chen-Yo Sun, Guillaume Lample, Henry Lagarde, Jean-Malo Delignon, Jaeyoung Kim, John Harvill, Khyathi Raghavi Chandu, Lorenzo Signoretti, Margaret Jennings, Patrick von Platen, Pavankumar Reddy Muddireddy, Rohin Arora, Sanchit Gandhi, Samuel Humeau, Soham Ghosh, Srijan Mishra, Van Phung, Abdelaziz Bounhar, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andrew Bai, Andrew Zhao, Angele Lenglemetz, Anmol Agarwal, Anton Eliseev, Antonia Calvi, Arjun Majumdar, Arthur Fournier, Artjom Joosen, Avi Sooriyarachchi, Aysenur Karaduman Utkur, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Benjamin Tibi, Bowen Yang, Charlotte Cronjäger, Clémence Lanfranchi, Connor Chen, Corentin Barreau, Corentin Sautier, Cyprien Courtot, Darius Dabert, Diego de las Casas, Elizaveta Demyanenko, Elliot Chane-Sane, Emmanuel Gottlob, Enguerrand Paquin, Etienne Goffinet, Fabien Niel, Faruk Ahmed, Federico Baldassarre, Gabrielle Berrada, Gaëtan Ecrepont, Gauthier Guinet, Genevieve Hayes, Georgii Novikov, Giada Pistilli, Guillaume Kunsch, Guillaume Martin, Guillaume Raille, Gunjan Dhanuka, Gunshi Gupta, Han Zhou, Harshil Shah, Hope McGovern, Hugo Thimonier, Indraneel Mukherjee, Irene Zhang, Jacques Sun, Jan Ludziejewski, Jason Rute, Jérémie Dentan, Joachim Studnia, Jonas Amar, Joséphine Delas, Josselin Somerville Roberts, Julien Tauran, Karmesh Yadav, Kartik Khandelwal, Kilian Tep, Kush Jain, Laurence Aitchison, Laurent Fainsin, Léonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Maarten Buyl, Manan Sharma, Marie Pellat, Mark Prins, Martin Alexandre, Mathieu Poirée, Mathieu Schmitt, Mathilde Guillaumin, Matthieu Dinot, Matthieu Futeral, Maxime Darrin, Maximilian Augustin, Mert Unsal, Mia Chiquier, Mikhail Biriuchinskii, Minh-Quang Pham, Mircea Lica, Morgane Rivière, Nathan Grinsztajn, Neha Gupta, Olivier Bousquet, Olivier Duchenne, Patricia Wang, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Philippe Pinel, Philomène Chagniot, Pierre Stock, Piotr Miłoś, Prateek Gupta, Pravesh Agrawal, Quentin Torroba, Ram Ramrakhya, Randall Isenhour, Rishi Shah, Romain Sauvestre, Roman Soletskyi, Rosalie Millner, Rupert Menneer, Sagar Vaze, Samuel Barry, Samuel Belkadi, Sandeep Subramanian, Sean Cha, Shashwat Verma, Siddhant Waghjale, Siddharth Gandhi, Simon Lepage, Sumukh Aithal, Szymon Antoniak, Tarun Kumar Vangani, Teven Le Scao, Théo Cachet, Theo Simon Sorg, Thibaut Lavril, Thomas Chabal, Thomas Foubert, Thomas Robert, Thomas Wang, Tim Lawson, Tom Bewley, Tom Edwards, Tyler Wang, Umar Jamil, Umberto Tomasini, Valeriia Nemychnikova, Vedant Nanda, Victor Jouault, Vincent Maladière, Vincent Pfister, Virgile Richard, Vladislav Bataev, Wassim Bouaziz, Wen-Ding Li, William Havard, William Marshall, Xinghui Li, Xingran Guo, Xinyu Yang, Yannic Neuhaus, Yassine El Ouahidi, Yassir Bendou, Yihan Wang, Yimu Pan, Zaccharie Ramzi, Zhenlin Xu</p>

            <p><strong>Title:</strong><br>
            Voxtral TTS</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25551v1">http://arxiv.org/abs/2603.25551v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Voxtral TTS, an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. Voxtral TTS adopts a hybrid architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens. These tokens are encoded and decoded with Voxtral Codec, a speech tokenizer trained from scratch with a hybrid VQ-FSQ quantization scheme. In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5. We release the model weights under a CC BY-NC license.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 27 Mar 2026 20:54:08 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8109cbbc/5badc49c.mp3" length="28037928" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1749</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg, Andy Lo, Chen-Yo Sun, Guillaume Lample, Henry Lagarde, Jean-Malo Delignon, Jaeyoung Kim, John Harvill, Khyathi Raghavi Chandu, Lorenzo Signoretti, Margaret Jennings, Patrick von Platen, Pavankumar Reddy Muddireddy, Rohin Arora, Sanchit Gandhi, Samuel Humeau, Soham Ghosh, Srijan Mishra, Van Phung, Abdelaziz Bounhar, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andrew Bai, Andrew Zhao, Angele Lenglemetz, Anmol Agarwal, Anton Eliseev, Antonia Calvi, Arjun Majumdar, Arthur Fournier, Artjom Joosen, Avi Sooriyarachchi, Aysenur Karaduman Utkur, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Benjamin Tibi, Bowen Yang, Charlotte Cronjäger, Clémence Lanfranchi, Connor Chen, Corentin Barreau, Corentin Sautier, Cyprien Courtot, Darius Dabert, Diego de las Casas, Elizaveta Demyanenko, Elliot Chane-Sane, Emmanuel Gottlob, Enguerrand Paquin, Etienne Goffinet, Fabien Niel, Faruk Ahmed, Federico Baldassarre, Gabrielle Berrada, Gaëtan Ecrepont, Gauthier Guinet, Genevieve Hayes, Georgii Novikov, Giada Pistilli, Guillaume Kunsch, Guillaume Martin, Guillaume Raille, Gunjan Dhanuka, Gunshi Gupta, Han Zhou, Harshil Shah, Hope McGovern, Hugo Thimonier, Indraneel Mukherjee, Irene Zhang, Jacques Sun, Jan Ludziejewski, Jason Rute, Jérémie Dentan, Joachim Studnia, Jonas Amar, Joséphine Delas, Josselin Somerville Roberts, Julien Tauran, Karmesh Yadav, Kartik Khandelwal, Kilian Tep, Kush Jain, Laurence Aitchison, Laurent Fainsin, Léonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Maarten Buyl, Manan Sharma, Marie Pellat, Mark Prins, Martin Alexandre, Mathieu Poirée, Mathieu Schmitt, Mathilde Guillaumin, Matthieu Dinot, Matthieu Futeral, Maxime Darrin, Maximilian Augustin, Mert Unsal, Mia Chiquier, Mikhail Biriuchinskii, Minh-Quang Pham, Mircea Lica, Morgane Rivière, Nathan Grinsztajn, Neha Gupta, Olivier Bousquet, Olivier Duchenne, Patricia Wang, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Philippe Pinel, Philomène Chagniot, Pierre Stock, Piotr Miłoś, Prateek Gupta, Pravesh Agrawal, Quentin Torroba, Ram Ramrakhya, Randall Isenhour, Rishi Shah, Romain Sauvestre, Roman Soletskyi, Rosalie Millner, Rupert Menneer, Sagar Vaze, Samuel Barry, Samuel Belkadi, Sandeep Subramanian, Sean Cha, Shashwat Verma, Siddhant Waghjale, Siddharth Gandhi, Simon Lepage, Sumukh Aithal, Szymon Antoniak, Tarun Kumar Vangani, Teven Le Scao, Théo Cachet, Theo Simon Sorg, Thibaut Lavril, Thomas Chabal, Thomas Foubert, Thomas Robert, Thomas Wang, Tim Lawson, Tom Bewley, Tom Edwards, Tyler Wang, Umar Jamil, Umberto Tomasini, Valeriia Nemychnikova, Vedant Nanda, Victor Jouault, Vincent Maladière, Vincent Pfister, Virgile Richard, Vladislav Bataev, Wassim Bouaziz, Wen-Ding Li, William Havard, William Marshall, Xinghui Li, Xingran Guo, Xinyu Yang, Yannic Neuhaus, Yassine El Ouahidi, Yassir Bendou, Yihan Wang, Yimu Pan, Zaccharie Ramzi, Zhenlin Xu</p>

            <p><strong>Title:</strong><br>
            Voxtral TTS</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.25551v1">http://arxiv.org/abs/2603.25551v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Voxtral TTS, an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. Voxtral TTS adopts a hybrid architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens. These tokens are encoded and decoded with Voxtral Codec, a speech tokenizer trained from scratch with a hybrid VQ-FSQ quantization scheme. In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4\% win rate over ElevenLabs Flash v2.5. We release the model weights under a CC BY-NC license.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?</title>
      <itunes:episode>1682</itunes:episode>
      <podcast:episode>1682</podcast:episode>
      <itunes:title>Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b8ff4285-6572-4d10-8ab7-08d873b304ca</guid>
      <link>https://share.transistor.fm/s/91f240dc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang</p>

            <p><strong>Title:</strong><br>
            Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.24472v1">http://arxiv.org/abs/2603.24472v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang</p>

            <p><strong>Title:</strong><br>
            Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.24472v1">http://arxiv.org/abs/2603.24472v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 26 Mar 2026 20:25:57 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/91f240dc/d45771c8.mp3" length="23902709" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1490</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang</p>

            <p><strong>Title:</strong><br>
            Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.24472v1">http://arxiv.org/abs/2603.24472v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding</title>
      <itunes:episode>1681</itunes:episode>
      <podcast:episode>1681</podcast:episode>
      <itunes:title>MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">52ff79df-85d4-4f29-b7bf-de7348c683f6</guid>
      <link>https://share.transistor.fm/s/bdc0d262</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 112 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hejun Dong, Junbo Niu, Bin Wang, Weijun Zeng, Wentao Zhang, Conghui He</p>

            <p><strong>Title:</strong><br>
            MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.22458v1">http://arxiv.org/abs/2603.22458v1</a></p>

            <p><strong>Abstract:</strong><br>
            Optical character recognition (OCR) has evolved from line-level transcription to structured document parsing, requiring models to recover long-form sequences containing layout, tables, and formulas. Despite recent advances in vision-language models, most existing systems rely on autoregressive decoding, which introduces sequential latency and amplifies error propagation in long documents. In this work, we revisit document OCR from an inverse rendering perspective, arguing that left-to-right causal generation is an artifact of serialization rather than an intrinsic property of the task. Motivated by this insight, we propose MinerU-Diffusion, a unified diffusion-based framework that replaces autoregressive sequential decoding with parallel diffusion denoising under visual conditioning. MinerU-Diffusion employs a block-wise diffusion decoder and an uncertainty-driven curriculum learning strategy to enable stable training and efficient long-sequence inference. Extensive experiments demonstrate that MinerU-Diffusion consistently improves robustness while achieving up to 3.2x faster decoding compared to autoregressive baselines. Evaluations on the proposed Semantic Shuffle benchmark further confirm its reduced dependence on linguistic priors and stronger visual OCR capability.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 112 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hejun Dong, Junbo Niu, Bin Wang, Weijun Zeng, Wentao Zhang, Conghui He</p>

            <p><strong>Title:</strong><br>
            MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.22458v1">http://arxiv.org/abs/2603.22458v1</a></p>

            <p><strong>Abstract:</strong><br>
            Optical character recognition (OCR) has evolved from line-level transcription to structured document parsing, requiring models to recover long-form sequences containing layout, tables, and formulas. Despite recent advances in vision-language models, most existing systems rely on autoregressive decoding, which introduces sequential latency and amplifies error propagation in long documents. In this work, we revisit document OCR from an inverse rendering perspective, arguing that left-to-right causal generation is an artifact of serialization rather than an intrinsic property of the task. Motivated by this insight, we propose MinerU-Diffusion, a unified diffusion-based framework that replaces autoregressive sequential decoding with parallel diffusion denoising under visual conditioning. MinerU-Diffusion employs a block-wise diffusion decoder and an uncertainty-driven curriculum learning strategy to enable stable training and efficient long-sequence inference. Extensive experiments demonstrate that MinerU-Diffusion consistently improves robustness while achieving up to 3.2x faster decoding compared to autoregressive baselines. Evaluations on the proposed Semantic Shuffle benchmark further confirm its reduced dependence on linguistic priors and stronger visual OCR capability.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 25 Mar 2026 21:19:28 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bdc0d262/26d503ce.mp3" length="23193019" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1446</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 112 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hejun Dong, Junbo Niu, Bin Wang, Weijun Zeng, Wentao Zhang, Conghui He</p>

            <p><strong>Title:</strong><br>
            MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.22458v1">http://arxiv.org/abs/2603.22458v1</a></p>

            <p><strong>Abstract:</strong><br>
            Optical character recognition (OCR) has evolved from line-level transcription to structured document parsing, requiring models to recover long-form sequences containing layout, tables, and formulas. Despite recent advances in vision-language models, most existing systems rely on autoregressive decoding, which introduces sequential latency and amplifies error propagation in long documents. In this work, we revisit document OCR from an inverse rendering perspective, arguing that left-to-right causal generation is an artifact of serialization rather than an intrinsic property of the task. Motivated by this insight, we propose MinerU-Diffusion, a unified diffusion-based framework that replaces autoregressive sequential decoding with parallel diffusion denoising under visual conditioning. MinerU-Diffusion employs a block-wise diffusion decoder and an uncertainty-driven curriculum learning strategy to enable stable training and efficient long-sequence inference. Extensive experiments demonstrate that MinerU-Diffusion consistently improves robustness while achieving up to 3.2x faster decoding compared to autoregressive baselines. Evaluations on the proposed Semantic Shuffle benchmark further confirm its reduced dependence on linguistic priors and stronger visual OCR capability.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG</title>
      <itunes:episode>1680</itunes:episode>
      <podcast:episode>1680</podcast:episode>
      <itunes:title>WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">42b3e24c-45c4-4488-9bac-3c8bdef85ce3</guid>
      <link>https://share.transistor.fm/s/7559f424</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 69 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu, Bo Zheng, Chuanhao Li, Kaipeng Zhang</p>

            <p><strong>Title:</strong><br>
            WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.23497v1">http://arxiv.org/abs/2603.23497v1</a></p>

            <p><strong>Abstract:</strong><br>
            Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is https://shandaai.github.io/wildworld-project/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 69 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu, Bo Zheng, Chuanhao Li, Kaipeng Zhang</p>

            <p><strong>Title:</strong><br>
            WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.23497v1">http://arxiv.org/abs/2603.23497v1</a></p>

            <p><strong>Abstract:</strong><br>
            Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is https://shandaai.github.io/wildworld-project/.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 25 Mar 2026 21:19:05 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7559f424/f88d8e1a.mp3" length="21402093" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1334</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 69 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu, Bo Zheng, Chuanhao Li, Kaipeng Zhang</p>

            <p><strong>Title:</strong><br>
            WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.23497v1">http://arxiv.org/abs/2603.23497v1</a></p>

            <p><strong>Abstract:</strong><br>
            Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is https://shandaai.github.io/wildworld-project/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents</title>
      <itunes:episode>1679</itunes:episode>
      <podcast:episode>1679</podcast:episode>
      <itunes:title>From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">709fbab0-dce6-443b-9d46-797b9f9b8f46</guid>
      <link>https://share.transistor.fm/s/94467c2e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ling Yue, Kushal Raj Bhandari, Ching-Yun Ko, Dhaval Patel, Shuxin Lin, Nianjun Zhou, Jianxi Gao, Pin-Yu Chen, Shaowu Pan</p>

            <p><strong>Title:</strong><br>
            From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.22386v1">http://arxiv.org/abs/2603.22386v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM)-based systems are becoming increasingly popular for solving tasks by constructing executable workflows that interleave LLM calls, information retrieval, tool use, code execution, memory updates, and verification. This survey reviews recent methods for designing and optimizing such workflows, which we treat as agentic computation graphs (ACGs). We organize the literature based on when workflow structure is determined, where structure refers to which components or agents are present, how they depend on each other, and how information flows between them. This lens distinguishes static methods, which fix a reusable workflow scaffold before deployment, from dynamic methods, which select, generate, or revise the workflow for a particular run before or during execution. We further organize prior work along three dimensions: when structure is determined, what part of the workflow is optimized, and which evaluation signals guide optimization (e.g., task metrics, verifier signals, preferences, or trace-derived feedback). We also distinguish reusable workflow templates, run-specific realized graphs, and execution traces, separating reusable design choices from the structures actually deployed in a given run and from realized runtime behavior. Finally, we outline a structure-aware evaluation perspective that complements downstream task metrics with graph-level properties, execution cost, robustness, and structural variation across inputs. Our goal is to provide a clear vocabulary, a unified framework for positioning new methods, a more comparable view of existing body of literature, and a more reproducible evaluation standard for future work in workflow optimizations for LLM agents.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ling Yue, Kushal Raj Bhandari, Ching-Yun Ko, Dhaval Patel, Shuxin Lin, Nianjun Zhou, Jianxi Gao, Pin-Yu Chen, Shaowu Pan</p>

            <p><strong>Title:</strong><br>
            From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.22386v1">http://arxiv.org/abs/2603.22386v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM)-based systems are becoming increasingly popular for solving tasks by constructing executable workflows that interleave LLM calls, information retrieval, tool use, code execution, memory updates, and verification. This survey reviews recent methods for designing and optimizing such workflows, which we treat as agentic computation graphs (ACGs). We organize the literature based on when workflow structure is determined, where structure refers to which components or agents are present, how they depend on each other, and how information flows between them. This lens distinguishes static methods, which fix a reusable workflow scaffold before deployment, from dynamic methods, which select, generate, or revise the workflow for a particular run before or during execution. We further organize prior work along three dimensions: when structure is determined, what part of the workflow is optimized, and which evaluation signals guide optimization (e.g., task metrics, verifier signals, preferences, or trace-derived feedback). We also distinguish reusable workflow templates, run-specific realized graphs, and execution traces, separating reusable design choices from the structures actually deployed in a given run and from realized runtime behavior. Finally, we outline a structure-aware evaluation perspective that complements downstream task metrics with graph-level properties, execution cost, robustness, and structural variation across inputs. Our goal is to provide a clear vocabulary, a unified framework for positioning new methods, a more comparable view of existing body of literature, and a more reproducible evaluation standard for future work in workflow optimizations for LLM agents.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 25 Mar 2026 21:18:42 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/94467c2e/9715f378.mp3" length="26896149" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1677</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ling Yue, Kushal Raj Bhandari, Ching-Yun Ko, Dhaval Patel, Shuxin Lin, Nianjun Zhou, Jianxi Gao, Pin-Yu Chen, Shaowu Pan</p>

            <p><strong>Title:</strong><br>
            From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.22386v1">http://arxiv.org/abs/2603.22386v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM)-based systems are becoming increasingly popular for solving tasks by constructing executable workflows that interleave LLM calls, information retrieval, tool use, code execution, memory updates, and verification. This survey reviews recent methods for designing and optimizing such workflows, which we treat as agentic computation graphs (ACGs). We organize the literature based on when workflow structure is determined, where structure refers to which components or agents are present, how they depend on each other, and how information flows between them. This lens distinguishes static methods, which fix a reusable workflow scaffold before deployment, from dynamic methods, which select, generate, or revise the workflow for a particular run before or during execution. We further organize prior work along three dimensions: when structure is determined, what part of the workflow is optimized, and which evaluation signals guide optimization (e.g., task metrics, verifier signals, preferences, or trace-derived feedback). We also distinguish reusable workflow templates, run-specific realized graphs, and execution traces, separating reusable design choices from the structures actually deployed in a given run and from realized runtime behavior. Finally, we outline a structure-aware evaluation perspective that complements downstream task metrics with graph-level properties, execution cost, robustness, and structural variation across inputs. Our goal is to provide a clear vocabulary, a unified framework for positioning new methods, a more comparable view of existing body of literature, and a more reproducible evaluation standard for future work in workflow optimizations for LLM agents.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning</title>
      <itunes:episode>1678</itunes:episode>
      <podcast:episode>1678</podcast:episode>
      <itunes:title>SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">dba579a6-c7e3-462e-a449-d94fad53a69a</guid>
      <link>https://share.transistor.fm/s/a5ef29cc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haoyu Huang, Jinfa Huang, Zhongwei Wan, Xiawu Zheng, Rongrong Ji, Jiebo Luo</p>

            <p><strong>Title:</strong><br>
            SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.23483v1">http://arxiv.org/abs/2603.23483v1</a></p>

            <p><strong>Abstract:</strong><br>
            Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haoyu Huang, Jinfa Huang, Zhongwei Wan, Xiawu Zheng, Rongrong Ji, Jiebo Luo</p>

            <p><strong>Title:</strong><br>
            SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.23483v1">http://arxiv.org/abs/2603.23483v1</a></p>

            <p><strong>Abstract:</strong><br>
            Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 25 Mar 2026 21:18:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a5ef29cc/0f6d6716.mp3" length="21074803" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1313</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haoyu Huang, Jinfa Huang, Zhongwei Wan, Xiawu Zheng, Rongrong Ji, Jiebo Luo</p>

            <p><strong>Title:</strong><br>
            SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.23483v1">http://arxiv.org/abs/2603.23483v1</a></p>

            <p><strong>Abstract:</strong><br>
            Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PEARL: Personalized Streaming Video Understanding Model</title>
      <itunes:episode>1677</itunes:episode>
      <podcast:episode>1677</podcast:episode>
      <itunes:title>PEARL: Personalized Streaming Video Understanding Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8c56eed0-4626-4155-896f-e510b79b0922</guid>
      <link>https://share.transistor.fm/s/d1269378</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Yuanhong Zheng, Ruichuan An, Xiaopeng Lin, Yuxing Liu, Sihan Yang, Huanyu Zhang, Haodong Li, Qintong Zhang, Renrui Zhang, Guopeng Li, Yifan Zhang, Yuheng Li, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            PEARL: Personalized Streaming Video Understanding Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.20422v1">http://arxiv.org/abs/2603.20422v1</a></p>

            <p><strong>Abstract:</strong><br>
            Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model's ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person or object in discrete frames, and (2) a novel Video-level, focusing on personalized actions unfolding across continuous frames. PEARL-Bench comprises 132 unique videos and 2,173 fine-grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug-and-play, training-free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state-of-the-art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision-language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at https://github.com/Yuanhong-Zheng/PEARL.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Yuanhong Zheng, Ruichuan An, Xiaopeng Lin, Yuxing Liu, Sihan Yang, Huanyu Zhang, Haodong Li, Qintong Zhang, Renrui Zhang, Guopeng Li, Yifan Zhang, Yuheng Li, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            PEARL: Personalized Streaming Video Understanding Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.20422v1">http://arxiv.org/abs/2603.20422v1</a></p>

            <p><strong>Abstract:</strong><br>
            Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model's ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person or object in discrete frames, and (2) a novel Video-level, focusing on personalized actions unfolding across continuous frames. PEARL-Bench comprises 132 unique videos and 2,173 fine-grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug-and-play, training-free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state-of-the-art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision-language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at https://github.com/Yuanhong-Zheng/PEARL.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 25 Mar 2026 21:17:55 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d1269378/65410b4f.mp3" length="21493985" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1340</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Yuanhong Zheng, Ruichuan An, Xiaopeng Lin, Yuxing Liu, Sihan Yang, Huanyu Zhang, Haodong Li, Qintong Zhang, Renrui Zhang, Guopeng Li, Yifan Zhang, Yuheng Li, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            PEARL: Personalized Streaming Video Understanding Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.20422v1">http://arxiv.org/abs/2603.20422v1</a></p>

            <p><strong>Abstract:</strong><br>
            Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model's ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person or object in discrete frames, and (2) a novel Video-level, focusing on personalized actions unfolding across continuous frames. PEARL-Bench comprises 132 unique videos and 2,173 fine-grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug-and-play, training-free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state-of-the-art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision-language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at https://github.com/Yuanhong-Zheng/PEARL.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models</title>
      <itunes:episode>1676</itunes:episode>
      <podcast:episode>1676</podcast:episode>
      <itunes:title>DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1be2de67-c1aa-4711-a2b6-ff90375c8450</guid>
      <link>https://share.transistor.fm/s/295b7a12</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jaewon Min, Jaeeun Lee, Yeji Choi, Paul Hyunbin Cho, Jin Hyeon Kim, Tae-Young Lee, Jongsik Ahn, Hwayeong Lee, Seonghyun Park, Seungryong Kim</p>

            <p><strong>Title:</strong><br>
            DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.23499v1">http://arxiv.org/abs/2603.23499v1</a></p>

            <p><strong>Abstract:</strong><br>
            Optical flow models trained on high-quality data often degrade severely when confronted with real-world corruptions such as blur, noise, and compression artifacts. To overcome this limitation, we formulate Degradation-Aware Optical Flow, a new task targeting accurate dense correspondence estimation from real-world corrupted videos. Our key insight is that the intermediate representations of image restoration diffusion models are inherently corruption-aware but lack temporal awareness. To address this limitation, we lift the model to attend across adjacent frames via full spatio-temporal attention, and empirically demonstrate that the resulting features exhibit zero-shot correspondence capabilities. Based on this finding, we present DA-Flow, a hybrid architecture that fuses these diffusion features with convolutional features within an iterative refinement framework. DA-Flow substantially outperforms existing optical flow methods under severe degradation across multiple benchmarks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jaewon Min, Jaeeun Lee, Yeji Choi, Paul Hyunbin Cho, Jin Hyeon Kim, Tae-Young Lee, Jongsik Ahn, Hwayeong Lee, Seonghyun Park, Seungryong Kim</p>

            <p><strong>Title:</strong><br>
            DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.23499v1">http://arxiv.org/abs/2603.23499v1</a></p>

            <p><strong>Abstract:</strong><br>
            Optical flow models trained on high-quality data often degrade severely when confronted with real-world corruptions such as blur, noise, and compression artifacts. To overcome this limitation, we formulate Degradation-Aware Optical Flow, a new task targeting accurate dense correspondence estimation from real-world corrupted videos. Our key insight is that the intermediate representations of image restoration diffusion models are inherently corruption-aware but lack temporal awareness. To address this limitation, we lift the model to attend across adjacent frames via full spatio-temporal attention, and empirically demonstrate that the resulting features exhibit zero-shot correspondence capabilities. Based on this finding, we present DA-Flow, a hybrid architecture that fuses these diffusion features with convolutional features within an iterative refinement framework. DA-Flow substantially outperforms existing optical flow methods under severe degradation across multiple benchmarks.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 25 Mar 2026 21:17:32 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/295b7a12/a90f3c76.mp3" length="19935432" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1242</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jaewon Min, Jaeeun Lee, Yeji Choi, Paul Hyunbin Cho, Jin Hyeon Kim, Tae-Young Lee, Jongsik Ahn, Hwayeong Lee, Seonghyun Park, Seungryong Kim</p>

            <p><strong>Title:</strong><br>
            DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.23499v1">http://arxiv.org/abs/2603.23499v1</a></p>

            <p><strong>Abstract:</strong><br>
            Optical flow models trained on high-quality data often degrade severely when confronted with real-world corruptions such as blur, noise, and compression artifacts. To overcome this limitation, we formulate Degradation-Aware Optical Flow, a new task targeting accurate dense correspondence estimation from real-world corrupted videos. Our key insight is that the intermediate representations of image restoration diffusion models are inherently corruption-aware but lack temporal awareness. To address this limitation, we lift the model to attend across adjacent frames via full spatio-temporal attention, and empirically demonstrate that the resulting features exhibit zero-shot correspondence capabilities. Based on this finding, we present DA-Flow, a hybrid architecture that fuses these diffusion features with convolutional features within an iterative refinement framework. DA-Flow substantially outperforms existing optical flow methods under severe degradation across multiple benchmarks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM</title>
      <itunes:episode>1675</itunes:episode>
      <podcast:episode>1675</podcast:episode>
      <itunes:title>SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2235647d-add9-4c88-be91-3fa255cf0257</guid>
      <link>https://share.transistor.fm/s/137deb71</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV, cs.GR, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Chuanrui Zhang, Minghan Qin, Yuang Wang, Baifeng Xie, Hang Li, Ziwei Wang</p>

            <p><strong>Title:</strong><br>
            SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.23386v1">http://arxiv.org/abs/2603.23386v1</a></p>

            <p><strong>Abstract:</strong><br>
            High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in "sim-ready" interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors across decoupled modules. Alternatively, unified MLLMs offer a single-stage path to joint static asset understanding and sim-ready asset generation. However dense voxel-based 3D tokenization yields long 3D token sequences and high memory overhead, limiting scalability to complex articulated objects. To address this, we propose SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. By introducing a Sparse 3D VQ-VAE, SIMART reduces token counts by 70% vs. dense voxel tokens, enabling high-fidelity multi-part assemblies. SIMART achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV, cs.GR, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Chuanrui Zhang, Minghan Qin, Yuang Wang, Baifeng Xie, Hang Li, Ziwei Wang</p>

            <p><strong>Title:</strong><br>
            SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.23386v1">http://arxiv.org/abs/2603.23386v1</a></p>

            <p><strong>Abstract:</strong><br>
            High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in "sim-ready" interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors across decoupled modules. Alternatively, unified MLLMs offer a single-stage path to joint static asset understanding and sim-ready asset generation. However dense voxel-based 3D tokenization yields long 3D token sequences and high memory overhead, limiting scalability to complex articulated objects. To address this, we propose SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. By introducing a Sparse 3D VQ-VAE, SIMART reduces token counts by 70% vs. dense voxel tokens, enabling high-fidelity multi-part assemblies. SIMART achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 25 Mar 2026 21:17:08 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/137deb71/247cae7c.mp3" length="22831479" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1423</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV, cs.GR, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Chuanrui Zhang, Minghan Qin, Yuang Wang, Baifeng Xie, Hang Li, Ziwei Wang</p>

            <p><strong>Title:</strong><br>
            SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.23386v1">http://arxiv.org/abs/2603.23386v1</a></p>

            <p><strong>Abstract:</strong><br>
            High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in "sim-ready" interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors across decoupled modules. Alternatively, unified MLLMs offer a single-stage path to joint static asset understanding and sim-ready asset generation. However dense voxel-based 3D tokenization yields long 3D token sequences and high memory overhead, limiting scalability to complex articulated objects. To address this, we propose SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. By introducing a Sparse 3D VQ-VAE, SIMART reduces token counts by 70% vs. dense voxel tokens, enabling high-fidelity multi-part assemblies. SIMART achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation</title>
      <itunes:episode>1674</itunes:episode>
      <podcast:episode>1674</podcast:episode>
      <itunes:title>UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f16945ad-5595-427b-a1e1-279f3d8e2a76</guid>
      <link>https://share.transistor.fm/s/b2409bdf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, Wanli Ouyang</p>

            <p><strong>Title:</strong><br>
            UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.23500v1">http://arxiv.org/abs/2603.23500v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, Wanli Ouyang</p>

            <p><strong>Title:</strong><br>
            UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.23500v1">http://arxiv.org/abs/2603.23500v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 25 Mar 2026 21:16:45 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b2409bdf/fe3a9b29.mp3" length="19492399" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1215</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, Wanli Ouyang</p>

            <p><strong>Title:</strong><br>
            UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.23500v1">http://arxiv.org/abs/2603.23500v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RealMaster: Lifting Rendered Scenes into Photorealistic Video</title>
      <itunes:episode>1673</itunes:episode>
      <podcast:episode>1673</podcast:episode>
      <itunes:title>RealMaster: Lifting Rendered Scenes into Photorealistic Video</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">34317c60-c51c-4dd7-827e-2665933546f5</guid>
      <link>https://share.transistor.fm/s/31ccd964</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Dana Cohen-Bar, Ido Sobol, Raphael Bensadoun, Shelly Sheynin, Oran Gafni, Or Patashnik, Daniel Cohen-Or, Amit Zohar</p>

            <p><strong>Title:</strong><br>
            RealMaster: Lifting Rendered Scenes into Photorealistic Video</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.23462v1">http://arxiv.org/abs/2603.23462v1</a></p>

            <p><strong>Abstract:</strong><br>
            State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the "uncanny valley". Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline's constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Dana Cohen-Bar, Ido Sobol, Raphael Bensadoun, Shelly Sheynin, Oran Gafni, Or Patashnik, Daniel Cohen-Or, Amit Zohar</p>

            <p><strong>Title:</strong><br>
            RealMaster: Lifting Rendered Scenes into Photorealistic Video</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.23462v1">http://arxiv.org/abs/2603.23462v1</a></p>

            <p><strong>Abstract:</strong><br>
            State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the "uncanny valley". Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline's constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 25 Mar 2026 21:16:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/31ccd964/ea4b41cd.mp3" length="21697955" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1352</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Dana Cohen-Bar, Ido Sobol, Raphael Bensadoun, Shelly Sheynin, Oran Gafni, Or Patashnik, Daniel Cohen-Or, Amit Zohar</p>

            <p><strong>Title:</strong><br>
            RealMaster: Lifting Rendered Scenes into Photorealistic Video</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.23462v1">http://arxiv.org/abs/2603.23462v1</a></p>

            <p><strong>Abstract:</strong><br>
            State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the "uncanny valley". Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline's constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models</title>
      <itunes:episode>1672</itunes:episode>
      <podcast:episode>1672</podcast:episode>
      <itunes:title>Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">06b0af42-f3a7-4546-bcb0-8be4776b2c56</guid>
      <link>https://share.transistor.fm/s/39ea5431</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 110 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Meiqi Wu, Zhixin Cai, Fufangchen Zhao, Xiaokun Feng, Rujing Dang, Bingze Song, Ruitian Tian, Jiashu Zhu, Jiachen Lei, Hao Dou, Jing Tang, Lei Sun, Jiahong Wu, Xiangxiang Chu, Zeming Liu, Kaiqi Huang</p>

            <p><strong>Title:</strong><br>
            Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.22212v1">http://arxiv.org/abs/2603.22212v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni--WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni--WorldBench comprises two key components: Omni--WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni--Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 110 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Meiqi Wu, Zhixin Cai, Fufangchen Zhao, Xiaokun Feng, Rujing Dang, Bingze Song, Ruitian Tian, Jiashu Zhu, Jiachen Lei, Hao Dou, Jing Tang, Lei Sun, Jiahong Wu, Xiangxiang Chu, Zeming Liu, Kaiqi Huang</p>

            <p><strong>Title:</strong><br>
            Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.22212v1">http://arxiv.org/abs/2603.22212v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni--WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni--WorldBench comprises two key components: Omni--WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni--Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Mar 2026 21:46:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/39ea5431/d7c4ab52.mp3" length="22201205" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1384</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 110 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Meiqi Wu, Zhixin Cai, Fufangchen Zhao, Xiaokun Feng, Rujing Dang, Bingze Song, Ruitian Tian, Jiashu Zhu, Jiachen Lei, Hao Dou, Jing Tang, Lei Sun, Jiahong Wu, Xiangxiang Chu, Zeming Liu, Kaiqi Huang</p>

            <p><strong>Title:</strong><br>
            Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.22212v1">http://arxiv.org/abs/2603.22212v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni--WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni--WorldBench comprises two key components: Omni--WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni--Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model</title>
      <itunes:episode>1671</itunes:episode>
      <podcast:episode>1671</podcast:episode>
      <itunes:title>Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">824de257-5fc4-4f75-b9b7-4ccee21bdf45</guid>
      <link>https://share.transistor.fm/s/b3096806</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 90 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            SII-GAIR, Sand. ai, :, Ethan Chern, Hansi Teng, Hanwen Sun, Hao Wang, Hong Pan, Hongyu Jia, Jiadi Su, Jin Li, Junjie Yu, Lijie Liu, Lingzhi Li, Lyumanshan Ye, Min Hu, Qiangang Wang, Quanwei Qi, Steffi Chern, Tao Bu, Taoran Wang, Teren Xu, Tianning Zhang, Tiantian Mi, Weixian Xu, Wenqiang Zhang, Wentai Zhang, Xianping Yi, Xiaojie Cai, Xiaoyang Kang, Yan Ma, Yixiu Liu, Yunbo Zhang, Yunpeng Huang, Yutong Lin, Zewei Tao, Zhaoliang Liu, Zheng Zhang, Zhiyao Cen, Zhixuan Yu, Zhongshu Wang, Zhulin Hu, Zijin Zhou, Zinan Guo, Yue Cao, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.21986v1">http://arxiv.org/abs/2603.21986v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 90 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            SII-GAIR, Sand. ai, :, Ethan Chern, Hansi Teng, Hanwen Sun, Hao Wang, Hong Pan, Hongyu Jia, Jiadi Su, Jin Li, Junjie Yu, Lijie Liu, Lingzhi Li, Lyumanshan Ye, Min Hu, Qiangang Wang, Quanwei Qi, Steffi Chern, Tao Bu, Taoran Wang, Teren Xu, Tianning Zhang, Tiantian Mi, Weixian Xu, Wenqiang Zhang, Wentai Zhang, Xianping Yi, Xiaojie Cai, Xiaoyang Kang, Yan Ma, Yixiu Liu, Yunbo Zhang, Yunpeng Huang, Yutong Lin, Zewei Tao, Zhaoliang Liu, Zheng Zhang, Zhiyao Cen, Zhixuan Yu, Zhongshu Wang, Zhulin Hu, Zijin Zhou, Zinan Guo, Yue Cao, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.21986v1">http://arxiv.org/abs/2603.21986v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Mar 2026 21:45:48 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b3096806/65b7f3c8.mp3" length="22066632" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1375</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 90 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            SII-GAIR, Sand. ai, :, Ethan Chern, Hansi Teng, Hanwen Sun, Hao Wang, Hong Pan, Hongyu Jia, Jiadi Su, Jin Li, Junjie Yu, Lijie Liu, Lingzhi Li, Lyumanshan Ye, Min Hu, Qiangang Wang, Quanwei Qi, Steffi Chern, Tao Bu, Taoran Wang, Teren Xu, Tianning Zhang, Tiantian Mi, Weixian Xu, Wenqiang Zhang, Wentai Zhang, Xianping Yi, Xiaojie Cai, Xiaoyang Kang, Yan Ma, Yixiu Liu, Yunbo Zhang, Yunpeng Huang, Yutong Lin, Zewei Tao, Zhaoliang Liu, Zheng Zhang, Zhiyao Cen, Zhixuan Yu, Zhongshu Wang, Zhulin Hu, Zijin Zhou, Zinan Guo, Yue Cao, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.21986v1">http://arxiv.org/abs/2603.21986v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning</title>
      <itunes:episode>1670</itunes:episode>
      <podcast:episode>1670</podcast:episode>
      <itunes:title>LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">020bbdf7-2cdf-4f09-bc55-50d94d1b1b1e</guid>
      <link>https://share.transistor.fm/s/1f2e6e4d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jianing Wang, Jianfei Zhang, Qi Guo, Linsen Guo, Rumei Li, Chao Zhang, Chong Peng, Cunguang Wang, Dengchang Zhao, Jiarong Shi, Jingang Wang, Liulin Feng, Mengxia Shen, Qi Li, Shengnan An, Shun Wang, Wei Shi, Xiangyu Xi, Xiaoyu Li, Xuezhi Cao, Yi Lu, Yunke Zhao, Zhengyu Chen, Zhimin Lin, Wei Wang, Peng Pei, Xunliang Cai</p>

            <p><strong>Title:</strong><br>
            LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.21065v1">http://arxiv.org/abs/2603.21065v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jianing Wang, Jianfei Zhang, Qi Guo, Linsen Guo, Rumei Li, Chao Zhang, Chong Peng, Cunguang Wang, Dengchang Zhao, Jiarong Shi, Jingang Wang, Liulin Feng, Mengxia Shen, Qi Li, Shengnan An, Shun Wang, Wei Shi, Xiangyu Xi, Xiaoyu Li, Xuezhi Cao, Yi Lu, Yunke Zhao, Zhengyu Chen, Zhimin Lin, Wei Wang, Peng Pei, Xunliang Cai</p>

            <p><strong>Title:</strong><br>
            LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.21065v1">http://arxiv.org/abs/2603.21065v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Mar 2026 21:45:27 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1f2e6e4d/1b184828.mp3" length="22038637" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1374</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jianing Wang, Jianfei Zhang, Qi Guo, Linsen Guo, Rumei Li, Chao Zhang, Chong Peng, Cunguang Wang, Dengchang Zhao, Jiarong Shi, Jingang Wang, Liulin Feng, Mengxia Shen, Qi Li, Shengnan An, Shun Wang, Wei Shi, Xiangyu Xi, Xiaoyu Li, Xuezhi Cao, Yi Lu, Yunke Zhao, Zhengyu Chen, Zhimin Lin, Wei Wang, Peng Pei, Xunliang Cai</p>

            <p><strong>Title:</strong><br>
            LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.21065v1">http://arxiv.org/abs/2603.21065v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs</title>
      <itunes:episode>1669</itunes:episode>
      <podcast:episode>1669</podcast:episode>
      <itunes:title>Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9dd155cc-4f18-4311-895e-8a8e0c1d74cd</guid>
      <link>https://share.transistor.fm/s/be06cf1a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Nimrod Shabtay, Moshe Kimhi, Artem Spector, Sivan Haray, Ehud Rivlin, Chaim Baskin, Raja Giryes, Eli Schwartz</p>

            <p><strong>Title:</strong><br>
            Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16932v1">http://arxiv.org/abs/2603.16932v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Nimrod Shabtay, Moshe Kimhi, Artem Spector, Sivan Haray, Ehud Rivlin, Chaim Baskin, Raja Giryes, Eli Schwartz</p>

            <p><strong>Title:</strong><br>
            Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16932v1">http://arxiv.org/abs/2603.16932v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Mar 2026 21:45:06 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/be06cf1a/4fe9fb17.mp3" length="25218855" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1572</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Nimrod Shabtay, Moshe Kimhi, Artem Spector, Sivan Haray, Ehud Rivlin, Chaim Baskin, Raja Giryes, Eli Schwartz</p>

            <p><strong>Title:</strong><br>
            Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16932v1">http://arxiv.org/abs/2603.16932v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis</title>
      <itunes:episode>1668</itunes:episode>
      <podcast:episode>1668</podcast:episode>
      <itunes:title>OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">49234549-8b20-4bca-a4a3-a843dad2bc9d</guid>
      <link>https://share.transistor.fm/s/77057472</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.IR, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhuofeng Li, Dongfu Jiang, Xueguang Ma, Haoxiang Zhang, Ping Nie, Yuyu Zhang, Kai Zou, Jianwen Xie, Yu Zhang, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.20278v1">http://arxiv.org/abs/2603.20278v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training deep research agents requires long-horizon trajectories that interleave search, evidence aggregation, and multi-step reasoning. However, existing data collection pipelines typically rely on proprietary web APIs, making large-scale trajectory synthesis costly, unstable, and difficult to reproduce. We present OpenResearcher, a reproducible pipeline that decouples one-time corpus bootstrapping from multi-turn trajectory synthesis and executes the search-and-browse loop entirely offline using three explicit browser primitives: search, open, and find, over a 15M-document corpus. Using GPT-OSS-120B as the teacher model, we synthesize over 97K trajectories, including a substantial long-horizon tail with 100+ tool calls. Supervised fine-tuning a 30B-A3B backbone on these trajectories achieves 54.8\% accuracy on BrowseComp-Plus, a +34.0 point improvement over the base model, while remaining competitive on BrowseComp, GAIA, and xbench-DeepSearch. Because the environment is offline and fully instrumented, it also enables controlled analysis, where our study reveals practical insights into deep research pipeline design, including data filtering strategies, agent configuration choices, and how retrieval success relates to final answer accuracy. We release the pipeline, synthesized trajectories, model checkpoints, and the offline search environment at https://github.com/TIGER-AI-Lab/OpenResearcher.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.IR, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhuofeng Li, Dongfu Jiang, Xueguang Ma, Haoxiang Zhang, Ping Nie, Yuyu Zhang, Kai Zou, Jianwen Xie, Yu Zhang, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.20278v1">http://arxiv.org/abs/2603.20278v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training deep research agents requires long-horizon trajectories that interleave search, evidence aggregation, and multi-step reasoning. However, existing data collection pipelines typically rely on proprietary web APIs, making large-scale trajectory synthesis costly, unstable, and difficult to reproduce. We present OpenResearcher, a reproducible pipeline that decouples one-time corpus bootstrapping from multi-turn trajectory synthesis and executes the search-and-browse loop entirely offline using three explicit browser primitives: search, open, and find, over a 15M-document corpus. Using GPT-OSS-120B as the teacher model, we synthesize over 97K trajectories, including a substantial long-horizon tail with 100+ tool calls. Supervised fine-tuning a 30B-A3B backbone on these trajectories achieves 54.8\% accuracy on BrowseComp-Plus, a +34.0 point improvement over the base model, while remaining competitive on BrowseComp, GAIA, and xbench-DeepSearch. Because the environment is offline and fully instrumented, it also enables controlled analysis, where our study reveals practical insights into deep research pipeline design, including data filtering strategies, agent configuration choices, and how retrieval success relates to final answer accuracy. We release the pipeline, synthesized trajectories, model checkpoints, and the offline search environment at https://github.com/TIGER-AI-Lab/OpenResearcher.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Mar 2026 21:44:45 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/77057472/faaca094.mp3" length="23344324" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1455</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.IR, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhuofeng Li, Dongfu Jiang, Xueguang Ma, Haoxiang Zhang, Ping Nie, Yuyu Zhang, Kai Zou, Jianwen Xie, Yu Zhang, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.20278v1">http://arxiv.org/abs/2603.20278v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training deep research agents requires long-horizon trajectories that interleave search, evidence aggregation, and multi-step reasoning. However, existing data collection pipelines typically rely on proprietary web APIs, making large-scale trajectory synthesis costly, unstable, and difficult to reproduce. We present OpenResearcher, a reproducible pipeline that decouples one-time corpus bootstrapping from multi-turn trajectory synthesis and executes the search-and-browse loop entirely offline using three explicit browser primitives: search, open, and find, over a 15M-document corpus. Using GPT-OSS-120B as the teacher model, we synthesize over 97K trajectories, including a substantial long-horizon tail with 100+ tool calls. Supervised fine-tuning a 30B-A3B backbone on these trajectories achieves 54.8\% accuracy on BrowseComp-Plus, a +34.0 point improvement over the base model, while remaining competitive on BrowseComp, GAIA, and xbench-DeepSearch. Because the environment is offline and fully instrumented, it also enables controlled analysis, where our study reveals practical insights into deep research pipeline design, including data filtering strategies, agent configuration choices, and how retrieval success relates to final answer accuracy. We release the pipeline, synthesized trajectories, model checkpoints, and the offline search environment at https://github.com/TIGER-AI-Lab/OpenResearcher.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding</title>
      <itunes:episode>1667</itunes:episode>
      <podcast:episode>1667</podcast:episode>
      <itunes:title>VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0badc3a7-913e-4f9b-81d3-a399bd1b0a36</guid>
      <link>https://share.transistor.fm/s/b66bd3e2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ruoliu Yang, Chu Wu, Caifeng Shan, Ran He, Chaoyou Fu</p>

            <p><strong>Title:</strong><br>
            VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.22285v1">http://arxiv.org/abs/2603.22285v1</a></p>

            <p><strong>Abstract:</strong><br>
            Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ruoliu Yang, Chu Wu, Caifeng Shan, Ran He, Chaoyou Fu</p>

            <p><strong>Title:</strong><br>
            VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.22285v1">http://arxiv.org/abs/2603.22285v1</a></p>

            <p><strong>Abstract:</strong><br>
            Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Mar 2026 21:44:23 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b66bd3e2/d549e838.mp3" length="21801236" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1359</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ruoliu Yang, Chu Wu, Caifeng Shan, Ran He, Chaoyou Fu</p>

            <p><strong>Title:</strong><br>
            VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.22285v1">http://arxiv.org/abs/2603.22285v1</a></p>

            <p><strong>Abstract:</strong><br>
            Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning</title>
      <itunes:episode>1666</itunes:episode>
      <podcast:episode>1666</podcast:episode>
      <itunes:title>SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fa9ef599-6b4b-4931-8eff-d6ac9c6d3055</guid>
      <link>https://share.transistor.fm/s/23bacacd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Byungwoo Jeon, Dongyoung Kim, Huiwon Jang, Insoo Kim, Jinwoo Shin</p>

            <p><strong>Title:</strong><br>
            SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.22057v1">http://arxiv.org/abs/2603.22057v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Byungwoo Jeon, Dongyoung Kim, Huiwon Jang, Insoo Kim, Jinwoo Shin</p>

            <p><strong>Title:</strong><br>
            SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.22057v1">http://arxiv.org/abs/2603.22057v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Mar 2026 21:44:02 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/23bacacd/36de13b0.mp3" length="21999740" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1371</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Byungwoo Jeon, Dongyoung Kim, Huiwon Jang, Insoo Kim, Jinwoo Shin</p>

            <p><strong>Title:</strong><br>
            SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.22057v1">http://arxiv.org/abs/2603.22057v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting</title>
      <itunes:episode>1665</itunes:episode>
      <podcast:episode>1665</podcast:episode>
      <itunes:title>F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3ab2a56d-3ac0-46ce-9874-2d763f9b4a94</guid>
      <link>https://share.transistor.fm/s/f162e2a3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Injae Kim, Chaehyeon Kim, Minseong Bae, Minseok Joo, Hyunwoo J. Kim</p>

            <p><strong>Title:</strong><br>
            F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.21304v1">http://arxiv.org/abs/2603.21304v1</a></p>

            <p><strong>Abstract:</strong><br>
            Feed-forward 3D Gaussian Splatting methods enable single-pass reconstruction and real-time rendering. However, they typically adopt rigid pixel-to-Gaussian or voxel-to-Gaussian pipelines that uniformly allocate Gaussians, leading to redundant Gaussians across views. Moreover, they lack an effective mechanism to control the total number of Gaussians while maintaining reconstruction fidelity. To address these limitations, we present F4Splat, which performs Feed-Forward predictive densification for Feed-Forward 3D Gaussian Splatting, introducing a densification-score-guided allocation strategy that adaptively distributes Gaussians according to spatial complexity and multi-view overlap. Our model predicts per-region densification scores to estimate the required Gaussian density and allows explicit control over the final Gaussian budget without retraining. This spatially adaptive allocation reduces redundancy in simple regions and minimizes duplicate Gaussians across overlapping views, producing compact yet high-quality 3D representations. Extensive experiments demonstrate that our model achieves superior novel-view synthesis performance compared to prior uncalibrated feed-forward methods, while using significantly fewer Gaussians.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Injae Kim, Chaehyeon Kim, Minseong Bae, Minseok Joo, Hyunwoo J. Kim</p>

            <p><strong>Title:</strong><br>
            F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.21304v1">http://arxiv.org/abs/2603.21304v1</a></p>

            <p><strong>Abstract:</strong><br>
            Feed-forward 3D Gaussian Splatting methods enable single-pass reconstruction and real-time rendering. However, they typically adopt rigid pixel-to-Gaussian or voxel-to-Gaussian pipelines that uniformly allocate Gaussians, leading to redundant Gaussians across views. Moreover, they lack an effective mechanism to control the total number of Gaussians while maintaining reconstruction fidelity. To address these limitations, we present F4Splat, which performs Feed-Forward predictive densification for Feed-Forward 3D Gaussian Splatting, introducing a densification-score-guided allocation strategy that adaptively distributes Gaussians according to spatial complexity and multi-view overlap. Our model predicts per-region densification scores to estimate the required Gaussian density and allows explicit control over the final Gaussian budget without retraining. This spatially adaptive allocation reduces redundancy in simple regions and minimizes duplicate Gaussians across overlapping views, producing compact yet high-quality 3D representations. Extensive experiments demonstrate that our model achieves superior novel-view synthesis performance compared to prior uncalibrated feed-forward methods, while using significantly fewer Gaussians.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Mar 2026 21:43:31 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f162e2a3/55263aa9.mp3" length="21968399" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1369</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Injae Kim, Chaehyeon Kim, Minseong Bae, Minseok Joo, Hyunwoo J. Kim</p>

            <p><strong>Title:</strong><br>
            F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.21304v1">http://arxiv.org/abs/2603.21304v1</a></p>

            <p><strong>Abstract:</strong><br>
            Feed-forward 3D Gaussian Splatting methods enable single-pass reconstruction and real-time rendering. However, they typically adopt rigid pixel-to-Gaussian or voxel-to-Gaussian pipelines that uniformly allocate Gaussians, leading to redundant Gaussians across views. Moreover, they lack an effective mechanism to control the total number of Gaussians while maintaining reconstruction fidelity. To address these limitations, we present F4Splat, which performs Feed-Forward predictive densification for Feed-Forward 3D Gaussian Splatting, introducing a densification-score-guided allocation strategy that adaptively distributes Gaussians according to spatial complexity and multi-view overlap. Our model predicts per-region densification scores to estimate the required Gaussian density and allows explicit control over the final Gaussian budget without retraining. This spatially adaptive allocation reduces redundancy in simple regions and minimizes duplicate Gaussians across overlapping views, producing compact yet high-quality 3D representations. Extensive experiments demonstrate that our model achieves superior novel-view synthesis performance compared to prior uncalibrated feed-forward methods, while using significantly fewer Gaussians.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>mSFT: Addressing Dataset Mixtures Overfitting Heterogeneously in Multi-task SFT</title>
      <itunes:episode>1664</itunes:episode>
      <podcast:episode>1664</podcast:episode>
      <itunes:title>mSFT: Addressing Dataset Mixtures Overfitting Heterogeneously in Multi-task SFT</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">18dcb410-bb2d-4e54-a9b1-e9f4c6e53bc1</guid>
      <link>https://share.transistor.fm/s/2134fea1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Woosung Koh, Jeyoung Jeon, Youngjin Song, Yujin Cheon, Soowon Oh, Jaehyeong Choi, Se-Young Yun</p>

            <p><strong>Title:</strong><br>
            mSFT: Addressing Dataset Mixtures Overfitting Heterogeneously in Multi-task SFT</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.21606v2">http://arxiv.org/abs/2603.21606v2</a></p>

            <p><strong>Abstract:</strong><br>
            Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes the potential of models across diverse data mixtures.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Woosung Koh, Jeyoung Jeon, Youngjin Song, Yujin Cheon, Soowon Oh, Jaehyeong Choi, Se-Young Yun</p>

            <p><strong>Title:</strong><br>
            mSFT: Addressing Dataset Mixtures Overfitting Heterogeneously in Multi-task SFT</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.21606v2">http://arxiv.org/abs/2603.21606v2</a></p>

            <p><strong>Abstract:</strong><br>
            Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes the potential of models across diverse data mixtures.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Mar 2026 21:43:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2134fea1/5b4fb5df.mp3" length="23526545" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1467</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Woosung Koh, Jeyoung Jeon, Youngjin Song, Yujin Cheon, Soowon Oh, Jaehyeong Choi, Se-Young Yun</p>

            <p><strong>Title:</strong><br>
            mSFT: Addressing Dataset Mixtures Overfitting Heterogeneously in Multi-task SFT</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.21606v2">http://arxiv.org/abs/2603.21606v2</a></p>

            <p><strong>Abstract:</strong><br>
            Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes the potential of models across diverse data mixtures.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning</title>
      <itunes:episode>1663</itunes:episode>
      <podcast:episode>1663</podcast:episode>
      <itunes:title>HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bdcbfd82-ac20-4321-be87-0ee8075a1795</guid>
      <link>https://share.transistor.fm/s/4fe61b9a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 96 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shenzhi Wang, Shixuan Liu, Jing Zhou, Chang Gao, Xiong-Hui Chen, Binghai Wang, An Yang, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.17024v2">http://arxiv.org/abs/2603.17024v2</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLMs) show strong multimodal capabilities but still struggle with fine-grained vision-language reasoning. We find that long chain-of-thought (CoT) reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for reinforcement learning with verifiable rewards (RLVR) does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B under two RLVR settings: the original data alone, and the original data plus HopChain's multi-hop data, and compare them across 24 benchmarks spanning STEM and Puzzle, General VQA, Text Recognition and Document Understanding, and Video Understanding. Although this multi-hop data is not synthesized for any specific benchmark, it improves 20 of 24 benchmarks on both models, indicating broad and generalizable gains. Consistently, replacing full chained queries with half-multi-hop or single-hop variants reduces the average score across five representative benchmarks from 70.4 to 66.7 and 64.3, respectively. Notably, multi-hop gains peak in long-CoT vision-language reasoning, exceeding 50 points in the ultra-long-CoT regime. These experiments establish HopChain as an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 96 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shenzhi Wang, Shixuan Liu, Jing Zhou, Chang Gao, Xiong-Hui Chen, Binghai Wang, An Yang, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.17024v2">http://arxiv.org/abs/2603.17024v2</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLMs) show strong multimodal capabilities but still struggle with fine-grained vision-language reasoning. We find that long chain-of-thought (CoT) reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for reinforcement learning with verifiable rewards (RLVR) does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B under two RLVR settings: the original data alone, and the original data plus HopChain's multi-hop data, and compare them across 24 benchmarks spanning STEM and Puzzle, General VQA, Text Recognition and Document Understanding, and Video Understanding. Although this multi-hop data is not synthesized for any specific benchmark, it improves 20 of 24 benchmarks on both models, indicating broad and generalizable gains. Consistently, replacing full chained queries with half-multi-hop or single-hop variants reduces the average score across five representative benchmarks from 70.4 to 66.7 and 64.3, respectively. Notably, multi-hop gains peak in long-CoT vision-language reasoning, exceeding 50 points in the ultra-long-CoT regime. These experiments establish HopChain as an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 23 Mar 2026 21:08:27 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4fe61b9a/a0c785c0.mp3" length="23669068" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1476</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 96 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shenzhi Wang, Shixuan Liu, Jing Zhou, Chang Gao, Xiong-Hui Chen, Binghai Wang, An Yang, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.17024v2">http://arxiv.org/abs/2603.17024v2</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLMs) show strong multimodal capabilities but still struggle with fine-grained vision-language reasoning. We find that long chain-of-thought (CoT) reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for reinforcement learning with verifiable rewards (RLVR) does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B under two RLVR settings: the original data alone, and the original data plus HopChain's multi-hop data, and compare them across 24 benchmarks spanning STEM and Puzzle, General VQA, Text Recognition and Document Understanding, and Video Understanding. Although this multi-hop data is not synthesized for any specific benchmark, it improves 20 of 24 benchmarks on both models, indicating broad and generalizable gains. Consistently, replacing full chained queries with half-multi-hop or single-hop variants reduces the average score across five representative benchmarks from 70.4 to 66.7 and 64.3, respectively. Notably, multi-hop gains peak in long-CoT vision-language reasoning, exceeding 50 points in the ultra-long-CoT regime. These experiments establish HopChain as an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models</title>
      <itunes:episode>1662</itunes:episode>
      <podcast:episode>1662</podcast:episode>
      <itunes:title>Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2a38a548-66aa-483c-b0af-d40a8fc8cc2e</guid>
      <link>https://share.transistor.fm/s/9bb32a94</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 87 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y Ma, Haoyang Huang, Nan Duan, Anyi Rao</p>

            <p><strong>Title:</strong><br>
            Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.17051v1">http://arxiv.org/abs/2603.17051v1</a></p>

            <p><strong>Abstract:</strong><br>
            Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 87 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y Ma, Haoyang Huang, Nan Duan, Anyi Rao</p>

            <p><strong>Title:</strong><br>
            Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.17051v1">http://arxiv.org/abs/2603.17051v1</a></p>

            <p><strong>Abstract:</strong><br>
            Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 23 Mar 2026 21:08:05 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9bb32a94/f19cc33d.mp3" length="23402014" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1459</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 87 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y Ma, Haoyang Huang, Nan Duan, Anyi Rao</p>

            <p><strong>Title:</strong><br>
            Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.17051v1">http://arxiv.org/abs/2603.17051v1</a></p>

            <p><strong>Abstract:</strong><br>
            Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation</title>
      <itunes:episode>1661</itunes:episode>
      <podcast:episode>1661</podcast:episode>
      <itunes:title>TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5d8ccb17-df85-49ac-a334-373d158ac941</guid>
      <link>https://share.transistor.fm/s/2b212775</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Begüm Demir, Nicu Sebe, Paolo Rota</p>

            <p><strong>Title:</strong><br>
            TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19039v1">http://arxiv.org/abs/2603.19039v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Begüm Demir, Nicu Sebe, Paolo Rota</p>

            <p><strong>Title:</strong><br>
            TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19039v1">http://arxiv.org/abs/2603.19039v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 23 Mar 2026 21:07:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2b212775/b025aa77.mp3" length="24968490" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1557</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Begüm Demir, Nicu Sebe, Paolo Rota</p>

            <p><strong>Title:</strong><br>
            TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19039v1">http://arxiv.org/abs/2603.19039v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models</title>
      <itunes:episode>1660</itunes:episode>
      <podcast:episode>1660</podcast:episode>
      <itunes:title>ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">73bc166b-86be-4b3e-be8c-0f990480284e</guid>
      <link>https://share.transistor.fm/s/1d8745b3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Thomas De Min, Subhankar Roy, Stéphane Lathuilière, Elisa Ricci, Massimiliano Mancini</p>

            <p><strong>Title:</strong><br>
            ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19466v1">http://arxiv.org/abs/2603.19466v1</a></p>

            <p><strong>Abstract:</strong><br>
            Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) "hinting" at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Thomas De Min, Subhankar Roy, Stéphane Lathuilière, Elisa Ricci, Massimiliano Mancini</p>

            <p><strong>Title:</strong><br>
            ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19466v1">http://arxiv.org/abs/2603.19466v1</a></p>

            <p><strong>Abstract:</strong><br>
            Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) "hinting" at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 23 Mar 2026 21:07:21 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1d8745b3/4616e02b.mp3" length="19960934" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1244</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Thomas De Min, Subhankar Roy, Stéphane Lathuilière, Elisa Ricci, Massimiliano Mancini</p>

            <p><strong>Title:</strong><br>
            ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19466v1">http://arxiv.org/abs/2603.19466v1</a></p>

            <p><strong>Abstract:</strong><br>
            Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) "hinting" at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow</title>
      <itunes:episode>1659</itunes:episode>
      <podcast:episode>1659</podcast:episode>
      <itunes:title>FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">668da7c4-a6e4-43e9-93df-09fcd169c652</guid>
      <link>https://share.transistor.fm/s/0c9ea01b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhifei Yang, Guangyao Zhai, Keyang Lu, YuYang Yin, Chao Zhang, Zhen Xiao, Jieyi Long, Nassir Navab, Yikai Wang</p>

            <p><strong>Title:</strong><br>
            FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19598v1">http://arxiv.org/abs/2603.19598v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style coherence. Graph-based formulations offer higher controllability over objects and inform holistic consistency by explicitly modeling relations, yet existing methods struggle to produce high-fidelity textured results, thereby limiting their practical utility. We present FlowScene, a tri-branch scene generative model conditioned on multimodal graphs that collaboratively generates scene layouts, object shapes, and object textures. At its core lies a tight-coupled rectified flow model that exchanges object information during generation, enabling collaborative reasoning across the graph. This enables fine-grained control of objects' shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance. Extensive experiments show that FlowScene outperforms both language-conditioned and graph-conditioned baselines in terms of generation realism, style consistency, and alignment with human preferences.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhifei Yang, Guangyao Zhai, Keyang Lu, YuYang Yin, Chao Zhang, Zhen Xiao, Jieyi Long, Nassir Navab, Yikai Wang</p>

            <p><strong>Title:</strong><br>
            FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19598v1">http://arxiv.org/abs/2603.19598v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style coherence. Graph-based formulations offer higher controllability over objects and inform holistic consistency by explicitly modeling relations, yet existing methods struggle to produce high-fidelity textured results, thereby limiting their practical utility. We present FlowScene, a tri-branch scene generative model conditioned on multimodal graphs that collaboratively generates scene layouts, object shapes, and object textures. At its core lies a tight-coupled rectified flow model that exchanges object information during generation, enabling collaborative reasoning across the graph. This enables fine-grained control of objects' shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance. Extensive experiments show that FlowScene outperforms both language-conditioned and graph-conditioned baselines in terms of generation realism, style consistency, and alignment with human preferences.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 23 Mar 2026 21:06:59 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0c9ea01b/3ae87b2b.mp3" length="25058792" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1562</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhifei Yang, Guangyao Zhai, Keyang Lu, YuYang Yin, Chao Zhang, Zhen Xiao, Jieyi Long, Nassir Navab, Yikai Wang</p>

            <p><strong>Title:</strong><br>
            FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19598v1">http://arxiv.org/abs/2603.19598v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style coherence. Graph-based formulations offer higher controllability over objects and inform holistic consistency by explicitly modeling relations, yet existing methods struggle to produce high-fidelity textured results, thereby limiting their practical utility. We present FlowScene, a tri-branch scene generative model conditioned on multimodal graphs that collaboratively generates scene layouts, object shapes, and object textures. At its core lies a tight-coupled rectified flow model that exchanges object information during generation, enabling collaborative reasoning across the graph. This enables fine-grained control of objects' shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance. Extensive experiments show that FlowScene outperforms both language-conditioned and graph-conditioned baselines in terms of generation realism, style consistency, and alignment with human preferences.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus</title>
      <itunes:episode>1658</itunes:episode>
      <podcast:episode>1658</podcast:episode>
      <itunes:title>The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fbc97896-0b2d-498d-bf67-dccaed6c8827</guid>
      <link>https://share.transistor.fm/s/78d7908d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Amartya Roy, Rasul Tutunov, Xiaotong Ji, Matthieu Zimmer, Haitham Bou-Ammar</p>

            <p><strong>Title:</strong><br>
            The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.20105v1">http://arxiv.org/abs/2603.20105v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLMs are increasingly used as general-purpose reasoners, but long inputs remain bottlenecked by a fixed context window. Recursive Language Models (RLMs) address this by externalising the prompt and recursively solving subproblems. Yet existing RLMs depend on an open-ended read-eval-print loop (REPL) in which the model generates arbitrary control code, making execution difficult to verify, predict, and analyse.   We introduce $λ$-RLM, a framework for long-context reasoning that replaces free-form recursive code generation with a typed functional runtime grounded in $λ$-calculus. It executes a compact library of pre-verified combinators and uses neural inference only on bounded leaf subproblems, turning recursive reasoning into a structured functional program with explicit control flow. We show that $λ$-RLM admits formal guarantees absent from standard RLMs, including termination, closed-form cost bounds, controlled accuracy scaling with recursion depth, and an optimal partition rule under a simple cost model. Empirically, across four long-context reasoning tasks and nine base models, $λ$-RLM outperforms standard RLM in 29 of 36 model-task comparisons, improves average accuracy by up to +21.9 points across model tiers, and reduces latency by up to 4.1x. These results show that typed symbolic control yields a more reliable and efficient foundation for long-context reasoning than open-ended recursive code generation. The complete implementation of $λ$-RLM, is open-sourced for the community at: https://github.com/lambda-calculus-LLM/lambda-RLM.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Amartya Roy, Rasul Tutunov, Xiaotong Ji, Matthieu Zimmer, Haitham Bou-Ammar</p>

            <p><strong>Title:</strong><br>
            The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.20105v1">http://arxiv.org/abs/2603.20105v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLMs are increasingly used as general-purpose reasoners, but long inputs remain bottlenecked by a fixed context window. Recursive Language Models (RLMs) address this by externalising the prompt and recursively solving subproblems. Yet existing RLMs depend on an open-ended read-eval-print loop (REPL) in which the model generates arbitrary control code, making execution difficult to verify, predict, and analyse.   We introduce $λ$-RLM, a framework for long-context reasoning that replaces free-form recursive code generation with a typed functional runtime grounded in $λ$-calculus. It executes a compact library of pre-verified combinators and uses neural inference only on bounded leaf subproblems, turning recursive reasoning into a structured functional program with explicit control flow. We show that $λ$-RLM admits formal guarantees absent from standard RLMs, including termination, closed-form cost bounds, controlled accuracy scaling with recursion depth, and an optimal partition rule under a simple cost model. Empirically, across four long-context reasoning tasks and nine base models, $λ$-RLM outperforms standard RLM in 29 of 36 model-task comparisons, improves average accuracy by up to +21.9 points across model tiers, and reduces latency by up to 4.1x. These results show that typed symbolic control yields a more reliable and efficient foundation for long-context reasoning than open-ended recursive code generation. The complete implementation of $λ$-RLM, is open-sourced for the community at: https://github.com/lambda-calculus-LLM/lambda-RLM.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 23 Mar 2026 21:06:37 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/78d7908d/1ca61bb7.mp3" length="25139532" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1568</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Amartya Roy, Rasul Tutunov, Xiaotong Ji, Matthieu Zimmer, Haitham Bou-Ammar</p>

            <p><strong>Title:</strong><br>
            The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.20105v1">http://arxiv.org/abs/2603.20105v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLMs are increasingly used as general-purpose reasoners, but long inputs remain bottlenecked by a fixed context window. Recursive Language Models (RLMs) address this by externalising the prompt and recursively solving subproblems. Yet existing RLMs depend on an open-ended read-eval-print loop (REPL) in which the model generates arbitrary control code, making execution difficult to verify, predict, and analyse.   We introduce $λ$-RLM, a framework for long-context reasoning that replaces free-form recursive code generation with a typed functional runtime grounded in $λ$-calculus. It executes a compact library of pre-verified combinators and uses neural inference only on bounded leaf subproblems, turning recursive reasoning into a structured functional program with explicit control flow. We show that $λ$-RLM admits formal guarantees absent from standard RLMs, including termination, closed-form cost bounds, controlled accuracy scaling with recursion depth, and an optimal partition rule under a simple cost model. Empirically, across four long-context reasoning tasks and nine base models, $λ$-RLM outperforms standard RLM in 29 of 36 model-task comparisons, improves average accuracy by up to +21.9 points across model tiers, and reduces latency by up to 4.1x. These results show that typed symbolic control yields a more reliable and efficient foundation for long-context reasoning than open-ended recursive code generation. The complete implementation of $λ$-RLM, is open-sourced for the community at: https://github.com/lambda-calculus-LLM/lambda-RLM.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation</title>
      <itunes:episode>1657</itunes:episode>
      <podcast:episode>1657</podcast:episode>
      <itunes:title>LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">26ea5779-b9e5-4148-8592-168793f0254a</guid>
      <link>https://share.transistor.fm/s/f071f213</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiazheng Xing, Fei Du, Hangjie Yuan, Pengwei Liu, Hongbin Xu, Hai Ci, Ruigang Niu, Weihua Chen, Fan Wang, Yong Liu</p>

            <p><strong>Title:</strong><br>
            LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.20192v1">http://arxiv.org/abs/2603.20192v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at https://jiazheng-xing.github.io/lumosx-home/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiazheng Xing, Fei Du, Hangjie Yuan, Pengwei Liu, Hongbin Xu, Hai Ci, Ruigang Niu, Weihua Chen, Fan Wang, Yong Liu</p>

            <p><strong>Title:</strong><br>
            LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.20192v1">http://arxiv.org/abs/2603.20192v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at https://jiazheng-xing.github.io/lumosx-home/.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 23 Mar 2026 21:06:13 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f071f213/0a4b18cd.mp3" length="21452219" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1337</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiazheng Xing, Fei Du, Hangjie Yuan, Pengwei Liu, Hongbin Xu, Hai Ci, Ruigang Niu, Weihua Chen, Fan Wang, Yong Liu</p>

            <p><strong>Title:</strong><br>
            LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.20192v1">http://arxiv.org/abs/2603.20192v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at https://jiazheng-xing.github.io/lumosx-home/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Hyperagents</title>
      <itunes:episode>1656</itunes:episode>
      <podcast:episode>1656</podcast:episode>
      <itunes:title>Hyperagents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f16d9e5a-6de8-44d7-84ce-49ceae8bd464</guid>
      <link>https://share.transistor.fm/s/36bee6e1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, Tatiana Shavrina</p>

            <p><strong>Title:</strong><br>
            Hyperagents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19461v1">http://arxiv.org/abs/2603.19461v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-improving AI systems aim to reduce reliance on human engineering by learning to improve their own learning and problem-solving processes. Existing approaches to self-improvement rely on fixed, handcrafted meta-level mechanisms, fundamentally limiting how fast such systems can improve. The Darwin Gödel Machine (DGM) demonstrates open-ended self-improvement in coding by repeatedly generating and evaluating self-modified variants. Because both evaluation and self-modification are coding tasks, gains in coding ability can translate into gains in self-improvement ability. However, this alignment does not generally hold beyond coding domains. We introduce \textbf{hyperagents}, self-referential agents that integrate a task agent (which solves the target task) and a meta agent (which modifies itself and the task agent) into a single editable program. Crucially, the meta-level modification procedure is itself editable, enabling metacognitive self-modification, improving not only the task-solving behavior, but also the mechanism that generates future improvements. We instantiate this framework by extending DGM to create DGM-Hyperagents (DGM-H), eliminating the assumption of domain-specific alignment between task performance and self-modification skill to potentially support self-accelerating progress on any computable task. Across diverse domains, the DGM-H improves performance over time and outperforms baselines without self-improvement or open-ended exploration, as well as prior self-improving systems. Furthermore, the DGM-H improves the process by which it generates new agents (e.g., persistent memory, performance tracking), and these meta-level improvements transfer across domains and accumulate across runs. DGM-Hyperagents offer a glimpse of open-ended AI systems that do not merely search for better solutions, but continually improve their search for how to improve.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, Tatiana Shavrina</p>

            <p><strong>Title:</strong><br>
            Hyperagents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19461v1">http://arxiv.org/abs/2603.19461v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-improving AI systems aim to reduce reliance on human engineering by learning to improve their own learning and problem-solving processes. Existing approaches to self-improvement rely on fixed, handcrafted meta-level mechanisms, fundamentally limiting how fast such systems can improve. The Darwin Gödel Machine (DGM) demonstrates open-ended self-improvement in coding by repeatedly generating and evaluating self-modified variants. Because both evaluation and self-modification are coding tasks, gains in coding ability can translate into gains in self-improvement ability. However, this alignment does not generally hold beyond coding domains. We introduce \textbf{hyperagents}, self-referential agents that integrate a task agent (which solves the target task) and a meta agent (which modifies itself and the task agent) into a single editable program. Crucially, the meta-level modification procedure is itself editable, enabling metacognitive self-modification, improving not only the task-solving behavior, but also the mechanism that generates future improvements. We instantiate this framework by extending DGM to create DGM-Hyperagents (DGM-H), eliminating the assumption of domain-specific alignment between task performance and self-modification skill to potentially support self-accelerating progress on any computable task. Across diverse domains, the DGM-H improves performance over time and outperforms baselines without self-improvement or open-ended exploration, as well as prior self-improving systems. Furthermore, the DGM-H improves the process by which it generates new agents (e.g., persistent memory, performance tracking), and these meta-level improvements transfer across domains and accumulate across runs. DGM-Hyperagents offer a glimpse of open-ended AI systems that do not merely search for better solutions, but continually improve their search for how to improve.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 23 Mar 2026 21:05:51 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/36bee6e1/af344110.mp3" length="23101830" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1440</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, Tatiana Shavrina</p>

            <p><strong>Title:</strong><br>
            Hyperagents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19461v1">http://arxiv.org/abs/2603.19461v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-improving AI systems aim to reduce reliance on human engineering by learning to improve their own learning and problem-solving processes. Existing approaches to self-improvement rely on fixed, handcrafted meta-level mechanisms, fundamentally limiting how fast such systems can improve. The Darwin Gödel Machine (DGM) demonstrates open-ended self-improvement in coding by repeatedly generating and evaluating self-modified variants. Because both evaluation and self-modification are coding tasks, gains in coding ability can translate into gains in self-improvement ability. However, this alignment does not generally hold beyond coding domains. We introduce \textbf{hyperagents}, self-referential agents that integrate a task agent (which solves the target task) and a meta agent (which modifies itself and the task agent) into a single editable program. Crucially, the meta-level modification procedure is itself editable, enabling metacognitive self-modification, improving not only the task-solving behavior, but also the mechanism that generates future improvements. We instantiate this framework by extending DGM to create DGM-Hyperagents (DGM-H), eliminating the assumption of domain-specific alignment between task performance and self-modification skill to potentially support self-accelerating progress on any computable task. Across diverse domains, the DGM-H improves performance over time and outperforms baselines without self-improvement or open-ended exploration, as well as prior self-improving systems. Furthermore, the DGM-H improves the process by which it generates new agents (e.g., persistent memory, performance tracking), and these meta-level improvements transfer across domains and accumulate across runs. DGM-Hyperagents offer a glimpse of open-ended AI systems that do not merely search for better solutions, but continually improve their search for how to improve.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding</title>
      <itunes:episode>1655</itunes:episode>
      <podcast:episode>1655</podcast:episode>
      <itunes:title>Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c813afd7-4056-4ea7-ba95-b529f395acb2</guid>
      <link>https://share.transistor.fm/s/30c181db</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, Xiang Bai</p>

            <p><strong>Title:</strong><br>
            Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19235v1">http://arxiv.org/abs/2603.19235v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, Xiang Bai</p>

            <p><strong>Title:</strong><br>
            Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19235v1">http://arxiv.org/abs/2603.19235v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Mar 2026 21:13:08 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/30c181db/fc612cee.mp3" length="23033775" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1436</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, Xiang Bai</p>

            <p><strong>Title:</strong><br>
            Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19235v1">http://arxiv.org/abs/2603.19235v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing</title>
      <itunes:episode>1654</itunes:episode>
      <podcast:episode>1654</podcast:episode>
      <itunes:title>SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">861c46e2-6af8-444d-a7b8-b2f42598bd8e</guid>
      <link>https://share.transistor.fm/s/ab00d4d4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xinyao Zhang, Wenkai Dong, Yuxin Song, Bo Fang, Qi Zhang, Jing Wang, Fan Chen, Hui Zhang, Haocheng Feng, Yu Lu, Hang Zhou, Chun Yuan, Jingdong Wang</p>

            <p><strong>Title:</strong><br>
            SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19228v1">http://arxiv.org/abs/2603.19228v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xinyao Zhang, Wenkai Dong, Yuxin Song, Bo Fang, Qi Zhang, Jing Wang, Fan Chen, Hui Zhang, Haocheng Feng, Yu Lu, Hang Zhou, Chun Yuan, Jingdong Wang</p>

            <p><strong>Title:</strong><br>
            SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19228v1">http://arxiv.org/abs/2603.19228v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Mar 2026 21:12:46 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ab00d4d4/8bd973c7.mp3" length="26102023" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1628</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xinyao Zhang, Wenkai Dong, Yuxin Song, Bo Fang, Qi Zhang, Jing Wang, Fan Chen, Hui Zhang, Haocheng Feng, Yu Lu, Hang Zhou, Chun Yuan, Jingdong Wang</p>

            <p><strong>Title:</strong><br>
            SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19228v1">http://arxiv.org/abs/2603.19228v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FASTER: Rethinking Real-Time Flow VLAs</title>
      <itunes:episode>1653</itunes:episode>
      <podcast:episode>1653</podcast:episode>
      <itunes:title>FASTER: Rethinking Real-Time Flow VLAs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0cb2a352-bc52-4eda-8d7c-a52f077e6585</guid>
      <link>https://share.transistor.fm/s/e96d1104</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuxiang Lu, Zhe Liu, Xianzhe Fan, Zhenya Yang, Jinghua Hou, Junyi Li, Kaixin Ding, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            FASTER: Rethinking Real-Time Flow VLAs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19199v1">http://arxiv.org/abs/2603.19199v1</a></p>

            <p><strong>Abstract:</strong><br>
            Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in $π_{0.5}$ and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks unprecedented real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuxiang Lu, Zhe Liu, Xianzhe Fan, Zhenya Yang, Jinghua Hou, Junyi Li, Kaixin Ding, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            FASTER: Rethinking Real-Time Flow VLAs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19199v1">http://arxiv.org/abs/2603.19199v1</a></p>

            <p><strong>Abstract:</strong><br>
            Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in $π_{0.5}$ and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks unprecedented real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Mar 2026 21:12:25 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e96d1104/4c4306ac.mp3" length="21734713" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1355</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuxiang Lu, Zhe Liu, Xianzhe Fan, Zhenya Yang, Jinghua Hou, Junyi Li, Kaixin Ding, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            FASTER: Rethinking Real-Time Flow VLAs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19199v1">http://arxiv.org/abs/2603.19199v1</a></p>

            <p><strong>Abstract:</strong><br>
            Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in $π_{0.5}$ and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks unprecedented real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model</title>
      <itunes:episode>1652</itunes:episode>
      <podcast:episode>1652</podcast:episode>
      <itunes:title>3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d01cbbe9-6089-4ee6-98fd-ea6441848353</guid>
      <link>https://share.transistor.fm/s/ed22efda</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hyun-kyu Ko, Jihyeon Park, Younghyun Kim, Dongheok Park, Eunbyung Park</p>

            <p><strong>Title:</strong><br>
            3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.18524v1">http://arxiv.org/abs/2603.18524v1</a></p>

            <p><strong>Abstract:</strong><br>
            Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hyun-kyu Ko, Jihyeon Park, Younghyun Kim, Dongheok Park, Eunbyung Park</p>

            <p><strong>Title:</strong><br>
            3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.18524v1">http://arxiv.org/abs/2603.18524v1</a></p>

            <p><strong>Abstract:</strong><br>
            Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Mar 2026 21:12:03 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ed22efda/ca9f38c9.mp3" length="20528511" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1279</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hyun-kyu Ko, Jihyeon Park, Younghyun Kim, Dongheok Park, Eunbyung Park</p>

            <p><strong>Title:</strong><br>
            3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.18524v1">http://arxiv.org/abs/2603.18524v1</a></p>

            <p><strong>Abstract:</strong><br>
            Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer</title>
      <itunes:episode>1651</itunes:episode>
      <podcast:episode>1651</podcast:episode>
      <itunes:title>Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e194cf09-651b-49d7-b983-5cc7e1c9485f</guid>
      <link>https://share.transistor.fm/s/7ebf35db</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenyang Gu, Mingyuan Zhang, Haozhe Xie, Zhongang Cai, Lei Yang, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19227v1">http://arxiv.org/abs/2603.19227v1</a></p>

            <p><strong>Abstract:</strong><br>
            Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenyang Gu, Mingyuan Zhang, Haozhe Xie, Zhongang Cai, Lei Yang, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19227v1">http://arxiv.org/abs/2603.19227v1</a></p>

            <p><strong>Abstract:</strong><br>
            Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Mar 2026 21:11:42 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7ebf35db/3bdd3df2.mp3" length="22912155" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1428</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenyang Gu, Mingyuan Zhang, Haozhe Xie, Zhongang Cai, Lei Yang, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19227v1">http://arxiv.org/abs/2603.19227v1</a></p>

            <p><strong>Abstract:</strong><br>
            Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction</title>
      <itunes:episode>1650</itunes:episode>
      <podcast:episode>1650</podcast:episode>
      <itunes:title>MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">87e0f6e8-96db-42cc-8271-9e646e05b189</guid>
      <link>https://share.transistor.fm/s/547f7463</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19231v1">http://arxiv.org/abs/2603.19231v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19231v1">http://arxiv.org/abs/2603.19231v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Mar 2026 21:11:20 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/547f7463/9a0c9b0d.mp3" length="22007687" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1372</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19231v1">http://arxiv.org/abs/2603.19231v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation</title>
      <itunes:episode>1649</itunes:episode>
      <podcast:episode>1649</podcast:episode>
      <itunes:title>Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">dd09da3a-13ba-458a-acae-ae6d24f5c063</guid>
      <link>https://share.transistor.fm/s/9e8d1436</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, Renjie Pi, Grace Lam, Nayeon Lee, Alexander Bukharin, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping</p>

            <p><strong>Title:</strong><br>
            Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19220v1">http://arxiv.org/abs/2603.19220v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, Renjie Pi, Grace Lam, Nayeon Lee, Alexander Bukharin, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping</p>

            <p><strong>Title:</strong><br>
            Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19220v1">http://arxiv.org/abs/2603.19220v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Mar 2026 21:10:58 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9e8d1436/00bbc118.mp3" length="22090034" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1377</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, Renjie Pi, Grace Lam, Nayeon Lee, Alexander Bukharin, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping</p>

            <p><strong>Title:</strong><br>
            Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19220v1">http://arxiv.org/abs/2603.19220v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens</title>
      <itunes:episode>1648</itunes:episode>
      <podcast:episode>1648</podcast:episode>
      <itunes:title>Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d9f69be7-d177-4c5c-bbd8-0704855ab571</guid>
      <link>https://share.transistor.fm/s/05776399</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuqing Wang, Chuofan Ma, Zhijie Lin, Yao Teng, Lijun Yu, Shuai Wang, Jiaming Han, Jiashi Feng, Yi Jiang, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19232v1">http://arxiv.org/abs/2603.19232v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation -- any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuqing Wang, Chuofan Ma, Zhijie Lin, Yao Teng, Lijun Yu, Shuai Wang, Jiaming Han, Jiashi Feng, Yi Jiang, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19232v1">http://arxiv.org/abs/2603.19232v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation -- any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Mar 2026 21:10:37 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/05776399/ceb3054f.mp3" length="21150880" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1318</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuqing Wang, Chuofan Ma, Zhijie Lin, Yao Teng, Lijun Yu, Shuai Wang, Jiaming Han, Jiashi Feng, Yi Jiang, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19232v1">http://arxiv.org/abs/2603.19232v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation -- any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs</title>
      <itunes:episode>1647</itunes:episode>
      <podcast:episode>1647</podcast:episode>
      <itunes:title>LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b7926e1b-5579-4825-8663-ac9122e7a880</guid>
      <link>https://share.transistor.fm/s/a5ad360e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, Huan Wang</p>

            <p><strong>Title:</strong><br>
            LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19217v1">http://arxiv.org/abs/2603.19217v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, Huan Wang</p>

            <p><strong>Title:</strong><br>
            LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19217v1">http://arxiv.org/abs/2603.19217v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Mar 2026 21:10:15 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a5ad360e/b2143782.mp3" length="20854536" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1300</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, Huan Wang</p>

            <p><strong>Title:</strong><br>
            LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.19217v1">http://arxiv.org/abs/2603.19217v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Memento-Skills: Let Agents Design Agents</title>
      <itunes:episode>1646</itunes:episode>
      <podcast:episode>1646</podcast:episode>
      <itunes:title>Memento-Skills: Let Agents Design Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b791892a-48c7-42ff-88e9-957e10fb049b</guid>
      <link>https://share.transistor.fm/s/611455cc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, Runyu Yang, Qiangbin Liu, Xinlei Yu, Jianmin Zhou, Na Wang, Chunyang Sun, Jun Wang</p>

            <p><strong>Title:</strong><br>
            Memento-Skills: Let Agents Design Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.18743v1">http://arxiv.org/abs/2603.18743v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce \emph{Memento-Skills}, a generalist, continually-learnable LLM agent system that functions as an \emph{agent-designing agent}: it autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with \emph{stateful prompts}, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions.   Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emph{Read--Write Reflective Learning} mechanism introduced in \emph{Memento~2}~\cite{wang2025memento2}. In the \emph{read} phase, a behaviour-trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emph{write} phase, the agent updates and expands its skill library based on new experience. This closed-loop design enables \emph{continual learning without updating LLM parameters}, as all adaptation is realised through the evolution of externalised skills and prompts.   Unlike prior approaches that rely on human-designed agents, Memento-Skills enables a generalist agent to \emph{design agents end-to-end} for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emph{General AI Assistants} benchmark and \emph{Humanity's Last Exam} demonstrate sustained gains, achieving 26.2\% and 116.2\% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento-Teams/Memento-Skills.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, Runyu Yang, Qiangbin Liu, Xinlei Yu, Jianmin Zhou, Na Wang, Chunyang Sun, Jun Wang</p>

            <p><strong>Title:</strong><br>
            Memento-Skills: Let Agents Design Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.18743v1">http://arxiv.org/abs/2603.18743v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce \emph{Memento-Skills}, a generalist, continually-learnable LLM agent system that functions as an \emph{agent-designing agent}: it autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with \emph{stateful prompts}, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions.   Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emph{Read--Write Reflective Learning} mechanism introduced in \emph{Memento~2}~\cite{wang2025memento2}. In the \emph{read} phase, a behaviour-trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emph{write} phase, the agent updates and expands its skill library based on new experience. This closed-loop design enables \emph{continual learning without updating LLM parameters}, as all adaptation is realised through the evolution of externalised skills and prompts.   Unlike prior approaches that rely on human-designed agents, Memento-Skills enables a generalist agent to \emph{design agents end-to-end} for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emph{General AI Assistants} benchmark and \emph{Humanity's Last Exam} demonstrate sustained gains, achieving 26.2\% and 116.2\% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento-Teams/Memento-Skills.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Mar 2026 21:09:54 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/611455cc/13fe601e.mp3" length="23700377" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1478</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, Runyu Yang, Qiangbin Liu, Xinlei Yu, Jianmin Zhou, Na Wang, Chunyang Sun, Jun Wang</p>

            <p><strong>Title:</strong><br>
            Memento-Skills: Let Agents Design Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.18743v1">http://arxiv.org/abs/2603.18743v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce \emph{Memento-Skills}, a generalist, continually-learnable LLM agent system that functions as an \emph{agent-designing agent}: it autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with \emph{stateful prompts}, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions.   Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emph{Read--Write Reflective Learning} mechanism introduced in \emph{Memento~2}~\cite{wang2025memento2}. In the \emph{read} phase, a behaviour-trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emph{write} phase, the agent updates and expands its skill library based on new experience. This closed-loop design enables \emph{continual learning without updating LLM parameters}, as all adaptation is realised through the evolution of externalised skills and prompts.   Unlike prior approaches that rely on human-designed agents, Memento-Skills enables a generalist agent to \emph{design agents end-to-end} for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emph{General AI Assistants} benchmark and \emph{Humanity's Last Exam} demonstrate sustained gains, achieving 26.2\% and 116.2\% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento-Teams/Memento-Skills.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild</title>
      <itunes:episode>1645</itunes:episode>
      <podcast:episode>1645</podcast:episode>
      <itunes:title>MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b55ec3b8-3195-4b8d-ad99-27e50c6bdb4c</guid>
      <link>https://share.transistor.fm/s/1b25cd8e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 97 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, Zeyu Zheng, Cihang Xie, Huaxiu Yao</p>

            <p><strong>Title:</strong><br>
            MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.17187v1">http://arxiv.org/abs/2603.17187v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM) agents are increasingly used for complex tasks, yet deployed agents often remain static, failing to adapt as user needs evolve. This creates a tension between the need for continuous service and the necessity of updating capabilities to match shifting task distributions. On platforms like OpenClaw, which handle diverse workloads across 20+ channels, existing methods either store raw trajectories without distilling knowledge, maintain static skill libraries, or require disruptive downtime for retraining. We present MetaClaw, a continual meta-learning framework that jointly evolves a base LLM policy and a library of reusable behavioral skills. MetaClaw employs two complementary mechanisms. Skill-driven fast adaptation analyzes failure trajectories via an LLM evolver to synthesize new skills, enabling immediate improvement with zero downtime. Opportunistic policy optimization performs gradient-based updates via cloud LoRA fine-tuning and Reinforcement Learning with a Process Reward Model (RL-PRM). This is triggered during user-inactive windows by the Opportunistic Meta-Learning Scheduler (OMLS), which monitors system inactivity and calendar data. These mechanisms are mutually reinforcing: a refined policy generates better trajectories for skill synthesis, while richer skills provide higher-quality data for policy optimization. To prevent data contamination, a versioning mechanism separates support and query data. Built on a proxy-based architecture, MetaClaw scales to production-size LLMs without local GPUs. Experiments on MetaClaw-Bench and AutoResearchClaw show that skill-driven adaptation improves accuracy by up to 32% relative. The full pipeline advances Kimi-K2.5 accuracy from 21.4% to 40.6% and increases composite robustness by 18.3%. Code is available at https://github.com/aiming-lab/MetaClaw.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 97 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, Zeyu Zheng, Cihang Xie, Huaxiu Yao</p>

            <p><strong>Title:</strong><br>
            MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.17187v1">http://arxiv.org/abs/2603.17187v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM) agents are increasingly used for complex tasks, yet deployed agents often remain static, failing to adapt as user needs evolve. This creates a tension between the need for continuous service and the necessity of updating capabilities to match shifting task distributions. On platforms like OpenClaw, which handle diverse workloads across 20+ channels, existing methods either store raw trajectories without distilling knowledge, maintain static skill libraries, or require disruptive downtime for retraining. We present MetaClaw, a continual meta-learning framework that jointly evolves a base LLM policy and a library of reusable behavioral skills. MetaClaw employs two complementary mechanisms. Skill-driven fast adaptation analyzes failure trajectories via an LLM evolver to synthesize new skills, enabling immediate improvement with zero downtime. Opportunistic policy optimization performs gradient-based updates via cloud LoRA fine-tuning and Reinforcement Learning with a Process Reward Model (RL-PRM). This is triggered during user-inactive windows by the Opportunistic Meta-Learning Scheduler (OMLS), which monitors system inactivity and calendar data. These mechanisms are mutually reinforcing: a refined policy generates better trajectories for skill synthesis, while richer skills provide higher-quality data for policy optimization. To prevent data contamination, a versioning mechanism separates support and query data. Built on a proxy-based architecture, MetaClaw scales to production-size LLMs without local GPUs. Experiments on MetaClaw-Bench and AutoResearchClaw show that skill-driven adaptation improves accuracy by up to 32% relative. The full pipeline advances Kimi-K2.5 accuracy from 21.4% to 40.6% and increases composite robustness by 18.3%. Code is available at https://github.com/aiming-lab/MetaClaw.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 19 Mar 2026 20:50:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1b25cd8e/f0d9ad20.mp3" length="22748298" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1418</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 97 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, Zeyu Zheng, Cihang Xie, Huaxiu Yao</p>

            <p><strong>Title:</strong><br>
            MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.17187v1">http://arxiv.org/abs/2603.17187v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM) agents are increasingly used for complex tasks, yet deployed agents often remain static, failing to adapt as user needs evolve. This creates a tension between the need for continuous service and the necessity of updating capabilities to match shifting task distributions. On platforms like OpenClaw, which handle diverse workloads across 20+ channels, existing methods either store raw trajectories without distilling knowledge, maintain static skill libraries, or require disruptive downtime for retraining. We present MetaClaw, a continual meta-learning framework that jointly evolves a base LLM policy and a library of reusable behavioral skills. MetaClaw employs two complementary mechanisms. Skill-driven fast adaptation analyzes failure trajectories via an LLM evolver to synthesize new skills, enabling immediate improvement with zero downtime. Opportunistic policy optimization performs gradient-based updates via cloud LoRA fine-tuning and Reinforcement Learning with a Process Reward Model (RL-PRM). This is triggered during user-inactive windows by the Opportunistic Meta-Learning Scheduler (OMLS), which monitors system inactivity and calendar data. These mechanisms are mutually reinforcing: a refined policy generates better trajectories for skill synthesis, while richer skills provide higher-quality data for policy optimization. To prevent data contamination, a versioning mechanism separates support and query data. Built on a proxy-based architecture, MetaClaw scales to production-size LLMs without local GPUs. Experiments on MetaClaw-Bench and AutoResearchClaw show that skill-driven adaptation improves accuracy by up to 32% relative. The full pipeline advances Kimi-K2.5 accuracy from 21.4% to 40.6% and increases composite robustness by 18.3%. Code is available at https://github.com/aiming-lab/MetaClaw.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Video-CoE: Reinforcing Video Event Prediction via Chain of Events</title>
      <itunes:episode>1644</itunes:episode>
      <podcast:episode>1644</podcast:episode>
      <itunes:title>Video-CoE: Reinforcing Video Event Prediction via Chain of Events</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a628693d-1748-492a-a261-c28694c034ba</guid>
      <link>https://share.transistor.fm/s/d2f99738</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 85 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qile Su, Jing Tang, Rui Chen, Lei Sun, Xiangxiang Chu</p>

            <p><strong>Title:</strong><br>
            Video-CoE: Reinforcing Video Event Prediction via Chain of Events</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.14935v1">http://arxiv.org/abs/2603.14935v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform fine-grained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with. In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information. To address these challenges, we propose \textbf{C}hain \textbf{o}f \textbf{E}vents (\textbf{CoE}) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model's reasoning capability with multiple training protocols. Experimental results on public benchmarks demonstrate that our method outperforms both leading open-source and commercial MLLMs, establishing a new state-of-the-art on the VEP task. Codes and models will be released soon.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 85 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qile Su, Jing Tang, Rui Chen, Lei Sun, Xiangxiang Chu</p>

            <p><strong>Title:</strong><br>
            Video-CoE: Reinforcing Video Event Prediction via Chain of Events</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.14935v1">http://arxiv.org/abs/2603.14935v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform fine-grained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with. In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information. To address these challenges, we propose \textbf{C}hain \textbf{o}f \textbf{E}vents (\textbf{CoE}) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model's reasoning capability with multiple training protocols. Experimental results on public benchmarks demonstrate that our method outperforms both leading open-source and commercial MLLMs, establishing a new state-of-the-art on the VEP task. Codes and models will be released soon.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 19 Mar 2026 20:50:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d2f99738/23089ea9.mp3" length="22646726" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1412</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 85 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qile Su, Jing Tang, Rui Chen, Lei Sun, Xiangxiang Chu</p>

            <p><strong>Title:</strong><br>
            Video-CoE: Reinforcing Video Event Prediction via Chain of Events</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.14935v1">http://arxiv.org/abs/2603.14935v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform fine-grained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with. In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information. To address these challenges, we propose \textbf{C}hain \textbf{o}f \textbf{E}vents (\textbf{CoE}) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model's reasoning capability with multiple training protocols. Experimental results on public benchmarks demonstrate that our method outperforms both leading open-source and commercial MLLMs, establishing a new state-of-the-art on the VEP task. Codes and models will be released soon.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MosaicMem: Hybrid Spatial Memory for Controllable Video World Models</title>
      <itunes:episode>1643</itunes:episode>
      <podcast:episode>1643</podcast:episode>
      <itunes:title>MosaicMem: Hybrid Spatial Memory for Controllable Video World Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b06cf764-9bbb-47e1-9d47-36625c891063</guid>
      <link>https://share.transistor.fm/s/a19aa242</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 72 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin, Sri Siddarth Chakaravarthy P, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg</p>

            <p><strong>Title:</strong><br>
            MosaicMem: Hybrid Spatial Memory for Controllable Video World Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.17117v1">http://arxiv.org/abs/2603.17117v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 72 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin, Sri Siddarth Chakaravarthy P, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg</p>

            <p><strong>Title:</strong><br>
            MosaicMem: Hybrid Spatial Memory for Controllable Video World Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.17117v1">http://arxiv.org/abs/2603.17117v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 19 Mar 2026 20:49:50 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a19aa242/2f92c6b0.mp3" length="22429809" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1398</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 72 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin, Sri Siddarth Chakaravarthy P, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg</p>

            <p><strong>Title:</strong><br>
            MosaicMem: Hybrid Spatial Memory for Controllable Video World Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.17117v1">http://arxiv.org/abs/2603.17117v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Alignment Makes Language Models Normative, Not Descriptive</title>
      <itunes:episode>1642</itunes:episode>
      <podcast:episode>1642</podcast:episode>
      <itunes:title>Alignment Makes Language Models Normative, Not Descriptive</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8011eeff-de5f-478b-a20d-5188fbf40474</guid>
      <link>https://share.transistor.fm/s/0e9cae22</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CL, cs.AI, cs.GT</p>

            <p><strong>Authors:</strong><br>
            Eilam Shapira, Moshe Tennenholtz, Roi Reichart</p>

            <p><strong>Title:</strong><br>
            Alignment Makes Language Models Normative, Not Descriptive</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.17218v1">http://arxiv.org/abs/2603.17218v1</a></p>

            <p><strong>Abstract:</strong><br>
            Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CL, cs.AI, cs.GT</p>

            <p><strong>Authors:</strong><br>
            Eilam Shapira, Moshe Tennenholtz, Roi Reichart</p>

            <p><strong>Title:</strong><br>
            Alignment Makes Language Models Normative, Not Descriptive</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.17218v1">http://arxiv.org/abs/2603.17218v1</a></p>

            <p><strong>Abstract:</strong><br>
            Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 19 Mar 2026 20:49:28 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0e9cae22/613efdaf.mp3" length="20676460" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1289</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CL, cs.AI, cs.GT</p>

            <p><strong>Authors:</strong><br>
            Eilam Shapira, Moshe Tennenholtz, Roi Reichart</p>

            <p><strong>Title:</strong><br>
            Alignment Makes Language Models Normative, Not Descriptive</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.17218v1">http://arxiv.org/abs/2603.17218v1</a></p>

            <p><strong>Abstract:</strong><br>
            Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Complementary Reinforcement Learning</title>
      <itunes:episode>1641</itunes:episode>
      <podcast:episode>1641</podcast:episode>
      <itunes:title>Complementary Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4cedd54a-e133-41c6-aa25-00b648b4b223</guid>
      <link>https://share.transistor.fm/s/b6a70a5c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dilxat Muhtar, Jiashun Liu, Wei Gao, Weixun Wang, Shaopan Xiong, Ju Huang, Siran Yang, Wenbo Su, Jiamang Wang, Ling Pan, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            Complementary Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.17621v1">http://arxiv.org/abs/2603.17621v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning (RL) has emerged as a powerful paradigm for training LLM-based agents, yet remains limited by low sample efficiency, stemming not only from sparse outcome feedback but also from the agent's inability to leverage prior experience across episodes. While augmenting agents with historical experience offers a promising remedy, existing approaches suffer from a critical weakness: the experience distilled from history is either stored statically or fail to coevolve with the improving actor, causing a progressive misalignment between the experience and the actor's evolving capability that diminishes its utility over the course of training. Inspired by complementary learning systems in neuroscience, we present Complementary RL to achieve seamless co-evolution of an experience extractor and a policy actor within the RL optimization loop. Specifically, the actor is optimized via sparse outcome-based rewards, while the experience extractor is optimized according to whether its distilled experiences demonstrably contribute to the actor's success, thereby evolving its experience management strategy in lockstep with the actor's growing capabilities. Empirically, Complementary RL outperforms outcome-based agentic RL baselines that do not learn from experience, achieving 10% performance improvement in single-task scenarios and exhibits robust scalability in multi-task settings. These results establish Complementary RL as a paradigm for efficient experience-driven agent learning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dilxat Muhtar, Jiashun Liu, Wei Gao, Weixun Wang, Shaopan Xiong, Ju Huang, Siran Yang, Wenbo Su, Jiamang Wang, Ling Pan, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            Complementary Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.17621v1">http://arxiv.org/abs/2603.17621v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning (RL) has emerged as a powerful paradigm for training LLM-based agents, yet remains limited by low sample efficiency, stemming not only from sparse outcome feedback but also from the agent's inability to leverage prior experience across episodes. While augmenting agents with historical experience offers a promising remedy, existing approaches suffer from a critical weakness: the experience distilled from history is either stored statically or fail to coevolve with the improving actor, causing a progressive misalignment between the experience and the actor's evolving capability that diminishes its utility over the course of training. Inspired by complementary learning systems in neuroscience, we present Complementary RL to achieve seamless co-evolution of an experience extractor and a policy actor within the RL optimization loop. Specifically, the actor is optimized via sparse outcome-based rewards, while the experience extractor is optimized according to whether its distilled experiences demonstrably contribute to the actor's success, thereby evolving its experience management strategy in lockstep with the actor's growing capabilities. Empirically, Complementary RL outperforms outcome-based agentic RL baselines that do not learn from experience, achieving 10% performance improvement in single-task scenarios and exhibits robust scalability in multi-task settings. These results establish Complementary RL as a paradigm for efficient experience-driven agent learning.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 19 Mar 2026 20:49:07 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b6a70a5c/664f1eb3.mp3" length="23101019" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1440</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dilxat Muhtar, Jiashun Liu, Wei Gao, Weixun Wang, Shaopan Xiong, Ju Huang, Siran Yang, Wenbo Su, Jiamang Wang, Ling Pan, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            Complementary Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.17621v1">http://arxiv.org/abs/2603.17621v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning (RL) has emerged as a powerful paradigm for training LLM-based agents, yet remains limited by low sample efficiency, stemming not only from sparse outcome feedback but also from the agent's inability to leverage prior experience across episodes. While augmenting agents with historical experience offers a promising remedy, existing approaches suffer from a critical weakness: the experience distilled from history is either stored statically or fail to coevolve with the improving actor, causing a progressive misalignment between the experience and the actor's evolving capability that diminishes its utility over the course of training. Inspired by complementary learning systems in neuroscience, we present Complementary RL to achieve seamless co-evolution of an experience extractor and a policy actor within the RL optimization loop. Specifically, the actor is optimized via sparse outcome-based rewards, while the experience extractor is optimized according to whether its distilled experiences demonstrably contribute to the actor's success, thereby evolving its experience management strategy in lockstep with the actor's growing capabilities. Empirically, Complementary RL outperforms outcome-based agentic RL baselines that do not learn from experience, achieving 10% performance improvement in single-task scenarios and exhibits robust scalability in multi-task settings. These results establish Complementary RL as a paradigm for efficient experience-driven agent learning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>When AI Navigates the Fog of War</title>
      <itunes:episode>1640</itunes:episode>
      <podcast:episode>1640</podcast:episode>
      <itunes:title>When AI Navigates the Fog of War</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ad3f5d15-68ff-4ba0-bfde-743dfc548de8</guid>
      <link>https://share.transistor.fm/s/190ad14e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI, cs.CL, cs.CY</p>

            <p><strong>Authors:</strong><br>
            Ming Li, Xirui Li, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            When AI Navigates the Fog of War</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16642v1">http://arxiv.org/abs/2603.16642v1</a></p>

            <p><strong>Abstract:</strong><br>
            Can AI reason about a war before its trajectory becomes historically obvious? Analyzing this capability is difficult because retrospective geopolitical prediction is heavily confounded by training-data leakage. We address this challenge through a temporally grounded case study of the early stages of the 2026 Middle East conflict, which unfolded after the training cutoff of current frontier models. We construct 11 critical temporal nodes, 42 node-specific verifiable questions, and 5 general exploratory questions, requiring models to reason only from information that would have been publicly available at each moment. This design substantially mitigates training-data leakage concerns, creating a setting well-suited for studying how models analyze an unfolding crisis under the fog of war, and provides, to our knowledge, the first temporally grounded analysis of LLM reasoning in an ongoing geopolitical conflict. Our analysis reveals three main findings. First, current state-of-the-art large language models often display a striking degree of strategic realism, reasoning beyond surface rhetoric toward deeper structural incentives. Second, this capability is uneven across domains: models are more reliable in economically and logistically structured settings than in politically ambiguous multi-actor environments. Finally, model narratives evolve over time, shifting from early expectations of rapid containment toward more systemic accounts of regional entrenchment and attritional de-escalation. Since the conflict remains ongoing at the time of writing, this work can serve as an archival snapshot of model reasoning during an unfolding geopolitical crisis, enabling future studies without the hindsight bias of retrospective analysis.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI, cs.CL, cs.CY</p>

            <p><strong>Authors:</strong><br>
            Ming Li, Xirui Li, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            When AI Navigates the Fog of War</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16642v1">http://arxiv.org/abs/2603.16642v1</a></p>

            <p><strong>Abstract:</strong><br>
            Can AI reason about a war before its trajectory becomes historically obvious? Analyzing this capability is difficult because retrospective geopolitical prediction is heavily confounded by training-data leakage. We address this challenge through a temporally grounded case study of the early stages of the 2026 Middle East conflict, which unfolded after the training cutoff of current frontier models. We construct 11 critical temporal nodes, 42 node-specific verifiable questions, and 5 general exploratory questions, requiring models to reason only from information that would have been publicly available at each moment. This design substantially mitigates training-data leakage concerns, creating a setting well-suited for studying how models analyze an unfolding crisis under the fog of war, and provides, to our knowledge, the first temporally grounded analysis of LLM reasoning in an ongoing geopolitical conflict. Our analysis reveals three main findings. First, current state-of-the-art large language models often display a striking degree of strategic realism, reasoning beyond surface rhetoric toward deeper structural incentives. Second, this capability is uneven across domains: models are more reliable in economically and logistically structured settings than in politically ambiguous multi-actor environments. Finally, model narratives evolve over time, shifting from early expectations of rapid containment toward more systemic accounts of regional entrenchment and attritional de-escalation. Since the conflict remains ongoing at the time of writing, this work can serve as an archival snapshot of model reasoning during an unfolding geopolitical crisis, enabling future studies without the hindsight bias of retrospective analysis.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 19 Mar 2026 20:48:45 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/190ad14e/acb0afbf.mp3" length="23027454" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1436</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI, cs.CL, cs.CY</p>

            <p><strong>Authors:</strong><br>
            Ming Li, Xirui Li, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            When AI Navigates the Fog of War</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16642v1">http://arxiv.org/abs/2603.16642v1</a></p>

            <p><strong>Abstract:</strong><br>
            Can AI reason about a war before its trajectory becomes historically obvious? Analyzing this capability is difficult because retrospective geopolitical prediction is heavily confounded by training-data leakage. We address this challenge through a temporally grounded case study of the early stages of the 2026 Middle East conflict, which unfolded after the training cutoff of current frontier models. We construct 11 critical temporal nodes, 42 node-specific verifiable questions, and 5 general exploratory questions, requiring models to reason only from information that would have been publicly available at each moment. This design substantially mitigates training-data leakage concerns, creating a setting well-suited for studying how models analyze an unfolding crisis under the fog of war, and provides, to our knowledge, the first temporally grounded analysis of LLM reasoning in an ongoing geopolitical conflict. Our analysis reveals three main findings. First, current state-of-the-art large language models often display a striking degree of strategic realism, reasoning beyond surface rhetoric toward deeper structural incentives. Second, this capability is uneven across domains: models are more reliable in economically and logistically structured settings than in politically ambiguous multi-actor environments. Finally, model narratives evolve over time, shifting from early expectations of rapid containment toward more systemic accounts of regional entrenchment and attritional de-escalation. Since the conflict remains ongoing at the time of writing, this work can serve as an archival snapshot of model reasoning during an unfolding geopolitical crisis, enabling future studies without the hindsight bias of retrospective analysis.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MiroThinker-1.7 &amp; H1: Towards Heavy-Duty Research Agents via Verification</title>
      <itunes:episode>1639</itunes:episode>
      <podcast:episode>1639</podcast:episode>
      <itunes:title>MiroThinker-1.7 &amp; H1: Towards Heavy-Duty Research Agents via Verification</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9a6ed5e8-053c-45a4-9a51-3240f1ff316e</guid>
      <link>https://share.transistor.fm/s/adf2b759</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 150 | cs.CL, cs.AI, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            MiroMind Team, S. Bai, L. Bing, L. Lei, R. Li, X. Li, X. Lin, E. Min, L. Su, B. Wang, L. Wang, L. Wang, S. Wang, X. Wang, Y. Zhang, Z. Zhang, G. Chen, L. Chen, Z. Cheng, Y. Deng, Z. Huang, D. Ng, J. Ni, Q. Ren, X. Tang, B. L. Wang, H. Wang, N. Wang, C. Wei, Q. Wu, J. Xia, Y. Xiao, H. Xu, X. Xu, C. Xue, Z. Yang, Z. Yang, F. Ye, H. Ye, J. Yu, C. Zhang, W. Zhang, H. Zhao, P. Zhu</p>

            <p><strong>Title:</strong><br>
            MiroThinker-1.7 &amp; H1: Towards Heavy-Duty Research Agents via Verification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15726v1">http://arxiv.org/abs/2603.15726v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present MiroThinker-1.7, a new research agent designed for complex long-horizon reasoning tasks. Building on this foundation, we further introduce MiroThinker-H1, which extends the agent with heavy-duty reasoning capabilities for more reliable multi-step problem solving. In particular, MiroThinker-1.7 improves the reliability of each interaction step through an agentic mid-training stage that emphasizes structured planning, contextual reasoning, and tool interaction. This enables more effective multi-step interaction and sustained reasoning across complex tasks. MiroThinker-H1 further incorporates verification directly into the reasoning process at both local and global levels. Intermediate reasoning decisions can be evaluated and refined during inference, while the overall reasoning trajectory is audited to ensure that final answers are supported by coherent chains of evidence. Across benchmarks covering open-web research, scientific reasoning, and financial analysis, MiroThinker-H1 achieves state-of-the-art performance on deep research tasks while maintaining strong results on specialized domains. We also release MiroThinker-1.7 and MiroThinker-1.7-mini as open-source models, providing competitive research-agent capabilities with significantly improved efficiency.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 150 | cs.CL, cs.AI, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            MiroMind Team, S. Bai, L. Bing, L. Lei, R. Li, X. Li, X. Lin, E. Min, L. Su, B. Wang, L. Wang, L. Wang, S. Wang, X. Wang, Y. Zhang, Z. Zhang, G. Chen, L. Chen, Z. Cheng, Y. Deng, Z. Huang, D. Ng, J. Ni, Q. Ren, X. Tang, B. L. Wang, H. Wang, N. Wang, C. Wei, Q. Wu, J. Xia, Y. Xiao, H. Xu, X. Xu, C. Xue, Z. Yang, Z. Yang, F. Ye, H. Ye, J. Yu, C. Zhang, W. Zhang, H. Zhao, P. Zhu</p>

            <p><strong>Title:</strong><br>
            MiroThinker-1.7 &amp; H1: Towards Heavy-Duty Research Agents via Verification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15726v1">http://arxiv.org/abs/2603.15726v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present MiroThinker-1.7, a new research agent designed for complex long-horizon reasoning tasks. Building on this foundation, we further introduce MiroThinker-H1, which extends the agent with heavy-duty reasoning capabilities for more reliable multi-step problem solving. In particular, MiroThinker-1.7 improves the reliability of each interaction step through an agentic mid-training stage that emphasizes structured planning, contextual reasoning, and tool interaction. This enables more effective multi-step interaction and sustained reasoning across complex tasks. MiroThinker-H1 further incorporates verification directly into the reasoning process at both local and global levels. Intermediate reasoning decisions can be evaluated and refined during inference, while the overall reasoning trajectory is audited to ensure that final answers are supported by coherent chains of evidence. Across benchmarks covering open-web research, scientific reasoning, and financial analysis, MiroThinker-H1 achieves state-of-the-art performance on deep research tasks while maintaining strong results on specialized domains. We also release MiroThinker-1.7 and MiroThinker-1.7-mini as open-source models, providing competitive research-agent capabilities with significantly improved efficiency.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Mar 2026 21:23:42 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/adf2b759/7f9359ad.mp3" length="24146790" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1505</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 150 | cs.CL, cs.AI, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            MiroMind Team, S. Bai, L. Bing, L. Lei, R. Li, X. Li, X. Lin, E. Min, L. Su, B. Wang, L. Wang, L. Wang, S. Wang, X. Wang, Y. Zhang, Z. Zhang, G. Chen, L. Chen, Z. Cheng, Y. Deng, Z. Huang, D. Ng, J. Ni, Q. Ren, X. Tang, B. L. Wang, H. Wang, N. Wang, C. Wei, Q. Wu, J. Xia, Y. Xiao, H. Xu, X. Xu, C. Xue, Z. Yang, Z. Yang, F. Ye, H. Ye, J. Yu, C. Zhang, W. Zhang, H. Zhao, P. Zhu</p>

            <p><strong>Title:</strong><br>
            MiroThinker-1.7 &amp; H1: Towards Heavy-Duty Research Agents via Verification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15726v1">http://arxiv.org/abs/2603.15726v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present MiroThinker-1.7, a new research agent designed for complex long-horizon reasoning tasks. Building on this foundation, we further introduce MiroThinker-H1, which extends the agent with heavy-duty reasoning capabilities for more reliable multi-step problem solving. In particular, MiroThinker-1.7 improves the reliability of each interaction step through an agentic mid-training stage that emphasizes structured planning, contextual reasoning, and tool interaction. This enables more effective multi-step interaction and sustained reasoning across complex tasks. MiroThinker-H1 further incorporates verification directly into the reasoning process at both local and global levels. Intermediate reasoning decisions can be evaluated and refined during inference, while the overall reasoning trajectory is audited to ensure that final answers are supported by coherent chains of evidence. Across benchmarks covering open-web research, scientific reasoning, and financial analysis, MiroThinker-H1 achieves state-of-the-art performance on deep research tasks while maintaining strong results on specialized domains. We also release MiroThinker-1.7 and MiroThinker-1.7-mini as open-source models, providing competitive research-agent capabilities with significantly improved efficiency.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>InCoder-32B: Code Foundation Model for Industrial Scenarios</title>
      <itunes:episode>1638</itunes:episode>
      <podcast:episode>1638</podcast:episode>
      <itunes:title>InCoder-32B: Code Foundation Model for Industrial Scenarios</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c5f9f181-738e-46f1-a279-a8cee25dfc33</guid>
      <link>https://share.transistor.fm/s/5d8877ff</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 144 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jian Yang, Wei Zhang, Jiajun Wu, Junhang Cheng, Shawn Guo, Haowen Wang, Weicheng Gu, Yaxin Du, Joseph Li, Fanglin Xu, Yizhi Li, Lin Jing, Yuanbo Wang, Yuhan Gao, Ruihao Gong, Chuan Hao, Ran Tao, Aishan Liu, Tuney Zheng, Ganqu Cui, Zhoujun Li, Mingjie Tang, Chenghua Lin, Wayne Xin Zhao, Xianglong Liu, Ming Zhou, Bryan Dai, Weifeng Lv</p>

            <p><strong>Title:</strong><br>
            InCoder-32B: Code Foundation Model for Industrial Scenarios</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16790v1">http://arxiv.org/abs/2603.16790v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent code large language models have achieved remarkable progress on general programming tasks. Nevertheless, their performance degrades significantly in industrial scenarios that require reasoning about hardware semantics, specialized language constructs, and strict resource constraints. To address these challenges, we introduce InCoder-32B (Industrial-Coder-32B), the first 32B-parameter code foundation model unifying code intelligence across chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D modeling. By adopting an efficient architecture, we train InCoder-32B from scratch with general code pre-training, curated industrial code annealing, mid-training that progressively extends context from 8K to 128K tokens with synthetic industrial reasoning data, and post-training with execution-grounded verification. We conduct extensive evaluation on 14 mainstream general code benchmarks and 9 industrial benchmarks spanning 4 specialized domains. Results show InCoder-32B achieves highly competitive performance on general tasks while establishing strong open-source baselines across industrial domains.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 144 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jian Yang, Wei Zhang, Jiajun Wu, Junhang Cheng, Shawn Guo, Haowen Wang, Weicheng Gu, Yaxin Du, Joseph Li, Fanglin Xu, Yizhi Li, Lin Jing, Yuanbo Wang, Yuhan Gao, Ruihao Gong, Chuan Hao, Ran Tao, Aishan Liu, Tuney Zheng, Ganqu Cui, Zhoujun Li, Mingjie Tang, Chenghua Lin, Wayne Xin Zhao, Xianglong Liu, Ming Zhou, Bryan Dai, Weifeng Lv</p>

            <p><strong>Title:</strong><br>
            InCoder-32B: Code Foundation Model for Industrial Scenarios</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16790v1">http://arxiv.org/abs/2603.16790v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent code large language models have achieved remarkable progress on general programming tasks. Nevertheless, their performance degrades significantly in industrial scenarios that require reasoning about hardware semantics, specialized language constructs, and strict resource constraints. To address these challenges, we introduce InCoder-32B (Industrial-Coder-32B), the first 32B-parameter code foundation model unifying code intelligence across chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D modeling. By adopting an efficient architecture, we train InCoder-32B from scratch with general code pre-training, curated industrial code annealing, mid-training that progressively extends context from 8K to 128K tokens with synthetic industrial reasoning data, and post-training with execution-grounded verification. We conduct extensive evaluation on 14 mainstream general code benchmarks and 9 industrial benchmarks spanning 4 specialized domains. Results show InCoder-32B achieves highly competitive performance on general tasks while establishing strong open-source baselines across industrial domains.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Mar 2026 21:23:19 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5d8877ff/f34856a3.mp3" length="22400960" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1396</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 144 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jian Yang, Wei Zhang, Jiajun Wu, Junhang Cheng, Shawn Guo, Haowen Wang, Weicheng Gu, Yaxin Du, Joseph Li, Fanglin Xu, Yizhi Li, Lin Jing, Yuanbo Wang, Yuhan Gao, Ruihao Gong, Chuan Hao, Ran Tao, Aishan Liu, Tuney Zheng, Ganqu Cui, Zhoujun Li, Mingjie Tang, Chenghua Lin, Wayne Xin Zhao, Xianglong Liu, Ming Zhou, Bryan Dai, Weifeng Lv</p>

            <p><strong>Title:</strong><br>
            InCoder-32B: Code Foundation Model for Industrial Scenarios</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16790v1">http://arxiv.org/abs/2603.16790v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent code large language models have achieved remarkable progress on general programming tasks. Nevertheless, their performance degrades significantly in industrial scenarios that require reasoning about hardware semantics, specialized language constructs, and strict resource constraints. To address these challenges, we introduce InCoder-32B (Industrial-Coder-32B), the first 32B-parameter code foundation model unifying code intelligence across chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D modeling. By adopting an efficient architecture, we train InCoder-32B from scratch with general code pre-training, curated industrial code annealing, mid-training that progressively extends context from 8K to 128K tokens with synthetic industrial reasoning data, and post-training with execution-grounded verification. We conduct extensive evaluation on 14 mainstream general code benchmarks and 9 industrial benchmarks spanning 4 specialized domains. Results show InCoder-32B achieves highly competitive performance on general tasks while establishing strong open-source baselines across industrial domains.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Qianfan-OCR: A Unified End-to-End Model for Document Intelligence</title>
      <itunes:episode>1637</itunes:episode>
      <podcast:episode>1637</podcast:episode>
      <itunes:title>Qianfan-OCR: A Unified End-to-End Model for Document Intelligence</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4dc92401-460b-4b79-83e2-5e9f45b8afc3</guid>
      <link>https://share.transistor.fm/s/a4e5d788</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 118 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Daxiang Dong, Mingming Zheng, Dong Xu, Chunhua Luo, Bairong Zhuang, Yuxuan Li, Ruoyun He, Haoran Wang, Wenyu Zhang, Wenbo Wang, Yicheng Wang, Xue Xiong, Ayong Zheng, Xiaoying Zuo, Ziwei Ou, Jingnan Gu, Quanhao Guo, Jianmin Wu, Dawei Yin, Dou Shen</p>

            <p><strong>Title:</strong><br>
            Qianfan-OCR: A Unified End-to-End Model for Document Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.13398v1">http://arxiv.org/abs/2603.13398v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 118 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Daxiang Dong, Mingming Zheng, Dong Xu, Chunhua Luo, Bairong Zhuang, Yuxuan Li, Ruoyun He, Haoran Wang, Wenyu Zhang, Wenbo Wang, Yicheng Wang, Xue Xiong, Ayong Zheng, Xiaoying Zuo, Ziwei Ou, Jingnan Gu, Quanhao Guo, Jianmin Wu, Dawei Yin, Dou Shen</p>

            <p><strong>Title:</strong><br>
            Qianfan-OCR: A Unified End-to-End Model for Document Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.13398v1">http://arxiv.org/abs/2603.13398v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Mar 2026 21:22:56 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a4e5d788/5f85340f.mp3" length="20698619" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1290</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 118 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Daxiang Dong, Mingming Zheng, Dong Xu, Chunhua Luo, Bairong Zhuang, Yuxuan Li, Ruoyun He, Haoran Wang, Wenyu Zhang, Wenbo Wang, Yicheng Wang, Xue Xiong, Ayong Zheng, Xiaoying Zuo, Ziwei Ou, Jingnan Gu, Quanhao Guo, Jianmin Wu, Dawei Yin, Dou Shen</p>

            <p><strong>Title:</strong><br>
            Qianfan-OCR: A Unified End-to-End Model for Document Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.13398v1">http://arxiv.org/abs/2603.13398v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding</title>
      <itunes:episode>1636</itunes:episode>
      <podcast:episode>1636</podcast:episode>
      <itunes:title>Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1a6e746c-98b3-4c7e-ad48-0c22850e4518</guid>
      <link>https://share.transistor.fm/s/22b020ca</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhongxing Xu, Zhonghua Wang, Zhe Qian, Dachuan Shi, Feilong Tang, Ming Hu, Shiyan Su, Xiaocheng Zou, Wei Feng, Dwarikanath Mahapatra, Yifan Peng, Mingquan Lin, Zongyuan Ge</p>

            <p><strong>Title:</strong><br>
            Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.13366v1">http://arxiv.org/abs/2603.13366v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit high-entropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhongxing Xu, Zhonghua Wang, Zhe Qian, Dachuan Shi, Feilong Tang, Ming Hu, Shiyan Su, Xiaocheng Zou, Wei Feng, Dwarikanath Mahapatra, Yifan Peng, Mingquan Lin, Zongyuan Ge</p>

            <p><strong>Title:</strong><br>
            Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.13366v1">http://arxiv.org/abs/2603.13366v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit high-entropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Mar 2026 21:22:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/22b020ca/4606e401.mp3" length="19537139" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1217</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhongxing Xu, Zhonghua Wang, Zhe Qian, Dachuan Shi, Feilong Tang, Ming Hu, Shiyan Su, Xiaocheng Zou, Wei Feng, Dwarikanath Mahapatra, Yifan Peng, Mingquan Lin, Zongyuan Ge</p>

            <p><strong>Title:</strong><br>
            Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.13366v1">http://arxiv.org/abs/2603.13366v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit high-entropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation</title>
      <itunes:episode>1635</itunes:episode>
      <podcast:episode>1635</podcast:episode>
      <itunes:title>Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5366a2c7-0267-401d-8a8b-7743ae0af44f</guid>
      <link>https://share.transistor.fm/s/2eaa8826</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Mutian Xu, Tianbao Zhang, Tianqi Liu, Zhaoxi Chen, Xiaoguang Han, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16669v1">http://arxiv.org/abs/2603.16669v1</a></p>

            <p><strong>Abstract:</strong><br>
            Simulating robot-world interactions is a cornerstone of Embodied AI. Recently, a few works have shown promise in leveraging video generations to transcend the rigid visual/physical constraints of traditional simulators. However, they primarily operate in 2D space or are guided by static environmental cues, ignoring the fundamental reality that robot-world interactions are inherently 4D spatiotemporal events that require precise interactive modeling. To restore this 4D essence while ensuring the precise robot control, we introduce Kinema4D, a new action-conditioned 4D generative robotic simulator that disentangles the robot-world interaction into: i) Precise 4D representation of robot controls: we drive a URDF-based 3D robot via kinematics, producing a precise 4D robot control trajectory. ii) Generative 4D modeling of environmental reactions: we project the 4D robot trajectory into a pointmap as a spatiotemporal visual signal, controlling the generative model to synthesize complex environments' reactive dynamics into synchronized RGB/pointmap sequences. To facilitate training, we curated a large-scale dataset called Robo4D-200k, comprising 201,426 robot interaction episodes with high-quality 4D annotations. Extensive experiments demonstrate that our method effectively simulates physically-plausible, geometry-consistent, and embodiment-agnostic interactions that faithfully mirror diverse real-world dynamics. For the first time, it shows potential zero-shot transfer capability, providing a high-fidelity foundation for advancing next-generation embodied simulation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Mutian Xu, Tianbao Zhang, Tianqi Liu, Zhaoxi Chen, Xiaoguang Han, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16669v1">http://arxiv.org/abs/2603.16669v1</a></p>

            <p><strong>Abstract:</strong><br>
            Simulating robot-world interactions is a cornerstone of Embodied AI. Recently, a few works have shown promise in leveraging video generations to transcend the rigid visual/physical constraints of traditional simulators. However, they primarily operate in 2D space or are guided by static environmental cues, ignoring the fundamental reality that robot-world interactions are inherently 4D spatiotemporal events that require precise interactive modeling. To restore this 4D essence while ensuring the precise robot control, we introduce Kinema4D, a new action-conditioned 4D generative robotic simulator that disentangles the robot-world interaction into: i) Precise 4D representation of robot controls: we drive a URDF-based 3D robot via kinematics, producing a precise 4D robot control trajectory. ii) Generative 4D modeling of environmental reactions: we project the 4D robot trajectory into a pointmap as a spatiotemporal visual signal, controlling the generative model to synthesize complex environments' reactive dynamics into synchronized RGB/pointmap sequences. To facilitate training, we curated a large-scale dataset called Robo4D-200k, comprising 201,426 robot interaction episodes with high-quality 4D annotations. Extensive experiments demonstrate that our method effectively simulates physically-plausible, geometry-consistent, and embodiment-agnostic interactions that faithfully mirror diverse real-world dynamics. For the first time, it shows potential zero-shot transfer capability, providing a high-fidelity foundation for advancing next-generation embodied simulation.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Mar 2026 21:22:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2eaa8826/64ecd814.mp3" length="22425219" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1398</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Mutian Xu, Tianbao Zhang, Tianqi Liu, Zhaoxi Chen, Xiaoguang Han, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16669v1">http://arxiv.org/abs/2603.16669v1</a></p>

            <p><strong>Abstract:</strong><br>
            Simulating robot-world interactions is a cornerstone of Embodied AI. Recently, a few works have shown promise in leveraging video generations to transcend the rigid visual/physical constraints of traditional simulators. However, they primarily operate in 2D space or are guided by static environmental cues, ignoring the fundamental reality that robot-world interactions are inherently 4D spatiotemporal events that require precise interactive modeling. To restore this 4D essence while ensuring the precise robot control, we introduce Kinema4D, a new action-conditioned 4D generative robotic simulator that disentangles the robot-world interaction into: i) Precise 4D representation of robot controls: we drive a URDF-based 3D robot via kinematics, producing a precise 4D robot control trajectory. ii) Generative 4D modeling of environmental reactions: we project the 4D robot trajectory into a pointmap as a spatiotemporal visual signal, controlling the generative model to synthesize complex environments' reactive dynamics into synchronized RGB/pointmap sequences. To facilitate training, we curated a large-scale dataset called Robo4D-200k, comprising 201,426 robot interaction episodes with high-quality 4D annotations. Extensive experiments demonstrate that our method effectively simulates physically-plausible, geometry-consistent, and embodiment-agnostic interactions that faithfully mirror diverse real-world dynamics. For the first time, it shows potential zero-shot transfer capability, providing a high-fidelity foundation for advancing next-generation embodied simulation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Demystifing Video Reasoning</title>
      <itunes:episode>1634</itunes:episode>
      <podcast:episode>1634</podcast:episode>
      <itunes:title>Demystifing Video Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a7da2b61-4924-46e5-98ee-7b0504f6a3a2</guid>
      <link>https://share.transistor.fm/s/de472eee</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang</p>

            <p><strong>Title:</strong><br>
            Demystifing Video Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16870v1">http://arxiv.org/abs/2603.16870v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang</p>

            <p><strong>Title:</strong><br>
            Demystifing Video Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16870v1">http://arxiv.org/abs/2603.16870v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Mar 2026 21:21:47 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/de472eee/03bfee4a.mp3" length="19583466" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1220</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang</p>

            <p><strong>Title:</strong><br>
            Demystifing Video Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16870v1">http://arxiv.org/abs/2603.16870v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation</title>
      <itunes:episode>1633</itunes:episode>
      <podcast:episode>1633</podcast:episode>
      <itunes:title>WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e4179e95-bca1-4a9d-9f44-8f43300eb2ce</guid>
      <link>https://share.transistor.fm/s/fc6aeddf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee, Jiyoung Kim, Siyoon Jin, Yunsung Lee, Jaeyoon Jung, Suhwan Choi, Seungryong Kim, Yang Zhou</p>

            <p><strong>Title:</strong><br>
            WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16871v1">http://arxiv.org/abs/2603.16871v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee, Jiyoung Kim, Siyoon Jin, Yunsung Lee, Jaeyoon Jung, Suhwan Choi, Seungryong Kim, Yang Zhou</p>

            <p><strong>Title:</strong><br>
            WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16871v1">http://arxiv.org/abs/2603.16871v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Mar 2026 21:21:24 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fc6aeddf/710fed91.mp3" length="20847456" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1299</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee, Jiyoung Kim, Siyoon Jin, Yunsung Lee, Jaeyoon Jung, Suhwan Choi, Seungryong Kim, Yang Zhou</p>

            <p><strong>Title:</strong><br>
            WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16871v1">http://arxiv.org/abs/2603.16871v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas</title>
      <itunes:episode>1632</itunes:episode>
      <podcast:episode>1632</podcast:episode>
      <itunes:title>TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ddee0c9d-ce1c-456e-a785-674282038e09</guid>
      <link>https://share.transistor.fm/s/0e0eef58</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ai Jian, Xiaoyun Zhang, Wanrou Du, Jingqing Ruan, Jiangbo Pei, Weipeng Zhang, Ke Zeng, Xunliang Cai</p>

            <p><strong>Title:</strong><br>
            TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16448v2">http://arxiv.org/abs/2603.16448v2</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-SQL parsing has achieved remarkable progress under the Full Schema Assumption. However, this premise fails in real-world enterprise environments where databases contain hundreds of tables with massive noisy metadata. Rather than injecting the full schema upfront, an agent must actively identify and verify only the relevant subset, giving rise to the Unknown Schema scenario we study in this work. To address this, we propose TRUST-SQL (Truthful Reasoning with Unknown Schema via Tools). We formulate the task as a Partially Observable Markov Decision Process where our autonomous agent employs a structured four-phase protocol to ground reasoning in verified metadata. Crucially, this protocol provides a structural boundary for our novel Dual-Track GRPO strategy. By applying token-level masked advantages, this strategy isolates exploration rewards from execution outcomes to resolve credit assignment, yielding a 9.9% relative improvement over standard GRPO. Extensive experiments across five benchmarks demonstrate that TRUST-SQL achieves an average absolute improvement of 30.6% and 16.6% for the 4B and 8B variants respectively over their base models. Remarkably, despite operating entirely without pre-loaded metadata, our framework consistently matches or surpasses strong baselines that rely on schema prefilling.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ai Jian, Xiaoyun Zhang, Wanrou Du, Jingqing Ruan, Jiangbo Pei, Weipeng Zhang, Ke Zeng, Xunliang Cai</p>

            <p><strong>Title:</strong><br>
            TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16448v2">http://arxiv.org/abs/2603.16448v2</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-SQL parsing has achieved remarkable progress under the Full Schema Assumption. However, this premise fails in real-world enterprise environments where databases contain hundreds of tables with massive noisy metadata. Rather than injecting the full schema upfront, an agent must actively identify and verify only the relevant subset, giving rise to the Unknown Schema scenario we study in this work. To address this, we propose TRUST-SQL (Truthful Reasoning with Unknown Schema via Tools). We formulate the task as a Partially Observable Markov Decision Process where our autonomous agent employs a structured four-phase protocol to ground reasoning in verified metadata. Crucially, this protocol provides a structural boundary for our novel Dual-Track GRPO strategy. By applying token-level masked advantages, this strategy isolates exploration rewards from execution outcomes to resolve credit assignment, yielding a 9.9% relative improvement over standard GRPO. Extensive experiments across five benchmarks demonstrate that TRUST-SQL achieves an average absolute improvement of 30.6% and 16.6% for the 4B and 8B variants respectively over their base models. Remarkably, despite operating entirely without pre-loaded metadata, our framework consistently matches or surpasses strong baselines that rely on schema prefilling.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Mar 2026 21:21:00 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0e0eef58/930fc6c2.mp3" length="22914670" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1428</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ai Jian, Xiaoyun Zhang, Wanrou Du, Jingqing Ruan, Jiangbo Pei, Weipeng Zhang, Ke Zeng, Xunliang Cai</p>

            <p><strong>Title:</strong><br>
            TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16448v2">http://arxiv.org/abs/2603.16448v2</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-SQL parsing has achieved remarkable progress under the Full Schema Assumption. However, this premise fails in real-world enterprise environments where databases contain hundreds of tables with massive noisy metadata. Rather than injecting the full schema upfront, an agent must actively identify and verify only the relevant subset, giving rise to the Unknown Schema scenario we study in this work. To address this, we propose TRUST-SQL (Truthful Reasoning with Unknown Schema via Tools). We formulate the task as a Partially Observable Markov Decision Process where our autonomous agent employs a structured four-phase protocol to ground reasoning in verified metadata. Crucially, this protocol provides a structural boundary for our novel Dual-Track GRPO strategy. By applying token-level masked advantages, this strategy isolates exploration rewards from execution outcomes to resolve credit assignment, yielding a 9.9% relative improvement over standard GRPO. Extensive experiments across five benchmarks demonstrate that TRUST-SQL achieves an average absolute improvement of 30.6% and 16.6% for the 4B and 8B variants respectively over their base models. Remarkably, despite operating entirely without pre-loaded metadata, our framework consistently matches or surpasses strong baselines that rely on schema prefilling.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Online Experiential Learning for Language Models</title>
      <itunes:episode>1631</itunes:episode>
      <podcast:episode>1631</podcast:episode>
      <itunes:title>Online Experiential Learning for Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ccb30f7e-d374-4f0e-80c1-99e98d39a839</guid>
      <link>https://share.transistor.fm/s/60fd514b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, Furu Wei</p>

            <p><strong>Title:</strong><br>
            Online Experiential Learning for Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16856v1">http://arxiv.org/abs/2603.16856v1</a></p>

            <p><strong>Abstract:</strong><br>
            The prevailing paradigm for improving large language models relies on offline training with human annotations or simulated environments, leaving the rich experience accumulated during real-world deployment entirely unexploited. We propose Online Experiential Learning (OEL), a framework that enables language models to continuously improve from their own deployment experience. OEL operates in two stages: first, transferable experiential knowledge is extracted and accumulated from interaction trajectories collected on the user side; second, this knowledge is consolidated into model parameters via on-policy context distillation, requiring no access to the user-side environment. The two stages are iterated to form an online learning loop, where the improved model collects higher-quality trajectories that yield richer experiential knowledge for subsequent rounds. We evaluate OEL on text-based game environments across multiple model scales and both thinking and non-thinking variants. OEL achieves consistent improvements over successive iterations, enhancing both task accuracy and token efficiency while preserving out-of-distribution performance. Our analysis further shows that extracted experiential knowledge is significantly more effective than raw trajectories, and that on-policy consistency between the knowledge source and the policy model is critical for effective learning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, Furu Wei</p>

            <p><strong>Title:</strong><br>
            Online Experiential Learning for Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16856v1">http://arxiv.org/abs/2603.16856v1</a></p>

            <p><strong>Abstract:</strong><br>
            The prevailing paradigm for improving large language models relies on offline training with human annotations or simulated environments, leaving the rich experience accumulated during real-world deployment entirely unexploited. We propose Online Experiential Learning (OEL), a framework that enables language models to continuously improve from their own deployment experience. OEL operates in two stages: first, transferable experiential knowledge is extracted and accumulated from interaction trajectories collected on the user side; second, this knowledge is consolidated into model parameters via on-policy context distillation, requiring no access to the user-side environment. The two stages are iterated to form an online learning loop, where the improved model collects higher-quality trajectories that yield richer experiential knowledge for subsequent rounds. We evaluate OEL on text-based game environments across multiple model scales and both thinking and non-thinking variants. OEL achieves consistent improvements over successive iterations, enhancing both task accuracy and token efficiency while preserving out-of-distribution performance. Our analysis further shows that extracted experiential knowledge is significantly more effective than raw trajectories, and that on-policy consistency between the knowledge source and the policy model is critical for effective learning.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Mar 2026 21:20:37 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/60fd514b/0acf0646.mp3" length="24687604" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1539</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, Furu Wei</p>

            <p><strong>Title:</strong><br>
            Online Experiential Learning for Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.16856v1">http://arxiv.org/abs/2603.16856v1</a></p>

            <p><strong>Abstract:</strong><br>
            The prevailing paradigm for improving large language models relies on offline training with human annotations or simulated environments, leaving the rich experience accumulated during real-world deployment entirely unexploited. We propose Online Experiential Learning (OEL), a framework that enables language models to continuously improve from their own deployment experience. OEL operates in two stages: first, transferable experiential knowledge is extracted and accumulated from interaction trajectories collected on the user side; second, this knowledge is consolidated into model parameters via on-policy context distillation, requiring no access to the user-side environment. The two stages are iterated to form an online learning loop, where the improved model collects higher-quality trajectories that yield richer experiential knowledge for subsequent rounds. We evaluate OEL on text-based game environments across multiple model scales and both thinking and non-thinking variants. OEL achieves consistent improvements over successive iterations, enhancing both task accuracy and token efficiency while preserving out-of-distribution performance. Our analysis further shows that extracted experiential knowledge is significantly more effective than raw trajectories, and that on-policy consistency between the knowledge source and the policy model is critical for effective learning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use</title>
      <itunes:episode>1630</itunes:episode>
      <podcast:episode>1630</podcast:episode>
      <itunes:title>FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">77ecf4a6-d71b-4597-9254-c33b6eb739ea</guid>
      <link>https://share.transistor.fm/s/7b34dfdf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiaxuan Lu, Kong Wang, Yemin Wang, Qingmei Tang, Hongwei Zeng, Xiang Chen, Jiahao Pi, Shujian Deng, Lingzhi Chen, Yi Fu, Kehua Yang, Xiao Sun</p>

            <p><strong>Title:</strong><br>
            FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.08262v1">http://arxiv.org/abs/2603.08262v1</a></p>

            <p><strong>Abstract:</strong><br>
            The integration of Large Language Models (LLMs) into the financial domain is driving a paradigm shift from passive information retrieval to dynamic, agentic interaction. While general-purpose tool learning has witnessed a surge in benchmarks, the financial sector, characterized by high stakes, strict compliance, and rapid data volatility, remains critically underserved. Existing financial evaluations predominantly focus on static textual analysis or document-based QA, ignoring the complex reality of tool execution. Conversely, general tool benchmarks lack the domain-specific rigor required for finance, often relying on toy environments or a negligible number of financial APIs. To bridge this gap, we introduce FinToolBench, the first real-world, runnable benchmark dedicated to evaluating financial tool learning agents. Unlike prior works limited to a handful of mock tools, FinToolBench establishes a realistic ecosystem coupling 760 executable financial tools with 295 rigorous, tool-required queries. We propose a novel evaluation framework that goes beyond binary execution success, assessing agents on finance-critical dimensions: timeliness, intent type, and regulatory domain alignment. Furthermore, we present FATR, a finance-aware tool retrieval and reasoning baseline that enhances stability and compliance. By providing the first testbed for auditable, agentic financial execution, FinToolBench sets a new standard for trustworthy AI in finance. The tool manifest, execution environment, and evaluation code will be open-sourced to facilitate future research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiaxuan Lu, Kong Wang, Yemin Wang, Qingmei Tang, Hongwei Zeng, Xiang Chen, Jiahao Pi, Shujian Deng, Lingzhi Chen, Yi Fu, Kehua Yang, Xiao Sun</p>

            <p><strong>Title:</strong><br>
            FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.08262v1">http://arxiv.org/abs/2603.08262v1</a></p>

            <p><strong>Abstract:</strong><br>
            The integration of Large Language Models (LLMs) into the financial domain is driving a paradigm shift from passive information retrieval to dynamic, agentic interaction. While general-purpose tool learning has witnessed a surge in benchmarks, the financial sector, characterized by high stakes, strict compliance, and rapid data volatility, remains critically underserved. Existing financial evaluations predominantly focus on static textual analysis or document-based QA, ignoring the complex reality of tool execution. Conversely, general tool benchmarks lack the domain-specific rigor required for finance, often relying on toy environments or a negligible number of financial APIs. To bridge this gap, we introduce FinToolBench, the first real-world, runnable benchmark dedicated to evaluating financial tool learning agents. Unlike prior works limited to a handful of mock tools, FinToolBench establishes a realistic ecosystem coupling 760 executable financial tools with 295 rigorous, tool-required queries. We propose a novel evaluation framework that goes beyond binary execution success, assessing agents on finance-critical dimensions: timeliness, intent type, and regulatory domain alignment. Furthermore, we present FATR, a finance-aware tool retrieval and reasoning baseline that enhances stability and compliance. By providing the first testbed for auditable, agentic financial execution, FinToolBench sets a new standard for trustworthy AI in finance. The tool manifest, execution environment, and evaluation code will be open-sourced to facilitate future research.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Mar 2026 21:20:13 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7b34dfdf/955c4e85.mp3" length="23978766" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1495</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiaxuan Lu, Kong Wang, Yemin Wang, Qingmei Tang, Hongwei Zeng, Xiang Chen, Jiahao Pi, Shujian Deng, Lingzhi Chen, Yi Fu, Kehua Yang, Xiao Sun</p>

            <p><strong>Title:</strong><br>
            FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.08262v1">http://arxiv.org/abs/2603.08262v1</a></p>

            <p><strong>Abstract:</strong><br>
            The integration of Large Language Models (LLMs) into the financial domain is driving a paradigm shift from passive information retrieval to dynamic, agentic interaction. While general-purpose tool learning has witnessed a surge in benchmarks, the financial sector, characterized by high stakes, strict compliance, and rapid data volatility, remains critically underserved. Existing financial evaluations predominantly focus on static textual analysis or document-based QA, ignoring the complex reality of tool execution. Conversely, general tool benchmarks lack the domain-specific rigor required for finance, often relying on toy environments or a negligible number of financial APIs. To bridge this gap, we introduce FinToolBench, the first real-world, runnable benchmark dedicated to evaluating financial tool learning agents. Unlike prior works limited to a handful of mock tools, FinToolBench establishes a realistic ecosystem coupling 760 executable financial tools with 295 rigorous, tool-required queries. We propose a novel evaluation framework that goes beyond binary execution success, assessing agents on finance-critical dimensions: timeliness, intent type, and regulatory domain alignment. Furthermore, we present FATR, a finance-aware tool retrieval and reasoning baseline that enhances stability and compliance. By providing the first testbed for auditable, agentic financial execution, FinToolBench sets a new standard for trustworthy AI in finance. The tool manifest, execution environment, and evaluation code will be open-sourced to facilitate future research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AI Can Learn Scientific Taste</title>
      <itunes:episode>1629</itunes:episode>
      <podcast:episode>1629</podcast:episode>
      <itunes:title>AI Can Learn Scientific Taste</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b96d111a-cf62-453a-bd28-627153d32c01</guid>
      <link>https://share.transistor.fm/s/e9537cfe</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 215 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jingqi Tong, Mingzhe Li, Hangcheng Li, Yongzhuo Yang, Yurong Mou, Weijie Ma, Zhiheng Xi, Hongji Chen, Xiaoran Liu, Qinyuan Cheng, Ming Zhang, Qiguang Chen, Weifeng Ge, Qipeng Guo, Tianlei Ying, Tianxiang Sun, Yining Zheng, Xinchi Chen, Jun Zhao, Ning Ding, Xuanjing Huang, Yugang Jiang, Xipeng Qiu</p>

            <p><strong>Title:</strong><br>
            AI Can Learn Scientific Taste</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.14473v1">http://arxiv.org/abs/2603.14473v1</a></p>

            <p><strong>Abstract:</strong><br>
            Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist's executive capability, while enhancing an AI's scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 215 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jingqi Tong, Mingzhe Li, Hangcheng Li, Yongzhuo Yang, Yurong Mou, Weijie Ma, Zhiheng Xi, Hongji Chen, Xiaoran Liu, Qinyuan Cheng, Ming Zhang, Qiguang Chen, Weifeng Ge, Qipeng Guo, Tianlei Ying, Tianxiang Sun, Yining Zheng, Xinchi Chen, Jun Zhao, Ning Ding, Xuanjing Huang, Yugang Jiang, Xipeng Qiu</p>

            <p><strong>Title:</strong><br>
            AI Can Learn Scientific Taste</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.14473v1">http://arxiv.org/abs/2603.14473v1</a></p>

            <p><strong>Abstract:</strong><br>
            Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist's executive capability, while enhancing an AI's scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Mar 2026 21:16:14 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e9537cfe/d918cafb.mp3" length="21669920" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1351</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 215 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jingqi Tong, Mingzhe Li, Hangcheng Li, Yongzhuo Yang, Yurong Mou, Weijie Ma, Zhiheng Xi, Hongji Chen, Xiaoran Liu, Qinyuan Cheng, Ming Zhang, Qiguang Chen, Weifeng Ge, Qipeng Guo, Tianlei Ying, Tianxiang Sun, Yining Zheng, Xinchi Chen, Jun Zhao, Ning Ding, Xuanjing Huang, Yugang Jiang, Xipeng Qiu</p>

            <p><strong>Title:</strong><br>
            AI Can Learn Scientific Taste</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.14473v1">http://arxiv.org/abs/2603.14473v1</a></p>

            <p><strong>Abstract:</strong><br>
            Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist's executive capability, while enhancing an AI's scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data</title>
      <itunes:episode>1628</itunes:episode>
      <podcast:episode>1628</podcast:episode>
      <itunes:title>OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">57bac4ba-901c-4519-9c99-a78c8777672f</guid>
      <link>https://share.transistor.fm/s/593b9e97</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 127 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, Siheng Chen</p>

            <p><strong>Title:</strong><br>
            OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15594v1">http://arxiv.org/abs/2603.15594v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 127 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, Siheng Chen</p>

            <p><strong>Title:</strong><br>
            OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15594v1">http://arxiv.org/abs/2603.15594v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Mar 2026 21:15:52 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/593b9e97/854bc872.mp3" length="22738697" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1417</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 127 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, Siheng Chen</p>

            <p><strong>Title:</strong><br>
            OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15594v1">http://arxiv.org/abs/2603.15594v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings</title>
      <itunes:episode>1627</itunes:episode>
      <podcast:episode>1627</podcast:episode>
      <itunes:title>EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d69dd8b2-ada9-4085-9bcd-ecbcbea53c90</guid>
      <link>https://share.transistor.fm/s/4c3d7c93</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 118 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shiva Krishna Reddy Malay, Shravan Nayak, Jishnu Sethumadhavan Nair, Sagar Davasam, Aman Tiwari, Sathwik Tejaswi Madhusudhan, Sridhar Krishna Nemala, Srinivas Sunkara, Sai Rajeswar</p>

            <p><strong>Title:</strong><br>
            EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.13594v1">http://arxiv.org/abs/2603.13594v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.5 achieves only 37.4% success. Further analysis shows that providing oracle human plans improves performance by 14-35 percentage points, pinpointing strategic reasoning as the primary bottleneck. Additionally, agents frequently fail to refuse infeasible tasks (best model achieves 53.9%), leading to unintended and potentially harmful side effects. Our findings underscore that current agents are not yet ready for autonomous enterprise deployment. More broadly, EnterpriseOps-Gym provides a concrete testbed to advance the robustness of agentic planning in professional workflows.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 118 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shiva Krishna Reddy Malay, Shravan Nayak, Jishnu Sethumadhavan Nair, Sagar Davasam, Aman Tiwari, Sathwik Tejaswi Madhusudhan, Sridhar Krishna Nemala, Srinivas Sunkara, Sai Rajeswar</p>

            <p><strong>Title:</strong><br>
            EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.13594v1">http://arxiv.org/abs/2603.13594v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.5 achieves only 37.4% success. Further analysis shows that providing oracle human plans improves performance by 14-35 percentage points, pinpointing strategic reasoning as the primary bottleneck. Additionally, agents frequently fail to refuse infeasible tasks (best model achieves 53.9%), leading to unintended and potentially harmful side effects. Our findings underscore that current agents are not yet ready for autonomous enterprise deployment. More broadly, EnterpriseOps-Gym provides a concrete testbed to advance the robustness of agentic planning in professional workflows.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Mar 2026 21:15:31 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4c3d7c93/f108c6df.mp3" length="22998696" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1434</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 118 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shiva Krishna Reddy Malay, Shravan Nayak, Jishnu Sethumadhavan Nair, Sagar Davasam, Aman Tiwari, Sathwik Tejaswi Madhusudhan, Sridhar Krishna Nemala, Srinivas Sunkara, Sai Rajeswar</p>

            <p><strong>Title:</strong><br>
            EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.13594v1">http://arxiv.org/abs/2603.13594v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.5 achieves only 37.4% success. Further analysis shows that providing oracle human plans improves performance by 14-35 percentage points, pinpointing strategic reasoning as the primary bottleneck. Additionally, agents frequently fail to refuse infeasible tasks (best model achieves 53.9%), leading to unintended and potentially harmful side effects. Our findings underscore that current agents are not yet ready for autonomous enterprise deployment. More broadly, EnterpriseOps-Gym provides a concrete testbed to advance the robustness of agentic planning in professional workflows.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Grounding World Simulation Models in a Real-World Metropolis</title>
      <itunes:episode>1626</itunes:episode>
      <podcast:episode>1626</podcast:episode>
      <itunes:title>Grounding World Simulation Models in a Real-World Metropolis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0dc28091-33aa-4f0b-b3a0-7dc83006fa95</guid>
      <link>https://share.transistor.fm/s/1af9ba40</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 103 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junyoung Seo, Hyunwook Choi, Minkyung Kwon, Jinhyeok Choi, Siyoon Jin, Gayoung Lee, Junho Kim, JoungBin Lee, Geonmo Gu, Dongyoon Han, Sangdoo Yun, Seungryong Kim, Jin-Hwa Kim</p>

            <p><strong>Title:</strong><br>
            Grounding World Simulation Models in a Real-World Metropolis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15583v1">http://arxiv.org/abs/2603.15583v1</a></p>

            <p><strong>Abstract:</strong><br>
            What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 103 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junyoung Seo, Hyunwook Choi, Minkyung Kwon, Jinhyeok Choi, Siyoon Jin, Gayoung Lee, Junho Kim, JoungBin Lee, Geonmo Gu, Dongyoon Han, Sangdoo Yun, Seungryong Kim, Jin-Hwa Kim</p>

            <p><strong>Title:</strong><br>
            Grounding World Simulation Models in a Real-World Metropolis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15583v1">http://arxiv.org/abs/2603.15583v1</a></p>

            <p><strong>Abstract:</strong><br>
            What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Mar 2026 21:15:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1af9ba40/51eb0ad5.mp3" length="22977327" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1432</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 103 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junyoung Seo, Hyunwook Choi, Minkyung Kwon, Jinhyeok Choi, Siyoon Jin, Gayoung Lee, Junho Kim, JoungBin Lee, Geonmo Gu, Dongyoon Han, Sangdoo Yun, Seungryong Kim, Jin-Hwa Kim</p>

            <p><strong>Title:</strong><br>
            Grounding World Simulation Models in a Real-World Metropolis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15583v1">http://arxiv.org/abs/2603.15583v1</a></p>

            <p><strong>Abstract:</strong><br>
            What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions</title>
      <itunes:episode>1625</itunes:episode>
      <podcast:episode>1625</podcast:episode>
      <itunes:title>HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">05dd4afe-bea9-4c19-bd6f-a46a4448eded</guid>
      <link>https://share.transistor.fm/s/5e1bef30</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 92 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Yukang Cao, Haozhe Xie, Fangzhou Hong, Long Zhuo, Zhaoxi Chen, Liang Pan, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15612v1">http://arxiv.org/abs/2603.15612v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present HSImul3R, a unified framework for simulation-ready 3D reconstruction of human-scene interactions (HSI) from casual captures, including sparse-view images and monocular videos. Existing methods suffer from a perception-simulation gap: visually plausible reconstructions often violate physical constraints, leading to instability in physics engines and failure in embodied AI applications. To bridge this gap, we introduce a physically-grounded bi-directional optimization pipeline that treats the physics simulator as an active supervisor to jointly refine human dynamics and scene geometry. In the forward direction, we employ Scene-targeted Reinforcement Learning to optimize human motion under dual supervision of motion fidelity and contact stability. In the reverse direction, we propose Direct Simulation Reward Optimization, which leverages simulation feedback on gravitational stability and interaction success to refine scene geometry. We further present HSIBench, a new benchmark with diverse objects and interaction scenarios. Extensive experiments demonstrate that HSImul3R produces the first stable, simulation-ready HSI reconstructions and can be directly deployed to real-world humanoid robots.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 92 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Yukang Cao, Haozhe Xie, Fangzhou Hong, Long Zhuo, Zhaoxi Chen, Liang Pan, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15612v1">http://arxiv.org/abs/2603.15612v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present HSImul3R, a unified framework for simulation-ready 3D reconstruction of human-scene interactions (HSI) from casual captures, including sparse-view images and monocular videos. Existing methods suffer from a perception-simulation gap: visually plausible reconstructions often violate physical constraints, leading to instability in physics engines and failure in embodied AI applications. To bridge this gap, we introduce a physically-grounded bi-directional optimization pipeline that treats the physics simulator as an active supervisor to jointly refine human dynamics and scene geometry. In the forward direction, we employ Scene-targeted Reinforcement Learning to optimize human motion under dual supervision of motion fidelity and contact stability. In the reverse direction, we propose Direct Simulation Reward Optimization, which leverages simulation feedback on gravitational stability and interaction success to refine scene geometry. We further present HSIBench, a new benchmark with diverse objects and interaction scenarios. Extensive experiments demonstrate that HSImul3R produces the first stable, simulation-ready HSI reconstructions and can be directly deployed to real-world humanoid robots.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Mar 2026 21:14:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5e1bef30/f1a46e1f.mp3" length="22092955" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1377</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 92 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Yukang Cao, Haozhe Xie, Fangzhou Hong, Long Zhuo, Zhaoxi Chen, Liang Pan, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15612v1">http://arxiv.org/abs/2603.15612v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present HSImul3R, a unified framework for simulation-ready 3D reconstruction of human-scene interactions (HSI) from casual captures, including sparse-view images and monocular videos. Existing methods suffer from a perception-simulation gap: visually plausible reconstructions often violate physical constraints, leading to instability in physics engines and failure in embodied AI applications. To bridge this gap, we introduce a physically-grounded bi-directional optimization pipeline that treats the physics simulator as an active supervisor to jointly refine human dynamics and scene geometry. In the forward direction, we employ Scene-targeted Reinforcement Learning to optimize human motion under dual supervision of motion fidelity and contact stability. In the reverse direction, we propose Direct Simulation Reward Optimization, which leverages simulation feedback on gravitational stability and interaction success to refine scene geometry. We further present HSIBench, a new benchmark with diverse objects and interaction scenarios. Extensive experiments demonstrate that HSImul3R produces the first stable, simulation-ready HSI reconstructions and can be directly deployed to real-world humanoid robots.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Attention Residuals</title>
      <itunes:episode>1624</itunes:episode>
      <podcast:episode>1624</podcast:episode>
      <itunes:title>Attention Residuals</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">900ab195-acef-49f7-8948-de7cc5c8dd2f</guid>
      <link>https://share.transistor.fm/s/288c330d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, Yutian Chen, Junjie Yan, Ming Wei, Y. Zhang, Fanqing Meng, Chao Hong, Xiaotong Xie, Shaowei Liu, Enzhe Lu, Yunpeng Tai, Yanru Chen, Xin Men, Haiqing Guo, Y. Charles, Haoyu Lu, Lin Sui, Jinguo Zhu, Zaida Zhou, Weiran He, Weixiao Huang, Xinran Xu, Yuzhi Wang, Guokun Lai, Yulun Du, Yuxin Wu, Zhilin Yang, Xinyu Zhou</p>

            <p><strong>Title:</strong><br>
            Attention Residuals</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15031v1">http://arxiv.org/abs/2603.15031v1</a></p>

            <p><strong>Abstract:</strong><br>
            Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead.   Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, Yutian Chen, Junjie Yan, Ming Wei, Y. Zhang, Fanqing Meng, Chao Hong, Xiaotong Xie, Shaowei Liu, Enzhe Lu, Yunpeng Tai, Yanru Chen, Xin Men, Haiqing Guo, Y. Charles, Haoyu Lu, Lin Sui, Jinguo Zhu, Zaida Zhou, Weiran He, Weixiao Huang, Xinran Xu, Yuzhi Wang, Guokun Lai, Yulun Du, Yuxin Wu, Zhilin Yang, Xinyu Zhou</p>

            <p><strong>Title:</strong><br>
            Attention Residuals</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15031v1">http://arxiv.org/abs/2603.15031v1</a></p>

            <p><strong>Abstract:</strong><br>
            Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead.   Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Mar 2026 21:14:27 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/288c330d/97067fe5.mp3" length="22232483" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1386</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, Yutian Chen, Junjie Yan, Ming Wei, Y. Zhang, Fanqing Meng, Chao Hong, Xiaotong Xie, Shaowei Liu, Enzhe Lu, Yunpeng Tai, Yanru Chen, Xin Men, Haiqing Guo, Y. Charles, Haoyu Lu, Lin Sui, Jinguo Zhu, Zaida Zhou, Weiran He, Weixiao Huang, Xinran Xu, Yuzhi Wang, Guokun Lai, Yulun Du, Yuxin Wu, Zhilin Yang, Xinyu Zhou</p>

            <p><strong>Title:</strong><br>
            Attention Residuals</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15031v1">http://arxiv.org/abs/2603.15031v1</a></p>

            <p><strong>Abstract:</strong><br>
            Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead.   Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Mixture-of-Depths Attention</title>
      <itunes:episode>1623</itunes:episode>
      <podcast:episode>1623</podcast:episode>
      <itunes:title>Mixture-of-Depths Attention</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cf2f89a9-cfa2-4df1-9181-fa13f07e276b</guid>
      <link>https://share.transistor.fm/s/5b21d3a8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            Mixture-of-Depths Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15619v1">http://arxiv.org/abs/2603.15619v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            Mixture-of-Depths Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15619v1">http://arxiv.org/abs/2603.15619v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Mar 2026 21:14:06 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5b21d3a8/64f898bf.mp3" length="22019332" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1373</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            Mixture-of-Depths Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15619v1">http://arxiv.org/abs/2603.15619v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Effective Distillation to Hybrid xLSTM Architectures</title>
      <itunes:episode>1622</itunes:episode>
      <podcast:episode>1622</podcast:episode>
      <itunes:title>Effective Distillation to Hybrid xLSTM Architectures</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">320220b8-314a-4638-8b91-d298bae43e3d</guid>
      <link>https://share.transistor.fm/s/60af2667</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Lukas Hauzenberger, Niklas Schmidinger, Thomas Schmied, Anamaria-Roberta Hartl, David Stap, Pieter-Jan Hoedt, Maximilian Beck, Sebastian Böck, Günter Klambauer, Sepp Hochreiter</p>

            <p><strong>Title:</strong><br>
            Effective Distillation to Hybrid xLSTM Architectures</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15590v1">http://arxiv.org/abs/2603.15590v1</a></p>

            <p><strong>Abstract:</strong><br>
            There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Lukas Hauzenberger, Niklas Schmidinger, Thomas Schmied, Anamaria-Roberta Hartl, David Stap, Pieter-Jan Hoedt, Maximilian Beck, Sebastian Böck, Günter Klambauer, Sepp Hochreiter</p>

            <p><strong>Title:</strong><br>
            Effective Distillation to Hybrid xLSTM Architectures</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15590v1">http://arxiv.org/abs/2603.15590v1</a></p>

            <p><strong>Abstract:</strong><br>
            There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Mar 2026 21:13:44 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/60af2667/f40ca2c2.mp3" length="24522514" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1529</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Lukas Hauzenberger, Niklas Schmidinger, Thomas Schmied, Anamaria-Roberta Hartl, David Stap, Pieter-Jan Hoedt, Maximilian Beck, Sebastian Böck, Günter Klambauer, Sepp Hochreiter</p>

            <p><strong>Title:</strong><br>
            Effective Distillation to Hybrid xLSTM Architectures</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15590v1">http://arxiv.org/abs/2603.15590v1</a></p>

            <p><strong>Abstract:</strong><br>
            There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models</title>
      <itunes:episode>1621</itunes:episode>
      <podcast:episode>1621</podcast:episode>
      <itunes:title>Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">045b2105-618b-4bfb-8c9a-b26adf97d631</guid>
      <link>https://share.transistor.fm/s/030ba72f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Lexiang Xiong, Qi Li, Jingwen Ye, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15557v1">http://arxiv.org/abs/2603.15557v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language Models (VLMs) frequently "hallucinate" - generate plausible yet factually incorrect statements - posing a critical barrier to their trustworthy deployment. In this work, we propose a new paradigm for diagnosing hallucinations, recasting them from static output errors into dynamic pathologies of a model's computational cognition. Our framework is grounded in a normative principle of computational rationality, allowing us to model a VLM's generation as a dynamic cognitive trajectory. We design a suite of information-theoretic probes that project this trajectory onto an interpretable, low-dimensional Cognitive State Space. Our central discovery is a governing principle we term the geometric-information duality: a cognitive trajectory's geometric abnormality within this space is fundamentally equivalent to its high information-theoretic surprisal. Hallucination detection is counts as a geometric anomaly detection problem. Evaluated across diverse settings - from rigorous binary QA (POPE) and comprehensive reasoning (MME) to unconstrained open-ended captioning (MS-COCO) - our framework achieves state-of-the-art performance. Crucially, it operates with high efficiency under weak supervision and remains highly robust even when calibration data is heavily contaminated. This approach enables a causal attribution of failures, mapping observable errors to distinct pathological states: perceptual instability (measured by Perceptual Entropy), logical-causal failure (measured by Inferential Conflict), and decisional ambiguity (measured by Decision Entropy). Ultimately, this opens a path toward building AI systems whose reasoning is transparent, auditable, and diagnosable by design.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Lexiang Xiong, Qi Li, Jingwen Ye, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15557v1">http://arxiv.org/abs/2603.15557v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language Models (VLMs) frequently "hallucinate" - generate plausible yet factually incorrect statements - posing a critical barrier to their trustworthy deployment. In this work, we propose a new paradigm for diagnosing hallucinations, recasting them from static output errors into dynamic pathologies of a model's computational cognition. Our framework is grounded in a normative principle of computational rationality, allowing us to model a VLM's generation as a dynamic cognitive trajectory. We design a suite of information-theoretic probes that project this trajectory onto an interpretable, low-dimensional Cognitive State Space. Our central discovery is a governing principle we term the geometric-information duality: a cognitive trajectory's geometric abnormality within this space is fundamentally equivalent to its high information-theoretic surprisal. Hallucination detection is counts as a geometric anomaly detection problem. Evaluated across diverse settings - from rigorous binary QA (POPE) and comprehensive reasoning (MME) to unconstrained open-ended captioning (MS-COCO) - our framework achieves state-of-the-art performance. Crucially, it operates with high efficiency under weak supervision and remains highly robust even when calibration data is heavily contaminated. This approach enables a causal attribution of failures, mapping observable errors to distinct pathological states: perceptual instability (measured by Perceptual Entropy), logical-causal failure (measured by Inferential Conflict), and decisional ambiguity (measured by Decision Entropy). Ultimately, this opens a path toward building AI systems whose reasoning is transparent, auditable, and diagnosable by design.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Mar 2026 21:13:23 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/030ba72f/2f2aae13.mp3" length="22375511" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1395</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Lexiang Xiong, Qi Li, Jingwen Ye, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15557v1">http://arxiv.org/abs/2603.15557v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language Models (VLMs) frequently "hallucinate" - generate plausible yet factually incorrect statements - posing a critical barrier to their trustworthy deployment. In this work, we propose a new paradigm for diagnosing hallucinations, recasting them from static output errors into dynamic pathologies of a model's computational cognition. Our framework is grounded in a normative principle of computational rationality, allowing us to model a VLM's generation as a dynamic cognitive trajectory. We design a suite of information-theoretic probes that project this trajectory onto an interpretable, low-dimensional Cognitive State Space. Our central discovery is a governing principle we term the geometric-information duality: a cognitive trajectory's geometric abnormality within this space is fundamentally equivalent to its high information-theoretic surprisal. Hallucination detection is counts as a geometric anomaly detection problem. Evaluated across diverse settings - from rigorous binary QA (POPE) and comprehensive reasoning (MME) to unconstrained open-ended captioning (MS-COCO) - our framework achieves state-of-the-art performance. Crucially, it operates with high efficiency under weak supervision and remains highly robust even when calibration data is heavily contaminated. This approach enables a causal attribution of failures, mapping observable errors to distinct pathological states: perceptual instability (measured by Perceptual Entropy), logical-causal failure (measured by Inferential Conflict), and decisional ambiguity (measured by Decision Entropy). Ultimately, this opens a path toward building AI systems whose reasoning is transparent, auditable, and diagnosable by design.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer</title>
      <itunes:episode>1620</itunes:episode>
      <podcast:episode>1620</podcast:episode>
      <itunes:title>ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">64494c6c-ca13-4778-9285-7a6243dfa4c2</guid>
      <link>https://share.transistor.fm/s/25621ff4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ruonan Yu, Zhenxiong Tan, Zigeng Chen, Songhua Liu, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15478v1">http://arxiv.org/abs/2603.15478v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformers (DiTs) have demonstrated remarkable scalability and quality in image and video generation, prompting growing interest in extending them to controllable generation and editing tasks. However, compared to the image counterparts, progress in video control and editing remains limited, mainly due to the scarcity of paired video data and the high computational cost of training video diffusion models. To address this issue, in this paper, we propose a video-free tuning framework termed ViFeEdit for video diffusion transformers. Without requiring any forms of video training data, ViFeEdit achieves versatile video generation and editing, adapted solely with 2D images. At the core of our approach is an architectural reparameterization that decouples spatial independence from the full 3D attention in modern video diffusion transformers, which enables visually faithful editing while maintaining temporal consistency with only minimal additional parameters. Moreover, this design operates in a dual-path pipeline with separate timestep embeddings for noise scheduling, exhibiting strong adaptability to diverse conditioning signals. Extensive experiments demonstrate that our method delivers promising results of controllable video generation and editing with only minimal training on 2D image data. Codes are available https://github.com/Lexie-YU/ViFeEdit.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ruonan Yu, Zhenxiong Tan, Zigeng Chen, Songhua Liu, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15478v1">http://arxiv.org/abs/2603.15478v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformers (DiTs) have demonstrated remarkable scalability and quality in image and video generation, prompting growing interest in extending them to controllable generation and editing tasks. However, compared to the image counterparts, progress in video control and editing remains limited, mainly due to the scarcity of paired video data and the high computational cost of training video diffusion models. To address this issue, in this paper, we propose a video-free tuning framework termed ViFeEdit for video diffusion transformers. Without requiring any forms of video training data, ViFeEdit achieves versatile video generation and editing, adapted solely with 2D images. At the core of our approach is an architectural reparameterization that decouples spatial independence from the full 3D attention in modern video diffusion transformers, which enables visually faithful editing while maintaining temporal consistency with only minimal additional parameters. Moreover, this design operates in a dual-path pipeline with separate timestep embeddings for noise scheduling, exhibiting strong adaptability to diverse conditioning signals. Extensive experiments demonstrate that our method delivers promising results of controllable video generation and editing with only minimal training on 2D image data. Codes are available https://github.com/Lexie-YU/ViFeEdit.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Mar 2026 21:13:02 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/25621ff4/ba88c84f.mp3" length="22390516" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1396</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ruonan Yu, Zhenxiong Tan, Zigeng Chen, Songhua Liu, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.15478v1">http://arxiv.org/abs/2603.15478v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformers (DiTs) have demonstrated remarkable scalability and quality in image and video generation, prompting growing interest in extending them to controllable generation and editing tasks. However, compared to the image counterparts, progress in video control and editing remains limited, mainly due to the scarcity of paired video data and the high computational cost of training video diffusion models. To address this issue, in this paper, we propose a video-free tuning framework termed ViFeEdit for video diffusion transformers. Without requiring any forms of video training data, ViFeEdit achieves versatile video generation and editing, adapted solely with 2D images. At the core of our approach is an architectural reparameterization that decouples spatial independence from the full 3D attention in modern video diffusion transformers, which enables visually faithful editing while maintaining temporal consistency with only minimal additional parameters. Moreover, this design operates in a dual-path pipeline with separate timestep embeddings for noise scheduling, exhibiting strong adaptability to diverse conditioning signals. Extensive experiments demonstrate that our method delivers promising results of controllable video generation and editing with only minimal training on 2D image data. Codes are available https://github.com/Lexie-YU/ViFeEdit.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LMEB: Long-horizon Memory Embedding Benchmark</title>
      <itunes:episode>1619</itunes:episode>
      <podcast:episode>1619</podcast:episode>
      <itunes:title>LMEB: Long-horizon Memory Embedding Benchmark</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f4051331-a51a-4a0f-8ec5-7baf4faae679</guid>
      <link>https://share.transistor.fm/s/d8b8f28a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xinping Zhao, Xinshuo Hu, Jiaxin Xu, Danyu Tang, Xin Zhang, Mengjia Zhou, Yan Zhong, Yao Zhou, Zifei Shan, Meishan Zhang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            LMEB: Long-horizon Memory Embedding Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.12572v1">http://arxiv.org/abs/2603.12572v1</a></p>

            <p><strong>Abstract:</strong><br>
            Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models' capabilities in handling complex, long-horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval. LMEB is available at https://github.com/KaLM-Embedding/LMEB.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xinping Zhao, Xinshuo Hu, Jiaxin Xu, Danyu Tang, Xin Zhang, Mengjia Zhou, Yan Zhong, Yao Zhou, Zifei Shan, Meishan Zhang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            LMEB: Long-horizon Memory Embedding Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.12572v1">http://arxiv.org/abs/2603.12572v1</a></p>

            <p><strong>Abstract:</strong><br>
            Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models' capabilities in handling complex, long-horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval. LMEB is available at https://github.com/KaLM-Embedding/LMEB.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 16 Mar 2026 20:39:03 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d8b8f28a/2668b840.mp3" length="20967346" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1307</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xinping Zhao, Xinshuo Hu, Jiaxin Xu, Danyu Tang, Xin Zhang, Mengjia Zhou, Yan Zhong, Yao Zhou, Zifei Shan, Meishan Zhang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            LMEB: Long-horizon Memory Embedding Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.12572v1">http://arxiv.org/abs/2603.12572v1</a></p>

            <p><strong>Abstract:</strong><br>
            Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models' capabilities in handling complex, long-horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval. LMEB is available at https://github.com/KaLM-Embedding/LMEB.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Can Vision-Language Models Solve the Shell Game?</title>
      <itunes:episode>1618</itunes:episode>
      <podcast:episode>1618</podcast:episode>
      <itunes:title>Can Vision-Language Models Solve the Shell Game?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4b3984aa-46d9-457b-894e-49ad7d81a8a5</guid>
      <link>https://share.transistor.fm/s/e59cf854</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tiedong Liu, Wee Sun Lee</p>

            <p><strong>Title:</strong><br>
            Can Vision-Language Models Solve the Shell Game?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.08436v1">http://arxiv.org/abs/2603.08436v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at https://vetbench.github.io .</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tiedong Liu, Wee Sun Lee</p>

            <p><strong>Title:</strong><br>
            Can Vision-Language Models Solve the Shell Game?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.08436v1">http://arxiv.org/abs/2603.08436v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at https://vetbench.github.io .</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 16 Mar 2026 20:38:40 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e59cf854/d02f35e0.mp3" length="21668267" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1351</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tiedong Liu, Wee Sun Lee</p>

            <p><strong>Title:</strong><br>
            Can Vision-Language Models Solve the Shell Game?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.08436v1">http://arxiv.org/abs/2603.08436v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at https://vetbench.github.io .</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation</title>
      <itunes:episode>1617</itunes:episode>
      <podcast:episode>1617</podcast:episode>
      <itunes:title>Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3c7918e3-9ed6-4458-8944-9296df6e3292</guid>
      <link>https://share.transistor.fm/s/8c4da0fb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yichen Zhang, Da Peng, Zonghao Guo, Zijian Zhang, Xuesong Yang, Tong Sun, Shichu Sun, Yidan Zhang, Yanghao Li, Haiyan Zhao, Wang Xu, Qi Shi, Yangang Sun, Chi Chen, Shuo Wang, Yukun Yan, Xu Han, Qiang Ma, Wei Ke, Liang Wang, Zhiyuan Liu, Maosong Sun</p>

            <p><strong>Title:</strong><br>
            Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.12793v1">http://arxiv.org/abs/2603.12793v1</a></p>

            <p><strong>Abstract:</strong><br>
            A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yichen Zhang, Da Peng, Zonghao Guo, Zijian Zhang, Xuesong Yang, Tong Sun, Shichu Sun, Yidan Zhang, Yanghao Li, Haiyan Zhao, Wang Xu, Qi Shi, Yangang Sun, Chi Chen, Shuo Wang, Yukun Yan, Xu Han, Qiang Ma, Wei Ke, Liang Wang, Zhiyuan Liu, Maosong Sun</p>

            <p><strong>Title:</strong><br>
            Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.12793v1">http://arxiv.org/abs/2603.12793v1</a></p>

            <p><strong>Abstract:</strong><br>
            A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 16 Mar 2026 20:38:17 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8c4da0fb/32d7588e.mp3" length="20126904" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1254</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yichen Zhang, Da Peng, Zonghao Guo, Zijian Zhang, Xuesong Yang, Tong Sun, Shichu Sun, Yidan Zhang, Yanghao Li, Haiyan Zhao, Wang Xu, Qi Shi, Yangang Sun, Chi Chen, Shuo Wang, Yukun Yan, Xu Han, Qiang Ma, Wei Ke, Liang Wang, Zhiyuan Liu, Maosong Sun</p>

            <p><strong>Title:</strong><br>
            Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.12793v1">http://arxiv.org/abs/2603.12793v1</a></p>

            <p><strong>Abstract:</strong><br>
            A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>daVinci-Env: Open SWE Environment Synthesis at Scale</title>
      <itunes:episode>1616</itunes:episode>
      <podcast:episode>1616</podcast:episode>
      <itunes:title>daVinci-Env: Open SWE Environment Synthesis at Scale</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">454309e9-fda4-427b-865b-16e6adf7966d</guid>
      <link>https://share.transistor.fm/s/79851bee</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.SE, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang, Jie Sun, Ji Zeng, Mohan Jiang, Lin Zhang, Yukun Li, Jiarui Hu, Liming Liu, Jinlong Hou, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            daVinci-Env: Open SWE Environment Synthesis at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.13023v1">http://arxiv.org/abs/2603.13023v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With $891K spent on environment construction and an additional $576K on trajectory sampling and difficulty-aware curation, the entire project represents a total investment of approximately $1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality guaranteed environments. Extensive experiments validate OpenSWE's effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.SE, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang, Jie Sun, Ji Zeng, Mohan Jiang, Lin Zhang, Yukun Li, Jiarui Hu, Liming Liu, Jinlong Hou, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            daVinci-Env: Open SWE Environment Synthesis at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.13023v1">http://arxiv.org/abs/2603.13023v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With $891K spent on environment construction and an additional $576K on trajectory sampling and difficulty-aware curation, the entire project represents a total investment of approximately $1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality guaranteed environments. Extensive experiments validate OpenSWE's effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 16 Mar 2026 20:37:53 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/79851bee/c5bfeb0f.mp3" length="19831340" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1236</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.SE, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang, Jie Sun, Ji Zeng, Mohan Jiang, Lin Zhang, Yukun Li, Jiarui Hu, Liming Liu, Jinlong Hou, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            daVinci-Env: Open SWE Environment Synthesis at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.13023v1">http://arxiv.org/abs/2603.13023v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With $891K spent on environment construction and an additional $576K on trajectory sampling and difficulty-aware curation, the entire project represents a total investment of approximately $1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality guaranteed environments. Extensive experiments validate OpenSWE's effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections</title>
      <itunes:episode>1615</itunes:episode>
      <podcast:episode>1615</podcast:episode>
      <itunes:title>Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">829c7aaa-872d-4135-9078-dea599cffa77</guid>
      <link>https://share.transistor.fm/s/5bb70e14</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabrés, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta</p>

            <p><strong>Title:</strong><br>
            Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.12180v1">http://arxiv.org/abs/2603.12180v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabrés, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta</p>

            <p><strong>Title:</strong><br>
            Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.12180v1">http://arxiv.org/abs/2603.12180v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 13 Mar 2026 20:19:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5bb70e14/2161d094.mp3" length="24512528" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1528</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabrés, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta</p>

            <p><strong>Title:</strong><br>
            Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.12180v1">http://arxiv.org/abs/2603.12180v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OpenClaw-RL: Train Any Agent Simply by Talking</title>
      <itunes:episode>1614</itunes:episode>
      <podcast:episode>1614</podcast:episode>
      <itunes:title>OpenClaw-RL: Train Any Agent Simply by Talking</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b42ded6c-b835-4a58-b37f-6c83bbd8cf76</guid>
      <link>https://share.transistor.fm/s/9b3acaca</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, Ling Yang</p>

            <p><strong>Title:</strong><br>
            OpenClaw-RL: Train Any Agent Simply by Talking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.10165v1">http://arxiv.org/abs/2603.10165v1</a></p>

            <p><strong>Abstract:</strong><br>
            Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: https://github.com/Gen-Verse/OpenClaw-RL</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, Ling Yang</p>

            <p><strong>Title:</strong><br>
            OpenClaw-RL: Train Any Agent Simply by Talking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.10165v1">http://arxiv.org/abs/2603.10165v1</a></p>

            <p><strong>Abstract:</strong><br>
            Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: https://github.com/Gen-Verse/OpenClaw-RL</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 12 Mar 2026 22:45:14 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9b3acaca/c7bfa478.mp3" length="24782061" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1545</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, Ling Yang</p>

            <p><strong>Title:</strong><br>
            OpenClaw-RL: Train Any Agent Simply by Talking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.10165v1">http://arxiv.org/abs/2603.10165v1</a></p>

            <p><strong>Abstract:</strong><br>
            Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: https://github.com/Gen-Verse/OpenClaw-RL</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Flash-KMeans: Fast and Memory-Efficient Exact K-Means</title>
      <itunes:episode>1613</itunes:episode>
      <podcast:episode>1613</podcast:episode>
      <itunes:title>Flash-KMeans: Fast and Memory-Efficient Exact K-Means</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8fd93e2d-bb27-4252-8c99-e4016a8bbefb</guid>
      <link>https://share.transistor.fm/s/6bda2092</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.DC</p>

            <p><strong>Authors:</strong><br>
            Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Xiaoze Fan, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Kurt Keutzer, Song Han, Chenfeng Xu, Ion Stoica</p>

            <p><strong>Title:</strong><br>
            Flash-KMeans: Fast and Memory-Efficient Exact K-Means</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.09229v1">http://arxiv.org/abs/2603.09229v1</a></p>

            <p><strong>Abstract:</strong><br>
            $k$-means has historically been positioned primarily as an offline processing primitive, typically used for dataset organization or embedding preprocessing rather than as a first-class component in online systems. In this work, we revisit this classical algorithm under the lens of modern AI system design and enable $k$-means as an online primitive. We point out that existing GPU implementations of $k$-means remain fundamentally bottlenecked by low-level system constraints rather than theoretical algorithmic complexity. Specifically, the assignment stage suffers from a severe IO bottleneck due to the massive explicit materialization of the $N \times K$ distance matrix in High Bandwidth Memory (HBM). Simultaneously, the centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations. To bridge this performance gap, we propose flash-kmeans, an IO-aware and contention-free $k$-means implementation for modern GPU workloads. Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization; (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions. Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability. Extensive evaluations on NVIDIA H200 GPUs demonstrate that flash-kmeans achieves up to 17.9$\times$ end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33$\times$ and over 200$\times$, respectively.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.DC</p>

            <p><strong>Authors:</strong><br>
            Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Xiaoze Fan, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Kurt Keutzer, Song Han, Chenfeng Xu, Ion Stoica</p>

            <p><strong>Title:</strong><br>
            Flash-KMeans: Fast and Memory-Efficient Exact K-Means</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.09229v1">http://arxiv.org/abs/2603.09229v1</a></p>

            <p><strong>Abstract:</strong><br>
            $k$-means has historically been positioned primarily as an offline processing primitive, typically used for dataset organization or embedding preprocessing rather than as a first-class component in online systems. In this work, we revisit this classical algorithm under the lens of modern AI system design and enable $k$-means as an online primitive. We point out that existing GPU implementations of $k$-means remain fundamentally bottlenecked by low-level system constraints rather than theoretical algorithmic complexity. Specifically, the assignment stage suffers from a severe IO bottleneck due to the massive explicit materialization of the $N \times K$ distance matrix in High Bandwidth Memory (HBM). Simultaneously, the centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations. To bridge this performance gap, we propose flash-kmeans, an IO-aware and contention-free $k$-means implementation for modern GPU workloads. Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization; (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions. Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability. Extensive evaluations on NVIDIA H200 GPUs demonstrate that flash-kmeans achieves up to 17.9$\times$ end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33$\times$ and over 200$\times$, respectively.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 12 Mar 2026 22:44:51 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6bda2092/e03586d5.mp3" length="22336171" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1392</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.DC</p>

            <p><strong>Authors:</strong><br>
            Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Xiaoze Fan, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Kurt Keutzer, Song Han, Chenfeng Xu, Ion Stoica</p>

            <p><strong>Title:</strong><br>
            Flash-KMeans: Fast and Memory-Efficient Exact K-Means</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.09229v1">http://arxiv.org/abs/2603.09229v1</a></p>

            <p><strong>Abstract:</strong><br>
            $k$-means has historically been positioned primarily as an offline processing primitive, typically used for dataset organization or embedding preprocessing rather than as a first-class component in online systems. In this work, we revisit this classical algorithm under the lens of modern AI system design and enable $k$-means as an online primitive. We point out that existing GPU implementations of $k$-means remain fundamentally bottlenecked by low-level system constraints rather than theoretical algorithmic complexity. Specifically, the assignment stage suffers from a severe IO bottleneck due to the massive explicit materialization of the $N \times K$ distance matrix in High Bandwidth Memory (HBM). Simultaneously, the centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations. To bridge this performance gap, we propose flash-kmeans, an IO-aware and contention-free $k$-means implementation for modern GPU workloads. Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization; (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions. Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability. Extensive evaluations on NVIDIA H200 GPUs demonstrate that flash-kmeans achieves up to 17.9$\times$ end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33$\times$ and over 200$\times$, respectively.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents</title>
      <itunes:episode>1612</itunes:episode>
      <podcast:episode>1612</podcast:episode>
      <itunes:title>MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">16c19b15-99a7-4afa-a92a-575b3ea0db71</guid>
      <link>https://share.transistor.fm/s/6d9be8dc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee, Mengye Ren, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.09827v2">http://arxiv.org/abs/2603.09827v2</a></p>

            <p><strong>Abstract:</strong><br>
            As embodied models become powerful, humans will collaborate with multiple embodied AI agents at their workplace or home in the future. To ensure better communication between human users and the multi-agent system, it is crucial to interpret incoming information from agents in parallel and refer to the appropriate context for each query. Existing challenges include effectively compressing and communicating high volumes of individual sensory inputs in the form of video and correctly aggregating multiple egocentric videos to construct system-level memory. In this work, we first formally define a novel problem of understanding multiple long-horizon egocentric videos simultaneously collected from embodied agents. To facilitate research in this direction, we introduce MultiAgent-EgoQA (MA-EgoQA), a benchmark designed to systemically evaluate existing models in our scenario. MA-EgoQA provides 1.7k questions unique to multiple egocentric streams, spanning five categories: social interaction, task coordination, theory-of-mind, temporal reasoning, and environmental interaction. We further propose a simple baseline model for MA-EgoQA named EgoMAS, which leverages shared memory across embodied agents and agent-wise dynamic retrieval. Through comprehensive evaluation across diverse baselines and EgoMAS on MA-EgoQA, we find that current approaches are unable to effectively handle multiple egocentric streams, highlighting the need for future advances in system-level understanding across the agents. The code and benchmark are available at https://ma-egoqa.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee, Mengye Ren, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.09827v2">http://arxiv.org/abs/2603.09827v2</a></p>

            <p><strong>Abstract:</strong><br>
            As embodied models become powerful, humans will collaborate with multiple embodied AI agents at their workplace or home in the future. To ensure better communication between human users and the multi-agent system, it is crucial to interpret incoming information from agents in parallel and refer to the appropriate context for each query. Existing challenges include effectively compressing and communicating high volumes of individual sensory inputs in the form of video and correctly aggregating multiple egocentric videos to construct system-level memory. In this work, we first formally define a novel problem of understanding multiple long-horizon egocentric videos simultaneously collected from embodied agents. To facilitate research in this direction, we introduce MultiAgent-EgoQA (MA-EgoQA), a benchmark designed to systemically evaluate existing models in our scenario. MA-EgoQA provides 1.7k questions unique to multiple egocentric streams, spanning five categories: social interaction, task coordination, theory-of-mind, temporal reasoning, and environmental interaction. We further propose a simple baseline model for MA-EgoQA named EgoMAS, which leverages shared memory across embodied agents and agent-wise dynamic retrieval. Through comprehensive evaluation across diverse baselines and EgoMAS on MA-EgoQA, we find that current approaches are unable to effectively handle multiple egocentric streams, highlighting the need for future advances in system-level understanding across the agents. The code and benchmark are available at https://ma-egoqa.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 12 Mar 2026 22:44:27 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6d9be8dc/0010b9e1.mp3" length="24773737" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1545</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee, Mengye Ren, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.09827v2">http://arxiv.org/abs/2603.09827v2</a></p>

            <p><strong>Abstract:</strong><br>
            As embodied models become powerful, humans will collaborate with multiple embodied AI agents at their workplace or home in the future. To ensure better communication between human users and the multi-agent system, it is crucial to interpret incoming information from agents in parallel and refer to the appropriate context for each query. Existing challenges include effectively compressing and communicating high volumes of individual sensory inputs in the form of video and correctly aggregating multiple egocentric videos to construct system-level memory. In this work, we first formally define a novel problem of understanding multiple long-horizon egocentric videos simultaneously collected from embodied agents. To facilitate research in this direction, we introduce MultiAgent-EgoQA (MA-EgoQA), a benchmark designed to systemically evaluate existing models in our scenario. MA-EgoQA provides 1.7k questions unique to multiple egocentric streams, spanning five categories: social interaction, task coordination, theory-of-mind, temporal reasoning, and environmental interaction. We further propose a simple baseline model for MA-EgoQA named EgoMAS, which leverages shared memory across embodied agents and agent-wise dynamic retrieval. Through comprehensive evaluation across diverse baselines and EgoMAS on MA-EgoQA, we find that current approaches are unable to effectively handle multiple egocentric streams, highlighting the need for future advances in system-level understanding across the agents. The code and benchmark are available at https://ma-egoqa.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LLM2Vec-Gen: Generative Embeddings from Large Language Models</title>
      <itunes:episode>1611</itunes:episode>
      <podcast:episode>1611</podcast:episode>
      <itunes:title>LLM2Vec-Gen: Generative Embeddings from Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">58b17624-2f92-4ba3-baf8-2b6acaab09e5</guid>
      <link>https://share.transistor.fm/s/8489730f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Parishad BehnamGhader, Vaibhav Adlakha, Fabian David Schmidt, Nicolas Chapados, Marius Mosbach, Siva Reddy</p>

            <p><strong>Title:</strong><br>
            LLM2Vec-Gen: Generative Embeddings from Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.10913v1">http://arxiv.org/abs/2603.10913v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model's potential response. Specifically, we add trainable special tokens to the LLM's vocabulary, append them to input, and optimize them to represent the LLM's response in a fixed-length sequence. Training is guided by the LLM's own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Parishad BehnamGhader, Vaibhav Adlakha, Fabian David Schmidt, Nicolas Chapados, Marius Mosbach, Siva Reddy</p>

            <p><strong>Title:</strong><br>
            LLM2Vec-Gen: Generative Embeddings from Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.10913v1">http://arxiv.org/abs/2603.10913v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model's potential response. Specifically, we add trainable special tokens to the LLM's vocabulary, append them to input, and optimize them to represent the LLM's response in a fixed-length sequence. Training is guided by the LLM's own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 12 Mar 2026 22:44:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8489730f/b9d41c54.mp3" length="23455473" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1462</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Parishad BehnamGhader, Vaibhav Adlakha, Fabian David Schmidt, Nicolas Chapados, Marius Mosbach, Siva Reddy</p>

            <p><strong>Title:</strong><br>
            LLM2Vec-Gen: Generative Embeddings from Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2603.10913v1">http://arxiv.org/abs/2603.10913v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model's potential response. Specifically, we add trainable special tokens to the LLM's vocabulary, append them to input, and optimize them to represent the LLM's response in a fixed-length sequence. Training is guided by the LLM's own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Urban Socio-Semantic Segmentation with Vision-Language Reasoning</title>
      <itunes:episode>1610</itunes:episode>
      <podcast:episode>1610</podcast:episode>
      <itunes:title>Urban Socio-Semantic Segmentation with Vision-Language Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7141e96c-0bc5-4766-bc64-856920c8894b</guid>
      <link>https://share.transistor.fm/s/1c42ae68</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 139 | cs.CV, cs.AI, cs.CY</p>

            <p><strong>Authors:</strong><br>
            Yu Wang, Yi Wang, Rui Dai, Yujie Wang, Kaikui Liu, Xiangxiang Chu, Yansheng Li</p>

            <p><strong>Title:</strong><br>
            Urban Socio-Semantic Segmentation with Vision-Language Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.10477v1">http://arxiv.org/abs/2601.10477v1</a></p>

            <p><strong>Abstract:</strong><br>
            As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., buildings, water bodies) but still struggle with socially defined categories (e.g., schools, parks). In this work, we achieve socio-semantic segmentation by vision-language model reasoning. To facilitate this, we introduce the Urban Socio-Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel-level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision-language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross-modal recognition and multi-stage reasoning. We employ reinforcement learning to optimize this non-differentiable process and elicit the reasoning capabilities of the vision-language model. Experiments demonstrate our approach's gains over state-of-the-art models and strong zero-shot generalization. Our dataset and code are available in https://github.com/AMAP-ML/SocioReasoner.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 139 | cs.CV, cs.AI, cs.CY</p>

            <p><strong>Authors:</strong><br>
            Yu Wang, Yi Wang, Rui Dai, Yujie Wang, Kaikui Liu, Xiangxiang Chu, Yansheng Li</p>

            <p><strong>Title:</strong><br>
            Urban Socio-Semantic Segmentation with Vision-Language Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.10477v1">http://arxiv.org/abs/2601.10477v1</a></p>

            <p><strong>Abstract:</strong><br>
            As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., buildings, water bodies) but still struggle with socially defined categories (e.g., schools, parks). In this work, we achieve socio-semantic segmentation by vision-language model reasoning. To facilitate this, we introduce the Urban Socio-Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel-level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision-language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross-modal recognition and multi-stage reasoning. We employ reinforcement learning to optimize this non-differentiable process and elicit the reasoning capabilities of the vision-language model. Experiments demonstrate our approach's gains over state-of-the-art models and strong zero-shot generalization. Our dataset and code are available in https://github.com/AMAP-ML/SocioReasoner.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 16 Jan 2026 19:27:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1c42ae68/76cf8f67.mp3" length="20981994" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1308</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 139 | cs.CV, cs.AI, cs.CY</p>

            <p><strong>Authors:</strong><br>
            Yu Wang, Yi Wang, Rui Dai, Yujie Wang, Kaikui Liu, Xiangxiang Chu, Yansheng Li</p>

            <p><strong>Title:</strong><br>
            Urban Socio-Semantic Segmentation with Vision-Language Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.10477v1">http://arxiv.org/abs/2601.10477v1</a></p>

            <p><strong>Abstract:</strong><br>
            As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., buildings, water bodies) but still struggle with socially defined categories (e.g., schools, parks). In this work, we achieve socio-semantic segmentation by vision-language model reasoning. To facilitate this, we introduce the Urban Socio-Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel-level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision-language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross-modal recognition and multi-stage reasoning. We employ reinforcement learning to optimize this non-differentiable process and elicit the reasoning capabilities of the vision-language model. Experiments demonstrate our approach's gains over state-of-the-art models and strong zero-shot generalization. Our dataset and code are available in https://github.com/AMAP-ML/SocioReasoner.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>STEP3-VL-10B Technical Report</title>
      <itunes:episode>1609</itunes:episode>
      <podcast:episode>1609</podcast:episode>
      <itunes:title>STEP3-VL-10B Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3361905c-1486-4707-875a-08e3a6fd0400</guid>
      <link>https://share.transistor.fm/s/3b4af257</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 130 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen Qu, Xiangfeng Wang, Yanlin Lai, Yingxiu Zhao, Yinmin Zhang, Yukang Shi, Yuyang Chen, Zejia Weng, Ziyang Meng, Ang Li, Aobo Kong, Bo Dong, Changyi Wan, David Wang, Di Qi, Dingming Li, En Yu, Guopeng Li, Haiquan Yin, Han Zhou, Hanshan Zhang, Haolong Yan, Hebin Zhou, Hongbo Peng, Jiaran Zhang, Jiashu Lv, Jiayi Fu, Jie Cheng, Jie Zhou, Jisheng Yin, Jingjing Xie, Jingwei Wu, Jun Zhang, Junfeng Liu, Kaijun Tan, Kaiwen Yan, Liangyu Chen, Lina Chen, Mingliang Li, Qian Zhao, Quan Sun, Shaoliang Pang, Shengjie Fan, Shijie Shang, Siyuan Zhang, Tianhao You, Wei Ji, Wuxun Xie, Xiaobo Yang, Xiaojie Hou, Xiaoran Jiao, Xiaoxiao Ren, Xiangwen Kong, Xin Huang, Xin Wu, Xing Chen, Xinran Wang, Xuelin Zhang, Yana Wei, Yang Li, Yanming Xu, Yeqing Shen, Yuang Peng, Yue Peng, Yu Zhou, Yusheng Li, Yuxiang Yang, Yuyang Zhang, Zhe Xie, Zhewei Huang, Zhenyi Lu, Zhimin Fan, Zihui Cheng, Daxin Jiang, Qi Han, Xiangyu Zhang, Yibo Zhu, Zheng Ge</p>

            <p><strong>Title:</strong><br>
            STEP3-VL-10B Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09668v2">http://arxiv.org/abs/2601.09668v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 130 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen Qu, Xiangfeng Wang, Yanlin Lai, Yingxiu Zhao, Yinmin Zhang, Yukang Shi, Yuyang Chen, Zejia Weng, Ziyang Meng, Ang Li, Aobo Kong, Bo Dong, Changyi Wan, David Wang, Di Qi, Dingming Li, En Yu, Guopeng Li, Haiquan Yin, Han Zhou, Hanshan Zhang, Haolong Yan, Hebin Zhou, Hongbo Peng, Jiaran Zhang, Jiashu Lv, Jiayi Fu, Jie Cheng, Jie Zhou, Jisheng Yin, Jingjing Xie, Jingwei Wu, Jun Zhang, Junfeng Liu, Kaijun Tan, Kaiwen Yan, Liangyu Chen, Lina Chen, Mingliang Li, Qian Zhao, Quan Sun, Shaoliang Pang, Shengjie Fan, Shijie Shang, Siyuan Zhang, Tianhao You, Wei Ji, Wuxun Xie, Xiaobo Yang, Xiaojie Hou, Xiaoran Jiao, Xiaoxiao Ren, Xiangwen Kong, Xin Huang, Xin Wu, Xing Chen, Xinran Wang, Xuelin Zhang, Yana Wei, Yang Li, Yanming Xu, Yeqing Shen, Yuang Peng, Yue Peng, Yu Zhou, Yusheng Li, Yuxiang Yang, Yuyang Zhang, Zhe Xie, Zhewei Huang, Zhenyi Lu, Zhimin Fan, Zihui Cheng, Daxin Jiang, Qi Han, Xiangyu Zhang, Yibo Zhu, Zheng Ge</p>

            <p><strong>Title:</strong><br>
            STEP3-VL-10B Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09668v2">http://arxiv.org/abs/2601.09668v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 16 Jan 2026 19:27:25 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3b4af257/850b5e9d.mp3" length="25453704" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1587</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 130 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen Qu, Xiangfeng Wang, Yanlin Lai, Yingxiu Zhao, Yinmin Zhang, Yukang Shi, Yuyang Chen, Zejia Weng, Ziyang Meng, Ang Li, Aobo Kong, Bo Dong, Changyi Wan, David Wang, Di Qi, Dingming Li, En Yu, Guopeng Li, Haiquan Yin, Han Zhou, Hanshan Zhang, Haolong Yan, Hebin Zhou, Hongbo Peng, Jiaran Zhang, Jiashu Lv, Jiayi Fu, Jie Cheng, Jie Zhou, Jisheng Yin, Jingjing Xie, Jingwei Wu, Jun Zhang, Junfeng Liu, Kaijun Tan, Kaiwen Yan, Liangyu Chen, Lina Chen, Mingliang Li, Qian Zhao, Quan Sun, Shaoliang Pang, Shengjie Fan, Shijie Shang, Siyuan Zhang, Tianhao You, Wei Ji, Wuxun Xie, Xiaobo Yang, Xiaojie Hou, Xiaoran Jiao, Xiaoxiao Ren, Xiangwen Kong, Xin Huang, Xin Wu, Xing Chen, Xinran Wang, Xuelin Zhang, Yana Wei, Yang Li, Yanming Xu, Yeqing Shen, Yuang Peng, Yue Peng, Yu Zhou, Yusheng Li, Yuxiang Yang, Yuyang Zhang, Zhe Xie, Zhewei Huang, Zhenyi Lu, Zhimin Fan, Zihui Cheng, Daxin Jiang, Qi Han, Xiangyu Zhang, Yibo Zhu, Zheng Ge</p>

            <p><strong>Title:</strong><br>
            STEP3-VL-10B Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09668v2">http://arxiv.org/abs/2601.09668v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs</title>
      <itunes:episode>1608</itunes:episode>
      <podcast:episode>1608</podcast:episode>
      <itunes:title>Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ca2e48b6-6292-4795-8069-7c124383853b</guid>
      <link>https://share.transistor.fm/s/b24116aa</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 111 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, Bryan Hooi</p>

            <p><strong>Title:</strong><br>
            Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.08763v2">http://arxiv.org/abs/2601.08763v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@$k$ across large sampling budgets and increases the area under the pass@$k$ curve (AUC@$K$) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 111 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, Bryan Hooi</p>

            <p><strong>Title:</strong><br>
            Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.08763v2">http://arxiv.org/abs/2601.08763v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@$k$ across large sampling budgets and increases the area under the pass@$k$ curve (AUC@$K$) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 16 Jan 2026 19:27:02 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b24116aa/868e6ac3.mp3" length="19772432" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1232</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 111 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, Bryan Hooi</p>

            <p><strong>Title:</strong><br>
            Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.08763v2">http://arxiv.org/abs/2601.08763v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@$k$ across large sampling budgets and increases the area under the pass@$k$ curve (AUC@$K$) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning</title>
      <itunes:episode>1607</itunes:episode>
      <podcast:episode>1607</podcast:episode>
      <itunes:title>Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9a81680e-46b1-4a20-9a27-efacb4046213</guid>
      <link>https://share.transistor.fm/s/1b62e538</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park</p>

            <p><strong>Title:</strong><br>
            Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09667v2">http://arxiv.org/abs/2601.09667v2</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park</p>

            <p><strong>Title:</strong><br>
            Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09667v2">http://arxiv.org/abs/2601.09667v2</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 16 Jan 2026 19:26:40 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1b62e538/3e630083.mp3" length="24282626" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1514</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park</p>

            <p><strong>Title:</strong><br>
            Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09667v2">http://arxiv.org/abs/2601.09667v2</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Controlled Self-Evolution for Algorithmic Code Optimization</title>
      <itunes:episode>1606</itunes:episode>
      <podcast:episode>1606</podcast:episode>
      <itunes:title>Controlled Self-Evolution for Algorithmic Code Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7f57c2e2-482e-4131-bb59-df4ffd4d61a3</guid>
      <link>https://share.transistor.fm/s/a6c41a47</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 97 | cs.CL, cs.AI, cs.NE</p>

            <p><strong>Authors:</strong><br>
            Tu Hu, Ronghao Chen, Shuo Zhang, Jianghao Yin, Mou Xiao Feng, Jingping Liu, Shaolei Zhang, Wenqi Jiang, Yuqi Fang, Sen Hu, Huacan Wang, Yi Xu</p>

            <p><strong>Title:</strong><br>
            Controlled Self-Evolution for Algorithmic Code Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.07348v4">http://arxiv.org/abs/2601.07348v4</a></p>

            <p><strong>Abstract:</strong><br>
            Self-evolution methods enhance code generation through iterative "generate-verify-refine" cycles, yet existing approaches suffer from low exploration efficiency, failing to discover solutions with superior complexity within limited budgets. This inefficiency stems from initialization bias trapping evolution in poor solution regions, uncontrolled stochastic operations lacking feedback guidance, and insufficient experience utilization across tasks. To address these bottlenecks, we propose Controlled Self-Evolution (CSE), which consists of three key components. Diversified Planning Initialization generates structurally distinct algorithmic strategies for broad solution space coverage. Genetic Evolution replaces stochastic operations with feedback-guided mechanisms, enabling targeted mutation and compositional crossover. Hierarchical Evolution Memory captures both successful and failed experiences at inter-task and intra-task levels. Experiments on EffiBench-X demonstrate that CSE consistently outperforms all baselines across various LLM backbones. Furthermore, CSE achieves higher efficiency from early generations and maintains continuous improvement throughout evolution. Our code is publicly available at https://github.com/QuantaAlpha/EvoControl.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 97 | cs.CL, cs.AI, cs.NE</p>

            <p><strong>Authors:</strong><br>
            Tu Hu, Ronghao Chen, Shuo Zhang, Jianghao Yin, Mou Xiao Feng, Jingping Liu, Shaolei Zhang, Wenqi Jiang, Yuqi Fang, Sen Hu, Huacan Wang, Yi Xu</p>

            <p><strong>Title:</strong><br>
            Controlled Self-Evolution for Algorithmic Code Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.07348v4">http://arxiv.org/abs/2601.07348v4</a></p>

            <p><strong>Abstract:</strong><br>
            Self-evolution methods enhance code generation through iterative "generate-verify-refine" cycles, yet existing approaches suffer from low exploration efficiency, failing to discover solutions with superior complexity within limited budgets. This inefficiency stems from initialization bias trapping evolution in poor solution regions, uncontrolled stochastic operations lacking feedback guidance, and insufficient experience utilization across tasks. To address these bottlenecks, we propose Controlled Self-Evolution (CSE), which consists of three key components. Diversified Planning Initialization generates structurally distinct algorithmic strategies for broad solution space coverage. Genetic Evolution replaces stochastic operations with feedback-guided mechanisms, enabling targeted mutation and compositional crossover. Hierarchical Evolution Memory captures both successful and failed experiences at inter-task and intra-task levels. Experiments on EffiBench-X demonstrate that CSE consistently outperforms all baselines across various LLM backbones. Furthermore, CSE achieves higher efficiency from early generations and maintains continuous improvement throughout evolution. Our code is publicly available at https://github.com/QuantaAlpha/EvoControl.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 15 Jan 2026 19:45:43 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a6c41a47/31b974ef.mp3" length="22773780" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1420</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 97 | cs.CL, cs.AI, cs.NE</p>

            <p><strong>Authors:</strong><br>
            Tu Hu, Ronghao Chen, Shuo Zhang, Jianghao Yin, Mou Xiao Feng, Jingping Liu, Shaolei Zhang, Wenqi Jiang, Yuqi Fang, Sen Hu, Huacan Wang, Yi Xu</p>

            <p><strong>Title:</strong><br>
            Controlled Self-Evolution for Algorithmic Code Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.07348v4">http://arxiv.org/abs/2601.07348v4</a></p>

            <p><strong>Abstract:</strong><br>
            Self-evolution methods enhance code generation through iterative "generate-verify-refine" cycles, yet existing approaches suffer from low exploration efficiency, failing to discover solutions with superior complexity within limited budgets. This inefficiency stems from initialization bias trapping evolution in poor solution regions, uncontrolled stochastic operations lacking feedback guidance, and insufficient experience utilization across tasks. To address these bottlenecks, we propose Controlled Self-Evolution (CSE), which consists of three key components. Diversified Planning Initialization generates structurally distinct algorithmic strategies for broad solution space coverage. Genetic Evolution replaces stochastic operations with feedback-guided mechanisms, enabling targeted mutation and compositional crossover. Hierarchical Evolution Memory captures both successful and failed experiences at inter-task and intra-task levels. Experiments on EffiBench-X demonstrate that CSE consistently outperforms all baselines across various LLM backbones. Furthermore, CSE achieves higher efficiency from early generations and maintains continuous improvement throughout evolution. Our code is publicly available at https://github.com/QuantaAlpha/EvoControl.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation</title>
      <itunes:episode>1605</itunes:episode>
      <podcast:episode>1605</podcast:episode>
      <itunes:title>DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cd04a507-fbd4-4a37-b6d6-94e78b132a75</guid>
      <link>https://share.transistor.fm/s/e85931ab</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 92 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yibo Wang, Lei Wang, Yue Deng, Keming Wu, Yao Xiao, Huanjin Yao, Liwei Kang, Hai Ye, Yongcheng Jing, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09688v1">http://arxiv.org/abs/2601.09688v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation-intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 92 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yibo Wang, Lei Wang, Yue Deng, Keming Wu, Yao Xiao, Huanjin Yao, Liwei Kang, Hai Ye, Yongcheng Jing, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09688v1">http://arxiv.org/abs/2601.09688v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation-intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 15 Jan 2026 19:45:22 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e85931ab/351a30ea.mp3" length="17841902" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1111</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 92 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yibo Wang, Lei Wang, Yue Deng, Keming Wu, Yao Xiao, Huanjin Yao, Liwei Kang, Hai Ye, Yongcheng Jing, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09688v1">http://arxiv.org/abs/2601.09688v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation-intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MAXS: Meta-Adaptive Exploration with LLM Agents</title>
      <itunes:episode>1604</itunes:episode>
      <podcast:episode>1604</podcast:episode>
      <itunes:title>MAXS: Meta-Adaptive Exploration with LLM Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">02f25294-1dfa-45a9-b71e-a09acf315551</guid>
      <link>https://share.transistor.fm/s/b8ab774b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 82 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jian Zhang, Zhiyuan Wang, Zhangqi Wang, Yu He, Haoran Luo, li yuan, Lingling Zhang, Rui Mao, Qika Lin, Jun Liu</p>

            <p><strong>Title:</strong><br>
            MAXS: Meta-Adaptive Exploration with LLM Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09259v1">http://arxiv.org/abs/2601.09259v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Model (LLM) Agents exhibit inherent reasoning abilities through the collaboration of multiple tools. However, during agent inference, existing methods often suffer from (i) locally myopic generation, due to the absence of lookahead, and (ii) trajectory instability, where minor early errors can escalate into divergent reasoning paths. These issues make it difficult to balance global effectiveness and computational efficiency. To address these two issues, we propose meta-adaptive exploration with LLM agents https://github.com/exoskeletonzj/MAXS, a meta-adaptive reasoning framework based on LLM Agents that flexibly integrates tool execution and reasoning planning. MAXS employs a lookahead strategy to extend reasoning paths a few steps ahead, estimating the advantage value of tool usage, and combines step consistency variance and inter-step trend slopes to jointly select stable, consistent, and high-value reasoning steps. Additionally, we introduce a trajectory convergence mechanism that controls computational cost by halting further rollouts once path consistency is achieved, enabling a balance between resource efficiency and global effectiveness in multi-tool reasoning. We conduct extensive empirical studies across three base models (MiMo-VL-7B, Qwen2.5-VL-7B, Qwen2.5-VL-32B) and five datasets, demonstrating that MAXS consistently outperforms existing methods in both performance and inference efficiency. Further analysis confirms the effectiveness of our lookahead strategy and tool usage.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 82 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jian Zhang, Zhiyuan Wang, Zhangqi Wang, Yu He, Haoran Luo, li yuan, Lingling Zhang, Rui Mao, Qika Lin, Jun Liu</p>

            <p><strong>Title:</strong><br>
            MAXS: Meta-Adaptive Exploration with LLM Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09259v1">http://arxiv.org/abs/2601.09259v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Model (LLM) Agents exhibit inherent reasoning abilities through the collaboration of multiple tools. However, during agent inference, existing methods often suffer from (i) locally myopic generation, due to the absence of lookahead, and (ii) trajectory instability, where minor early errors can escalate into divergent reasoning paths. These issues make it difficult to balance global effectiveness and computational efficiency. To address these two issues, we propose meta-adaptive exploration with LLM agents https://github.com/exoskeletonzj/MAXS, a meta-adaptive reasoning framework based on LLM Agents that flexibly integrates tool execution and reasoning planning. MAXS employs a lookahead strategy to extend reasoning paths a few steps ahead, estimating the advantage value of tool usage, and combines step consistency variance and inter-step trend slopes to jointly select stable, consistent, and high-value reasoning steps. Additionally, we introduce a trajectory convergence mechanism that controls computational cost by halting further rollouts once path consistency is achieved, enabling a balance between resource efficiency and global effectiveness in multi-tool reasoning. We conduct extensive empirical studies across three base models (MiMo-VL-7B, Qwen2.5-VL-7B, Qwen2.5-VL-32B) and five datasets, demonstrating that MAXS consistently outperforms existing methods in both performance and inference efficiency. Further analysis confirms the effectiveness of our lookahead strategy and tool usage.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 15 Jan 2026 19:45:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b8ab774b/7c5aa590.mp3" length="20886682" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1302</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 82 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jian Zhang, Zhiyuan Wang, Zhangqi Wang, Yu He, Haoran Luo, li yuan, Lingling Zhang, Rui Mao, Qika Lin, Jun Liu</p>

            <p><strong>Title:</strong><br>
            MAXS: Meta-Adaptive Exploration with LLM Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09259v1">http://arxiv.org/abs/2601.09259v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Model (LLM) Agents exhibit inherent reasoning abilities through the collaboration of multiple tools. However, during agent inference, existing methods often suffer from (i) locally myopic generation, due to the absence of lookahead, and (ii) trajectory instability, where minor early errors can escalate into divergent reasoning paths. These issues make it difficult to balance global effectiveness and computational efficiency. To address these two issues, we propose meta-adaptive exploration with LLM agents https://github.com/exoskeletonzj/MAXS, a meta-adaptive reasoning framework based on LLM Agents that flexibly integrates tool execution and reasoning planning. MAXS employs a lookahead strategy to extend reasoning paths a few steps ahead, estimating the advantage value of tool usage, and combines step consistency variance and inter-step trend slopes to jointly select stable, consistent, and high-value reasoning steps. Additionally, we introduce a trajectory convergence mechanism that controls computational cost by halting further rollouts once path consistency is achieved, enabling a balance between resource efficiency and global effectiveness in multi-tool reasoning. We conduct extensive empirical studies across three base models (MiMo-VL-7B, Qwen2.5-VL-7B, Qwen2.5-VL-32B) and five datasets, demonstrating that MAXS consistently outperforms existing methods in both performance and inference efficiency. Further analysis confirms the effectiveness of our lookahead strategy and tool usage.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning</title>
      <itunes:episode>1603</itunes:episode>
      <podcast:episode>1603</podcast:episode>
      <itunes:title>Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">481a0405-1f49-4151-8416-3cf6be129154</guid>
      <link>https://share.transistor.fm/s/cc095335</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, Jieping Ye</p>

            <p><strong>Title:</strong><br>
            Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09088v1">http://arxiv.org/abs/2601.09088v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this report, we introduce DASD-4B-Thinking, a lightweight yet highly capable, fully open-source reasoning model. It achieves SOTA performance among open-source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation -- even outperforming several larger models. We begin by critically reexamining a widely adopted distillation paradigm in the community: SFT on teacher-generated responses, also known as sequence-level distillation. Although a series of recent works following this scheme have demonstrated remarkable efficiency and strong empirical performance, they are primarily grounded in the SFT perspective. Consequently, these approaches focus predominantly on designing heuristic rules for SFT data filtering, while largely overlooking the core principle of distillation itself -- enabling the student model to learn the teacher's full output distribution so as to inherit its generalization capability. Specifically, we identify three critical limitations in current practice: i) Inadequate representation of the teacher's sequence-level distribution; ii) Misalignment between the teacher's output distribution and the student's learning capacity; and iii) Exposure bias arising from teacher-forced training versus autoregressive inference. In summary, these shortcomings reflect a systemic absence of explicit teacher-student interaction throughout the distillation process, leaving the essence of distillation underexploited. To address these issues, we propose several methodological innovations that collectively form an enhanced sequence-level distillation training pipeline. Remarkably, DASD-4B-Thinking obtains competitive results using only 448K training samples -- an order of magnitude fewer than those employed by most existing open-source efforts. To support community research, we publicly release our models and the training dataset.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, Jieping Ye</p>

            <p><strong>Title:</strong><br>
            Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09088v1">http://arxiv.org/abs/2601.09088v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this report, we introduce DASD-4B-Thinking, a lightweight yet highly capable, fully open-source reasoning model. It achieves SOTA performance among open-source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation -- even outperforming several larger models. We begin by critically reexamining a widely adopted distillation paradigm in the community: SFT on teacher-generated responses, also known as sequence-level distillation. Although a series of recent works following this scheme have demonstrated remarkable efficiency and strong empirical performance, they are primarily grounded in the SFT perspective. Consequently, these approaches focus predominantly on designing heuristic rules for SFT data filtering, while largely overlooking the core principle of distillation itself -- enabling the student model to learn the teacher's full output distribution so as to inherit its generalization capability. Specifically, we identify three critical limitations in current practice: i) Inadequate representation of the teacher's sequence-level distribution; ii) Misalignment between the teacher's output distribution and the student's learning capacity; and iii) Exposure bias arising from teacher-forced training versus autoregressive inference. In summary, these shortcomings reflect a systemic absence of explicit teacher-student interaction throughout the distillation process, leaving the essence of distillation underexploited. To address these issues, we propose several methodological innovations that collectively form an enhanced sequence-level distillation training pipeline. Remarkably, DASD-4B-Thinking obtains competitive results using only 448K training samples -- an order of magnitude fewer than those employed by most existing open-source efforts. To support community research, we publicly release our models and the training dataset.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 15 Jan 2026 19:44:30 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cc095335/17d7277b.mp3" length="19817152" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1235</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, Jieping Ye</p>

            <p><strong>Title:</strong><br>
            Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09088v1">http://arxiv.org/abs/2601.09088v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this report, we introduce DASD-4B-Thinking, a lightweight yet highly capable, fully open-source reasoning model. It achieves SOTA performance among open-source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation -- even outperforming several larger models. We begin by critically reexamining a widely adopted distillation paradigm in the community: SFT on teacher-generated responses, also known as sequence-level distillation. Although a series of recent works following this scheme have demonstrated remarkable efficiency and strong empirical performance, they are primarily grounded in the SFT perspective. Consequently, these approaches focus predominantly on designing heuristic rules for SFT data filtering, while largely overlooking the core principle of distillation itself -- enabling the student model to learn the teacher's full output distribution so as to inherit its generalization capability. Specifically, we identify three critical limitations in current practice: i) Inadequate representation of the teacher's sequence-level distribution; ii) Misalignment between the teacher's output distribution and the student's learning capacity; and iii) Exposure bias arising from teacher-forced training versus autoregressive inference. In summary, these shortcomings reflect a systemic absence of explicit teacher-student interaction throughout the distillation process, leaving the essence of distillation underexploited. To address these issues, we propose several methodological innovations that collectively form an enhanced sequence-level distillation training pipeline. Remarkably, DASD-4B-Thinking obtains competitive results using only 448K training samples -- an order of magnitude fewer than those employed by most existing open-source efforts. To support community research, we publicly release our models and the training dataset.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning</title>
      <itunes:episode>1602</itunes:episode>
      <podcast:episode>1602</podcast:episode>
      <itunes:title>Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5f3536d7-3fe7-4bdd-8155-5ad8a1d24397</guid>
      <link>https://share.transistor.fm/s/f61d0404</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV, cs.AI, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang, Fu-En Yang</p>

            <p><strong>Title:</strong><br>
            Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09708v1">http://arxiv.org/abs/2601.09708v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV, cs.AI, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang, Fu-En Yang</p>

            <p><strong>Title:</strong><br>
            Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09708v1">http://arxiv.org/abs/2601.09708v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 15 Jan 2026 19:44:09 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f61d0404/efdd73e6.mp3" length="23767718" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1482</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV, cs.AI, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang, Fu-En Yang</p>

            <p><strong>Title:</strong><br>
            Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09708v1">http://arxiv.org/abs/2601.09708v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL</title>
      <itunes:episode>1601</itunes:episode>
      <podcast:episode>1601</podcast:episode>
      <itunes:title>SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d9d9cb90-b776-41f9-8bb1-8981624a67fa</guid>
      <link>https://share.transistor.fm/s/b00e2f1a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lijun Liu, Linwei Chen, Zhishou Zhang, Meng Tian, Hengfu Cui, Ruiyang Li, Zhaocheng Liu, Qiang Ju, Qianxi Li, Hong-Yu Zhou</p>

            <p><strong>Title:</strong><br>
            SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09136v1">http://arxiv.org/abs/2601.09136v1</a></p>

            <p><strong>Abstract:</strong><br>
            General-purpose Large Vision-Language Models (LVLMs), despite their massive scale, often falter in dermatology due to "diffuse attention" - the inability to disentangle subtle pathological lesions from background noise. In this paper, we challenge the assumption that parameter scaling is the only path to medical precision. We introduce SkinFlow, a framework that treats diagnosis as an optimization of visual information transmission efficiency. Our approach utilizes a Virtual-Width Dynamic Vision Encoder (DVE) to "unfold" complex pathological manifolds without physical parameter expansion, coupled with a two-stage Reinforcement Learning strategy. This strategy sequentially aligns explicit medical descriptions (Stage I) and reconstructs implicit diagnostic textures (Stage II) within a constrained semantic space. Furthermore, we propose a clinically grounded evaluation protocol that prioritizes diagnostic safety and hierarchical relevance over rigid label matching. Empirical results are compelling: our 7B model establishes a new state-of-the-art on the Fitzpatrick17k benchmark, achieving a +12.06% gain in Top-1 accuracy and a +28.57% boost in Top-6 accuracy over the massive general-purpose models (e.g., Qwen3VL-235B and GPT-5.2). These findings demonstrate that optimizing geometric capacity and information flow yields superior diagnostic reasoning compared to raw parameter scaling.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lijun Liu, Linwei Chen, Zhishou Zhang, Meng Tian, Hengfu Cui, Ruiyang Li, Zhaocheng Liu, Qiang Ju, Qianxi Li, Hong-Yu Zhou</p>

            <p><strong>Title:</strong><br>
            SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09136v1">http://arxiv.org/abs/2601.09136v1</a></p>

            <p><strong>Abstract:</strong><br>
            General-purpose Large Vision-Language Models (LVLMs), despite their massive scale, often falter in dermatology due to "diffuse attention" - the inability to disentangle subtle pathological lesions from background noise. In this paper, we challenge the assumption that parameter scaling is the only path to medical precision. We introduce SkinFlow, a framework that treats diagnosis as an optimization of visual information transmission efficiency. Our approach utilizes a Virtual-Width Dynamic Vision Encoder (DVE) to "unfold" complex pathological manifolds without physical parameter expansion, coupled with a two-stage Reinforcement Learning strategy. This strategy sequentially aligns explicit medical descriptions (Stage I) and reconstructs implicit diagnostic textures (Stage II) within a constrained semantic space. Furthermore, we propose a clinically grounded evaluation protocol that prioritizes diagnostic safety and hierarchical relevance over rigid label matching. Empirical results are compelling: our 7B model establishes a new state-of-the-art on the Fitzpatrick17k benchmark, achieving a +12.06% gain in Top-1 accuracy and a +28.57% boost in Top-6 accuracy over the massive general-purpose models (e.g., Qwen3VL-235B and GPT-5.2). These findings demonstrate that optimizing geometric capacity and information flow yields superior diagnostic reasoning compared to raw parameter scaling.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 15 Jan 2026 19:43:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b00e2f1a/1dc26143.mp3" length="25916476" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1616</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lijun Liu, Linwei Chen, Zhishou Zhang, Meng Tian, Hengfu Cui, Ruiyang Li, Zhaocheng Liu, Qiang Ju, Qianxi Li, Hong-Yu Zhou</p>

            <p><strong>Title:</strong><br>
            SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09136v1">http://arxiv.org/abs/2601.09136v1</a></p>

            <p><strong>Abstract:</strong><br>
            General-purpose Large Vision-Language Models (LVLMs), despite their massive scale, often falter in dermatology due to "diffuse attention" - the inability to disentangle subtle pathological lesions from background noise. In this paper, we challenge the assumption that parameter scaling is the only path to medical precision. We introduce SkinFlow, a framework that treats diagnosis as an optimization of visual information transmission efficiency. Our approach utilizes a Virtual-Width Dynamic Vision Encoder (DVE) to "unfold" complex pathological manifolds without physical parameter expansion, coupled with a two-stage Reinforcement Learning strategy. This strategy sequentially aligns explicit medical descriptions (Stage I) and reconstructs implicit diagnostic textures (Stage II) within a constrained semantic space. Furthermore, we propose a clinically grounded evaluation protocol that prioritizes diagnostic safety and hierarchical relevance over rigid label matching. Empirical results are compelling: our 7B model establishes a new state-of-the-art on the Fitzpatrick17k benchmark, achieving a +12.06% gain in Top-1 accuracy and a +28.57% boost in Top-6 accuracy over the massive general-purpose models (e.g., Qwen3VL-235B and GPT-5.2). These findings demonstrate that optimizing geometric capacity and information flow yields superior diagnostic reasoning compared to raw parameter scaling.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG</title>
      <itunes:episode>1600</itunes:episode>
      <podcast:episode>1600</podcast:episode>
      <itunes:title>OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d225c86d-cf7e-45b2-a93d-bbacd1b54519</guid>
      <link>https://share.transistor.fm/s/37eee6db</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Fengran Mo, Zhan Su, Yuchen Hui, Jinghan Zhang, Jia Ao Sun, Zheyuan Liu, Chao Zhang, Tetsuya Sakai, Jian-Yun Nie</p>

            <p><strong>Title:</strong><br>
            OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09028v1">http://arxiv.org/abs/2601.09028v1</a></p>

            <p><strong>Abstract:</strong><br>
            The development of large language models (LLMs) has achieved superior performance in a range of downstream tasks, including LLM-based retrieval-augmented generation (RAG). The quality of generated content heavily relies on the usefulness of the retrieved information and the capacity of LLMs' internal information processing mechanism to incorporate it in answer generation. It is generally assumed that the retrieved information is relevant to the question. However, the retrieved information may have a variable degree of relevance and usefulness, depending on the question and the document collection. It is important to take into account the relevance of the retrieved information in answer generation. In this paper, we propose OpenDecoder, a new approach that leverages explicit evaluation of the retrieved information as quality indicator features for generation. We aim to build a RAG model that is more robust to varying levels of noisy context. Three types of explicit evaluation information are considered: relevance score, ranking score, and QPP (query performance prediction) score. The experimental results on five benchmark datasets demonstrate the effectiveness and better robustness of OpenDecoder by outperforming various baseline methods. Importantly, this paradigm is flexible to be integrated with the post-training of LLMs for any purposes and incorporated with any type of external indicators.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Fengran Mo, Zhan Su, Yuchen Hui, Jinghan Zhang, Jia Ao Sun, Zheyuan Liu, Chao Zhang, Tetsuya Sakai, Jian-Yun Nie</p>

            <p><strong>Title:</strong><br>
            OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09028v1">http://arxiv.org/abs/2601.09028v1</a></p>

            <p><strong>Abstract:</strong><br>
            The development of large language models (LLMs) has achieved superior performance in a range of downstream tasks, including LLM-based retrieval-augmented generation (RAG). The quality of generated content heavily relies on the usefulness of the retrieved information and the capacity of LLMs' internal information processing mechanism to incorporate it in answer generation. It is generally assumed that the retrieved information is relevant to the question. However, the retrieved information may have a variable degree of relevance and usefulness, depending on the question and the document collection. It is important to take into account the relevance of the retrieved information in answer generation. In this paper, we propose OpenDecoder, a new approach that leverages explicit evaluation of the retrieved information as quality indicator features for generation. We aim to build a RAG model that is more robust to varying levels of noisy context. Three types of explicit evaluation information are considered: relevance score, ranking score, and QPP (query performance prediction) score. The experimental results on five benchmark datasets demonstrate the effectiveness and better robustness of OpenDecoder by outperforming various baseline methods. Importantly, this paradigm is flexible to be integrated with the post-training of LLMs for any purposes and incorporated with any type of external indicators.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 15 Jan 2026 19:43:26 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/37eee6db/b22284f6.mp3" length="24507502" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1528</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Fengran Mo, Zhan Su, Yuchen Hui, Jinghan Zhang, Jia Ao Sun, Zheyuan Liu, Chao Zhang, Tetsuya Sakai, Jian-Yun Nie</p>

            <p><strong>Title:</strong><br>
            OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09028v1">http://arxiv.org/abs/2601.09028v1</a></p>

            <p><strong>Abstract:</strong><br>
            The development of large language models (LLMs) has achieved superior performance in a range of downstream tasks, including LLM-based retrieval-augmented generation (RAG). The quality of generated content heavily relies on the usefulness of the retrieved information and the capacity of LLMs' internal information processing mechanism to incorporate it in answer generation. It is generally assumed that the retrieved information is relevant to the question. However, the retrieved information may have a variable degree of relevance and usefulness, depending on the question and the document collection. It is important to take into account the relevance of the retrieved information in answer generation. In this paper, we propose OpenDecoder, a new approach that leverages explicit evaluation of the retrieved information as quality indicator features for generation. We aim to build a RAG model that is more robust to varying levels of noisy context. Three types of explicit evaluation information are considered: relevance score, ranking score, and QPP (query performance prediction) score. The experimental results on five benchmark datasets demonstrate the effectiveness and better robustness of OpenDecoder by outperforming various baseline methods. Importantly, this paradigm is flexible to be integrated with the post-training of LLMs for any purposes and incorporated with any type of external indicators.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding</title>
      <itunes:episode>1599</itunes:episode>
      <podcast:episode>1599</podcast:episode>
      <itunes:title>OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3d8fbec6-6a2a-47db-9606-e4f74d4e7f93</guid>
      <link>https://share.transistor.fm/s/bcf6eb0b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sheng-Yu Huang, Jaesung Choe, Yu-Chiang Frank Wang, Cheng Sun</p>

            <p><strong>Title:</strong><br>
            OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09575v1">http://arxiv.org/abs/2601.09575v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sheng-Yu Huang, Jaesung Choe, Yu-Chiang Frank Wang, Cheng Sun</p>

            <p><strong>Title:</strong><br>
            OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09575v1">http://arxiv.org/abs/2601.09575v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 15 Jan 2026 19:43:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bcf6eb0b/73a0bda1.mp3" length="22457842" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1400</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sheng-Yu Huang, Jaesung Choe, Yu-Chiang Frank Wang, Cheng Sun</p>

            <p><strong>Title:</strong><br>
            OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.09575v1">http://arxiv.org/abs/2601.09575v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences</title>
      <itunes:episode>1598</itunes:episode>
      <podcast:episode>1598</podcast:episode>
      <itunes:title>MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">10fed0a5-997b-4036-a55b-11edd1038686</guid>
      <link>https://share.transistor.fm/s/ca69ab2d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qihao Wang, Ziming Cheng, Shuo Zhang, Fan Liu, Rui Xu, Heng Lian, Kunyi Wang, Xiaoming Yu, Jianghao Yin, Sen Hu, Yue Hu, Shaolei Zhang, Yanbing Liu, Ronghao Chen, Huacan Wang</p>

            <p><strong>Title:</strong><br>
            MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.06789v2">http://arxiv.org/abs/2601.06789v2</a></p>

            <p><strong>Abstract:</strong><br>
            While autonomous software engineering (SWE) agents are reshaping programming paradigms, they currently suffer from a "closed-world" limitation: they attempt to fix bugs from scratch or solely using local context, ignoring the immense historical human experience available on platforms like GitHub. Accessing this open-world experience is hindered by the unstructured and fragmented nature of real-world issue-tracking data. In this paper, we introduce MemGovern, a framework designed to govern and transform raw GitHub data into actionable experiential memory for agents. MemGovern employs experience governance to convert human experience into agent-friendly experience cards and introduces an agentic experience search strategy that enables logic-driven retrieval of human expertise. By producing 135K governed experience cards, MemGovern achieves a significant performance boost, improving resolution rates on the SWE-bench Verified by 4.65%. As a plug-in approach, MemGovern provides a solution for agent-friendly memory infrastructure.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qihao Wang, Ziming Cheng, Shuo Zhang, Fan Liu, Rui Xu, Heng Lian, Kunyi Wang, Xiaoming Yu, Jianghao Yin, Sen Hu, Yue Hu, Shaolei Zhang, Yanbing Liu, Ronghao Chen, Huacan Wang</p>

            <p><strong>Title:</strong><br>
            MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.06789v2">http://arxiv.org/abs/2601.06789v2</a></p>

            <p><strong>Abstract:</strong><br>
            While autonomous software engineering (SWE) agents are reshaping programming paradigms, they currently suffer from a "closed-world" limitation: they attempt to fix bugs from scratch or solely using local context, ignoring the immense historical human experience available on platforms like GitHub. Accessing this open-world experience is hindered by the unstructured and fragmented nature of real-world issue-tracking data. In this paper, we introduce MemGovern, a framework designed to govern and transform raw GitHub data into actionable experiential memory for agents. MemGovern employs experience governance to convert human experience into agent-friendly experience cards and introduces an agentic experience search strategy that enables logic-driven retrieval of human expertise. By producing 135K governed experience cards, MemGovern achieves a significant performance boost, improving resolution rates on the SWE-bench Verified by 4.65%. As a plug-in approach, MemGovern provides a solution for agent-friendly memory infrastructure.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 14 Jan 2026 19:57:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ca69ab2d/76a3d81d.mp3" length="22536401" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1405</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qihao Wang, Ziming Cheng, Shuo Zhang, Fan Liu, Rui Xu, Heng Lian, Kunyi Wang, Xiaoming Yu, Jianghao Yin, Sen Hu, Yue Hu, Shaolei Zhang, Yanbing Liu, Ronghao Chen, Huacan Wang</p>

            <p><strong>Title:</strong><br>
            MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.06789v2">http://arxiv.org/abs/2601.06789v2</a></p>

            <p><strong>Abstract:</strong><br>
            While autonomous software engineering (SWE) agents are reshaping programming paradigms, they currently suffer from a "closed-world" limitation: they attempt to fix bugs from scratch or solely using local context, ignoring the immense historical human experience available on platforms like GitHub. Accessing this open-world experience is hindered by the unstructured and fragmented nature of real-world issue-tracking data. In this paper, we introduce MemGovern, a framework designed to govern and transform raw GitHub data into actionable experiential memory for agents. MemGovern employs experience governance to convert human experience into agent-friendly experience cards and introduces an agentic experience search strategy that enables logic-driven retrieval of human expertise. By producing 135K governed experience cards, MemGovern achieves a significant performance boost, improving resolution rates on the SWE-bench Verified by 4.65%. As a plug-in approach, MemGovern provides a solution for agent-friendly memory infrastructure.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Solar Open Technical Report</title>
      <itunes:episode>1597</itunes:episode>
      <podcast:episode>1597</podcast:episode>
      <itunes:title>Solar Open Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0f5c90f4-91e2-4dbf-92f6-e3c7813b8291</guid>
      <link>https://share.transistor.fm/s/42e5925e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sungrae Park, Sanghoon Kim, Jungho Cho, Gyoungjin Gim, Dawoon Jung, Mikyoung Cha, Eunhae Choo, Taekgyu Hong, Minbyul Jeong, SeHwan Joo, Minsoo Khang, Eunwon Kim, Minjeong Kim, Sujeong Kim, Yunsu Kim, Hyeonju Lee, Seunghyun Lee, Sukyung Lee, Siyoung Park, Gyungin Shin, Inseo Song, Wonho Song, Seonghoon Yang, Seungyoun Yi, Sanghoon Yoon, Jeonghyun Ko, Seyoung Song, Keunwoo Choi, Hwalsuk Lee, Sunghun Kim, Du-Seong Chang, Kyunghyun Cho, Junsuk Choe, Hwaran Lee, Jae-Gil Lee, KyungTae Lim, Alice Oh</p>

            <p><strong>Title:</strong><br>
            Solar Open Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.07022v1">http://arxiv.org/abs/2601.07022v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Solar Open, a 102B-parameter bilingual Mixture-of-Experts language model for underserved languages. Solar Open demonstrates a systematic methodology for building competitive LLMs by addressing three interconnected challenges. First, to train effectively despite data scarcity for underserved languages, we synthesize 4.5T tokens of high-quality, domain-specific, and RL-oriented data. Second, we coordinate this data through a progressive curriculum jointly optimizing composition, quality thresholds, and domain coverage across 20 trillion tokens. Third, to enable reasoning capabilities through scalable RL, we apply our proposed framework SnapPO for efficient optimization. Across benchmarks in English and Korean, Solar Open achieves competitive performance, demonstrating the effectiveness of this methodology for underserved language AI development.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sungrae Park, Sanghoon Kim, Jungho Cho, Gyoungjin Gim, Dawoon Jung, Mikyoung Cha, Eunhae Choo, Taekgyu Hong, Minbyul Jeong, SeHwan Joo, Minsoo Khang, Eunwon Kim, Minjeong Kim, Sujeong Kim, Yunsu Kim, Hyeonju Lee, Seunghyun Lee, Sukyung Lee, Siyoung Park, Gyungin Shin, Inseo Song, Wonho Song, Seonghoon Yang, Seungyoun Yi, Sanghoon Yoon, Jeonghyun Ko, Seyoung Song, Keunwoo Choi, Hwalsuk Lee, Sunghun Kim, Du-Seong Chang, Kyunghyun Cho, Junsuk Choe, Hwaran Lee, Jae-Gil Lee, KyungTae Lim, Alice Oh</p>

            <p><strong>Title:</strong><br>
            Solar Open Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.07022v1">http://arxiv.org/abs/2601.07022v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Solar Open, a 102B-parameter bilingual Mixture-of-Experts language model for underserved languages. Solar Open demonstrates a systematic methodology for building competitive LLMs by addressing three interconnected challenges. First, to train effectively despite data scarcity for underserved languages, we synthesize 4.5T tokens of high-quality, domain-specific, and RL-oriented data. Second, we coordinate this data through a progressive curriculum jointly optimizing composition, quality thresholds, and domain coverage across 20 trillion tokens. Third, to enable reasoning capabilities through scalable RL, we apply our proposed framework SnapPO for efficient optimization. Across benchmarks in English and Korean, Solar Open achieves competitive performance, demonstrating the effectiveness of this methodology for underserved language AI development.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 14 Jan 2026 19:57:06 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/42e5925e/799e4ccb.mp3" length="20316984" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1266</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sungrae Park, Sanghoon Kim, Jungho Cho, Gyoungjin Gim, Dawoon Jung, Mikyoung Cha, Eunhae Choo, Taekgyu Hong, Minbyul Jeong, SeHwan Joo, Minsoo Khang, Eunwon Kim, Minjeong Kim, Sujeong Kim, Yunsu Kim, Hyeonju Lee, Seunghyun Lee, Sukyung Lee, Siyoung Park, Gyungin Shin, Inseo Song, Wonho Song, Seonghoon Yang, Seungyoun Yi, Sanghoon Yoon, Jeonghyun Ko, Seyoung Song, Keunwoo Choi, Hwalsuk Lee, Sunghun Kim, Du-Seong Chang, Kyunghyun Cho, Junsuk Choe, Hwaran Lee, Jae-Gil Lee, KyungTae Lim, Alice Oh</p>

            <p><strong>Title:</strong><br>
            Solar Open Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.07022v1">http://arxiv.org/abs/2601.07022v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Solar Open, a 102B-parameter bilingual Mixture-of-Experts language model for underserved languages. Solar Open demonstrates a systematic methodology for building competitive LLMs by addressing three interconnected challenges. First, to train effectively despite data scarcity for underserved languages, we synthesize 4.5T tokens of high-quality, domain-specific, and RL-oriented data. Second, we coordinate this data through a progressive curriculum jointly optimizing composition, quality thresholds, and domain coverage across 20 trillion tokens. Third, to enable reasoning capabilities through scalable RL, we apply our proposed framework SnapPO for efficient optimization. Across benchmarks in English and Korean, Solar Open achieves competitive performance, demonstrating the effectiveness of this methodology for underserved language AI development.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions</title>
      <itunes:episode>1596</itunes:episode>
      <podcast:episode>1596</podcast:episode>
      <itunes:title>KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0cef4b8d-16f3-402e-a8ff-0a3f64686248</guid>
      <link>https://share.transistor.fm/s/00f56c8a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Tingyu Wu, Zhisheng Chen, Ziyan Weng, Shuhe Wang, Chenglong Li, Shuo Zhang, Sen Hu, Silin Wu, Qizhen Lan, Huacan Wang, Ronghao Chen</p>

            <p><strong>Title:</strong><br>
            KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.04745v1">http://arxiv.org/abs/2601.04745v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing long-horizon memory benchmarks mostly use multi-turn dialogues or synthetic user histories, which makes retrieval performance an imperfect proxy for person understanding. We present \BenchName, a publicly releasable benchmark built from long-form autobiographical narratives, where actions, context, and inner thoughts provide dense evidence for inferring stable motivations and decision principles. \BenchName~reconstructs each narrative into a flashback-aware, time-anchored stream and evaluates models with evidence-linked questions spanning factual recall, subjective state attribution, and principle-level reasoning. Across diverse narrative sources, retrieval-augmented systems mainly improve factual accuracy, while errors persist on temporally grounded explanations and higher-level inferences, highlighting the need for memory mechanisms beyond retrieval. Our data is in \href{KnowMeBench}{https://github.com/QuantaAlpha/KnowMeBench}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Tingyu Wu, Zhisheng Chen, Ziyan Weng, Shuhe Wang, Chenglong Li, Shuo Zhang, Sen Hu, Silin Wu, Qizhen Lan, Huacan Wang, Ronghao Chen</p>

            <p><strong>Title:</strong><br>
            KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.04745v1">http://arxiv.org/abs/2601.04745v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing long-horizon memory benchmarks mostly use multi-turn dialogues or synthetic user histories, which makes retrieval performance an imperfect proxy for person understanding. We present \BenchName, a publicly releasable benchmark built from long-form autobiographical narratives, where actions, context, and inner thoughts provide dense evidence for inferring stable motivations and decision principles. \BenchName~reconstructs each narrative into a flashback-aware, time-anchored stream and evaluates models with evidence-linked questions spanning factual recall, subjective state attribution, and principle-level reasoning. Across diverse narrative sources, retrieval-augmented systems mainly improve factual accuracy, while errors persist on temporally grounded explanations and higher-level inferences, highlighting the need for memory mechanisms beyond retrieval. Our data is in \href{KnowMeBench}{https://github.com/QuantaAlpha/KnowMeBench}.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 14 Jan 2026 19:56:44 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/00f56c8a/f2ff0c2e.mp3" length="21277088" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1326</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Tingyu Wu, Zhisheng Chen, Ziyan Weng, Shuhe Wang, Chenglong Li, Shuo Zhang, Sen Hu, Silin Wu, Qizhen Lan, Huacan Wang, Ronghao Chen</p>

            <p><strong>Title:</strong><br>
            KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.04745v1">http://arxiv.org/abs/2601.04745v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing long-horizon memory benchmarks mostly use multi-turn dialogues or synthetic user histories, which makes retrieval performance an imperfect proxy for person understanding. We present \BenchName, a publicly releasable benchmark built from long-form autobiographical narratives, where actions, context, and inner thoughts provide dense evidence for inferring stable motivations and decision principles. \BenchName~reconstructs each narrative into a flashback-aware, time-anchored stream and evaluates models with evidence-linked questions spanning factual recall, subjective state attribution, and principle-level reasoning. Across diverse narrative sources, retrieval-augmented systems mainly improve factual accuracy, while errors persist on temporally grounded explanations and higher-level inferences, highlighting the need for memory mechanisms beyond retrieval. Our data is in \href{KnowMeBench}{https://github.com/QuantaAlpha/KnowMeBench}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale</title>
      <itunes:episode>1595</itunes:episode>
      <podcast:episode>1595</podcast:episode>
      <itunes:title>User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9379dc82-875e-4855-a60c-3cb78d30440a</guid>
      <link>https://share.transistor.fm/s/65e119dd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jungho Cho, Minbyul Jeong, Sungrae Park</p>

            <p><strong>Title:</strong><br>
            User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.08225v1">http://arxiv.org/abs/2601.08225v1</a></p>

            <p><strong>Abstract:</strong><br>
            The recent paradigm shift toward large reasoning models (LRMs) as autonomous agents has intensified the demand for sophisticated, multi-turn tool-use capabilities. Yet, existing datasets and data-generation approaches are limited by static, predefined toolsets that cannot scale to the complexity of open-ended human-agent collaboration. To address this, we initially developed a framework for automated task-oriented multi-turn dialogue generation at scale, utilizing an LRM-based simulator to dynamically generate high-value, domain-specific tools to solve specified tasks. However, we observe that a purely task-oriented design often results in "solely task-solving" trajectories, where the agent completes the objective with minimal interaction, failing to generate the high turn-count conversations seen in realistic scenarios. To bridge this gap, we shift toward a user-oriented simulation paradigm. By decoupling task generation from a dedicated user simulator that mimics human behavioral rules - such as incremental request-making and turn-by-turn feedback - we facilitate more authentic, extended multi-turn dialogues that reflect the iterative nature of real-world problem solving. Our generation pipeline operates as a versatile, plug-and-play module capable of initiating generation from any state, ensuring high scalability in producing extended tool-use data. Furthermore, by facilitating multiple task completions within a single trajectory, it yields a high-density dataset that reflects the multifaceted demands of real-world human-agent interaction.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jungho Cho, Minbyul Jeong, Sungrae Park</p>

            <p><strong>Title:</strong><br>
            User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.08225v1">http://arxiv.org/abs/2601.08225v1</a></p>

            <p><strong>Abstract:</strong><br>
            The recent paradigm shift toward large reasoning models (LRMs) as autonomous agents has intensified the demand for sophisticated, multi-turn tool-use capabilities. Yet, existing datasets and data-generation approaches are limited by static, predefined toolsets that cannot scale to the complexity of open-ended human-agent collaboration. To address this, we initially developed a framework for automated task-oriented multi-turn dialogue generation at scale, utilizing an LRM-based simulator to dynamically generate high-value, domain-specific tools to solve specified tasks. However, we observe that a purely task-oriented design often results in "solely task-solving" trajectories, where the agent completes the objective with minimal interaction, failing to generate the high turn-count conversations seen in realistic scenarios. To bridge this gap, we shift toward a user-oriented simulation paradigm. By decoupling task generation from a dedicated user simulator that mimics human behavioral rules - such as incremental request-making and turn-by-turn feedback - we facilitate more authentic, extended multi-turn dialogues that reflect the iterative nature of real-world problem solving. Our generation pipeline operates as a versatile, plug-and-play module capable of initiating generation from any state, ensuring high scalability in producing extended tool-use data. Furthermore, by facilitating multiple task completions within a single trajectory, it yields a high-density dataset that reflects the multifaceted demands of real-world human-agent interaction.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 14 Jan 2026 19:56:23 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/65e119dd/c722754c.mp3" length="22553106" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1406</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jungho Cho, Minbyul Jeong, Sungrae Park</p>

            <p><strong>Title:</strong><br>
            User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.08225v1">http://arxiv.org/abs/2601.08225v1</a></p>

            <p><strong>Abstract:</strong><br>
            The recent paradigm shift toward large reasoning models (LRMs) as autonomous agents has intensified the demand for sophisticated, multi-turn tool-use capabilities. Yet, existing datasets and data-generation approaches are limited by static, predefined toolsets that cannot scale to the complexity of open-ended human-agent collaboration. To address this, we initially developed a framework for automated task-oriented multi-turn dialogue generation at scale, utilizing an LRM-based simulator to dynamically generate high-value, domain-specific tools to solve specified tasks. However, we observe that a purely task-oriented design often results in "solely task-solving" trajectories, where the agent completes the objective with minimal interaction, failing to generate the high turn-count conversations seen in realistic scenarios. To bridge this gap, we shift toward a user-oriented simulation paradigm. By decoupling task generation from a dedicated user simulator that mimics human behavioral rules - such as incremental request-making and turn-by-turn feedback - we facilitate more authentic, extended multi-turn dialogues that reflect the iterative nature of real-world problem solving. Our generation pipeline operates as a versatile, plug-and-play module capable of initiating generation from any state, ensuring high scalability in producing extended tool-use data. Furthermore, by facilitating multiple task completions within a single trajectory, it yields a high-density dataset that reflects the multifaceted demands of real-world human-agent interaction.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands</title>
      <itunes:episode>1594</itunes:episode>
      <podcast:episode>1594</podcast:episode>
      <itunes:title>ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">02fb95b2-999f-49a0-93d6-cfb53fabcfdb</guid>
      <link>https://share.transistor.fm/s/8546840e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.AI, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Siyuan Hu, Kevin Qinghong Lin, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24965v1">http://arxiv.org/abs/2512.24965v1</a></p>

            <p><strong>Abstract:</strong><br>
            Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-$π$, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-$π$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at https://github.com/showlab/showui-pi.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.AI, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Siyuan Hu, Kevin Qinghong Lin, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24965v1">http://arxiv.org/abs/2512.24965v1</a></p>

            <p><strong>Abstract:</strong><br>
            Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-$π$, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-$π$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at https://github.com/showlab/showui-pi.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 14 Jan 2026 19:56:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8546840e/b06f6e80.mp3" length="21506179" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1340</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.AI, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Siyuan Hu, Kevin Qinghong Lin, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24965v1">http://arxiv.org/abs/2512.24965v1</a></p>

            <p><strong>Abstract:</strong><br>
            Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-$π$, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-$π$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at https://github.com/showlab/showui-pi.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking</title>
      <itunes:episode>1593</itunes:episode>
      <podcast:episode>1593</podcast:episode>
      <itunes:title>ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7b760cb3-78c6-42ef-91e0-14dff414713c</guid>
      <link>https://share.transistor.fm/s/9031021c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qiang Zhang, Boli Chen, Fanrui Zhang, Ruixue Ding, Shihang Wang, Qiuchen Wang, Yinfeng Huang, Haonan Zhang, Rongxiang Zhu, Pengyong Wang, Ailin Ren, Xin Li, Pengjun Xie, Jiawei Liu, Ning Guo, Jingren Zhou, Zheng-Jun Zha</p>

            <p><strong>Title:</strong><br>
            ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.06487v1">http://arxiv.org/abs/2601.06487v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning has substantially improved the performance of LLM agents on tasks with verifiable outcomes, but it still struggles on open-ended agent tasks with vast solution spaces (e.g., complex travel planning). Due to the absence of objective ground-truth for these tasks, current RL algorithms largely rely on reward models that assign scalar scores to individual responses. We contend that such pointwise scoring suffers from an inherent discrimination collapse: the reward model struggles to distinguish subtle advantages among different trajectories, resulting in scores within a group being compressed into a narrow range. Consequently, the effective reward signal becomes dominated by noise from the reward model, leading to optimization stagnation. To address this, we propose ArenaRL, a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking. ArenaRL introduces a process-aware pairwise evaluation mechanism, employing multi-level rubrics to assign fine-grained relative scores to trajectories. Additionally, we construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals. Empirical results confirm that the built seeded single-elimination scheme achieves nearly equivalent advantage estimation accuracy to full pairwise comparisons with O(N^2) complexity, while operating with only O(N) complexity, striking an optimal balance between efficiency and precision. Furthermore, to address the lack of full-cycle benchmarks for open-ended agents, we build Open-Travel and Open-DeepResearch, two high-quality benchmarks featuring a comprehensive pipeline covering SFT, RL training, and multi-dimensional evaluation. Extensive experiments show that ArenaRL substantially outperforms standard RL baselines, enabling LLM agents to generate more robust solutions for complex real-world tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qiang Zhang, Boli Chen, Fanrui Zhang, Ruixue Ding, Shihang Wang, Qiuchen Wang, Yinfeng Huang, Haonan Zhang, Rongxiang Zhu, Pengyong Wang, Ailin Ren, Xin Li, Pengjun Xie, Jiawei Liu, Ning Guo, Jingren Zhou, Zheng-Jun Zha</p>

            <p><strong>Title:</strong><br>
            ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.06487v1">http://arxiv.org/abs/2601.06487v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning has substantially improved the performance of LLM agents on tasks with verifiable outcomes, but it still struggles on open-ended agent tasks with vast solution spaces (e.g., complex travel planning). Due to the absence of objective ground-truth for these tasks, current RL algorithms largely rely on reward models that assign scalar scores to individual responses. We contend that such pointwise scoring suffers from an inherent discrimination collapse: the reward model struggles to distinguish subtle advantages among different trajectories, resulting in scores within a group being compressed into a narrow range. Consequently, the effective reward signal becomes dominated by noise from the reward model, leading to optimization stagnation. To address this, we propose ArenaRL, a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking. ArenaRL introduces a process-aware pairwise evaluation mechanism, employing multi-level rubrics to assign fine-grained relative scores to trajectories. Additionally, we construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals. Empirical results confirm that the built seeded single-elimination scheme achieves nearly equivalent advantage estimation accuracy to full pairwise comparisons with O(N^2) complexity, while operating with only O(N) complexity, striking an optimal balance between efficiency and precision. Furthermore, to address the lack of full-cycle benchmarks for open-ended agents, we build Open-Travel and Open-DeepResearch, two high-quality benchmarks featuring a comprehensive pipeline covering SFT, RL training, and multi-dimensional evaluation. Extensive experiments show that ArenaRL substantially outperforms standard RL baselines, enabling LLM agents to generate more robust solutions for complex real-world tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 14 Jan 2026 19:55:40 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9031021c/eaba57c2.mp3" length="25031616" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1561</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qiang Zhang, Boli Chen, Fanrui Zhang, Ruixue Ding, Shihang Wang, Qiuchen Wang, Yinfeng Huang, Haonan Zhang, Rongxiang Zhu, Pengyong Wang, Ailin Ren, Xin Li, Pengjun Xie, Jiawei Liu, Ning Guo, Jingren Zhou, Zheng-Jun Zha</p>

            <p><strong>Title:</strong><br>
            ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.06487v1">http://arxiv.org/abs/2601.06487v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning has substantially improved the performance of LLM agents on tasks with verifiable outcomes, but it still struggles on open-ended agent tasks with vast solution spaces (e.g., complex travel planning). Due to the absence of objective ground-truth for these tasks, current RL algorithms largely rely on reward models that assign scalar scores to individual responses. We contend that such pointwise scoring suffers from an inherent discrimination collapse: the reward model struggles to distinguish subtle advantages among different trajectories, resulting in scores within a group being compressed into a narrow range. Consequently, the effective reward signal becomes dominated by noise from the reward model, leading to optimization stagnation. To address this, we propose ArenaRL, a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking. ArenaRL introduces a process-aware pairwise evaluation mechanism, employing multi-level rubrics to assign fine-grained relative scores to trajectories. Additionally, we construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals. Empirical results confirm that the built seeded single-elimination scheme achieves nearly equivalent advantage estimation accuracy to full pairwise comparisons with O(N^2) complexity, while operating with only O(N) complexity, striking an optimal balance between efficiency and precision. Furthermore, to address the lack of full-cycle benchmarks for open-ended agents, we build Open-Travel and Open-DeepResearch, two high-quality benchmarks featuring a comprehensive pipeline covering SFT, RL training, and multi-dimensional evaluation. Extensive experiments show that ArenaRL substantially outperforms standard RL baselines, enabling LLM agents to generate more robust solutions for complex real-world tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MemoBrain: Executive Memory as an Agentic Brain for Reasoning</title>
      <itunes:episode>1592</itunes:episode>
      <podcast:episode>1592</podcast:episode>
      <itunes:title>MemoBrain: Executive Memory as an Agentic Brain for Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ebb2f43a-0392-4a3c-9f57-c0c428eb5347</guid>
      <link>https://share.transistor.fm/s/8f4b0884</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.AI, cs.CL, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Hongjin Qian, Zhao Cao, Zheng Liu</p>

            <p><strong>Title:</strong><br>
            MemoBrain: Executive Memory as an Agentic Brain for Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.08079v1">http://arxiv.org/abs/2601.08079v1</a></p>

            <p><strong>Abstract:</strong><br>
            Complex reasoning in tool-augmented agent frameworks is inherently long-horizon, causing reasoning traces and transient tool artifacts to accumulate and strain the bounded working context of large language models. Without explicit memory mechanisms, such accumulation disrupts logical continuity and undermines task alignment. This positions memory not as an auxiliary efficiency concern, but as a core component for sustaining coherent, goal-directed reasoning over long horizons.   We propose MemoBrain, an executive memory model for tool-augmented agents that constructs a dependency-aware memory over reasoning steps, capturing salient intermediate states and their logical relations. Operating as a co-pilot alongside the reasoning agent, MemoBrain organizes reasoning progress without blocking execution and actively manages the working context. Specifically, it prunes invalid steps, folds completed sub-trajectories, and preserves a compact, high-salience reasoning backbone under a fixed context budget. Together, these mechanisms enable explicit cognitive control over reasoning trajectories rather than passive context accumulation.   We evaluate MemoBrain on challenging long-horizon benchmarks, including GAIA, WebWalker, and BrowseComp-Plus, demonstrating consistent improvements over strong baselines.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.AI, cs.CL, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Hongjin Qian, Zhao Cao, Zheng Liu</p>

            <p><strong>Title:</strong><br>
            MemoBrain: Executive Memory as an Agentic Brain for Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.08079v1">http://arxiv.org/abs/2601.08079v1</a></p>

            <p><strong>Abstract:</strong><br>
            Complex reasoning in tool-augmented agent frameworks is inherently long-horizon, causing reasoning traces and transient tool artifacts to accumulate and strain the bounded working context of large language models. Without explicit memory mechanisms, such accumulation disrupts logical continuity and undermines task alignment. This positions memory not as an auxiliary efficiency concern, but as a core component for sustaining coherent, goal-directed reasoning over long horizons.   We propose MemoBrain, an executive memory model for tool-augmented agents that constructs a dependency-aware memory over reasoning steps, capturing salient intermediate states and their logical relations. Operating as a co-pilot alongside the reasoning agent, MemoBrain organizes reasoning progress without blocking execution and actively manages the working context. Specifically, it prunes invalid steps, folds completed sub-trajectories, and preserves a compact, high-salience reasoning backbone under a fixed context budget. Together, these mechanisms enable explicit cognitive control over reasoning trajectories rather than passive context accumulation.   We evaluate MemoBrain on challenging long-horizon benchmarks, including GAIA, WebWalker, and BrowseComp-Plus, demonstrating consistent improvements over strong baselines.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 14 Jan 2026 19:55:18 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8f4b0884/7d382de6.mp3" length="21612273" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1347</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.AI, cs.CL, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Hongjin Qian, Zhao Cao, Zheng Liu</p>

            <p><strong>Title:</strong><br>
            MemoBrain: Executive Memory as an Agentic Brain for Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.08079v1">http://arxiv.org/abs/2601.08079v1</a></p>

            <p><strong>Abstract:</strong><br>
            Complex reasoning in tool-augmented agent frameworks is inherently long-horizon, causing reasoning traces and transient tool artifacts to accumulate and strain the bounded working context of large language models. Without explicit memory mechanisms, such accumulation disrupts logical continuity and undermines task alignment. This positions memory not as an auxiliary efficiency concern, but as a core component for sustaining coherent, goal-directed reasoning over long horizons.   We propose MemoBrain, an executive memory model for tool-augmented agents that constructs a dependency-aware memory over reasoning steps, capturing salient intermediate states and their logical relations. Operating as a co-pilot alongside the reasoning agent, MemoBrain organizes reasoning progress without blocking execution and actively manages the working context. Specifically, it prunes invalid steps, folds completed sub-trajectories, and preserves a compact, high-salience reasoning backbone under a fixed context budget. Together, these mechanisms enable explicit cognitive control over reasoning trajectories rather than passive context accumulation.   We evaluate MemoBrain on challenging long-horizon benchmarks, including GAIA, WebWalker, and BrowseComp-Plus, demonstrating consistent improvements over strong baselines.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Motion Attribution for Video Generation</title>
      <itunes:episode>1591</itunes:episode>
      <podcast:episode>1591</podcast:episode>
      <itunes:title>Motion Attribution for Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">35da82fe-3566-4368-82fd-09e8c11349ec</guid>
      <link>https://share.transistor.fm/s/32392001</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI, cs.LG, cs.MM, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Xindi Wu, Despoina Paschalidou, Jun Gao, Antonio Torralba, Laura Leal-Taixé, Olga Russakovsky, Sanja Fidler, Jonathan Lorraine</p>

            <p><strong>Title:</strong><br>
            Motion Attribution for Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.08828v1">http://arxiv.org/abs/2601.08828v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI, cs.LG, cs.MM, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Xindi Wu, Despoina Paschalidou, Jun Gao, Antonio Torralba, Laura Leal-Taixé, Olga Russakovsky, Sanja Fidler, Jonathan Lorraine</p>

            <p><strong>Title:</strong><br>
            Motion Attribution for Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.08828v1">http://arxiv.org/abs/2601.08828v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 14 Jan 2026 19:54:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/32392001/7579fae4.mp3" length="19362795" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1206</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI, cs.LG, cs.MM, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Xindi Wu, Despoina Paschalidou, Jun Gao, Antonio Torralba, Laura Leal-Taixé, Olga Russakovsky, Sanja Fidler, Jonathan Lorraine</p>

            <p><strong>Title:</strong><br>
            Motion Attribution for Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.08828v1">http://arxiv.org/abs/2601.08828v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>3AM: Segment Anything with Geometric Consistency in Videos</title>
      <itunes:episode>1590</itunes:episode>
      <podcast:episode>1590</podcast:episode>
      <itunes:title>3AM: Segment Anything with Geometric Consistency in Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b8d50cb2-078b-4eb4-bb57-f33b3efe8292</guid>
      <link>https://share.transistor.fm/s/95f295b7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang-Che Sun, Cheng Sun, Chin-Yang Lin, Fu-En Yang, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            3AM: Segment Anything with Geometric Consistency in Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.08831v1">http://arxiv.org/abs/2601.08831v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps, and expensive preprocessing. We introduce 3AM, a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2. Our lightweight Feature Merger fuses multi-level MUSt3R features that encode implicit geometric correspondence. Combined with SAM2's appearance features, the model achieves geometry-consistent recognition grounded in both spatial position and visual similarity. We propose a field-of-view aware sampling strategy ensuring frames observe spatially consistent object regions for reliable 3D correspondence learning. Critically, our method requires only RGB input at inference, with no camera poses or preprocessing. On challenging datasets with wide-baseline motion (ScanNet++, Replica), 3AM substantially outperforms SAM2 and extensions, achieving 90.6% IoU and 71.7% Positive IoU on ScanNet++'s Selected Subset, improving over state-of-the-art VOS methods by +15.9 and +30.4 points. Project page: https://jayisaking.github.io/3AM-Page/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang-Che Sun, Cheng Sun, Chin-Yang Lin, Fu-En Yang, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            3AM: Segment Anything with Geometric Consistency in Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.08831v1">http://arxiv.org/abs/2601.08831v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps, and expensive preprocessing. We introduce 3AM, a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2. Our lightweight Feature Merger fuses multi-level MUSt3R features that encode implicit geometric correspondence. Combined with SAM2's appearance features, the model achieves geometry-consistent recognition grounded in both spatial position and visual similarity. We propose a field-of-view aware sampling strategy ensuring frames observe spatially consistent object regions for reliable 3D correspondence learning. Critically, our method requires only RGB input at inference, with no camera poses or preprocessing. On challenging datasets with wide-baseline motion (ScanNet++, Replica), 3AM substantially outperforms SAM2 and extensions, achieving 90.6% IoU and 71.7% Positive IoU on ScanNet++'s Selected Subset, improving over state-of-the-art VOS methods by +15.9 and +30.4 points. Project page: https://jayisaking.github.io/3AM-Page/</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 14 Jan 2026 19:54:25 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/95f295b7/94b19c69.mp3" length="21782798" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1358</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang-Che Sun, Cheng Sun, Chin-Yang Lin, Fu-En Yang, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            3AM: Segment Anything with Geometric Consistency in Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.08831v1">http://arxiv.org/abs/2601.08831v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps, and expensive preprocessing. We introduce 3AM, a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2. Our lightweight Feature Merger fuses multi-level MUSt3R features that encode implicit geometric correspondence. Combined with SAM2's appearance features, the model achieves geometry-consistent recognition grounded in both spatial position and visual similarity. We propose a field-of-view aware sampling strategy ensuring frames observe spatially consistent object regions for reliable 3D correspondence learning. Critically, our method requires only RGB input at inference, with no camera poses or preprocessing. On challenging datasets with wide-baseline motion (ScanNet++, Replica), 3AM substantially outperforms SAM2 and extensions, achieving 90.6% IoU and 71.7% Positive IoU on ScanNet++'s Selected Subset, improving over state-of-the-art VOS methods by +15.9 and +30.4 points. Project page: https://jayisaking.github.io/3AM-Page/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BabyVision: Visual Reasoning Beyond Language</title>
      <itunes:episode>1589</itunes:episode>
      <podcast:episode>1589</podcast:episode>
      <itunes:title>BabyVision: Visual Reasoning Beyond Language</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a8bd8ded-6cca-4326-b54c-e431cb9406a7</guid>
      <link>https://share.transistor.fm/s/699d7cbe</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 156 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Y. charles, Yiping Bao, Yuantao Fan, Guopeng Li, Haiyang Shen, Xuanzhong Chen, Wendong Xu, Shuzheng Si, Zefan Cai, Wenhao Chai, Ziqi Huang, Fangfu Liu, Tianyu Liu, Baobao Chang, Xiaobo Hu, Kaiyuan Chen, Yixin Ren, Yang Liu, Yuan Gong, Kuan Li</p>

            <p><strong>Title:</strong><br>
            BabyVision: Visual Reasoning Beyond Language</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.06521v1">http://arxiv.org/abs/2601.06521v1</a></p>

            <p><strong>Abstract:</strong><br>
            While humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding. We uncovered a crucial fact: state-of-the-art MLLMs consistently fail on basic visual tasks that humans, even 3-year-olds, can solve effortlessly. To systematically investigate this gap, we introduce BabyVision, a benchmark designed to assess core visual abilities independent of linguistic knowledge for MLLMs. BabyVision spans a wide range of tasks, with 388 items divided into 22 subclasses across four key categories. Empirical results and human evaluation reveal that leading MLLMs perform significantly below human baselines. Gemini3-Pro-Preview scores 49.7, lagging behind 6-year-old humans and falling well behind the average adult score of 94.1. These results show despite excelling in knowledge-heavy evaluations, current MLLMs still lack fundamental visual primitives. Progress in BabyVision represents a step toward human-level visual perception and reasoning capabilities. We also explore solving visual reasoning with generation models by proposing BabyVision-Gen and automatic evaluation toolkit. Our code and benchmark data are released at https://github.com/UniPat-AI/BabyVision for reproduction.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 156 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Y. charles, Yiping Bao, Yuantao Fan, Guopeng Li, Haiyang Shen, Xuanzhong Chen, Wendong Xu, Shuzheng Si, Zefan Cai, Wenhao Chai, Ziqi Huang, Fangfu Liu, Tianyu Liu, Baobao Chang, Xiaobo Hu, Kaiyuan Chen, Yixin Ren, Yang Liu, Yuan Gong, Kuan Li</p>

            <p><strong>Title:</strong><br>
            BabyVision: Visual Reasoning Beyond Language</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.06521v1">http://arxiv.org/abs/2601.06521v1</a></p>

            <p><strong>Abstract:</strong><br>
            While humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding. We uncovered a crucial fact: state-of-the-art MLLMs consistently fail on basic visual tasks that humans, even 3-year-olds, can solve effortlessly. To systematically investigate this gap, we introduce BabyVision, a benchmark designed to assess core visual abilities independent of linguistic knowledge for MLLMs. BabyVision spans a wide range of tasks, with 388 items divided into 22 subclasses across four key categories. Empirical results and human evaluation reveal that leading MLLMs perform significantly below human baselines. Gemini3-Pro-Preview scores 49.7, lagging behind 6-year-old humans and falling well behind the average adult score of 94.1. These results show despite excelling in knowledge-heavy evaluations, current MLLMs still lack fundamental visual primitives. Progress in BabyVision represents a step toward human-level visual perception and reasoning capabilities. We also explore solving visual reasoning with generation models by proposing BabyVision-Gen and automatic evaluation toolkit. Our code and benchmark data are released at https://github.com/UniPat-AI/BabyVision for reproduction.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 13 Jan 2026 19:51:42 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/699d7cbe/3c46e727.mp3" length="21241109" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1324</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 156 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Y. charles, Yiping Bao, Yuantao Fan, Guopeng Li, Haiyang Shen, Xuanzhong Chen, Wendong Xu, Shuzheng Si, Zefan Cai, Wenhao Chai, Ziqi Huang, Fangfu Liu, Tianyu Liu, Baobao Chang, Xiaobo Hu, Kaiyuan Chen, Yixin Ren, Yang Liu, Yuan Gong, Kuan Li</p>

            <p><strong>Title:</strong><br>
            BabyVision: Visual Reasoning Beyond Language</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.06521v1">http://arxiv.org/abs/2601.06521v1</a></p>

            <p><strong>Abstract:</strong><br>
            While humans develop core visual skills long before acquiring language, contemporary Multimodal LLMs (MLLMs) still rely heavily on linguistic priors to compensate for their fragile visual understanding. We uncovered a crucial fact: state-of-the-art MLLMs consistently fail on basic visual tasks that humans, even 3-year-olds, can solve effortlessly. To systematically investigate this gap, we introduce BabyVision, a benchmark designed to assess core visual abilities independent of linguistic knowledge for MLLMs. BabyVision spans a wide range of tasks, with 388 items divided into 22 subclasses across four key categories. Empirical results and human evaluation reveal that leading MLLMs perform significantly below human baselines. Gemini3-Pro-Preview scores 49.7, lagging behind 6-year-old humans and falling well behind the average adult score of 94.1. These results show despite excelling in knowledge-heavy evaluations, current MLLMs still lack fundamental visual primitives. Progress in BabyVision represents a step toward human-level visual perception and reasoning capabilities. We also explore solving visual reasoning with generation models by proposing BabyVision-Gen and automatic evaluation toolkit. Our code and benchmark data are released at https://github.com/UniPat-AI/BabyVision for reproduction.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning</title>
      <itunes:episode>1588</itunes:episode>
      <podcast:episode>1588</podcast:episode>
      <itunes:title>PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e356cfb8-b0b0-4153-aaef-4c25594ae6f0</guid>
      <link>https://share.transistor.fm/s/b2b6b23b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, Xiangwen Kong, Chengyuan Yao, Kaiwen Yan, Ailin Huang, Hongyu Zhou, Qi Han, Zheng Ge, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum</p>

            <p><strong>Title:</strong><br>
            PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05593v1">http://arxiv.org/abs/2601.05593v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Parallel Coordinated Reasoning (PaCoRe), a training-and-inference framework designed to overcome a central limitation of contemporary language models: their inability to scale test-time compute (TTC) far beyond sequential reasoning under a fixed context window. PaCoRe departs from the traditional sequential paradigm by driving TTC through massive parallel exploration coordinated via a message-passing architecture in multiple rounds. Each round launches many parallel reasoning trajectories, compacts their findings into context-bounded messages, and synthesizes these messages to guide the next round and ultimately produce the final answer. Trained end-to-end with large-scale, outcome-based reinforcement learning, the model masters the synthesis abilities required by PaCoRe and scales to multi-million-token effective TTC without exceeding context limits. The approach yields strong improvements across diverse domains, and notably pushes reasoning beyond frontier systems in mathematics: an 8B model reaches 94.5% on HMMT 2025, surpassing GPT-5's 93.2% by scaling effective TTC to roughly two million tokens. We open-source model checkpoints, training data, and the full inference pipeline to accelerate follow-up work.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, Xiangwen Kong, Chengyuan Yao, Kaiwen Yan, Ailin Huang, Hongyu Zhou, Qi Han, Zheng Ge, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum</p>

            <p><strong>Title:</strong><br>
            PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05593v1">http://arxiv.org/abs/2601.05593v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Parallel Coordinated Reasoning (PaCoRe), a training-and-inference framework designed to overcome a central limitation of contemporary language models: their inability to scale test-time compute (TTC) far beyond sequential reasoning under a fixed context window. PaCoRe departs from the traditional sequential paradigm by driving TTC through massive parallel exploration coordinated via a message-passing architecture in multiple rounds. Each round launches many parallel reasoning trajectories, compacts their findings into context-bounded messages, and synthesizes these messages to guide the next round and ultimately produce the final answer. Trained end-to-end with large-scale, outcome-based reinforcement learning, the model masters the synthesis abilities required by PaCoRe and scales to multi-million-token effective TTC without exceeding context limits. The approach yields strong improvements across diverse domains, and notably pushes reasoning beyond frontier systems in mathematics: an 8B model reaches 94.5% on HMMT 2025, surpassing GPT-5's 93.2% by scaling effective TTC to roughly two million tokens. We open-source model checkpoints, training data, and the full inference pipeline to accelerate follow-up work.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 13 Jan 2026 19:51:19 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b2b6b23b/15014345.mp3" length="22413937" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1397</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, Xiangwen Kong, Chengyuan Yao, Kaiwen Yan, Ailin Huang, Hongyu Zhou, Qi Han, Zheng Ge, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum</p>

            <p><strong>Title:</strong><br>
            PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05593v1">http://arxiv.org/abs/2601.05593v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Parallel Coordinated Reasoning (PaCoRe), a training-and-inference framework designed to overcome a central limitation of contemporary language models: their inability to scale test-time compute (TTC) far beyond sequential reasoning under a fixed context window. PaCoRe departs from the traditional sequential paradigm by driving TTC through massive parallel exploration coordinated via a message-passing architecture in multiple rounds. Each round launches many parallel reasoning trajectories, compacts their findings into context-bounded messages, and synthesizes these messages to guide the next round and ultimately produce the final answer. Trained end-to-end with large-scale, outcome-based reinforcement learning, the model masters the synthesis abilities required by PaCoRe and scales to multi-million-token effective TTC without exceeding context limits. The approach yields strong improvements across diverse domains, and notably pushes reasoning beyond frontier systems in mathematics: an 8B model reaches 94.5% on HMMT 2025, surpassing GPT-5's 93.2% by scaling effective TTC to roughly two million tokens. We open-source model checkpoints, training data, and the full inference pipeline to accelerate follow-up work.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head</title>
      <itunes:episode>1587</itunes:episode>
      <podcast:episode>1587</podcast:episode>
      <itunes:title>MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">94e3abfb-c357-4722-b182-5dee1df0157a</guid>
      <link>https://share.transistor.fm/s/6dd1b1ec</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Kewei Zhang, Ye Huang, Yufan Deng, Jincheng Yu, Junsong Chen, Huan Ling, Enze Xie, Daquan Zhou</p>

            <p><strong>Title:</strong><br>
            MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.07832v1">http://arxiv.org/abs/2601.07832v1</a></p>

            <p><strong>Abstract:</strong><br>
            While the Transformer architecture dominates many fields, its quadratic self-attention complexity hinders its use in large-scale applications. Linear attention offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re-introducing computational overhead through extra modules (e.g., depthwise separable convolution) that defeat the original purpose. In this work, we identify a key failure mode in these methods: global context collapse, where the model loses representational diversity. To address this, we propose Multi-Head Linear Attention (MHLA), which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a 3.6\% improvement on ImageNet classification, a 6.3\% gain on NLP, a 12.6\% improvement on image generation, and a 41\% enhancement on video generation under the same time complexity.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Kewei Zhang, Ye Huang, Yufan Deng, Jincheng Yu, Junsong Chen, Huan Ling, Enze Xie, Daquan Zhou</p>

            <p><strong>Title:</strong><br>
            MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.07832v1">http://arxiv.org/abs/2601.07832v1</a></p>

            <p><strong>Abstract:</strong><br>
            While the Transformer architecture dominates many fields, its quadratic self-attention complexity hinders its use in large-scale applications. Linear attention offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re-introducing computational overhead through extra modules (e.g., depthwise separable convolution) that defeat the original purpose. In this work, we identify a key failure mode in these methods: global context collapse, where the model loses representational diversity. To address this, we propose Multi-Head Linear Attention (MHLA), which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a 3.6\% improvement on ImageNet classification, a 6.3\% gain on NLP, a 12.6\% improvement on image generation, and a 41\% enhancement on video generation under the same time complexity.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 13 Jan 2026 19:50:56 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6dd1b1ec/23b185fa.mp3" length="21589300" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1346</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Kewei Zhang, Ye Huang, Yufan Deng, Jincheng Yu, Junsong Chen, Huan Ling, Enze Xie, Daquan Zhou</p>

            <p><strong>Title:</strong><br>
            MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.07832v1">http://arxiv.org/abs/2601.07832v1</a></p>

            <p><strong>Abstract:</strong><br>
            While the Transformer architecture dominates many fields, its quadratic self-attention complexity hinders its use in large-scale applications. Linear attention offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re-introducing computational overhead through extra modules (e.g., depthwise separable convolution) that defeat the original purpose. In this work, we identify a key failure mode in these methods: global context collapse, where the model loses representational diversity. To address this, we propose Multi-Head Linear Attention (MHLA), which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a 3.6\% improvement on ImageNet classification, a 6.3\% gain on NLP, a 12.6\% improvement on image generation, and a 41\% enhancement on video generation under the same time complexity.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests</title>
      <itunes:episode>1586</itunes:episode>
      <podcast:episode>1586</podcast:episode>
      <itunes:title>X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">36a89a88-c12f-405e-87c8-6cde300f6a59</guid>
      <link>https://share.transistor.fm/s/7e4f49e6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jie Wu, Haoling Li, Xin Zhang, Jiani Guo, Jane Luo, Steven Liu, Yangyu Huang, Ruihang Chu, Scarlett Li, Yujiu Yang</p>

            <p><strong>Title:</strong><br>
            X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.06953v1">http://arxiv.org/abs/2601.06953v1</a></p>

            <p><strong>Abstract:</strong><br>
            Competitive programming presents great challenges for Code LLMs due to its intensive reasoning demands and high logical complexity. However, current Code LLMs still rely heavily on real-world data, which limits their scalability. In this paper, we explore a fully synthetic approach: training Code LLMs with entirely generated tasks, solutions, and test cases, to empower code reasoning models without relying on real-world data. To support this, we leverage feature-based synthesis to propose a novel data synthesis pipeline called SynthSmith. SynthSmith shows strong potential in producing diverse and challenging tasks, along with verified solutions and tests, supporting both supervised fine-tuning and reinforcement learning. Based on the proposed synthetic SFT and RL datasets, we introduce the X-Coder model series, which achieves a notable pass rate of 62.9 avg@8 on LiveCodeBench v5 and 55.8 on v6, outperforming DeepCoder-14B-Preview and AReal-boba2-14B despite having only 7B parameters. In-depth analysis reveals that scaling laws hold on our synthetic dataset, and we explore which dimensions are more effective to scale. We further provide insights into code-centric reinforcement learning and highlight the key factors that shape performance through detailed ablations and analysis. Our findings demonstrate that scaling high-quality synthetic data and adopting staged training can greatly advance code reasoning, while mitigating reliance on real-world coding data.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jie Wu, Haoling Li, Xin Zhang, Jiani Guo, Jane Luo, Steven Liu, Yangyu Huang, Ruihang Chu, Scarlett Li, Yujiu Yang</p>

            <p><strong>Title:</strong><br>
            X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.06953v1">http://arxiv.org/abs/2601.06953v1</a></p>

            <p><strong>Abstract:</strong><br>
            Competitive programming presents great challenges for Code LLMs due to its intensive reasoning demands and high logical complexity. However, current Code LLMs still rely heavily on real-world data, which limits their scalability. In this paper, we explore a fully synthetic approach: training Code LLMs with entirely generated tasks, solutions, and test cases, to empower code reasoning models without relying on real-world data. To support this, we leverage feature-based synthesis to propose a novel data synthesis pipeline called SynthSmith. SynthSmith shows strong potential in producing diverse and challenging tasks, along with verified solutions and tests, supporting both supervised fine-tuning and reinforcement learning. Based on the proposed synthetic SFT and RL datasets, we introduce the X-Coder model series, which achieves a notable pass rate of 62.9 avg@8 on LiveCodeBench v5 and 55.8 on v6, outperforming DeepCoder-14B-Preview and AReal-boba2-14B despite having only 7B parameters. In-depth analysis reveals that scaling laws hold on our synthetic dataset, and we explore which dimensions are more effective to scale. We further provide insights into code-centric reinforcement learning and highlight the key factors that shape performance through detailed ablations and analysis. Our findings demonstrate that scaling high-quality synthetic data and adopting staged training can greatly advance code reasoning, while mitigating reliance on real-world coding data.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 13 Jan 2026 19:50:33 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7e4f49e6/afda1fde.mp3" length="21652428" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1350</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jie Wu, Haoling Li, Xin Zhang, Jiani Guo, Jane Luo, Steven Liu, Yangyu Huang, Ruihang Chu, Scarlett Li, Yujiu Yang</p>

            <p><strong>Title:</strong><br>
            X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.06953v1">http://arxiv.org/abs/2601.06953v1</a></p>

            <p><strong>Abstract:</strong><br>
            Competitive programming presents great challenges for Code LLMs due to its intensive reasoning demands and high logical complexity. However, current Code LLMs still rely heavily on real-world data, which limits their scalability. In this paper, we explore a fully synthetic approach: training Code LLMs with entirely generated tasks, solutions, and test cases, to empower code reasoning models without relying on real-world data. To support this, we leverage feature-based synthesis to propose a novel data synthesis pipeline called SynthSmith. SynthSmith shows strong potential in producing diverse and challenging tasks, along with verified solutions and tests, supporting both supervised fine-tuning and reinforcement learning. Based on the proposed synthetic SFT and RL datasets, we introduce the X-Coder model series, which achieves a notable pass rate of 62.9 avg@8 on LiveCodeBench v5 and 55.8 on v6, outperforming DeepCoder-14B-Preview and AReal-boba2-14B despite having only 7B parameters. In-depth analysis reveals that scaling laws hold on our synthetic dataset, and we explore which dimensions are more effective to scale. We further provide insights into code-centric reinforcement learning and highlight the key factors that shape performance through detailed ablations and analysis. Our findings demonstrate that scaling high-quality synthetic data and adopting staged training can greatly advance code reasoning, while mitigating reliance on real-world coding data.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts</title>
      <itunes:episode>1585</itunes:episode>
      <podcast:episode>1585</podcast:episode>
      <itunes:title>GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">96d01838-95fc-4aae-9762-574b4e52a461</guid>
      <link>https://share.transistor.fm/s/fab254d2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, Xiaodong Gu</p>

            <p><strong>Title:</strong><br>
            GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05110v1">http://arxiv.org/abs/2601.05110v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the "Aha Moment" phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, Xiaodong Gu</p>

            <p><strong>Title:</strong><br>
            GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05110v1">http://arxiv.org/abs/2601.05110v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the "Aha Moment" phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 13 Jan 2026 19:50:10 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fab254d2/65382ff0.mp3" length="19373286" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1207</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, Xiaodong Gu</p>

            <p><strong>Title:</strong><br>
            GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05110v1">http://arxiv.org/abs/2601.05110v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the "Aha Moment" phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Lost in the Noise: How Reasoning Models Fail with Contextual Distractors</title>
      <itunes:episode>1584</itunes:episode>
      <podcast:episode>1584</podcast:episode>
      <itunes:title>Lost in the Noise: How Reasoning Models Fail with Contextual Distractors</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0ae8003c-a615-4378-95d9-a7342ec36c19</guid>
      <link>https://share.transistor.fm/s/50c27106</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Seongyun Lee, Yongrae Jo, Minju Seo, Moontae Lee, Minjoon Seo</p>

            <p><strong>Title:</strong><br>
            Lost in the Noise: How Reasoning Models Fail with Contextual Distractors</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.07226v1">http://arxiv.org/abs/2601.07226v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in reasoning models and agentic AI systems have led to an increased reliance on diverse external information. However, this shift introduces input contexts that are inherently noisy, a reality that current sanitized benchmarks fail to capture. We introduce NoisyBench, a comprehensive benchmark that systematically evaluates model robustness across 11 datasets in RAG, reasoning, alignment, and tool-use tasks against diverse noise types, including random documents, irrelevant chat histories, and hard negative distractors. Our evaluation reveals a catastrophic performance drop of up to 80% in state-of-the-art models when faced with contextual distractors. Crucially, we find that agentic workflows often amplify these errors by over-trusting noisy tool outputs, and distractors can trigger emergent misalignment even without adversarial intent. We find that prompting, context engineering, SFT, and outcome-reward only RL fail to ensure robustness; in contrast, our proposed Rationale-Aware Reward (RARE) significantly strengthens resilience by incentivizing the identification of helpful information within noise. Finally, we uncover an inverse scaling trend where increased test-time computation leads to worse performance in noisy settings and demonstrate via attention visualization that models disproportionately focus on distractor tokens, providing vital insights for building the next generation of robust, reasoning-capable agents.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Seongyun Lee, Yongrae Jo, Minju Seo, Moontae Lee, Minjoon Seo</p>

            <p><strong>Title:</strong><br>
            Lost in the Noise: How Reasoning Models Fail with Contextual Distractors</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.07226v1">http://arxiv.org/abs/2601.07226v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in reasoning models and agentic AI systems have led to an increased reliance on diverse external information. However, this shift introduces input contexts that are inherently noisy, a reality that current sanitized benchmarks fail to capture. We introduce NoisyBench, a comprehensive benchmark that systematically evaluates model robustness across 11 datasets in RAG, reasoning, alignment, and tool-use tasks against diverse noise types, including random documents, irrelevant chat histories, and hard negative distractors. Our evaluation reveals a catastrophic performance drop of up to 80% in state-of-the-art models when faced with contextual distractors. Crucially, we find that agentic workflows often amplify these errors by over-trusting noisy tool outputs, and distractors can trigger emergent misalignment even without adversarial intent. We find that prompting, context engineering, SFT, and outcome-reward only RL fail to ensure robustness; in contrast, our proposed Rationale-Aware Reward (RARE) significantly strengthens resilience by incentivizing the identification of helpful information within noise. Finally, we uncover an inverse scaling trend where increased test-time computation leads to worse performance in noisy settings and demonstrate via attention visualization that models disproportionately focus on distractor tokens, providing vital insights for building the next generation of robust, reasoning-capable agents.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 13 Jan 2026 19:49:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/50c27106/f11f96d0.mp3" length="22705248" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1415</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Seongyun Lee, Yongrae Jo, Minju Seo, Moontae Lee, Minjoon Seo</p>

            <p><strong>Title:</strong><br>
            Lost in the Noise: How Reasoning Models Fail with Contextual Distractors</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.07226v1">http://arxiv.org/abs/2601.07226v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in reasoning models and agentic AI systems have led to an increased reliance on diverse external information. However, this shift introduces input contexts that are inherently noisy, a reality that current sanitized benchmarks fail to capture. We introduce NoisyBench, a comprehensive benchmark that systematically evaluates model robustness across 11 datasets in RAG, reasoning, alignment, and tool-use tasks against diverse noise types, including random documents, irrelevant chat histories, and hard negative distractors. Our evaluation reveals a catastrophic performance drop of up to 80% in state-of-the-art models when faced with contextual distractors. Crucially, we find that agentic workflows often amplify these errors by over-trusting noisy tool outputs, and distractors can trigger emergent misalignment even without adversarial intent. We find that prompting, context engineering, SFT, and outcome-reward only RL fail to ensure robustness; in contrast, our proposed Rationale-Aware Reward (RARE) significantly strengthens resilience by incentivizing the identification of helpful information within noise. Finally, we uncover an inverse scaling trend where increased test-time computation leads to worse performance in noisy settings and demonstrate via attention visualization that models disproportionately focus on distractor tokens, providing vital insights for building the next generation of robust, reasoning-capable agents.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent</title>
      <itunes:episode>1583</itunes:episode>
      <podcast:episode>1583</podcast:episode>
      <itunes:title>OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8b190cd9-e23a-419c-925a-eabf9384bbc3</guid>
      <link>https://share.transistor.fm/s/a5f19898</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.MA, cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Bowen Yang, Kaiming Jin, Zhenyu Wu, Zhaoyang Liu, Qiushi Sun, Zehao Li, JingJing Xie, Zhoumianze Liu, Fangzhi Xu, Kanzhi Cheng, Qingyun Li, Yian Wang, Yu Qiao, Zun Wang, Zichen Ding</p>

            <p><strong>Title:</strong><br>
            OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.07779v1">http://arxiv.org/abs/2601.07779v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Vision-Language Models (VLMs) have significantly advanced Computer-Using Agents (CUAs), current frameworks struggle with robustness in long-horizon workflows and generalization in novel domains. These limitations stem from a lack of granular control over historical visual context curation and the absence of visual-aware tutorial retrieval. To bridge these gaps, we introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation: (1) a Reflection-Memory Agent that utilizes milestone-driven long-term memory to enable trajectory-level self-correction, effectively mitigating visual context loss in long-horizon tasks; (2) Versatile Tool Agents featuring a Multimodal Searcher that adopts a SeeAct paradigm to navigate a browser-based sandbox to synthesize live, visually aligned tutorials, thereby resolving fidelity issues in unseen scenarios. Experimental results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales, establishing new state-of-the-art results on three online benchmarks, notably achieving 65.84% on OSWorld.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.MA, cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Bowen Yang, Kaiming Jin, Zhenyu Wu, Zhaoyang Liu, Qiushi Sun, Zehao Li, JingJing Xie, Zhoumianze Liu, Fangzhi Xu, Kanzhi Cheng, Qingyun Li, Yian Wang, Yu Qiao, Zun Wang, Zichen Ding</p>

            <p><strong>Title:</strong><br>
            OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.07779v1">http://arxiv.org/abs/2601.07779v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Vision-Language Models (VLMs) have significantly advanced Computer-Using Agents (CUAs), current frameworks struggle with robustness in long-horizon workflows and generalization in novel domains. These limitations stem from a lack of granular control over historical visual context curation and the absence of visual-aware tutorial retrieval. To bridge these gaps, we introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation: (1) a Reflection-Memory Agent that utilizes milestone-driven long-term memory to enable trajectory-level self-correction, effectively mitigating visual context loss in long-horizon tasks; (2) Versatile Tool Agents featuring a Multimodal Searcher that adopts a SeeAct paradigm to navigate a browser-based sandbox to synthesize live, visually aligned tutorials, thereby resolving fidelity issues in unseen scenarios. Experimental results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales, establishing new state-of-the-art results on three online benchmarks, notably achieving 65.84% on OSWorld.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 13 Jan 2026 19:49:23 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a5f19898/388d99cd.mp3" length="23562908" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1469</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.MA, cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Bowen Yang, Kaiming Jin, Zhenyu Wu, Zhaoyang Liu, Qiushi Sun, Zehao Li, JingJing Xie, Zhoumianze Liu, Fangzhi Xu, Kanzhi Cheng, Qingyun Li, Yian Wang, Yu Qiao, Zun Wang, Zichen Ding</p>

            <p><strong>Title:</strong><br>
            OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.07779v1">http://arxiv.org/abs/2601.07779v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Vision-Language Models (VLMs) have significantly advanced Computer-Using Agents (CUAs), current frameworks struggle with robustness in long-horizon workflows and generalization in novel domains. These limitations stem from a lack of granular control over historical visual context curation and the absence of visual-aware tutorial retrieval. To bridge these gaps, we introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation: (1) a Reflection-Memory Agent that utilizes milestone-driven long-term memory to enable trajectory-level self-correction, effectively mitigating visual context loss in long-horizon tasks; (2) Versatile Tool Agents featuring a Multimodal Searcher that adopts a SeeAct paradigm to navigate a browser-based sandbox to synthesize live, visually aligned tutorials, thereby resolving fidelity issues in unseen scenarios. Experimental results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales, establishing new state-of-the-art results on three online benchmarks, notably achieving 65.84% on OSWorld.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization</title>
      <itunes:episode>1582</itunes:episode>
      <podcast:episode>1582</podcast:episode>
      <itunes:title>Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d3f73d9e-d74e-4fbe-ae53-46f6dfa5177d</guid>
      <link>https://share.transistor.fm/s/2a92ff0d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 131 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuxiang Ji, Yong Wang, Ziyu Ma, Yiming Hu, Hailang Huang, Xuecai Hu, Guanhua Chen, Liaoni Wu, Xiangxiang Chu</p>

            <p><strong>Title:</strong><br>
            Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05432v1">http://arxiv.org/abs/2601.05432v1</a></p>

            <p><strong>Abstract:</strong><br>
            The image geolocalization task aims to predict the location where an image was taken anywhere on Earth using visual clues. Existing large vision-language model (LVLM) approaches leverage world knowledge, chain-of-thought reasoning, and agentic capabilities, but overlook a common strategy used by humans -- using maps. In this work, we first equip the model \textit{Thinking with Map} ability and formulate it as an agent-in-the-map loop. We develop a two-stage optimization scheme for it, including agentic reinforcement learning (RL) followed by parallel test-time scaling (TTS). The RL strengthens the agentic capability of model to improve sampling efficiency, and the parallel TTS enables the model to explore multiple candidate paths before making the final prediction, which is crucial for geolocalization. To evaluate our method on up-to-date and in-the-wild images, we further present MAPBench, a comprehensive geolocalization training and evaluation benchmark composed entirely of real-world images. Experimental results show that our method outperforms existing open- and closed-source models on most metrics, specifically improving Acc@500m from 8.0\% to 22.1\% compared to \textit{Gemini-3-Pro} with Google Search/Map grounded mode.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 131 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuxiang Ji, Yong Wang, Ziyu Ma, Yiming Hu, Hailang Huang, Xuecai Hu, Guanhua Chen, Liaoni Wu, Xiangxiang Chu</p>

            <p><strong>Title:</strong><br>
            Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05432v1">http://arxiv.org/abs/2601.05432v1</a></p>

            <p><strong>Abstract:</strong><br>
            The image geolocalization task aims to predict the location where an image was taken anywhere on Earth using visual clues. Existing large vision-language model (LVLM) approaches leverage world knowledge, chain-of-thought reasoning, and agentic capabilities, but overlook a common strategy used by humans -- using maps. In this work, we first equip the model \textit{Thinking with Map} ability and formulate it as an agent-in-the-map loop. We develop a two-stage optimization scheme for it, including agentic reinforcement learning (RL) followed by parallel test-time scaling (TTS). The RL strengthens the agentic capability of model to improve sampling efficiency, and the parallel TTS enables the model to explore multiple candidate paths before making the final prediction, which is crucial for geolocalization. To evaluate our method on up-to-date and in-the-wild images, we further present MAPBench, a comprehensive geolocalization training and evaluation benchmark composed entirely of real-world images. Experimental results show that our method outperforms existing open- and closed-source models on most metrics, specifically improving Acc@500m from 8.0\% to 22.1\% compared to \textit{Gemini-3-Pro} with Google Search/Map grounded mode.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 12 Jan 2026 19:36:39 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2a92ff0d/7a47cf5e.mp3" length="25905985" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1615</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 131 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuxiang Ji, Yong Wang, Ziyu Ma, Yiming Hu, Hailang Huang, Xuecai Hu, Guanhua Chen, Liaoni Wu, Xiangxiang Chu</p>

            <p><strong>Title:</strong><br>
            Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05432v1">http://arxiv.org/abs/2601.05432v1</a></p>

            <p><strong>Abstract:</strong><br>
            The image geolocalization task aims to predict the location where an image was taken anywhere on Earth using visual clues. Existing large vision-language model (LVLM) approaches leverage world knowledge, chain-of-thought reasoning, and agentic capabilities, but overlook a common strategy used by humans -- using maps. In this work, we first equip the model \textit{Thinking with Map} ability and formulate it as an agent-in-the-map loop. We develop a two-stage optimization scheme for it, including agentic reinforcement learning (RL) followed by parallel test-time scaling (TTS). The RL strengthens the agentic capability of model to improve sampling efficiency, and the parallel TTS enables the model to explore multiple candidate paths before making the final prediction, which is crucial for geolocalization. To evaluate our method on up-to-date and in-the-wild images, we further present MAPBench, a comprehensive geolocalization training and evaluation benchmark composed entirely of real-world images. Experimental results show that our method outperforms existing open- and closed-source models on most metrics, specifically improving Acc@500m from 8.0\% to 22.1\% compared to \textit{Gemini-3-Pro} with Google Search/Map grounded mode.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MMFormalizer: Multimodal Autoformalization in the Wild</title>
      <itunes:episode>1581</itunes:episode>
      <podcast:episode>1581</podcast:episode>
      <itunes:title>MMFormalizer: Multimodal Autoformalization in the Wild</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">65f52d67-e324-4f6b-a0a7-7d8e174dc8a0</guid>
      <link>https://share.transistor.fm/s/16f71ce1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 94 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jing Xiong, Qi Han, Yunta Hsieh, Hui Shen, Huajian Xin, Chaofan Tao, Chenyang Zhao, Hengyuan Zhang, Taiqiang Wu, Zhen Zhang, Haochen Wang, Zhongwei Wan, Lingpeng Kong, Ngai Wong</p>

            <p><strong>Title:</strong><br>
            MMFormalizer: Multimodal Autoformalization in the Wild</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03017v1">http://arxiv.org/abs/2601.03017v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoformalization, which translates natural language mathematics into formal statements to enable machine reasoning, faces fundamental challenges in the wild due to the multimodal nature of the physical world, where physics requires inferring hidden constraints (e.g., mass or energy) from visual elements. To address this, we propose MMFormalizer, which extends autoformalization beyond text by integrating adaptive grounding with entities from real-world mathematical and physical domains. MMFormalizer recursively constructs formal propositions from perceptually grounded primitives through recursive grounding and axiom composition, with adaptive recursive termination ensuring that every abstraction is supported by visual evidence and anchored in dimensional or axiomatic grounding. We evaluate MMFormalizer on a new benchmark, PhyX-AF, comprising 115 curated samples from MathVerse, PhyX, Synthetic Geometry, and Analytic Geometry, covering diverse multimodal autoformalization tasks. Results show that frontier models such as GPT-5 and Gemini-3-Pro achieve the highest compile and semantic accuracy, with GPT-5 excelling in physical reasoning, while geometry remains the most challenging domain. Overall, MMFormalizer provides a scalable framework for unified multimodal autoformalization, bridging perception and formal reasoning. To the best of our knowledge, this is the first multimodal autoformalization method capable of handling classical mechanics (derived from the Hamiltonian), as well as relativity, quantum mechanics, and thermodynamics. More details are available on our project page: MMFormalizer.github.io</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 94 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jing Xiong, Qi Han, Yunta Hsieh, Hui Shen, Huajian Xin, Chaofan Tao, Chenyang Zhao, Hengyuan Zhang, Taiqiang Wu, Zhen Zhang, Haochen Wang, Zhongwei Wan, Lingpeng Kong, Ngai Wong</p>

            <p><strong>Title:</strong><br>
            MMFormalizer: Multimodal Autoformalization in the Wild</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03017v1">http://arxiv.org/abs/2601.03017v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoformalization, which translates natural language mathematics into formal statements to enable machine reasoning, faces fundamental challenges in the wild due to the multimodal nature of the physical world, where physics requires inferring hidden constraints (e.g., mass or energy) from visual elements. To address this, we propose MMFormalizer, which extends autoformalization beyond text by integrating adaptive grounding with entities from real-world mathematical and physical domains. MMFormalizer recursively constructs formal propositions from perceptually grounded primitives through recursive grounding and axiom composition, with adaptive recursive termination ensuring that every abstraction is supported by visual evidence and anchored in dimensional or axiomatic grounding. We evaluate MMFormalizer on a new benchmark, PhyX-AF, comprising 115 curated samples from MathVerse, PhyX, Synthetic Geometry, and Analytic Geometry, covering diverse multimodal autoformalization tasks. Results show that frontier models such as GPT-5 and Gemini-3-Pro achieve the highest compile and semantic accuracy, with GPT-5 excelling in physical reasoning, while geometry remains the most challenging domain. Overall, MMFormalizer provides a scalable framework for unified multimodal autoformalization, bridging perception and formal reasoning. To the best of our knowledge, this is the first multimodal autoformalization method capable of handling classical mechanics (derived from the Hamiltonian), as well as relativity, quantum mechanics, and thermodynamics. More details are available on our project page: MMFormalizer.github.io</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 12 Jan 2026 19:36:18 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/16f71ce1/f44b5a83.mp3" length="21144988" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1318</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 94 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jing Xiong, Qi Han, Yunta Hsieh, Hui Shen, Huajian Xin, Chaofan Tao, Chenyang Zhao, Hengyuan Zhang, Taiqiang Wu, Zhen Zhang, Haochen Wang, Zhongwei Wan, Lingpeng Kong, Ngai Wong</p>

            <p><strong>Title:</strong><br>
            MMFormalizer: Multimodal Autoformalization in the Wild</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03017v1">http://arxiv.org/abs/2601.03017v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoformalization, which translates natural language mathematics into formal statements to enable machine reasoning, faces fundamental challenges in the wild due to the multimodal nature of the physical world, where physics requires inferring hidden constraints (e.g., mass or energy) from visual elements. To address this, we propose MMFormalizer, which extends autoformalization beyond text by integrating adaptive grounding with entities from real-world mathematical and physical domains. MMFormalizer recursively constructs formal propositions from perceptually grounded primitives through recursive grounding and axiom composition, with adaptive recursive termination ensuring that every abstraction is supported by visual evidence and anchored in dimensional or axiomatic grounding. We evaluate MMFormalizer on a new benchmark, PhyX-AF, comprising 115 curated samples from MathVerse, PhyX, Synthetic Geometry, and Analytic Geometry, covering diverse multimodal autoformalization tasks. Results show that frontier models such as GPT-5 and Gemini-3-Pro achieve the highest compile and semantic accuracy, with GPT-5 excelling in physical reasoning, while geometry remains the most challenging domain. Overall, MMFormalizer provides a scalable framework for unified multimodal autoformalization, bridging perception and formal reasoning. To the best of our knowledge, this is the first multimodal autoformalization method capable of handling classical mechanics (derived from the Hamiltonian), as well as relativity, quantum mechanics, and thermodynamics. More details are available on our project page: MMFormalizer.github.io</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CaricatureGS: Exaggerating 3D Gaussian Splatting Faces With Gaussian Curvature</title>
      <itunes:episode>1580</itunes:episode>
      <podcast:episode>1580</podcast:episode>
      <itunes:title>CaricatureGS: Exaggerating 3D Gaussian Splatting Faces With Gaussian Curvature</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">151498b6-c58c-4836-9fce-35b6bcfc19e2</guid>
      <link>https://share.transistor.fm/s/819e1254</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.GR, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Eldad Matmon, Amit Bracha, Noam Rotstein, Ron Kimmel</p>

            <p><strong>Title:</strong><br>
            CaricatureGS: Exaggerating 3D Gaussian Splatting Faces With Gaussian Curvature</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03319v1">http://arxiv.org/abs/2601.03319v1</a></p>

            <p><strong>Abstract:</strong><br>
            A photorealistic and controllable 3D caricaturization framework for faces is introduced. We start with an intrinsic Gaussian curvature-based surface exaggeration technique, which, when coupled with texture, tends to produce over-smoothed renders. To address this, we resort to 3D Gaussian Splatting (3DGS), which has recently been shown to produce realistic free-viewpoint avatars. Given a multiview sequence, we extract a FLAME mesh, solve a curvature-weighted Poisson equation, and obtain its exaggerated form. However, directly deforming the Gaussians yields poor results, necessitating the synthesis of pseudo-ground-truth caricature images by warping each frame to its exaggerated 2D representation using local affine transformations. We then devise a training scheme that alternates real and synthesized supervision, enabling a single Gaussian collection to represent both natural and exaggerated avatars. This scheme improves fidelity, supports local edits, and allows continuous control over the intensity of the caricature. In order to achieve real-time deformations, an efficient interpolation between the original and exaggerated surfaces is introduced. We further analyze and show that it has a bounded deviation from closed-form solutions. In both quantitative and qualitative evaluations, our results outperform prior work, delivering photorealistic, geometry-controlled caricature avatars.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.GR, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Eldad Matmon, Amit Bracha, Noam Rotstein, Ron Kimmel</p>

            <p><strong>Title:</strong><br>
            CaricatureGS: Exaggerating 3D Gaussian Splatting Faces With Gaussian Curvature</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03319v1">http://arxiv.org/abs/2601.03319v1</a></p>

            <p><strong>Abstract:</strong><br>
            A photorealistic and controllable 3D caricaturization framework for faces is introduced. We start with an intrinsic Gaussian curvature-based surface exaggeration technique, which, when coupled with texture, tends to produce over-smoothed renders. To address this, we resort to 3D Gaussian Splatting (3DGS), which has recently been shown to produce realistic free-viewpoint avatars. Given a multiview sequence, we extract a FLAME mesh, solve a curvature-weighted Poisson equation, and obtain its exaggerated form. However, directly deforming the Gaussians yields poor results, necessitating the synthesis of pseudo-ground-truth caricature images by warping each frame to its exaggerated 2D representation using local affine transformations. We then devise a training scheme that alternates real and synthesized supervision, enabling a single Gaussian collection to represent both natural and exaggerated avatars. This scheme improves fidelity, supports local edits, and allows continuous control over the intensity of the caricature. In order to achieve real-time deformations, an efficient interpolation between the original and exaggerated surfaces is introduced. We further analyze and show that it has a bounded deviation from closed-form solutions. In both quantitative and qualitative evaluations, our results outperform prior work, delivering photorealistic, geometry-controlled caricature avatars.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 12 Jan 2026 19:35:57 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/819e1254/de2dfc89.mp3" length="21057241" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1312</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.GR, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Eldad Matmon, Amit Bracha, Noam Rotstein, Ron Kimmel</p>

            <p><strong>Title:</strong><br>
            CaricatureGS: Exaggerating 3D Gaussian Splatting Faces With Gaussian Curvature</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03319v1">http://arxiv.org/abs/2601.03319v1</a></p>

            <p><strong>Abstract:</strong><br>
            A photorealistic and controllable 3D caricaturization framework for faces is introduced. We start with an intrinsic Gaussian curvature-based surface exaggeration technique, which, when coupled with texture, tends to produce over-smoothed renders. To address this, we resort to 3D Gaussian Splatting (3DGS), which has recently been shown to produce realistic free-viewpoint avatars. Given a multiview sequence, we extract a FLAME mesh, solve a curvature-weighted Poisson equation, and obtain its exaggerated form. However, directly deforming the Gaussians yields poor results, necessitating the synthesis of pseudo-ground-truth caricature images by warping each frame to its exaggerated 2D representation using local affine transformations. We then devise a training scheme that alternates real and synthesized supervision, enabling a single Gaussian collection to represent both natural and exaggerated avatars. This scheme improves fidelity, supports local edits, and allows continuous control over the intensity of the caricature. In order to achieve real-time deformations, an efficient interpolation between the original and exaggerated surfaces is introduced. We further analyze and show that it has a bounded deviation from closed-form solutions. In both quantitative and qualitative evaluations, our results outperform prior work, delivering photorealistic, geometry-controlled caricature avatars.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning</title>
      <itunes:episode>1579</itunes:episode>
      <podcast:episode>1579</podcast:episode>
      <itunes:title>The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d12615fc-a0f7-46f0-bf95-62e90649acb5</guid>
      <link>https://share.transistor.fm/s/b46df43c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, Libo Qin, Wanxiang Che, Wenhao Huang</p>

            <p><strong>Title:</strong><br>
            The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.06002v1">http://arxiv.org/abs/2601.06002v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learnable Long CoT trajectories feature stable molecular-like structures in unified view, which are formed by three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). Analysis of distilled trajectories reveals these structures emerge from Long CoT fine-tuning, not keyword imitation. We introduce Effective Semantic Isomers and show that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition impairs training. Drawing on these findings, we present Mole-Syn, a distribution-transfer-graph method that guides synthesis of effective Long CoT structures, boosting performance and RL stability across benchmarks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, Libo Qin, Wanxiang Che, Wenhao Huang</p>

            <p><strong>Title:</strong><br>
            The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.06002v1">http://arxiv.org/abs/2601.06002v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learnable Long CoT trajectories feature stable molecular-like structures in unified view, which are formed by three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). Analysis of distilled trajectories reveals these structures emerge from Long CoT fine-tuning, not keyword imitation. We introduce Effective Semantic Isomers and show that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition impairs training. Drawing on these findings, we present Mole-Syn, a distribution-transfer-graph method that guides synthesis of effective Long CoT structures, boosting performance and RL stability across benchmarks.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 12 Jan 2026 19:35:36 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b46df43c/1b07ce49.mp3" length="22092957" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1377</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, Libo Qin, Wanxiang Che, Wenhao Huang</p>

            <p><strong>Title:</strong><br>
            The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.06002v1">http://arxiv.org/abs/2601.06002v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learnable Long CoT trajectories feature stable molecular-like structures in unified view, which are formed by three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). Analysis of distilled trajectories reveals these structures emerge from Long CoT fine-tuning, not keyword imitation. We introduce Effective Semantic Isomers and show that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition impairs training. Drawing on these findings, we present Mole-Syn, a distribution-transfer-graph method that guides synthesis of effective Long CoT structures, boosting performance and RL stability across benchmarks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards</title>
      <itunes:episode>1578</itunes:episode>
      <podcast:episode>1578</podcast:episode>
      <itunes:title>Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9743eda6-c78d-41a4-99b7-f8a14e169c69</guid>
      <link>https://share.transistor.fm/s/7e410a7f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiajie Zhang, Xin Lv, Ling Feng, Lei Hou, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.06021v1">http://arxiv.org/abs/2601.06021v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM-based deep search agents. However, existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents' reasoning process, and often lead to undesirable behaviors such as shortcut exploitation and hallucinations. To address these limitations, we propose \textbf{Citation-aware Rubric Rewards (CaRR)}, a fine-grained reward framework for deep search agents that emphasizes reasoning comprehensiveness, factual grounding, and evidence connectivity. CaRR decomposes complex questions into verifiable single-hop rubrics and requires agents to satisfy these rubrics by explicitly identifying hidden entities, supporting them with correct citations, and constructing complete evidence chains that link to the predicted answer. We further introduce \textbf{Citation-aware Group Relative Policy Optimization (C-GRPO)}, which combines CaRR and outcome rewards for training robust deep search agents. Experiments show that C-GRPO consistently outperforms standard outcome-based RL baselines across multiple deep search benchmarks. Our analysis also validates that C-GRPO effectively discourages shortcut exploitation, promotes comprehensive, evidence-grounded reasoning, and exhibits strong generalization to open-ended deep research tasks. Our code and data are available at https://github.com/THUDM/CaRR.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiajie Zhang, Xin Lv, Ling Feng, Lei Hou, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.06021v1">http://arxiv.org/abs/2601.06021v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM-based deep search agents. However, existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents' reasoning process, and often lead to undesirable behaviors such as shortcut exploitation and hallucinations. To address these limitations, we propose \textbf{Citation-aware Rubric Rewards (CaRR)}, a fine-grained reward framework for deep search agents that emphasizes reasoning comprehensiveness, factual grounding, and evidence connectivity. CaRR decomposes complex questions into verifiable single-hop rubrics and requires agents to satisfy these rubrics by explicitly identifying hidden entities, supporting them with correct citations, and constructing complete evidence chains that link to the predicted answer. We further introduce \textbf{Citation-aware Group Relative Policy Optimization (C-GRPO)}, which combines CaRR and outcome rewards for training robust deep search agents. Experiments show that C-GRPO consistently outperforms standard outcome-based RL baselines across multiple deep search benchmarks. Our analysis also validates that C-GRPO effectively discourages shortcut exploitation, promotes comprehensive, evidence-grounded reasoning, and exhibits strong generalization to open-ended deep research tasks. Our code and data are available at https://github.com/THUDM/CaRR.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 12 Jan 2026 19:35:15 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7e410a7f/b0442e4f.mp3" length="24680979" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1539</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiajie Zhang, Xin Lv, Ling Feng, Lei Hou, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.06021v1">http://arxiv.org/abs/2601.06021v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM-based deep search agents. However, existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents' reasoning process, and often lead to undesirable behaviors such as shortcut exploitation and hallucinations. To address these limitations, we propose \textbf{Citation-aware Rubric Rewards (CaRR)}, a fine-grained reward framework for deep search agents that emphasizes reasoning comprehensiveness, factual grounding, and evidence connectivity. CaRR decomposes complex questions into verifiable single-hop rubrics and requires agents to satisfy these rubrics by explicitly identifying hidden entities, supporting them with correct citations, and constructing complete evidence chains that link to the predicted answer. We further introduce \textbf{Citation-aware Group Relative Policy Optimization (C-GRPO)}, which combines CaRR and outcome rewards for training robust deep search agents. Experiments show that C-GRPO consistently outperforms standard outcome-based RL baselines across multiple deep search benchmarks. Our analysis also validates that C-GRPO effectively discourages shortcut exploitation, promotes comprehensive, evidence-grounded reasoning, and exhibits strong generalization to open-ended deep research tasks. Our code and data are available at https://github.com/THUDM/CaRR.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis</title>
      <itunes:episode>1577</itunes:episode>
      <podcast:episode>1577</podcast:episode>
      <itunes:title>EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f53fe704-0465-4f8e-93d5-7fd3275d4ef9</guid>
      <link>https://share.transistor.fm/s/ea27b8d9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Zhicheng Dou, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05808v1">http://arxiv.org/abs/2601.05808v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are expected to be trained to act as agents in various real-world environments, but this process relies on rich and varied tool-interaction sandboxes. However, access to real systems is often restricted; LLM-simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool-interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule-based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs' ability to solve tasks in complex environments involving multi-turn, multi-tool interactions. We release our code and data at https://github.com/RUC-NLPIR/EnvScaler.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Zhicheng Dou, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05808v1">http://arxiv.org/abs/2601.05808v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are expected to be trained to act as agents in various real-world environments, but this process relies on rich and varied tool-interaction sandboxes. However, access to real systems is often restricted; LLM-simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool-interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule-based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs' ability to solve tasks in complex environments involving multi-turn, multi-tool interactions. We release our code and data at https://github.com/RUC-NLPIR/EnvScaler.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 12 Jan 2026 19:34:53 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ea27b8d9/c26c578e.mp3" length="22517183" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1404</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Zhicheng Dou, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05808v1">http://arxiv.org/abs/2601.05808v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are expected to be trained to act as agents in various real-world environments, but this process relies on rich and varied tool-interaction sandboxes. However, access to real systems is often restricted; LLM-simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool-interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule-based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs' ability to solve tasks in complex environments involving multi-turn, multi-tool interactions. We release our code and data at https://github.com/RUC-NLPIR/EnvScaler.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking</title>
      <itunes:episode>1576</itunes:episode>
      <podcast:episode>1576</podcast:episode>
      <itunes:title>Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">be20f656-735a-44db-9a63-586eedf4bcbe</guid>
      <link>https://share.transistor.fm/s/df375173</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.04720v1">http://arxiv.org/abs/2601.04720v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.04720v1">http://arxiv.org/abs/2601.04720v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 12 Jan 2026 19:34:32 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/df375173/e5d66af2.mp3" length="21201892" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1321</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.04720v1">http://arxiv.org/abs/2601.04720v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization</title>
      <itunes:episode>1575</itunes:episode>
      <podcast:episode>1575</podcast:episode>
      <itunes:title>GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b53183c1-c28e-405f-9f0c-425a088fa73c</guid>
      <link>https://share.transistor.fm/s/332ed783</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 98 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov</p>

            <p><strong>Title:</strong><br>
            GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05242v1">http://arxiv.org/abs/2601.05242v1</a></p>

            <p><strong>Abstract:</strong><br>
            As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 98 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov</p>

            <p><strong>Title:</strong><br>
            GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05242v1">http://arxiv.org/abs/2601.05242v1</a></p>

            <p><strong>Abstract:</strong><br>
            As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 09 Jan 2026 19:13:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/332ed783/f908d3bf.mp3" length="24122152" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1504</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 98 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov</p>

            <p><strong>Title:</strong><br>
            GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05242v1">http://arxiv.org/abs/2601.05242v1</a></p>

            <p><strong>Abstract:</strong><br>
            As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers</title>
      <itunes:episode>1574</itunes:episode>
      <podcast:episode>1574</podcast:episode>
      <itunes:title>Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5f6b0bbd-827a-477b-a690-8c85582fb117</guid>
      <link>https://share.transistor.fm/s/a83a9cec</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Maksim Velikanov, Ilyas Chahed, Jingwei Zuo, Dhia Eddine Rhaiem, Younes Belkada, Hakim Hacid</p>

            <p><strong>Title:</strong><br>
            Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.04890v1">http://arxiv.org/abs/2601.04890v1</a></p>

            <p><strong>Abstract:</strong><br>
            Applying weight decay (WD) to matrix layers is standard practice in large-language-model pretraining. Prior work suggests that stochastic gradient noise induces a Brownian-like expansion of the weight matrices W, whose growth is counteracted by WD, leading to a WD-noise equilibrium with a certain weight norm ||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducing learnable multipliers to learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that the WD-noise equilibrium norm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization of muP multipliers. It outperforms a well-tuned muP baseline, reduces the computational overhead of multiplier tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers. Finally, we validate learnable multipliers with both Adam and Muon optimizers, where it shows improvement in downstream evaluations matching the improvement of the switching from Adam to Muon.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Maksim Velikanov, Ilyas Chahed, Jingwei Zuo, Dhia Eddine Rhaiem, Younes Belkada, Hakim Hacid</p>

            <p><strong>Title:</strong><br>
            Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.04890v1">http://arxiv.org/abs/2601.04890v1</a></p>

            <p><strong>Abstract:</strong><br>
            Applying weight decay (WD) to matrix layers is standard practice in large-language-model pretraining. Prior work suggests that stochastic gradient noise induces a Brownian-like expansion of the weight matrices W, whose growth is counteracted by WD, leading to a WD-noise equilibrium with a certain weight norm ||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducing learnable multipliers to learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that the WD-noise equilibrium norm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization of muP multipliers. It outperforms a well-tuned muP baseline, reduces the computational overhead of multiplier tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers. Finally, we validate learnable multipliers with both Adam and Muon optimizers, where it shows improvement in downstream evaluations matching the improvement of the switching from Adam to Muon.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 09 Jan 2026 19:13:06 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a83a9cec/f7278a00.mp3" length="24153894" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1506</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Maksim Velikanov, Ilyas Chahed, Jingwei Zuo, Dhia Eddine Rhaiem, Younes Belkada, Hakim Hacid</p>

            <p><strong>Title:</strong><br>
            Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.04890v1">http://arxiv.org/abs/2601.04890v1</a></p>

            <p><strong>Abstract:</strong><br>
            Applying weight decay (WD) to matrix layers is standard practice in large-language-model pretraining. Prior work suggests that stochastic gradient noise induces a Brownian-like expansion of the weight matrices W, whose growth is counteracted by WD, leading to a WD-noise equilibrium with a certain weight norm ||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducing learnable multipliers to learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that the WD-noise equilibrium norm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization of muP multipliers. It outperforms a well-tuned muP baseline, reduces the computational overhead of multiplier tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers. Finally, we validate learnable multipliers with both Adam and Muon optimizers, where it shows improvement in downstream evaluations matching the improvement of the switching from Adam to Muon.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes</title>
      <itunes:episode>1573</itunes:episode>
      <podcast:episode>1573</podcast:episode>
      <itunes:title>RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">51fb5592-203e-4752-9bc7-ec9b8132aa25</guid>
      <link>https://share.transistor.fm/s/c554a28c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuan-Kang Lee, Kuan-Lin Chen, Chia-Che Chang, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05249v1">http://arxiv.org/abs/2601.05249v1</a></p>

            <p><strong>Abstract:</strong><br>
            Nighttime color constancy remains a challenging problem in computational photography due to low-light noise and complex illumination conditions. We present RL-AWB, a novel framework combining statistical methods with deep reinforcement learning for nighttime white balance. Our method begins with a statistical algorithm tailored for nighttime scenes, integrating salient gray pixel detection with novel illumination estimation. Building on this foundation, we develop the first deep reinforcement learning approach for color constancy that leverages the statistical algorithm as its core, mimicking professional AWB tuning experts by dynamically optimizing parameters for each image. To facilitate cross-sensor evaluation, we introduce the first multi-sensor nighttime dataset. Experiment results demonstrate that our method achieves superior generalization capability across low-light and well-illuminated images. Project page: https://ntuneillee.github.io/research/rl-awb/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuan-Kang Lee, Kuan-Lin Chen, Chia-Che Chang, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05249v1">http://arxiv.org/abs/2601.05249v1</a></p>

            <p><strong>Abstract:</strong><br>
            Nighttime color constancy remains a challenging problem in computational photography due to low-light noise and complex illumination conditions. We present RL-AWB, a novel framework combining statistical methods with deep reinforcement learning for nighttime white balance. Our method begins with a statistical algorithm tailored for nighttime scenes, integrating salient gray pixel detection with novel illumination estimation. Building on this foundation, we develop the first deep reinforcement learning approach for color constancy that leverages the statistical algorithm as its core, mimicking professional AWB tuning experts by dynamically optimizing parameters for each image. To facilitate cross-sensor evaluation, we introduce the first multi-sensor nighttime dataset. Experiment results demonstrate that our method achieves superior generalization capability across low-light and well-illuminated images. Project page: https://ntuneillee.github.io/research/rl-awb/</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 09 Jan 2026 19:12:45 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c554a28c/ebbe4d06.mp3" length="22281047" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1389</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuan-Kang Lee, Kuan-Lin Chen, Chia-Che Chang, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05249v1">http://arxiv.org/abs/2601.05249v1</a></p>

            <p><strong>Abstract:</strong><br>
            Nighttime color constancy remains a challenging problem in computational photography due to low-light noise and complex illumination conditions. We present RL-AWB, a novel framework combining statistical methods with deep reinforcement learning for nighttime white balance. Our method begins with a statistical algorithm tailored for nighttime scenes, integrating salient gray pixel detection with novel illumination estimation. Building on this foundation, we develop the first deep reinforcement learning approach for color constancy that leverages the statistical algorithm as its core, mimicking professional AWB tuning experts by dynamically optimizing parameters for each image. To facilitate cross-sensor evaluation, we introduce the first multi-sensor nighttime dataset. Experiment results demonstrate that our method achieves superior generalization capability across low-light and well-illuminated images. Project page: https://ntuneillee.github.io/research/rl-awb/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Token-Level LLM Collaboration via FusionRoute</title>
      <itunes:episode>1572</itunes:episode>
      <podcast:episode>1572</podcast:episode>
      <itunes:title>Token-Level LLM Collaboration via FusionRoute</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e5f5ca77-e3a4-49b8-b7ca-dcaeb25e0810</guid>
      <link>https://share.transistor.fm/s/748290b6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao</p>

            <p><strong>Title:</strong><br>
            Token-Level LLM Collaboration via FusionRoute</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05106v1">http://arxiv.org/abs/2601.05106v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao</p>

            <p><strong>Title:</strong><br>
            Token-Level LLM Collaboration via FusionRoute</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05106v1">http://arxiv.org/abs/2601.05106v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 09 Jan 2026 19:12:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/748290b6/4ec67818.mp3" length="24795435" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1546</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao</p>

            <p><strong>Title:</strong><br>
            Token-Level LLM Collaboration via FusionRoute</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.05106v1">http://arxiv.org/abs/2601.05106v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting</title>
      <itunes:episode>1571</itunes:episode>
      <podcast:episode>1571</podcast:episode>
      <itunes:title>Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9aa77359-9dfc-45cf-b1a0-07558b44e3e1</guid>
      <link>https://share.transistor.fm/s/ba0be23e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Muxi Diao, Lele Yang, Wuxuan Gong, Yutong Zhang, Zhonghao Yan, Yufei Han, Kongming Liang, Weiran Xu, Zhanyu Ma</p>

            <p><strong>Title:</strong><br>
            Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.02151v1">http://arxiv.org/abs/2601.02151v1</a></p>

            <p><strong>Abstract:</strong><br>
            Supervised Fine-Tuning (SFT) is the standard paradigm for domain adaptation, yet it frequently incurs the cost of catastrophic forgetting. In sharp contrast, on-policy Reinforcement Learning (RL) effectively preserves general capabilities. We investigate this discrepancy and identify a fundamental distributional gap: while RL aligns with the model's internal belief, SFT forces the model to fit external supervision. This mismatch often manifests as "Confident Conflicts" tokens characterized by low probability but low entropy. In these instances, the model is highly confident in its own prediction but is forced to learn a divergent ground truth, triggering destructive gradient updates. To address this, we propose Entropy-Adaptive Fine-Tuning (EAFT). Unlike methods relying solely on prediction probability, EAFT utilizes token-level entropy as a gating mechanism to distinguish between epistemic uncertainty and knowledge conflict. This allows the model to learn from uncertain samples while suppressing gradients on conflicting data. Extensive experiments on Qwen and GLM series (ranging from 4B to 32B parameters) across mathematical, medical, and agentic domains confirm our hypothesis. EAFT consistently matches the downstream performance of standard SFT while significantly mitigating the degradation of general capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Muxi Diao, Lele Yang, Wuxuan Gong, Yutong Zhang, Zhonghao Yan, Yufei Han, Kongming Liang, Weiran Xu, Zhanyu Ma</p>

            <p><strong>Title:</strong><br>
            Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.02151v1">http://arxiv.org/abs/2601.02151v1</a></p>

            <p><strong>Abstract:</strong><br>
            Supervised Fine-Tuning (SFT) is the standard paradigm for domain adaptation, yet it frequently incurs the cost of catastrophic forgetting. In sharp contrast, on-policy Reinforcement Learning (RL) effectively preserves general capabilities. We investigate this discrepancy and identify a fundamental distributional gap: while RL aligns with the model's internal belief, SFT forces the model to fit external supervision. This mismatch often manifests as "Confident Conflicts" tokens characterized by low probability but low entropy. In these instances, the model is highly confident in its own prediction but is forced to learn a divergent ground truth, triggering destructive gradient updates. To address this, we propose Entropy-Adaptive Fine-Tuning (EAFT). Unlike methods relying solely on prediction probability, EAFT utilizes token-level entropy as a gating mechanism to distinguish between epistemic uncertainty and knowledge conflict. This allows the model to learn from uncertain samples while suppressing gradients on conflicting data. Extensive experiments on Qwen and GLM series (ranging from 4B to 32B parameters) across mathematical, medical, and agentic domains confirm our hypothesis. EAFT consistently matches the downstream performance of standard SFT while significantly mitigating the degradation of general capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 08 Jan 2026 19:17:33 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ba0be23e/7da5a3f4.mp3" length="22571511" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1407</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Muxi Diao, Lele Yang, Wuxuan Gong, Yutong Zhang, Zhonghao Yan, Yufei Han, Kongming Liang, Weiran Xu, Zhanyu Ma</p>

            <p><strong>Title:</strong><br>
            Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.02151v1">http://arxiv.org/abs/2601.02151v1</a></p>

            <p><strong>Abstract:</strong><br>
            Supervised Fine-Tuning (SFT) is the standard paradigm for domain adaptation, yet it frequently incurs the cost of catastrophic forgetting. In sharp contrast, on-policy Reinforcement Learning (RL) effectively preserves general capabilities. We investigate this discrepancy and identify a fundamental distributional gap: while RL aligns with the model's internal belief, SFT forces the model to fit external supervision. This mismatch often manifests as "Confident Conflicts" tokens characterized by low probability but low entropy. In these instances, the model is highly confident in its own prediction but is forced to learn a divergent ground truth, triggering destructive gradient updates. To address this, we propose Entropy-Adaptive Fine-Tuning (EAFT). Unlike methods relying solely on prediction probability, EAFT utilizes token-level entropy as a gating mechanism to distinguish between epistemic uncertainty and knowledge conflict. This allows the model to learn from uncertain samples while suppressing gradients on conflicting data. Extensive experiments on Qwen and GLM series (ranging from 4B to 32B parameters) across mathematical, medical, and agentic domains confirm our hypothesis. EAFT consistently matches the downstream performance of standard SFT while significantly mitigating the degradation of general capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Evolving Programmatic Skill Networks</title>
      <itunes:episode>1570</itunes:episode>
      <podcast:episode>1570</podcast:episode>
      <itunes:title>Evolving Programmatic Skill Networks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">713c5a4e-5c50-45ad-b422-5d59a346745e</guid>
      <link>https://share.transistor.fm/s/0e1ef00c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.AI, cs.NE</p>

            <p><strong>Authors:</strong><br>
            Haochen Shi, Xingdi Yuan, Bang Liu</p>

            <p><strong>Title:</strong><br>
            Evolving Programmatic Skill Networks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03509v1">http://arxiv.org/abs/2601.03509v1</a></p>

            <p><strong>Abstract:</strong><br>
            We study continual skill acquisition in open-ended embodied environments where an agent must construct, refine, and reuse an expanding library of executable skills. We introduce the Programmatic Skill Network (PSN), a framework in which skills are executable symbolic programs forming a compositional network that evolves through experience. PSN defines three core mechanisms instantiated via large language models: (1)REFLECT for structured fault localization over skill compositions, (2) progressive optimization with maturity-aware update gating that stabilizes reliable skills while maintaining plasticity for uncertain ones, and (3) canonical structural refactoring under rollback validation that maintains network compactness. We further show that PSN's learning dynamics exhibit structural parallels to neural network training. Experiments on MineDojo and Crafter demonstrate robust skill reuse, rapid adaptation, and strong generalization across open-ended task distributions.\footnote{We plan to open-source the code.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.AI, cs.NE</p>

            <p><strong>Authors:</strong><br>
            Haochen Shi, Xingdi Yuan, Bang Liu</p>

            <p><strong>Title:</strong><br>
            Evolving Programmatic Skill Networks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03509v1">http://arxiv.org/abs/2601.03509v1</a></p>

            <p><strong>Abstract:</strong><br>
            We study continual skill acquisition in open-ended embodied environments where an agent must construct, refine, and reuse an expanding library of executable skills. We introduce the Programmatic Skill Network (PSN), a framework in which skills are executable symbolic programs forming a compositional network that evolves through experience. PSN defines three core mechanisms instantiated via large language models: (1)REFLECT for structured fault localization over skill compositions, (2) progressive optimization with maturity-aware update gating that stabilizes reliable skills while maintaining plasticity for uncertain ones, and (3) canonical structural refactoring under rollback validation that maintains network compactness. We further show that PSN's learning dynamics exhibit structural parallels to neural network training. Experiments on MineDojo and Crafter demonstrate robust skill reuse, rapid adaptation, and strong generalization across open-ended task distributions.\footnote{We plan to open-source the code.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 08 Jan 2026 19:17:10 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0e1ef00c/90f1ec92.mp3" length="24616121" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1535</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.AI, cs.NE</p>

            <p><strong>Authors:</strong><br>
            Haochen Shi, Xingdi Yuan, Bang Liu</p>

            <p><strong>Title:</strong><br>
            Evolving Programmatic Skill Networks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03509v1">http://arxiv.org/abs/2601.03509v1</a></p>

            <p><strong>Abstract:</strong><br>
            We study continual skill acquisition in open-ended embodied environments where an agent must construct, refine, and reuse an expanding library of executable skills. We introduce the Programmatic Skill Network (PSN), a framework in which skills are executable symbolic programs forming a compositional network that evolves through experience. PSN defines three core mechanisms instantiated via large language models: (1)REFLECT for structured fault localization over skill compositions, (2) progressive optimization with maturity-aware update gating that stabilizes reliable skills while maintaining plasticity for uncertain ones, and (3) canonical structural refactoring under rollback validation that maintains network compactness. We further show that PSN's learning dynamics exhibit structural parallels to neural network training. Experiments on MineDojo and Crafter demonstrate robust skill reuse, rapid adaptation, and strong generalization across open-ended task distributions.\footnote{We plan to open-source the code.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning</title>
      <itunes:episode>1569</itunes:episode>
      <podcast:episode>1569</podcast:episode>
      <itunes:title>Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0cc87251-11ad-4f55-8a86-edebe0e18440</guid>
      <link>https://share.transistor.fm/s/2fbd4047</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jinyang Wu, Guocheng Zhai, Ruihan Jin, Jiahao Yuan, Yuhao Shen, Shuai Zhang, Zhengqi Wen, Jianhua Tao</p>

            <p><strong>Title:</strong><br>
            Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03872v1">http://arxiv.org/abs/2601.03872v1</a></p>

            <p><strong>Abstract:</strong><br>
            The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool combination becomes a high-dimensional optimization challenge. Existing approaches often rely on a single model or fixed tool-calling logic, failing to exploit the performance variations across heterogeneous model-tool pairs. In this paper, we present ATLAS (Adaptive Tool-LLM Alignment and Synergistic Invocation), a dual-path framework for dynamic tool usage in cross-domain complex reasoning. ATLAS operates via a dual-path approach: (1) \textbf{training-free cluster-based routing} that exploits empirical priors for domain-specific alignment, and (2) \textbf{RL-based multi-step routing} that explores autonomous trajectories for out-of-distribution generalization. Extensive experiments across 15 benchmarks demonstrate that our method outperforms closed-source models like GPT-4o, surpassing existing routing methods on both in-distribution (+10.1%) and out-of-distribution (+13.1%) tasks. Furthermore, our framework shows significant gains in visual reasoning by orchestrating specialized multi-modal tools.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jinyang Wu, Guocheng Zhai, Ruihan Jin, Jiahao Yuan, Yuhao Shen, Shuai Zhang, Zhengqi Wen, Jianhua Tao</p>

            <p><strong>Title:</strong><br>
            Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03872v1">http://arxiv.org/abs/2601.03872v1</a></p>

            <p><strong>Abstract:</strong><br>
            The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool combination becomes a high-dimensional optimization challenge. Existing approaches often rely on a single model or fixed tool-calling logic, failing to exploit the performance variations across heterogeneous model-tool pairs. In this paper, we present ATLAS (Adaptive Tool-LLM Alignment and Synergistic Invocation), a dual-path framework for dynamic tool usage in cross-domain complex reasoning. ATLAS operates via a dual-path approach: (1) \textbf{training-free cluster-based routing} that exploits empirical priors for domain-specific alignment, and (2) \textbf{RL-based multi-step routing} that explores autonomous trajectories for out-of-distribution generalization. Extensive experiments across 15 benchmarks demonstrate that our method outperforms closed-source models like GPT-4o, surpassing existing routing methods on both in-distribution (+10.1%) and out-of-distribution (+13.1%) tasks. Furthermore, our framework shows significant gains in visual reasoning by orchestrating specialized multi-modal tools.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 08 Jan 2026 19:16:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2fbd4047/e52f8ba6.mp3" length="26427606" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1648</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jinyang Wu, Guocheng Zhai, Ruihan Jin, Jiahao Yuan, Yuhao Shen, Shuai Zhang, Zhengqi Wen, Jianhua Tao</p>

            <p><strong>Title:</strong><br>
            Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03872v1">http://arxiv.org/abs/2601.03872v1</a></p>

            <p><strong>Abstract:</strong><br>
            The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool combination becomes a high-dimensional optimization challenge. Existing approaches often rely on a single model or fixed tool-calling logic, failing to exploit the performance variations across heterogeneous model-tool pairs. In this paper, we present ATLAS (Adaptive Tool-LLM Alignment and Synergistic Invocation), a dual-path framework for dynamic tool usage in cross-domain complex reasoning. ATLAS operates via a dual-path approach: (1) \textbf{training-free cluster-based routing} that exploits empirical priors for domain-specific alignment, and (2) \textbf{RL-based multi-step routing} that explores autonomous trajectories for out-of-distribution generalization. Extensive experiments across 15 benchmarks demonstrate that our method outperforms closed-source models like GPT-4o, surpassing existing routing methods on both in-distribution (+10.1%) and out-of-distribution (+13.1%) tasks. Furthermore, our framework shows significant gains in visual reasoning by orchestrating specialized multi-modal tools.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Benchmark^2: Systematic Evaluation of LLM Benchmarks</title>
      <itunes:episode>1568</itunes:episode>
      <podcast:episode>1568</podcast:episode>
      <itunes:title>Benchmark^2: Systematic Evaluation of LLM Benchmarks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">775f6d02-c664-491b-a9d7-734940e70172</guid>
      <link>https://share.transistor.fm/s/55a3b35d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qi Qian, Chengsong Huang, Jingwen Xu, Changze Lv, Muling Wu, Wenhao Liu, Xiaohua Wang, Zhenghua Wang, Zisu Huang, Muzhao Tian, Jianhan Xu, Kun Hu, He-Da Wang, Yao Hu, Xuanjing Huang, Xiaoqing Zheng</p>

            <p><strong>Title:</strong><br>
            Benchmark^2: Systematic Evaluation of LLM Benchmarks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03986v1">http://arxiv.org/abs/2601.03986v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid proliferation of benchmarks for evaluating large language models (LLMs) has created an urgent need for systematic methods to assess benchmark quality itself. We propose Benchmark^2, a comprehensive framework comprising three complementary metrics: (1) Cross-Benchmark Ranking Consistency, measuring whether a benchmark produces model rankings aligned with peer benchmarks; (2) Discriminability Score, quantifying a benchmark's ability to differentiate between models; and (3) Capability Alignment Deviation, identifying problematic instances where stronger models fail but weaker models succeed within the same model family. We conduct extensive experiments across 15 benchmarks spanning mathematics, reasoning, and knowledge domains, evaluating 11 LLMs across four model families. Our analysis reveals significant quality variations among existing benchmarks and demonstrates that selective benchmark construction based on our metrics can achieve comparable evaluation performance with substantially reduced test sets.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qi Qian, Chengsong Huang, Jingwen Xu, Changze Lv, Muling Wu, Wenhao Liu, Xiaohua Wang, Zhenghua Wang, Zisu Huang, Muzhao Tian, Jianhan Xu, Kun Hu, He-Da Wang, Yao Hu, Xuanjing Huang, Xiaoqing Zheng</p>

            <p><strong>Title:</strong><br>
            Benchmark^2: Systematic Evaluation of LLM Benchmarks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03986v1">http://arxiv.org/abs/2601.03986v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid proliferation of benchmarks for evaluating large language models (LLMs) has created an urgent need for systematic methods to assess benchmark quality itself. We propose Benchmark^2, a comprehensive framework comprising three complementary metrics: (1) Cross-Benchmark Ranking Consistency, measuring whether a benchmark produces model rankings aligned with peer benchmarks; (2) Discriminability Score, quantifying a benchmark's ability to differentiate between models; and (3) Capability Alignment Deviation, identifying problematic instances where stronger models fail but weaker models succeed within the same model family. We conduct extensive experiments across 15 benchmarks spanning mathematics, reasoning, and knowledge domains, evaluating 11 LLMs across four model families. Our analysis reveals significant quality variations among existing benchmarks and demonstrates that selective benchmark construction based on our metrics can achieve comparable evaluation performance with substantially reduced test sets.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 08 Jan 2026 19:16:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/55a3b35d/2c20ae3e.mp3" length="21410808" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1334</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qi Qian, Chengsong Huang, Jingwen Xu, Changze Lv, Muling Wu, Wenhao Liu, Xiaohua Wang, Zhenghua Wang, Zisu Huang, Muzhao Tian, Jianhan Xu, Kun Hu, He-Da Wang, Yao Hu, Xuanjing Huang, Xiaoqing Zheng</p>

            <p><strong>Title:</strong><br>
            Benchmark^2: Systematic Evaluation of LLM Benchmarks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03986v1">http://arxiv.org/abs/2601.03986v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid proliferation of benchmarks for evaluating large language models (LLMs) has created an urgent need for systematic methods to assess benchmark quality itself. We propose Benchmark^2, a comprehensive framework comprising three complementary metrics: (1) Cross-Benchmark Ranking Consistency, measuring whether a benchmark produces model rankings aligned with peer benchmarks; (2) Discriminability Score, quantifying a benchmark's ability to differentiate between models; and (3) Capability Alignment Deviation, identifying problematic instances where stronger models fail but weaker models succeed within the same model family. We conduct extensive experiments across 15 benchmarks spanning mathematics, reasoning, and knowledge domains, evaluating 11 LLMs across four model families. Our analysis reveals significant quality variations among existing benchmarks and demonstrates that selective benchmark construction based on our metrics can achieve comparable evaluation performance with substantially reduced test sets.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields</title>
      <itunes:episode>1567</itunes:episode>
      <podcast:episode>1567</podcast:episode>
      <itunes:title>InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">89b85ed3-beb1-4d07-8932-c4b2cf538c58</guid>
      <link>https://share.transistor.fm/s/46bd838f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 73 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hao Yu, Haotong Lin, Jiawei Wang, Jiaxin Li, Yida Wang, Xueyang Zhang, Yue Wang, Xiaowei Zhou, Ruizhen Hu, Sida Peng</p>

            <p><strong>Title:</strong><br>
            InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03252v1">http://arxiv.org/abs/2601.03252v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing depth estimation methods are fundamentally limited to predicting depth on discrete image grids. Such representations restrict their scalability to arbitrary output resolutions and hinder the geometric detail recovery. This paper introduces InfiniDepth, which represents depth as neural implicit fields. Through a simple yet effective local implicit decoder, we can query depth at continuous 2D coordinates, enabling arbitrary-resolution and fine-grained depth estimation. To better assess our method's capabilities, we curate a high-quality 4K synthetic benchmark from five different games, spanning diverse scenes with rich geometric and appearance details. Extensive experiments demonstrate that InfiniDepth achieves state-of-the-art performance on both synthetic and real-world benchmarks across relative and metric depth estimation tasks, particularly excelling in fine-detail regions. It also benefits the task of novel view synthesis under large viewpoint shifts, producing high-quality results with fewer holes and artifacts.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 73 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hao Yu, Haotong Lin, Jiawei Wang, Jiaxin Li, Yida Wang, Xueyang Zhang, Yue Wang, Xiaowei Zhou, Ruizhen Hu, Sida Peng</p>

            <p><strong>Title:</strong><br>
            InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03252v1">http://arxiv.org/abs/2601.03252v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing depth estimation methods are fundamentally limited to predicting depth on discrete image grids. Such representations restrict their scalability to arbitrary output resolutions and hinder the geometric detail recovery. This paper introduces InfiniDepth, which represents depth as neural implicit fields. Through a simple yet effective local implicit decoder, we can query depth at continuous 2D coordinates, enabling arbitrary-resolution and fine-grained depth estimation. To better assess our method's capabilities, we curate a high-quality 4K synthetic benchmark from five different games, spanning diverse scenes with rich geometric and appearance details. Extensive experiments demonstrate that InfiniDepth achieves state-of-the-art performance on both synthetic and real-world benchmarks across relative and metric depth estimation tasks, particularly excelling in fine-detail regions. It also benefits the task of novel view synthesis under large viewpoint shifts, producing high-quality results with fewer holes and artifacts.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 07 Jan 2026 19:27:50 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/46bd838f/8944211d.mp3" length="21296330" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1327</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 73 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hao Yu, Haotong Lin, Jiawei Wang, Jiaxin Li, Yida Wang, Xueyang Zhang, Yue Wang, Xiaowei Zhou, Ruizhen Hu, Sida Peng</p>

            <p><strong>Title:</strong><br>
            InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03252v1">http://arxiv.org/abs/2601.03252v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing depth estimation methods are fundamentally limited to predicting depth on discrete image grids. Such representations restrict their scalability to arbitrary output resolutions and hinder the geometric detail recovery. This paper introduces InfiniDepth, which represents depth as neural implicit fields. Through a simple yet effective local implicit decoder, we can query depth at continuous 2D coordinates, enabling arbitrary-resolution and fine-grained depth estimation. To better assess our method's capabilities, we curate a high-quality 4K synthetic benchmark from five different games, spanning diverse scenes with rich geometric and appearance details. Extensive experiments demonstrate that InfiniDepth achieves state-of-the-art performance on both synthetic and real-world benchmarks across relative and metric depth estimation tasks, particularly excelling in fine-detail regions. It also benefits the task of novel view synthesis under large viewpoint shifts, producing high-quality results with fewer holes and artifacts.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LTX-2: Efficient Joint Audio-Visual Foundation Model</title>
      <itunes:episode>1566</itunes:episode>
      <podcast:episode>1566</podcast:episode>
      <itunes:title>LTX-2: Efficient Joint Audio-Visual Foundation Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ae080894-b197-4e92-bb2a-1cddbfbf4597</guid>
      <link>https://share.transistor.fm/s/af7f81ec</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, Victor Kulikov, Yaron Inger, Yonatan Shiftan, Zeev Melumian, Zeev Farbman</p>

            <p><strong>Title:</strong><br>
            LTX-2: Efficient Joint Audio-Visual Foundation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03233v1">http://arxiv.org/abs/2601.03233v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, Victor Kulikov, Yaron Inger, Yonatan Shiftan, Zeev Melumian, Zeev Farbman</p>

            <p><strong>Title:</strong><br>
            LTX-2: Efficient Joint Audio-Visual Foundation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03233v1">http://arxiv.org/abs/2601.03233v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 07 Jan 2026 19:27:28 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/af7f81ec/0aa471b3.mp3" length="21608921" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1347</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, Victor Kulikov, Yaron Inger, Yonatan Shiftan, Zeev Melumian, Zeev Farbman</p>

            <p><strong>Title:</strong><br>
            LTX-2: Efficient Joint Audio-Visual Foundation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.03233v1">http://arxiv.org/abs/2601.03233v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization</title>
      <itunes:episode>1565</itunes:episode>
      <podcast:episode>1565</podcast:episode>
      <itunes:title>MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7ee96fee-bfc1-4a6c-9782-7f680748503b</guid>
      <link>https://share.transistor.fm/s/7e5684e2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.SD, cs.AI, eess.AS</p>

            <p><strong>Authors:</strong><br>
            MOSI. AI, Donghua Yu, Zhengyuan Lin, Chen Yang, Yiyang Zhang, Hanfu Chen, Jingqi Chen, Ke Chen, Liwei Fan, Yi Jiang, Jie Zhu, Muchen Li, Wenxuan Wang, Yang Wang, Zhe Xu, Yitian Gong, Yuqian Zhang, Wenbo Zhang, Zhaoye Fei, Qinyuan Cheng, Shimin Li, Xipeng Qiu</p>

            <p><strong>Title:</strong><br>
            MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.01554v2">http://arxiv.org/abs/2601.01554v2</a></p>

            <p><strong>Abstract:</strong><br>
            Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, which is particularly valuable for meeting transcription. Existing SATS systems rarely adopt an end-to-end formulation and are further constrained by limited context windows, weak long-range speaker memory, and the inability to output timestamps. To address these limitations, we present MOSS Transcribe Diarize, a unified multimodal large language model that jointly performs Speaker-Attributed, Time-Stamped Transcription in an end-to-end paradigm. Trained on extensive real wild data and equipped with a 128k context window for up to 90-minute inputs, MOSS Transcribe Diarize scales well and generalizes robustly. Across comprehensive evaluations, it outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.SD, cs.AI, eess.AS</p>

            <p><strong>Authors:</strong><br>
            MOSI. AI, Donghua Yu, Zhengyuan Lin, Chen Yang, Yiyang Zhang, Hanfu Chen, Jingqi Chen, Ke Chen, Liwei Fan, Yi Jiang, Jie Zhu, Muchen Li, Wenxuan Wang, Yang Wang, Zhe Xu, Yitian Gong, Yuqian Zhang, Wenbo Zhang, Zhaoye Fei, Qinyuan Cheng, Shimin Li, Xipeng Qiu</p>

            <p><strong>Title:</strong><br>
            MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.01554v2">http://arxiv.org/abs/2601.01554v2</a></p>

            <p><strong>Abstract:</strong><br>
            Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, which is particularly valuable for meeting transcription. Existing SATS systems rarely adopt an end-to-end formulation and are further constrained by limited context windows, weak long-range speaker memory, and the inability to output timestamps. To address these limitations, we present MOSS Transcribe Diarize, a unified multimodal large language model that jointly performs Speaker-Attributed, Time-Stamped Transcription in an end-to-end paradigm. Trained on extensive real wild data and equipped with a 128k context window for up to 90-minute inputs, MOSS Transcribe Diarize scales well and generalizes robustly. Across comprehensive evaluations, it outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 07 Jan 2026 19:27:07 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7e5684e2/1e7e1de2.mp3" length="25947357" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1618</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.SD, cs.AI, eess.AS</p>

            <p><strong>Authors:</strong><br>
            MOSI. AI, Donghua Yu, Zhengyuan Lin, Chen Yang, Yiyang Zhang, Hanfu Chen, Jingqi Chen, Ke Chen, Liwei Fan, Yi Jiang, Jie Zhu, Muchen Li, Wenxuan Wang, Yang Wang, Zhe Xu, Yitian Gong, Yuqian Zhang, Wenbo Zhang, Zhaoye Fei, Qinyuan Cheng, Shimin Li, Xipeng Qiu</p>

            <p><strong>Title:</strong><br>
            MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.01554v2">http://arxiv.org/abs/2601.01554v2</a></p>

            <p><strong>Abstract:</strong><br>
            Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, which is particularly valuable for meeting transcription. Existing SATS systems rarely adopt an end-to-end formulation and are further constrained by limited context windows, weak long-range speaker memory, and the inability to output timestamps. To address these limitations, we present MOSS Transcribe Diarize, a unified multimodal large language model that jointly performs Speaker-Attributed, Time-Stamped Transcription in an end-to-end paradigm. Trained on extensive real wild data and equipped with a 128k context window for up to 90-minute inputs, MOSS Transcribe Diarize scales well and generalizes robustly. Across comprehensive evaluations, it outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence</title>
      <itunes:episode>1564</itunes:episode>
      <podcast:episode>1564</podcast:episode>
      <itunes:title>SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ead9789a-5c76-439e-a515-d76dbc3f3144</guid>
      <link>https://share.transistor.fm/s/9a7596a5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiheng Wang, Yixin Chen, Shuo Li, Yifan Zhou, Bo Liu, Hengjian Gao, Jiakang Yuan, Jia Bu, Wanghan Xu, Yuhao Zhou, Xiangyu Zhao, Zhiwang Zhou, Fengxiang Wang, Haodong Duan, Songyang Zhang, Jun Yao, Han Deng, Yizhou Wang, Jiabei Xiao, Jiaqi Liu, Encheng Su, Yujie Liu, Weida Wang, Junchi Yao, Shenghe Zheng, Haoran Sun, Runmin Ma, Xiangchao Yan, Bo Zhang, Dongzhan Zhou, Shufei Zhang, Peng Ye, Xiaosong Wang, Shixiang Tang, Wenlong Zhang, Lei Bai</p>

            <p><strong>Title:</strong><br>
            SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.22334v3">http://arxiv.org/abs/2512.22334v3</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation platforms, SciEvalKit focuses on the core competencies of scientific intelligence, including Scientific Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding, Scientific Symbolic Reasoning, Scientific Code Generation, Science Hypothesis Generation and Scientific Knowledge Understanding. It supports six major scientific domains, spanning from physics and chemistry to astronomy and materials science. SciEvalKit builds a foundation of expert-grade scientific benchmarks, curated from real-world, domain-specific datasets, ensuring that tasks reflect authentic scientific challenges. The toolkit features a flexible, extensible evaluation pipeline that enables batch evaluation across models and datasets, supports custom model and dataset integration, and provides transparent, reproducible, and comparable results. By bridging capability-based evaluation and disciplinary diversity, SciEvalKit offers a standardized yet customizable infrastructure to benchmark the next generation of scientific foundation models and intelligent agents. The toolkit is open-sourced and actively maintained to foster community-driven development and progress in AI4Science.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiheng Wang, Yixin Chen, Shuo Li, Yifan Zhou, Bo Liu, Hengjian Gao, Jiakang Yuan, Jia Bu, Wanghan Xu, Yuhao Zhou, Xiangyu Zhao, Zhiwang Zhou, Fengxiang Wang, Haodong Duan, Songyang Zhang, Jun Yao, Han Deng, Yizhou Wang, Jiabei Xiao, Jiaqi Liu, Encheng Su, Yujie Liu, Weida Wang, Junchi Yao, Shenghe Zheng, Haoran Sun, Runmin Ma, Xiangchao Yan, Bo Zhang, Dongzhan Zhou, Shufei Zhang, Peng Ye, Xiaosong Wang, Shixiang Tang, Wenlong Zhang, Lei Bai</p>

            <p><strong>Title:</strong><br>
            SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.22334v3">http://arxiv.org/abs/2512.22334v3</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation platforms, SciEvalKit focuses on the core competencies of scientific intelligence, including Scientific Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding, Scientific Symbolic Reasoning, Scientific Code Generation, Science Hypothesis Generation and Scientific Knowledge Understanding. It supports six major scientific domains, spanning from physics and chemistry to astronomy and materials science. SciEvalKit builds a foundation of expert-grade scientific benchmarks, curated from real-world, domain-specific datasets, ensuring that tasks reflect authentic scientific challenges. The toolkit features a flexible, extensible evaluation pipeline that enables batch evaluation across models and datasets, supports custom model and dataset integration, and provides transparent, reproducible, and comparable results. By bridging capability-based evaluation and disciplinary diversity, SciEvalKit offers a standardized yet customizable infrastructure to benchmark the next generation of scientific foundation models and intelligent agents. The toolkit is open-sourced and actively maintained to foster community-driven development and progress in AI4Science.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 07 Jan 2026 19:26:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9a7596a5/291e86a2.mp3" length="26989756" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1683</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiheng Wang, Yixin Chen, Shuo Li, Yifan Zhou, Bo Liu, Hengjian Gao, Jiakang Yuan, Jia Bu, Wanghan Xu, Yuhao Zhou, Xiangyu Zhao, Zhiwang Zhou, Fengxiang Wang, Haodong Duan, Songyang Zhang, Jun Yao, Han Deng, Yizhou Wang, Jiabei Xiao, Jiaqi Liu, Encheng Su, Yujie Liu, Weida Wang, Junchi Yao, Shenghe Zheng, Haoran Sun, Runmin Ma, Xiangchao Yan, Bo Zhang, Dongzhan Zhou, Shufei Zhang, Peng Ye, Xiaosong Wang, Shixiang Tang, Wenlong Zhang, Lei Bai</p>

            <p><strong>Title:</strong><br>
            SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.22334v3">http://arxiv.org/abs/2512.22334v3</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation platforms, SciEvalKit focuses on the core competencies of scientific intelligence, including Scientific Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding, Scientific Symbolic Reasoning, Scientific Code Generation, Science Hypothesis Generation and Scientific Knowledge Understanding. It supports six major scientific domains, spanning from physics and chemistry to astronomy and materials science. SciEvalKit builds a foundation of expert-grade scientific benchmarks, curated from real-world, domain-specific datasets, ensuring that tasks reflect authentic scientific challenges. The toolkit features a flexible, extensible evaluation pipeline that enables batch evaluation across models and datasets, supports custom model and dataset integration, and provides transparent, reproducible, and comparable results. By bridging capability-based evaluation and disciplinary diversity, SciEvalKit offers a standardized yet customizable infrastructure to benchmark the next generation of scientific foundation models and intelligent agents. The toolkit is open-sourced and actively maintained to foster community-driven development and progress in AI4Science.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>NitroGen: An Open Foundation Model for Generalist Gaming Agents</title>
      <itunes:episode>1563</itunes:episode>
      <podcast:episode>1563</podcast:episode>
      <itunes:title>NitroGen: An Open Foundation Model for Generalist Gaming Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">00ad17dd-e851-4b37-9509-02fb28b1e1e8</guid>
      <link>https://share.transistor.fm/s/7d43ee19</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Loïc Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, Ludwig Schmidt, Georgia Gkioxari, Jan Kautz, Yisong Yue, Yejin Choi, Yuke Zhu, Linxi "Jim" Fan</p>

            <p><strong>Title:</strong><br>
            NitroGen: An Open Foundation Model for Generalist Gaming Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.02427v1">http://arxiv.org/abs/2601.02427v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce NitroGen, a vision-action foundation model for generalist gaming agents that is trained on 40,000 hours of gameplay videos across more than 1,000 games. We incorporate three key ingredients: 1) an internet-scale video-action dataset constructed by automatically extracting player actions from publicly available gameplay videos, 2) a multi-game benchmark environment that can measure cross-game generalization, and 3) a unified vision-action model trained with large-scale behavior cloning. NitroGen exhibits strong competence across diverse domains, including combat encounters in 3D action games, high-precision control in 2D platformers, and exploration in procedurally generated worlds. It transfers effectively to unseen games, achieving up to 52% relative improvement in task success rates over models trained from scratch. We release the dataset, evaluation suite, and model weights to advance research on generalist embodied agents.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Loïc Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, Ludwig Schmidt, Georgia Gkioxari, Jan Kautz, Yisong Yue, Yejin Choi, Yuke Zhu, Linxi "Jim" Fan</p>

            <p><strong>Title:</strong><br>
            NitroGen: An Open Foundation Model for Generalist Gaming Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.02427v1">http://arxiv.org/abs/2601.02427v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce NitroGen, a vision-action foundation model for generalist gaming agents that is trained on 40,000 hours of gameplay videos across more than 1,000 games. We incorporate three key ingredients: 1) an internet-scale video-action dataset constructed by automatically extracting player actions from publicly available gameplay videos, 2) a multi-game benchmark environment that can measure cross-game generalization, and 3) a unified vision-action model trained with large-scale behavior cloning. NitroGen exhibits strong competence across diverse domains, including combat encounters in 3D action games, high-precision control in 2D platformers, and exploration in procedurally generated worlds. It transfers effectively to unseen games, achieving up to 52% relative improvement in task success rates over models trained from scratch. We release the dataset, evaluation suite, and model weights to advance research on generalist embodied agents.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 07 Jan 2026 19:26:15 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7d43ee19/6fcc1e3d.mp3" length="21677059" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1351</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Loïc Magne, Anas Awadalla, Guanzhi Wang, Yinzhen Xu, Joshua Belofsky, Fengyuan Hu, Joohwan Kim, Ludwig Schmidt, Georgia Gkioxari, Jan Kautz, Yisong Yue, Yejin Choi, Yuke Zhu, Linxi "Jim" Fan</p>

            <p><strong>Title:</strong><br>
            NitroGen: An Open Foundation Model for Generalist Gaming Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.02427v1">http://arxiv.org/abs/2601.02427v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce NitroGen, a vision-action foundation model for generalist gaming agents that is trained on 40,000 hours of gameplay videos across more than 1,000 games. We incorporate three key ingredients: 1) an internet-scale video-action dataset constructed by automatically extracting player actions from publicly available gameplay videos, 2) a multi-game benchmark environment that can measure cross-game generalization, and 3) a unified vision-action model trained with large-scale behavior cloning. NitroGen exhibits strong competence across diverse domains, including combat encounters in 3D action games, high-precision control in 2D platformers, and exploration in procedurally generated worlds. It transfers effectively to unseen games, achieving up to 52% relative improvement in task success rates over models trained from scratch. We release the dataset, evaluation suite, and model weights to advance research on generalist embodied agents.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits</title>
      <itunes:episode>1562</itunes:episode>
      <podcast:episode>1562</podcast:episode>
      <itunes:title>Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4cf5584b-13e0-414d-a2dc-6a04ad1c5753</guid>
      <link>https://share.transistor.fm/s/c5b2bc45</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Amirhosein Ghasemabadi, Di Niu</p>

            <p><strong>Title:</strong><br>
            Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.20578v2">http://arxiv.org/abs/2512.20578v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) generate fluent and complex outputs but often fail to recognize their own mistakes and hallucinations. Existing approaches typically rely on external judges, multi-sample consistency, or text-based self-critique, which incur additional compute or correlate weakly with true correctness. We ask: can LLMs predict their own failures by inspecting internal states during inference? We introduce Gnosis, a lightweight self-awareness mechanism that enables frozen LLMs to perform intrinsic self-verification by decoding signals from hidden states and attention patterns. Gnosis passively observes internal traces, compresses them into fixed-budget descriptors, and predicts correctness with negligible inference cost, adding only ~5M parameters and operating independently of sequence length. Across math reasoning, open-domain question answering, and academic knowledge benchmarks, and over frozen backbones ranging from 1.7B to 20B parameters, Gnosis consistently outperforms strong internal baselines and large external judges in both accuracy and calibration. Moreover, it generalizes zero-shot to partial generations, enabling early detection of failing trajectories and compute-aware control. These results show that reliable correctness cues are intrinsic to generation process and can be extracted efficiently without external supervision.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Amirhosein Ghasemabadi, Di Niu</p>

            <p><strong>Title:</strong><br>
            Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.20578v2">http://arxiv.org/abs/2512.20578v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) generate fluent and complex outputs but often fail to recognize their own mistakes and hallucinations. Existing approaches typically rely on external judges, multi-sample consistency, or text-based self-critique, which incur additional compute or correlate weakly with true correctness. We ask: can LLMs predict their own failures by inspecting internal states during inference? We introduce Gnosis, a lightweight self-awareness mechanism that enables frozen LLMs to perform intrinsic self-verification by decoding signals from hidden states and attention patterns. Gnosis passively observes internal traces, compresses them into fixed-budget descriptors, and predicts correctness with negligible inference cost, adding only ~5M parameters and operating independently of sequence length. Across math reasoning, open-domain question answering, and academic knowledge benchmarks, and over frozen backbones ranging from 1.7B to 20B parameters, Gnosis consistently outperforms strong internal baselines and large external judges in both accuracy and calibration. Moreover, it generalizes zero-shot to partial generations, enabling early detection of failing trajectories and compute-aware control. These results show that reliable correctness cues are intrinsic to generation process and can be extracted efficiently without external supervision.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 06 Jan 2026 19:45:18 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c5b2bc45/80337250.mp3" length="22113837" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1378</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Amirhosein Ghasemabadi, Di Niu</p>

            <p><strong>Title:</strong><br>
            Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.20578v2">http://arxiv.org/abs/2512.20578v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) generate fluent and complex outputs but often fail to recognize their own mistakes and hallucinations. Existing approaches typically rely on external judges, multi-sample consistency, or text-based self-critique, which incur additional compute or correlate weakly with true correctness. We ask: can LLMs predict their own failures by inspecting internal states during inference? We introduce Gnosis, a lightweight self-awareness mechanism that enables frozen LLMs to perform intrinsic self-verification by decoding signals from hidden states and attention patterns. Gnosis passively observes internal traces, compresses them into fixed-budget descriptors, and predicts correctness with negligible inference cost, adding only ~5M parameters and operating independently of sequence length. Across math reasoning, open-domain question answering, and academic knowledge benchmarks, and over frozen backbones ranging from 1.7B to 20B parameters, Gnosis consistently outperforms strong internal baselines and large external judges in both accuracy and calibration. Moreover, it generalizes zero-shot to partial generations, enabling early detection of failing trajectories and compute-aware control. These results show that reliable correctness cues are intrinsic to generation process and can be extracted efficiently without external supervision.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation</title>
      <itunes:episode>1561</itunes:episode>
      <podcast:episode>1561</podcast:episode>
      <itunes:title>NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2ceb0c98-1f40-4443-aa01-e41cfb11cba5</guid>
      <link>https://share.transistor.fm/s/212557af</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, Hu Ye, Bo Chen, Yiming Gao, Peng Liu, Akide Liu, Zhipeng Yang, Qili Deng, Linjie Xing, Jiyang Liu, Zhao Wang, Yang Zhou, Mingcong Liu, Yi Zhang, Qian He, Xiwei Hu, Zhongqi Qi, Jie Shao, Zhiye Fu, Shuai Wang, Fangmin Chen, Xuezhi Chai, Zhihua Wu, Yitong Wang, Zehuan Yuan, Daniel K. Du, Xinglong Wu</p>

            <p><strong>Title:</strong><br>
            NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.02204v1">http://arxiv.org/abs/2601.02204v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, Hu Ye, Bo Chen, Yiming Gao, Peng Liu, Akide Liu, Zhipeng Yang, Qili Deng, Linjie Xing, Jiyang Liu, Zhao Wang, Yang Zhou, Mingcong Liu, Yi Zhang, Qian He, Xiwei Hu, Zhongqi Qi, Jie Shao, Zhiye Fu, Shuai Wang, Fangmin Chen, Xuezhi Chai, Zhihua Wu, Yitong Wang, Zehuan Yuan, Daniel K. Du, Xinglong Wu</p>

            <p><strong>Title:</strong><br>
            NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.02204v1">http://arxiv.org/abs/2601.02204v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 06 Jan 2026 19:44:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/212557af/8fe86c6f.mp3" length="25893037" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1615</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Huichao Zhang, Liao Qu, Yiheng Liu, Hang Chen, Yangyang Song, Yongsheng Dong, Shikun Sun, Xian Li, Xu Wang, Yi Jiang, Hu Ye, Bo Chen, Yiming Gao, Peng Liu, Akide Liu, Zhipeng Yang, Qili Deng, Linjie Xing, Jiyang Liu, Zhao Wang, Yang Zhou, Mingcong Liu, Yi Zhang, Qian He, Xiwei Hu, Zhongqi Qi, Jie Shao, Zhiye Fu, Shuai Wang, Fangmin Chen, Xuezhi Chai, Zhihua Wu, Yitong Wang, Zehuan Yuan, Daniel K. Du, Xinglong Wu</p>

            <p><strong>Title:</strong><br>
            NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.02204v1">http://arxiv.org/abs/2601.02204v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer</title>
      <itunes:episode>1560</itunes:episode>
      <podcast:episode>1560</podcast:episode>
      <itunes:title>DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d1782539-43d0-428b-9beb-2e7861b75388</guid>
      <link>https://share.transistor.fm/s/02d13e5a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xu Guo, Fulong Ye, Xinghui Li, Pengqi Tu, Pengze Zhang, Qichao Sun, Songtao Zhao, Xiangwang Hou, Qian He</p>

            <p><strong>Title:</strong><br>
            DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.01425v1">http://arxiv.org/abs/2601.01425v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video while meticulously preserving the original pose, expression, lighting, background, and dynamic information. Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency. To address the challenge, we propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping (IFS) to the video domain. We first introduce a novel data pipeline SyncID-Pipe that pre-trains an Identity-Anchored Video Synthesizer and combines it with IFS models to construct bidirectional ID quadruplets for explicit supervision. Building upon paired data, we propose the first Diffusion Transformer-based framework DreamID-V, employing a core Modality-Aware Conditioning module to discriminatively inject multi-model conditions. Meanwhile, we propose a Synthetic-to-Real Curriculum mechanism and an Identity-Coherence Reinforcement Learning strategy to enhance visual realism and identity consistency under challenging scenarios. To address the issue of limited benchmarks, we introduce IDBench-V, a comprehensive benchmark encompassing diverse scenes. Extensive experiments demonstrate DreamID-V outperforms state-of-the-art methods and further exhibits exceptional versatility, which can be seamlessly adapted to various swap-related tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xu Guo, Fulong Ye, Xinghui Li, Pengqi Tu, Pengze Zhang, Qichao Sun, Songtao Zhao, Xiangwang Hou, Qian He</p>

            <p><strong>Title:</strong><br>
            DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.01425v1">http://arxiv.org/abs/2601.01425v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video while meticulously preserving the original pose, expression, lighting, background, and dynamic information. Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency. To address the challenge, we propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping (IFS) to the video domain. We first introduce a novel data pipeline SyncID-Pipe that pre-trains an Identity-Anchored Video Synthesizer and combines it with IFS models to construct bidirectional ID quadruplets for explicit supervision. Building upon paired data, we propose the first Diffusion Transformer-based framework DreamID-V, employing a core Modality-Aware Conditioning module to discriminatively inject multi-model conditions. Meanwhile, we propose a Synthetic-to-Real Curriculum mechanism and an Identity-Coherence Reinforcement Learning strategy to enhance visual realism and identity consistency under challenging scenarios. To address the issue of limited benchmarks, we introduce IDBench-V, a comprehensive benchmark encompassing diverse scenes. Extensive experiments demonstrate DreamID-V outperforms state-of-the-art methods and further exhibits exceptional versatility, which can be seamlessly adapted to various swap-related tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 06 Jan 2026 19:44:26 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/02d13e5a/1b88d9b0.mp3" length="22884997" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1427</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xu Guo, Fulong Ye, Xinghui Li, Pengqi Tu, Pengze Zhang, Qichao Sun, Songtao Zhao, Xiangwang Hou, Qian He</p>

            <p><strong>Title:</strong><br>
            DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.01425v1">http://arxiv.org/abs/2601.01425v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video while meticulously preserving the original pose, expression, lighting, background, and dynamic information. Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency. To address the challenge, we propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping (IFS) to the video domain. We first introduce a novel data pipeline SyncID-Pipe that pre-trains an Identity-Anchored Video Synthesizer and combines it with IFS models to construct bidirectional ID quadruplets for explicit supervision. Building upon paired data, we propose the first Diffusion Transformer-based framework DreamID-V, employing a core Modality-Aware Conditioning module to discriminatively inject multi-model conditions. Meanwhile, we propose a Synthetic-to-Real Curriculum mechanism and an Identity-Coherence Reinforcement Learning strategy to enhance visual realism and identity consistency under challenging scenarios. To address the issue of limited benchmarks, we introduce IDBench-V, a comprehensive benchmark encompassing diverse scenes. Extensive experiments demonstrate DreamID-V outperforms state-of-the-art methods and further exhibits exceptional versatility, which can be seamlessly adapted to various swap-related tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation</title>
      <itunes:episode>1559</itunes:episode>
      <podcast:episode>1559</podcast:episode>
      <itunes:title>VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4a67b909-3254-4719-8dc0-8c47e0ae29f1</guid>
      <link>https://share.transistor.fm/s/232cea26</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shikun Sun, Liao Qu, Huichao Zhang, Yiheng Liu, Yangyang Song, Xian Li, Xu Wang, Yi Jiang, Daniel K. Du, Xinglong Wu, Jia Jia</p>

            <p><strong>Title:</strong><br>
            VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.02256v1">http://arxiv.org/abs/2601.02256v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shikun Sun, Liao Qu, Huichao Zhang, Yiheng Liu, Yangyang Song, Xian Li, Xu Wang, Yi Jiang, Daniel K. Du, Xinglong Wu, Jia Jia</p>

            <p><strong>Title:</strong><br>
            VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.02256v1">http://arxiv.org/abs/2601.02256v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 06 Jan 2026 19:44:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/232cea26/1b82cc14.mp3" length="21699659" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1353</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shikun Sun, Liao Qu, Huichao Zhang, Yiheng Liu, Yangyang Song, Xian Li, Xu Wang, Yi Jiang, Daniel K. Du, Xinglong Wu, Jia Jia</p>

            <p><strong>Title:</strong><br>
            VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.02256v1">http://arxiv.org/abs/2601.02256v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive (VAR) models. Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts. This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. Our method integrates three synergistic components: 1) a stabilizing intermediate reward to guide early-stage generation; 2) a dynamic time-step reweighting scheme for precise credit assignment; and 3) a novel mask propagation algorithm, derived from principles of Reward Feedback Learning (ReFL), designed to isolate optimization effects both spatially and temporally. Our approach demonstrates significant improvements in sample quality and objective alignment over the vanilla GRPO baseline, enabling robust and effective optimization for VAR models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GARDO: Reinforcing Diffusion Models without Reward Hacking</title>
      <itunes:episode>1558</itunes:episode>
      <podcast:episode>1558</podcast:episode>
      <itunes:title>GARDO: Reinforcing Diffusion Models without Reward Hacking</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">aa9c4d05-741c-4a9c-b9fd-4526062fe1ad</guid>
      <link>https://share.transistor.fm/s/4033b62a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang, Ziyang Yuan, Xintao Wang, Hangyu Mao, Pengfei Wan, Ling Pan</p>

            <p><strong>Title:</strong><br>
            GARDO: Reinforcing Diffusion Models without Reward Hacking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24138v1">http://arxiv.org/abs/2512.24138v1</a></p>

            <p><strong>Abstract:</strong><br>
            Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment. However, since precisely specifying a ground-truth objective for visual tasks remains challenging, the models are often optimized using a proxy reward that only partially captures the true goal. This mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses. While common solutions add regularization against the reference policy to prevent reward hacking, they compromise sample efficiency and impede the exploration of novel, high-reward regions, as the reference policy is usually sub-optimal. To address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking, we propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO), a versatile framework compatible with various RL algorithms. Our key insight is that regularization need not be applied universally; instead, it is highly effective to selectively penalize a subset of samples that exhibit high uncertainty. To address the exploration challenge, GARDO introduces an adaptive regularization mechanism wherein the reference model is periodically updated to match the capabilities of the online policy, ensuring a relevant regularization target. To address the mode collapse issue in RL, GARDO amplifies the rewards for high-quality samples that also exhibit high diversity, encouraging mode coverage without destabilizing the optimization process. Extensive experiments across diverse proxy rewards and hold-out unseen metrics consistently show that GARDO mitigates reward hacking and enhances generation diversity without sacrificing sample efficiency or exploration, highlighting its effectiveness and robustness.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang, Ziyang Yuan, Xintao Wang, Hangyu Mao, Pengfei Wan, Ling Pan</p>

            <p><strong>Title:</strong><br>
            GARDO: Reinforcing Diffusion Models without Reward Hacking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24138v1">http://arxiv.org/abs/2512.24138v1</a></p>

            <p><strong>Abstract:</strong><br>
            Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment. However, since precisely specifying a ground-truth objective for visual tasks remains challenging, the models are often optimized using a proxy reward that only partially captures the true goal. This mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses. While common solutions add regularization against the reference policy to prevent reward hacking, they compromise sample efficiency and impede the exploration of novel, high-reward regions, as the reference policy is usually sub-optimal. To address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking, we propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO), a versatile framework compatible with various RL algorithms. Our key insight is that regularization need not be applied universally; instead, it is highly effective to selectively penalize a subset of samples that exhibit high uncertainty. To address the exploration challenge, GARDO introduces an adaptive regularization mechanism wherein the reference model is periodically updated to match the capabilities of the online policy, ensuring a relevant regularization target. To address the mode collapse issue in RL, GARDO amplifies the rewards for high-quality samples that also exhibit high diversity, encouraging mode coverage without destabilizing the optimization process. Extensive experiments across diverse proxy rewards and hold-out unseen metrics consistently show that GARDO mitigates reward hacking and enhances generation diversity without sacrificing sample efficiency or exploration, highlighting its effectiveness and robustness.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 06 Jan 2026 19:43:44 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4033b62a/6a565f05.mp3" length="23339696" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1455</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang, Ziyang Yuan, Xintao Wang, Hangyu Mao, Pengfei Wan, Ling Pan</p>

            <p><strong>Title:</strong><br>
            GARDO: Reinforcing Diffusion Models without Reward Hacking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24138v1">http://arxiv.org/abs/2512.24138v1</a></p>

            <p><strong>Abstract:</strong><br>
            Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment. However, since precisely specifying a ground-truth objective for visual tasks remains challenging, the models are often optimized using a proxy reward that only partially captures the true goal. This mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses. While common solutions add regularization against the reference policy to prevent reward hacking, they compromise sample efficiency and impede the exploration of novel, high-reward regions, as the reference policy is usually sub-optimal. To address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking, we propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO), a versatile framework compatible with various RL algorithms. Our key insight is that regularization need not be applied universally; instead, it is highly effective to selectively penalize a subset of samples that exhibit high uncertainty. To address the exploration challenge, GARDO introduces an adaptive regularization mechanism wherein the reference model is periodically updated to match the capabilities of the online policy, ensuring a relevant regularization target. To address the mode collapse issue in RL, GARDO amplifies the rewards for high-quality samples that also exhibit high diversity, encouraging mode coverage without destabilizing the optimization process. Extensive experiments across diverse proxy rewards and hold-out unseen metrics consistently show that GARDO mitigates reward hacking and enhances generation diversity without sacrificing sample efficiency or exploration, highlighting its effectiveness and robustness.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams</title>
      <itunes:episode>1557</itunes:episode>
      <podcast:episode>1557</podcast:episode>
      <itunes:title>InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0a13c4d1-fda9-4d33-80f4-af2e41eef90e</guid>
      <link>https://share.transistor.fm/s/6c717548</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, Zhipeng Zhang</p>

            <p><strong>Title:</strong><br>
            InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.02281v1">http://arxiv.org/abs/2601.02281v1</a></p>

            <p><strong>Abstract:</strong><br>
            The grand vision of enabling persistent, large-scale 3D visual geometry understanding is shackled by the irreconcilable demands of scalability and long-term stability. While offline models like VGGT achieve inspiring geometry capability, their batch-based nature renders them irrelevant for live systems. Streaming architectures, though the intended solution for live operation, have proven inadequate. Existing methods either fail to support truly infinite-horizon inputs or suffer from catastrophic drift over long sequences. We shatter this long-standing dilemma with InfiniteVGGT, a causal visual geometry transformer that operationalizes the concept of a rolling memory through a bounded yet adaptive and perpetually expressive KV cache. Capitalizing on this, we devise a training-free, attention-agnostic pruning strategy that intelligently discards obsolete information, effectively ``rolling'' the memory forward with each new frame. Fully compatible with FlashAttention, InfiniteVGGT finally alleviates the compromise, enabling infinite-horizon streaming while outperforming existing streaming methods in long-term stability. The ultimate test for such a system is its performance over a truly infinite horizon, a capability that has been impossible to rigorously validate due to the lack of extremely long-term, continuous benchmarks. To address this critical gap, we introduce the Long3D benchmark, which, for the first time, enables a rigorous evaluation of continuous 3D geometry estimation on sequences about 10,000 frames. This provides the definitive evaluation platform for future research in long-term 3D geometry understanding. Code is available at: https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, Zhipeng Zhang</p>

            <p><strong>Title:</strong><br>
            InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.02281v1">http://arxiv.org/abs/2601.02281v1</a></p>

            <p><strong>Abstract:</strong><br>
            The grand vision of enabling persistent, large-scale 3D visual geometry understanding is shackled by the irreconcilable demands of scalability and long-term stability. While offline models like VGGT achieve inspiring geometry capability, their batch-based nature renders them irrelevant for live systems. Streaming architectures, though the intended solution for live operation, have proven inadequate. Existing methods either fail to support truly infinite-horizon inputs or suffer from catastrophic drift over long sequences. We shatter this long-standing dilemma with InfiniteVGGT, a causal visual geometry transformer that operationalizes the concept of a rolling memory through a bounded yet adaptive and perpetually expressive KV cache. Capitalizing on this, we devise a training-free, attention-agnostic pruning strategy that intelligently discards obsolete information, effectively ``rolling'' the memory forward with each new frame. Fully compatible with FlashAttention, InfiniteVGGT finally alleviates the compromise, enabling infinite-horizon streaming while outperforming existing streaming methods in long-term stability. The ultimate test for such a system is its performance over a truly infinite horizon, a capability that has been impossible to rigorously validate due to the lack of extremely long-term, continuous benchmarks. To address this critical gap, we introduce the Long3D benchmark, which, for the first time, enables a rigorous evaluation of continuous 3D geometry estimation on sequences about 10,000 frames. This provides the definitive evaluation platform for future research in long-term 3D geometry understanding. Code is available at: https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 06 Jan 2026 19:43:22 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6c717548/ae9a8fa9.mp3" length="24973928" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1557</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, Zhipeng Zhang</p>

            <p><strong>Title:</strong><br>
            InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.02281v1">http://arxiv.org/abs/2601.02281v1</a></p>

            <p><strong>Abstract:</strong><br>
            The grand vision of enabling persistent, large-scale 3D visual geometry understanding is shackled by the irreconcilable demands of scalability and long-term stability. While offline models like VGGT achieve inspiring geometry capability, their batch-based nature renders them irrelevant for live systems. Streaming architectures, though the intended solution for live operation, have proven inadequate. Existing methods either fail to support truly infinite-horizon inputs or suffer from catastrophic drift over long sequences. We shatter this long-standing dilemma with InfiniteVGGT, a causal visual geometry transformer that operationalizes the concept of a rolling memory through a bounded yet adaptive and perpetually expressive KV cache. Capitalizing on this, we devise a training-free, attention-agnostic pruning strategy that intelligently discards obsolete information, effectively ``rolling'' the memory forward with each new frame. Fully compatible with FlashAttention, InfiniteVGGT finally alleviates the compromise, enabling infinite-horizon streaming while outperforming existing streaming methods in long-term stability. The ultimate test for such a system is its performance over a truly infinite horizon, a capability that has been impossible to rigorously validate due to the lack of extremely long-term, continuous benchmarks. To address this critical gap, we introduce the Long3D benchmark, which, for the first time, enables a rigorous evaluation of continuous 3D geometry estimation on sequences about 10,000 frames. This provides the definitive evaluation platform for future research in long-term 3D geometry understanding. Code is available at: https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VINO: A Unified Visual Generator with Interleaved OmniModal Context</title>
      <itunes:episode>1556</itunes:episode>
      <podcast:episode>1556</podcast:episode>
      <itunes:title>VINO: A Unified Visual Generator with Interleaved OmniModal Context</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8352cdd9-bdca-4762-8079-a835e57f15f2</guid>
      <link>https://share.transistor.fm/s/4df00c93</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junyi Chen, Tong He, Zhoujie Fu, Pengfei Wan, Kun Gai, Weicai Ye</p>

            <p><strong>Title:</strong><br>
            VINO: A Unified Visual Generator with Interleaved OmniModal Context</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.02358v1">http://arxiv.org/abs/2601.02358v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present VINO, a unified visual generator that performs image and video generation and editing within a single framework. Instead of relying on task-specific models or independent modules for each modality, VINO uses a shared diffusion backbone that conditions on text, images and videos, enabling a broad range of visual creation and editing tasks under one model. Specifically, VINO couples a vision-language model (VLM) with a Multimodal Diffusion Transformer (MMDiT), where multimodal inputs are encoded as interleaved conditioning tokens, and then used to guide the diffusion process. This design supports multi-reference grounding, long-form instruction following, and coherent identity preservation across static and dynamic content, while avoiding modality-specific architectural components. To train such a unified system, we introduce a multi-stage training pipeline that progressively expands a video generation base model into a unified, multi-task generator capable of both image and video input and output. Across diverse generation and editing benchmarks, VINO demonstrates strong visual quality, faithful instruction following, improved reference and attribute preservation, and more controllable multi-identity edits. Our results highlight a practical path toward scalable unified visual generation, and the promise of interleaved, in-context computation as a foundation for general-purpose visual creation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junyi Chen, Tong He, Zhoujie Fu, Pengfei Wan, Kun Gai, Weicai Ye</p>

            <p><strong>Title:</strong><br>
            VINO: A Unified Visual Generator with Interleaved OmniModal Context</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.02358v1">http://arxiv.org/abs/2601.02358v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present VINO, a unified visual generator that performs image and video generation and editing within a single framework. Instead of relying on task-specific models or independent modules for each modality, VINO uses a shared diffusion backbone that conditions on text, images and videos, enabling a broad range of visual creation and editing tasks under one model. Specifically, VINO couples a vision-language model (VLM) with a Multimodal Diffusion Transformer (MMDiT), where multimodal inputs are encoded as interleaved conditioning tokens, and then used to guide the diffusion process. This design supports multi-reference grounding, long-form instruction following, and coherent identity preservation across static and dynamic content, while avoiding modality-specific architectural components. To train such a unified system, we introduce a multi-stage training pipeline that progressively expands a video generation base model into a unified, multi-task generator capable of both image and video input and output. Across diverse generation and editing benchmarks, VINO demonstrates strong visual quality, faithful instruction following, improved reference and attribute preservation, and more controllable multi-identity edits. Our results highlight a practical path toward scalable unified visual generation, and the promise of interleaved, in-context computation as a foundation for general-purpose visual creation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 06 Jan 2026 19:43:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4df00c93/f837e83f.mp3" length="22991545" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1433</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junyi Chen, Tong He, Zhoujie Fu, Pengfei Wan, Kun Gai, Weicai Ye</p>

            <p><strong>Title:</strong><br>
            VINO: A Unified Visual Generator with Interleaved OmniModal Context</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.02358v1">http://arxiv.org/abs/2601.02358v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present VINO, a unified visual generator that performs image and video generation and editing within a single framework. Instead of relying on task-specific models or independent modules for each modality, VINO uses a shared diffusion backbone that conditions on text, images and videos, enabling a broad range of visual creation and editing tasks under one model. Specifically, VINO couples a vision-language model (VLM) with a Multimodal Diffusion Transformer (MMDiT), where multimodal inputs are encoded as interleaved conditioning tokens, and then used to guide the diffusion process. This design supports multi-reference grounding, long-form instruction following, and coherent identity preservation across static and dynamic content, while avoiding modality-specific architectural components. To train such a unified system, we introduce a multi-stage training pipeline that progressively expands a video generation base model into a unified, multi-task generator capable of both image and video input and output. Across diverse generation and editing benchmarks, VINO demonstrates strong visual quality, faithful instruction following, improved reference and attribute preservation, and more controllable multi-identity edits. Our results highlight a practical path toward scalable unified visual generation, and the promise of interleaved, in-context computation as a foundation for general-purpose visual creation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization</title>
      <itunes:episode>1555</itunes:episode>
      <podcast:episode>1555</podcast:episode>
      <itunes:title>Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8e807a08-4253-428b-b0df-3a42f57d4511</guid>
      <link>https://share.transistor.fm/s/3e6102c1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 88 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuchen Shi, Yuzheng Cai, Siqi Cai, Zihan Xu, Lichao Chen, Yulei Qin, Zhijian Zhou, Xiang Fei, Chaofan Qiu, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Guocan Cai, Yong Mao, Yunsheng Wu, Ke Li, Xing Sun</p>

            <p><strong>Title:</strong><br>
            Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24615v1">http://arxiv.org/abs/2512.24615v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing Large Language Model (LLM) agent frameworks face two significant challenges: high configuration costs and static capabilities. Building a high-quality agent often requires extensive manual effort in tool integration and prompt engineering, while deployed agents struggle to adapt to dynamic environments without expensive fine-tuning. To address these issues, we propose \textbf{Youtu-Agent}, a modular framework designed for the automated generation and continuous evolution of LLM agents. Youtu-Agent features a structured configuration system that decouples execution environments, toolkits, and context management, enabling flexible reuse and automated synthesis. We introduce two generation paradigms: a \textbf{Workflow} mode for standard tasks and a \textbf{Meta-Agent} mode for complex, non-standard requirements, capable of automatically generating tool code, prompts, and configurations. Furthermore, Youtu-Agent establishes a hybrid policy optimization system: (1) an \textbf{Agent Practice} module that enables agents to accumulate experience and improve performance through in-context optimization without parameter updates; and (2) an \textbf{Agent RL} module that integrates with distributed training frameworks to enable scalable and stable reinforcement learning of any Youtu-Agents in an end-to-end, large-scale manner. Experiments demonstrate that Youtu-Agent achieves state-of-the-art performance on WebWalkerQA (71.47\%) and GAIA (72.8\%) using open-weight models. Our automated generation pipeline achieves over 81\% tool synthesis success rate, while the Practice module improves performance on AIME 2024/2025 by +2.7\% and +5.4\% respectively. Moreover, our Agent RL training achieves 40\% speedup with steady performance improvement on 7B LLMs, enhancing coding/reasoning and searching capabilities respectively up to 35\% and 21\% on Maths and general/multi-hop QA benchmarks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 88 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuchen Shi, Yuzheng Cai, Siqi Cai, Zihan Xu, Lichao Chen, Yulei Qin, Zhijian Zhou, Xiang Fei, Chaofan Qiu, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Guocan Cai, Yong Mao, Yunsheng Wu, Ke Li, Xing Sun</p>

            <p><strong>Title:</strong><br>
            Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24615v1">http://arxiv.org/abs/2512.24615v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing Large Language Model (LLM) agent frameworks face two significant challenges: high configuration costs and static capabilities. Building a high-quality agent often requires extensive manual effort in tool integration and prompt engineering, while deployed agents struggle to adapt to dynamic environments without expensive fine-tuning. To address these issues, we propose \textbf{Youtu-Agent}, a modular framework designed for the automated generation and continuous evolution of LLM agents. Youtu-Agent features a structured configuration system that decouples execution environments, toolkits, and context management, enabling flexible reuse and automated synthesis. We introduce two generation paradigms: a \textbf{Workflow} mode for standard tasks and a \textbf{Meta-Agent} mode for complex, non-standard requirements, capable of automatically generating tool code, prompts, and configurations. Furthermore, Youtu-Agent establishes a hybrid policy optimization system: (1) an \textbf{Agent Practice} module that enables agents to accumulate experience and improve performance through in-context optimization without parameter updates; and (2) an \textbf{Agent RL} module that integrates with distributed training frameworks to enable scalable and stable reinforcement learning of any Youtu-Agents in an end-to-end, large-scale manner. Experiments demonstrate that Youtu-Agent achieves state-of-the-art performance on WebWalkerQA (71.47\%) and GAIA (72.8\%) using open-weight models. Our automated generation pipeline achieves over 81\% tool synthesis success rate, while the Practice module improves performance on AIME 2024/2025 by +2.7\% and +5.4\% respectively. Moreover, our Agent RL training achieves 40\% speedup with steady performance improvement on 7B LLMs, enhancing coding/reasoning and searching capabilities respectively up to 35\% and 21\% on Maths and general/multi-hop QA benchmarks.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 05 Jan 2026 19:45:25 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3e6102c1/8ef2d50b.mp3" length="22514682" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1403</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 88 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuchen Shi, Yuzheng Cai, Siqi Cai, Zihan Xu, Lichao Chen, Yulei Qin, Zhijian Zhou, Xiang Fei, Chaofan Qiu, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Guocan Cai, Yong Mao, Yunsheng Wu, Ke Li, Xing Sun</p>

            <p><strong>Title:</strong><br>
            Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24615v1">http://arxiv.org/abs/2512.24615v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing Large Language Model (LLM) agent frameworks face two significant challenges: high configuration costs and static capabilities. Building a high-quality agent often requires extensive manual effort in tool integration and prompt engineering, while deployed agents struggle to adapt to dynamic environments without expensive fine-tuning. To address these issues, we propose \textbf{Youtu-Agent}, a modular framework designed for the automated generation and continuous evolution of LLM agents. Youtu-Agent features a structured configuration system that decouples execution environments, toolkits, and context management, enabling flexible reuse and automated synthesis. We introduce two generation paradigms: a \textbf{Workflow} mode for standard tasks and a \textbf{Meta-Agent} mode for complex, non-standard requirements, capable of automatically generating tool code, prompts, and configurations. Furthermore, Youtu-Agent establishes a hybrid policy optimization system: (1) an \textbf{Agent Practice} module that enables agents to accumulate experience and improve performance through in-context optimization without parameter updates; and (2) an \textbf{Agent RL} module that integrates with distributed training frameworks to enable scalable and stable reinforcement learning of any Youtu-Agents in an end-to-end, large-scale manner. Experiments demonstrate that Youtu-Agent achieves state-of-the-art performance on WebWalkerQA (71.47\%) and GAIA (72.8\%) using open-weight models. Our automated generation pipeline achieves over 81\% tool synthesis success rate, while the Practice module improves performance on AIME 2024/2025 by +2.7\% and +5.4\% respectively. Moreover, our Agent RL training achieves 40\% speedup with steady performance improvement on 7B LLMs, enhancing coding/reasoning and searching capabilities respectively up to 35\% and 21\% on Maths and general/multi-hop QA benchmarks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos</title>
      <itunes:episode>1554</itunes:episode>
      <podcast:episode>1554</podcast:episode>
      <itunes:title>NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1b075c21-9075-4397-b956-a9f6a77c1d5d</guid>
      <link>https://share.transistor.fm/s/936cd115</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 86 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuxue Yang, Lue Fan, Ziqi Shi, Junran Peng, Feng Wang, Zhaoxiang Zhang</p>

            <p><strong>Title:</strong><br>
            NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.00393v1">http://arxiv.org/abs/2601.00393v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks. Our project page is available at https://neoverse-4d.github.io</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 86 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuxue Yang, Lue Fan, Ziqi Shi, Junran Peng, Feng Wang, Zhaoxiang Zhang</p>

            <p><strong>Title:</strong><br>
            NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.00393v1">http://arxiv.org/abs/2601.00393v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks. Our project page is available at https://neoverse-4d.github.io</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 05 Jan 2026 19:45:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/936cd115/99ad78c3.mp3" length="21911539" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1366</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 86 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuxue Yang, Lue Fan, Ziqi Shi, Junran Peng, Feng Wang, Zhaoxiang Zhang</p>

            <p><strong>Title:</strong><br>
            NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.00393v1">http://arxiv.org/abs/2601.00393v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we propose NeoVerse, a versatile 4D world model that is capable of 4D reconstruction, novel-trajectory video generation, and rich downstream applications. We first identify a common limitation of scalability in current 4D world modeling methods, caused either by expensive and specialized multi-view 4D data or by cumbersome training pre-processing. In contrast, our NeoVerse is built upon a core philosophy that makes the full pipeline scalable to diverse in-the-wild monocular videos. Specifically, NeoVerse features pose-free feed-forward 4D reconstruction, online monocular degradation pattern simulation, and other well-aligned techniques. These designs empower NeoVerse with versatility and generalization to various domains. Meanwhile, NeoVerse achieves state-of-the-art performance in standard reconstruction and generation benchmarks. Our project page is available at https://neoverse-4d.github.io</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation</title>
      <itunes:episode>1553</itunes:episode>
      <podcast:episode>1553</podcast:episode>
      <itunes:title>Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9f874c29-22df-4b49-a029-9dfb5a642892</guid>
      <link>https://share.transistor.fm/s/ccab322f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.LG, cs.AI, cs.CV, cs.HC, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.00664v1">http://arxiv.org/abs/2601.00664v1</a></p>

            <p><strong>Abstract:</strong><br>
            Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user's audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.LG, cs.AI, cs.CV, cs.HC, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.00664v1">http://arxiv.org/abs/2601.00664v1</a></p>

            <p><strong>Abstract:</strong><br>
            Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user's audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 05 Jan 2026 19:44:42 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ccab322f/aebff25e.mp3" length="22105071" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1378</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.LG, cs.AI, cs.CV, cs.HC, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.00664v1">http://arxiv.org/abs/2601.00664v1</a></p>

            <p><strong>Abstract:</strong><br>
            Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user's audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation</title>
      <itunes:episode>1552</itunes:episode>
      <podcast:episode>1552</podcast:episode>
      <itunes:title>Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8c4a3683-e6f6-4ea9-a66a-e97b6f04a8a1</guid>
      <link>https://share.transistor.fm/s/d050ae97</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhe Huang, Hao Wen, Aiming Hao, Bingze Song, Meiqi Wu, Jiahong Wu, Xiangxiang Chu, Sheng Lu, Haoqian Wang</p>

            <p><strong>Title:</strong><br>
            Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24271v1">http://arxiv.org/abs/2512.24271v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise $\ell_1$ advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhe Huang, Hao Wen, Aiming Hao, Bingze Song, Meiqi Wu, Jiahong Wu, Xiangxiang Chu, Sheng Lu, Haoqian Wang</p>

            <p><strong>Title:</strong><br>
            Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24271v1">http://arxiv.org/abs/2512.24271v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise $\ell_1$ advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 05 Jan 2026 19:44:21 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d050ae97/e71ccb32.mp3" length="25837038" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1611</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhe Huang, Hao Wen, Aiming Hao, Bingze Song, Meiqi Wu, Jiahong Wu, Xiangxiang Chu, Sheng Lu, Haoqian Wang</p>

            <p><strong>Title:</strong><br>
            Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24271v1">http://arxiv.org/abs/2512.24271v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise $\ell_1$ advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning</title>
      <itunes:episode>1551</itunes:episode>
      <podcast:episode>1551</podcast:episode>
      <itunes:title>SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">25de38c0-f42b-48b8-8f04-b4193185a0a0</guid>
      <link>https://share.transistor.fm/s/2e3c7394</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, Gao Huang, Dahua Lin, Lewei Lu</p>

            <p><strong>Title:</strong><br>
            SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24330v1">http://arxiv.org/abs/2512.24330v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model's ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, Gao Huang, Dahua Lin, Lewei Lu</p>

            <p><strong>Title:</strong><br>
            SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24330v1">http://arxiv.org/abs/2512.24330v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model's ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 05 Jan 2026 19:44:00 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2e3c7394/bf9d143d.mp3" length="26879427" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1676</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, Gao Huang, Dahua Lin, Lewei Lu</p>

            <p><strong>Title:</strong><br>
            SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24330v1">http://arxiv.org/abs/2512.24330v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model's ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Deep Delta Learning</title>
      <itunes:episode>1550</itunes:episode>
      <podcast:episode>1550</podcast:episode>
      <itunes:title>Deep Delta Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c73e2b8f-e6e4-4b16-bee3-8bf12086e3e9</guid>
      <link>https://share.transistor.fm/s/af8f3296</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG, cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yifan Zhang, Yifeng Liu, Mengdi Wang, Quanquan Gu</p>

            <p><strong>Title:</strong><br>
            Deep Delta Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.00417v1">http://arxiv.org/abs/2601.00417v1</a></p>

            <p><strong>Abstract:</strong><br>
            The efficacy of deep residual networks is fundamentally predicated on the identity shortcut connection. While this mechanism effectively mitigates the vanishing gradient problem, it imposes a strictly additive inductive bias on feature transformations, thereby limiting the network's capacity to model complex state transitions. In this paper, we introduce Deep Delta Learning (DDL), a novel architecture that generalizes the standard residual connection by modulating the identity shortcut with a learnable, data-dependent geometric transformation. This transformation, termed the Delta Operator, constitutes a rank-1 perturbation of the identity matrix, parameterized by a reflection direction vector $\mathbf{k}(\mathbf{X})$ and a gating scalar $β(\mathbf{X})$. We provide a spectral analysis of this operator, demonstrating that the gate $β(\mathbf{X})$ enables dynamic interpolation between identity mapping, orthogonal projection, and geometric reflection. Furthermore, we restructure the residual update as a synchronous rank-1 injection, where the gate acts as a dynamic step size governing both the erasure of old information and the writing of new features. This unification empowers the network to explicitly control the spectrum of its layer-wise transition operator, enabling the modeling of complex, non-monotonic dynamics while preserving the stable training characteristics of gated residual architectures.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG, cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yifan Zhang, Yifeng Liu, Mengdi Wang, Quanquan Gu</p>

            <p><strong>Title:</strong><br>
            Deep Delta Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.00417v1">http://arxiv.org/abs/2601.00417v1</a></p>

            <p><strong>Abstract:</strong><br>
            The efficacy of deep residual networks is fundamentally predicated on the identity shortcut connection. While this mechanism effectively mitigates the vanishing gradient problem, it imposes a strictly additive inductive bias on feature transformations, thereby limiting the network's capacity to model complex state transitions. In this paper, we introduce Deep Delta Learning (DDL), a novel architecture that generalizes the standard residual connection by modulating the identity shortcut with a learnable, data-dependent geometric transformation. This transformation, termed the Delta Operator, constitutes a rank-1 perturbation of the identity matrix, parameterized by a reflection direction vector $\mathbf{k}(\mathbf{X})$ and a gating scalar $β(\mathbf{X})$. We provide a spectral analysis of this operator, demonstrating that the gate $β(\mathbf{X})$ enables dynamic interpolation between identity mapping, orthogonal projection, and geometric reflection. Furthermore, we restructure the residual update as a synchronous rank-1 injection, where the gate acts as a dynamic step size governing both the erasure of old information and the writing of new features. This unification empowers the network to explicitly control the spectrum of its layer-wise transition operator, enabling the modeling of complex, non-monotonic dynamics while preserving the stable training characteristics of gated residual architectures.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 05 Jan 2026 19:43:38 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/af8f3296/7d050dd5.mp3" length="19804140" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1234</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG, cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yifan Zhang, Yifeng Liu, Mengdi Wang, Quanquan Gu</p>

            <p><strong>Title:</strong><br>
            Deep Delta Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.00417v1">http://arxiv.org/abs/2601.00417v1</a></p>

            <p><strong>Abstract:</strong><br>
            The efficacy of deep residual networks is fundamentally predicated on the identity shortcut connection. While this mechanism effectively mitigates the vanishing gradient problem, it imposes a strictly additive inductive bias on feature transformations, thereby limiting the network's capacity to model complex state transitions. In this paper, we introduce Deep Delta Learning (DDL), a novel architecture that generalizes the standard residual connection by modulating the identity shortcut with a learnable, data-dependent geometric transformation. This transformation, termed the Delta Operator, constitutes a rank-1 perturbation of the identity matrix, parameterized by a reflection direction vector $\mathbf{k}(\mathbf{X})$ and a gating scalar $β(\mathbf{X})$. We provide a spectral analysis of this operator, demonstrating that the gate $β(\mathbf{X})$ enables dynamic interpolation between identity mapping, orthogonal projection, and geometric reflection. Furthermore, we restructure the residual update as a synchronous rank-1 injection, where the gate acts as a dynamic step size governing both the erasure of old information and the writing of new features. This unification empowers the network to explicitly control the spectrum of its layer-wise transition operator, enabling the modeling of complex, non-monotonic dynamics while preserving the stable training characteristics of gated residual architectures.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction</title>
      <itunes:episode>1549</itunes:episode>
      <podcast:episode>1549</podcast:episode>
      <itunes:title>AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">09dc0cfc-9b4c-400f-811d-e8123417a3e7</guid>
      <link>https://share.transistor.fm/s/92dad114</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiewen Chan, Zhenjun Zhao, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.00796v1">http://arxiv.org/abs/2601.00796v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reconstructing dynamic 3D scenes from monocular videos requires simultaneously capturing high-frequency appearance details and temporally continuous motion. Existing methods using single Gaussian primitives are limited by their low-pass filtering nature, while standard Gabor functions introduce energy instability. Moreover, lack of temporal continuity constraints often leads to motion artifacts during interpolation. We propose AdaGaR, a unified framework addressing both frequency adaptivity and temporal continuity in explicit dynamic scene modeling. We introduce Adaptive Gabor Representation, extending Gaussians through learnable frequency weights and adaptive energy compensation to balance detail capture and stability. For temporal continuity, we employ Cubic Hermite Splines with Temporal Curvature Regularization to ensure smooth motion evolution. An Adaptive Initialization mechanism combining depth estimation, point tracking, and foreground masks establishes stable point cloud distributions in early training. Experiments on Tap-Vid DAVIS demonstrate state-of-the-art performance (PSNR 35.49, SSIM 0.9433, LPIPS 0.0723) and strong generalization across frame interpolation, depth consistency, video editing, and stereo view synthesis. Project page: https://jiewenchan.github.io/AdaGaR/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiewen Chan, Zhenjun Zhao, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.00796v1">http://arxiv.org/abs/2601.00796v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reconstructing dynamic 3D scenes from monocular videos requires simultaneously capturing high-frequency appearance details and temporally continuous motion. Existing methods using single Gaussian primitives are limited by their low-pass filtering nature, while standard Gabor functions introduce energy instability. Moreover, lack of temporal continuity constraints often leads to motion artifacts during interpolation. We propose AdaGaR, a unified framework addressing both frequency adaptivity and temporal continuity in explicit dynamic scene modeling. We introduce Adaptive Gabor Representation, extending Gaussians through learnable frequency weights and adaptive energy compensation to balance detail capture and stability. For temporal continuity, we employ Cubic Hermite Splines with Temporal Curvature Regularization to ensure smooth motion evolution. An Adaptive Initialization mechanism combining depth estimation, point tracking, and foreground masks establishes stable point cloud distributions in early training. Experiments on Tap-Vid DAVIS demonstrate state-of-the-art performance (PSNR 35.49, SSIM 0.9433, LPIPS 0.0723) and strong generalization across frame interpolation, depth consistency, video editing, and stereo view synthesis. Project page: https://jiewenchan.github.io/AdaGaR/</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 05 Jan 2026 19:43:17 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/92dad114/6d2687c9.mp3" length="22286869" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1389</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiewen Chan, Zhenjun Zhao, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2601.00796v1">http://arxiv.org/abs/2601.00796v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reconstructing dynamic 3D scenes from monocular videos requires simultaneously capturing high-frequency appearance details and temporally continuous motion. Existing methods using single Gaussian primitives are limited by their low-pass filtering nature, while standard Gabor functions introduce energy instability. Moreover, lack of temporal continuity constraints often leads to motion artifacts during interpolation. We propose AdaGaR, a unified framework addressing both frequency adaptivity and temporal continuity in explicit dynamic scene modeling. We introduce Adaptive Gabor Representation, extending Gaussians through learnable frequency weights and adaptive energy compensation to balance detail capture and stability. For temporal continuity, we employ Cubic Hermite Splines with Temporal Curvature Regularization to ensure smooth motion evolution. An Adaptive Initialization mechanism combining depth estimation, point tracking, and foreground masks establishes stable point cloud distributions in early training. Experiments on Tap-Vid DAVIS demonstrate state-of-the-art performance (PSNR 35.49, SSIM 0.9433, LPIPS 0.0723) and strong generalization across frame interpolation, depth consistency, video editing, and stereo view synthesis. Project page: https://jiewenchan.github.io/AdaGaR/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Nested Learning: The Illusion of Deep Learning Architectures</title>
      <itunes:episode>1548</itunes:episode>
      <podcast:episode>1548</podcast:episode>
      <itunes:title>Nested Learning: The Illusion of Deep Learning Architectures</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0de440c8-fa00-408a-b4ca-5679af2095e2</guid>
      <link>https://share.transistor.fm/s/68354836</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni</p>

            <p><strong>Title:</strong><br>
            Nested Learning: The Illusion of Deep Learning Architectures</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24695v1">http://arxiv.org/abs/2512.24695v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the recent progresses, particularly in developing Language Models, there are fundamental challenges and unanswered questions about how such models can continually learn/memorize, self-improve, and find effective solutions. In this paper, we present a new learning paradigm, called Nested Learning (NL), that coherently represents a machine learning model with a set of nested, multi-level, and/or parallel optimization problems, each of which with its own context flow. Through the lenses of NL, existing deep learning methods learns from data through compressing their own context flow, and in-context learning naturally emerges in large models. NL suggests a philosophy to design more expressive learning algorithms with more levels, resulting in higher-order in-context learning and potentially unlocking effective continual learning capabilities. We advocate for NL by presenting three core contributions: (1) Expressive Optimizers: We show that known gradient-based optimizers, such as Adam, SGD with Momentum, etc., are in fact associative memory modules that aim to compress the gradients' information (by gradient descent). Building on this insight, we present other more expressive optimizers with deep memory and/or more powerful learning rules; (2) Self-Modifying Learning Module: Taking advantage of NL's insights on learning algorithms, we present a sequence model that learns how to modify itself by learning its own update algorithm; and (3) Continuum Memory System: We present a new formulation for memory system that generalizes the traditional viewpoint of long/short-term memory. Combining our self-modifying sequence model with the continuum memory system, we present a continual learning module, called Hope, showing promising results in language modeling, knowledge incorporation, and few-shot generalization tasks, continual learning, and long-context reasoning tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni</p>

            <p><strong>Title:</strong><br>
            Nested Learning: The Illusion of Deep Learning Architectures</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24695v1">http://arxiv.org/abs/2512.24695v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the recent progresses, particularly in developing Language Models, there are fundamental challenges and unanswered questions about how such models can continually learn/memorize, self-improve, and find effective solutions. In this paper, we present a new learning paradigm, called Nested Learning (NL), that coherently represents a machine learning model with a set of nested, multi-level, and/or parallel optimization problems, each of which with its own context flow. Through the lenses of NL, existing deep learning methods learns from data through compressing their own context flow, and in-context learning naturally emerges in large models. NL suggests a philosophy to design more expressive learning algorithms with more levels, resulting in higher-order in-context learning and potentially unlocking effective continual learning capabilities. We advocate for NL by presenting three core contributions: (1) Expressive Optimizers: We show that known gradient-based optimizers, such as Adam, SGD with Momentum, etc., are in fact associative memory modules that aim to compress the gradients' information (by gradient descent). Building on this insight, we present other more expressive optimizers with deep memory and/or more powerful learning rules; (2) Self-Modifying Learning Module: Taking advantage of NL's insights on learning algorithms, we present a sequence model that learns how to modify itself by learning its own update algorithm; and (3) Continuum Memory System: We present a new formulation for memory system that generalizes the traditional viewpoint of long/short-term memory. Combining our self-modifying sequence model with the continuum memory system, we present a continual learning module, called Hope, showing promising results in language modeling, knowledge incorporation, and few-shot generalization tasks, continual learning, and long-context reasoning tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 05 Jan 2026 19:42:56 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/68354836/24a605a6.mp3" length="22865314" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1425</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni</p>

            <p><strong>Title:</strong><br>
            Nested Learning: The Illusion of Deep Learning Architectures</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24695v1">http://arxiv.org/abs/2512.24695v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the recent progresses, particularly in developing Language Models, there are fundamental challenges and unanswered questions about how such models can continually learn/memorize, self-improve, and find effective solutions. In this paper, we present a new learning paradigm, called Nested Learning (NL), that coherently represents a machine learning model with a set of nested, multi-level, and/or parallel optimization problems, each of which with its own context flow. Through the lenses of NL, existing deep learning methods learns from data through compressing their own context flow, and in-context learning naturally emerges in large models. NL suggests a philosophy to design more expressive learning algorithms with more levels, resulting in higher-order in-context learning and potentially unlocking effective continual learning capabilities. We advocate for NL by presenting three core contributions: (1) Expressive Optimizers: We show that known gradient-based optimizers, such as Adam, SGD with Momentum, etc., are in fact associative memory modules that aim to compress the gradients' information (by gradient descent). Building on this insight, we present other more expressive optimizers with deep memory and/or more powerful learning rules; (2) Self-Modifying Learning Module: Taking advantage of NL's insights on learning algorithms, we present a sequence model that learns how to modify itself by learning its own update algorithm; and (3) Continuum Memory System: We present a new formulation for memory system that generalizes the traditional viewpoint of long/short-term memory. Combining our self-modifying sequence model with the continuum memory system, we present a continual learning module, called Hope, showing promising results in language modeling, knowledge incorporation, and few-shot generalization tasks, continual learning, and long-context reasoning tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling</title>
      <itunes:episode>1547</itunes:episode>
      <podcast:episode>1547</podcast:episode>
      <itunes:title>Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c22263a2-5672-4375-ac84-3716b6a5d36e</guid>
      <link>https://share.transistor.fm/s/50894cf1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu</p>

            <p><strong>Title:</strong><br>
            Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.23959v1">http://arxiv.org/abs/2512.23959v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Many RAG systems incorporate a working memory module to consolidate retrieved information. However, existing memory designs function primarily as passive storage that accumulates isolated facts for the purpose of condensing the lengthy inputs and generating new sub-queries through deduction. This static nature overlooks the crucial high-order correlations among primitive facts, the compositions of which can often provide stronger guidance for subsequent steps. Therefore, their representational strength and impact on multi-step reasoning and knowledge evolution are limited, resulting in fragmented reasoning and weak global sense-making capacity in extended contexts. We introduce HGMem, a hypergraph-based memory mechanism that extends the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph whose hyperedges correspond to distinct memory units, enabling the progressive formation of higher-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning in subsequent steps. We evaluate HGMem on several challenging datasets designed for global sense-making. Extensive experiments and in-depth analyses show that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu</p>

            <p><strong>Title:</strong><br>
            Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.23959v1">http://arxiv.org/abs/2512.23959v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Many RAG systems incorporate a working memory module to consolidate retrieved information. However, existing memory designs function primarily as passive storage that accumulates isolated facts for the purpose of condensing the lengthy inputs and generating new sub-queries through deduction. This static nature overlooks the crucial high-order correlations among primitive facts, the compositions of which can often provide stronger guidance for subsequent steps. Therefore, their representational strength and impact on multi-step reasoning and knowledge evolution are limited, resulting in fragmented reasoning and weak global sense-making capacity in extended contexts. We introduce HGMem, a hypergraph-based memory mechanism that extends the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph whose hyperedges correspond to distinct memory units, enabling the progressive formation of higher-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning in subsequent steps. We evaluate HGMem on several challenging datasets designed for global sense-making. Extensive experiments and in-depth analyses show that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 02 Jan 2026 18:56:06 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/50894cf1/02d98f48.mp3" length="21750655" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1356</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu</p>

            <p><strong>Title:</strong><br>
            Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.23959v1">http://arxiv.org/abs/2512.23959v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Many RAG systems incorporate a working memory module to consolidate retrieved information. However, existing memory designs function primarily as passive storage that accumulates isolated facts for the purpose of condensing the lengthy inputs and generating new sub-queries through deduction. This static nature overlooks the crucial high-order correlations among primitive facts, the compositions of which can often provide stronger guidance for subsequent steps. Therefore, their representational strength and impact on multi-step reasoning and knowledge evolution are limited, resulting in fragmented reasoning and weak global sense-making capacity in extended contexts. We introduce HGMem, a hypergraph-based memory mechanism that extends the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph whose hyperedges correspond to distinct memory units, enabling the progressive formation of higher-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning in subsequent steps. We evaluate HGMem on several challenging datasets designed for global sense-making. Extensive experiments and in-depth analyses show that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space</title>
      <itunes:episode>1546</itunes:episode>
      <podcast:episode>1546</podcast:episode>
      <itunes:title>Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e51926e8-e1b0-42c5-bb09-a4c461cd5328</guid>
      <link>https://share.transistor.fm/s/6ac38aa6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xingwei Qu, Shaowen Wang, Zihao Huang, Kai Hua, Fan Yin, Rui-Jie Zhu, Jundong Zhou, Qiyang Min, Zihao Wang, Yizhi Li, Tianyu Zhang, He Xing, Zheng Zhang, Yuxuan Song, Tianyu Zheng, Zhiyuan Zeng, Chenghua Lin, Ge Zhang, Wenhao Huang</p>

            <p><strong>Title:</strong><br>
            Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24617v1">http://arxiv.org/abs/2512.24617v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) apply uniform computation to all tokens, despite language exhibiting highly non-uniform information density. This token-uniform regime wastes capacity on locally predictable spans while under-allocating computation to semantically critical transitions. We propose $\textbf{Dynamic Large Concept Models (DLCM)}$, a hierarchical language modeling framework that learns semantic boundaries from latent representations and shifts computation from tokens to a compressed concept space where reasoning is more efficient. DLCM discovers variable-length concepts end-to-end without relying on predefined linguistic units. Hierarchical compression fundamentally changes scaling behavior. We introduce the first $\textbf{compression-aware scaling law}$, which disentangles token-level capacity, concept-level reasoning capacity, and compression ratio, enabling principled compute allocation under fixed FLOPs. To stably train this heterogeneous architecture, we further develop a $\textbf{decoupled $μ$P parametrization}$ that supports zero-shot hyperparameter transfer across widths and compression regimes. At a practical setting ($R=4$, corresponding to an average of four tokens per concept), DLCM reallocates roughly one-third of inference compute into a higher-capacity reasoning backbone, achieving a $\textbf{+2.69$\%$ average improvement}$ across 12 zero-shot benchmarks under matched inference FLOPs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xingwei Qu, Shaowen Wang, Zihao Huang, Kai Hua, Fan Yin, Rui-Jie Zhu, Jundong Zhou, Qiyang Min, Zihao Wang, Yizhi Li, Tianyu Zhang, He Xing, Zheng Zhang, Yuxuan Song, Tianyu Zheng, Zhiyuan Zeng, Chenghua Lin, Ge Zhang, Wenhao Huang</p>

            <p><strong>Title:</strong><br>
            Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24617v1">http://arxiv.org/abs/2512.24617v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) apply uniform computation to all tokens, despite language exhibiting highly non-uniform information density. This token-uniform regime wastes capacity on locally predictable spans while under-allocating computation to semantically critical transitions. We propose $\textbf{Dynamic Large Concept Models (DLCM)}$, a hierarchical language modeling framework that learns semantic boundaries from latent representations and shifts computation from tokens to a compressed concept space where reasoning is more efficient. DLCM discovers variable-length concepts end-to-end without relying on predefined linguistic units. Hierarchical compression fundamentally changes scaling behavior. We introduce the first $\textbf{compression-aware scaling law}$, which disentangles token-level capacity, concept-level reasoning capacity, and compression ratio, enabling principled compute allocation under fixed FLOPs. To stably train this heterogeneous architecture, we further develop a $\textbf{decoupled $μ$P parametrization}$ that supports zero-shot hyperparameter transfer across widths and compression regimes. At a practical setting ($R=4$, corresponding to an average of four tokens per concept), DLCM reallocates roughly one-third of inference compute into a higher-capacity reasoning backbone, achieving a $\textbf{+2.69$\%$ average improvement}$ across 12 zero-shot benchmarks under matched inference FLOPs.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 02 Jan 2026 18:55:44 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6ac38aa6/19aaf602.mp3" length="24411361" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1522</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xingwei Qu, Shaowen Wang, Zihao Huang, Kai Hua, Fan Yin, Rui-Jie Zhu, Jundong Zhou, Qiyang Min, Zihao Wang, Yizhi Li, Tianyu Zhang, He Xing, Zheng Zhang, Yuxuan Song, Tianyu Zheng, Zhiyuan Zeng, Chenghua Lin, Ge Zhang, Wenhao Huang</p>

            <p><strong>Title:</strong><br>
            Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24617v1">http://arxiv.org/abs/2512.24617v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) apply uniform computation to all tokens, despite language exhibiting highly non-uniform information density. This token-uniform regime wastes capacity on locally predictable spans while under-allocating computation to semantically critical transitions. We propose $\textbf{Dynamic Large Concept Models (DLCM)}$, a hierarchical language modeling framework that learns semantic boundaries from latent representations and shifts computation from tokens to a compressed concept space where reasoning is more efficient. DLCM discovers variable-length concepts end-to-end without relying on predefined linguistic units. Hierarchical compression fundamentally changes scaling behavior. We introduce the first $\textbf{compression-aware scaling law}$, which disentangles token-level capacity, concept-level reasoning capacity, and compression ratio, enabling principled compute allocation under fixed FLOPs. To stably train this heterogeneous architecture, we further develop a $\textbf{decoupled $μ$P parametrization}$ that supports zero-shot hyperparameter transfer across widths and compression regimes. At a practical setting ($R=4$, corresponding to an average of four tokens per concept), DLCM reallocates roughly one-third of inference compute into a higher-capacity reasoning backbone, achieving a $\textbf{+2.69$\%$ average improvement}$ across 12 zero-shot benchmarks under matched inference FLOPs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>mHC: Manifold-Constrained Hyper-Connections</title>
      <itunes:episode>1545</itunes:episode>
      <podcast:episode>1545</podcast:episode>
      <itunes:title>mHC: Manifold-Constrained Hyper-Connections</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">826ca069-f4fa-4497-8256-694dc3c22785</guid>
      <link>https://share.transistor.fm/s/6a3a92b8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 73 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Liang Zhao, Shangyan Zhou, Zhean Xu, Zhengyan Zhang, Wangding Zeng, Shengding Hu, Yuqing Wang, Jingyang Yuan, Lean Wang, Wenfeng Liang</p>

            <p><strong>Title:</strong><br>
            mHC: Manifold-Constrained Hyper-Connections</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24880v1">http://arxiv.org/abs/2512.24880v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, studies exemplified by Hyper-Connections (HC) have extended the ubiquitous residual connection paradigm established over the past decade by expanding the residual stream width and diversifying connectivity patterns. While yielding substantial performance gains, this diversification fundamentally compromises the identity mapping property intrinsic to the residual connection, which causes severe training instability and restricted scalability, and additionally incurs notable memory access overhead. To address these challenges, we propose Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability. We anticipate that mHC, as a flexible and practical extension of HC, will contribute to a deeper understanding of topological architecture design and suggest promising directions for the evolution of foundational models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 73 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Liang Zhao, Shangyan Zhou, Zhean Xu, Zhengyan Zhang, Wangding Zeng, Shengding Hu, Yuqing Wang, Jingyang Yuan, Lean Wang, Wenfeng Liang</p>

            <p><strong>Title:</strong><br>
            mHC: Manifold-Constrained Hyper-Connections</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24880v1">http://arxiv.org/abs/2512.24880v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, studies exemplified by Hyper-Connections (HC) have extended the ubiquitous residual connection paradigm established over the past decade by expanding the residual stream width and diversifying connectivity patterns. While yielding substantial performance gains, this diversification fundamentally compromises the identity mapping property intrinsic to the residual connection, which causes severe training instability and restricted scalability, and additionally incurs notable memory access overhead. To address these challenges, we propose Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability. We anticipate that mHC, as a flexible and practical extension of HC, will contribute to a deeper understanding of topological architecture design and suggest promising directions for the evolution of foundational models.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 01 Jan 2026 19:17:22 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6a3a92b8/41ef48d2.mp3" length="20172386" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1257</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 73 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Liang Zhao, Shangyan Zhou, Zhean Xu, Zhengyan Zhang, Wangding Zeng, Shengding Hu, Yuqing Wang, Jingyang Yuan, Lean Wang, Wenfeng Liang</p>

            <p><strong>Title:</strong><br>
            mHC: Manifold-Constrained Hyper-Connections</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24880v1">http://arxiv.org/abs/2512.24880v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, studies exemplified by Hyper-Connections (HC) have extended the ubiquitous residual connection paradigm established over the past decade by expanding the residual stream width and diversifying connectivity patterns. While yielding substantial performance gains, this diversification fundamentally compromises the identity mapping property intrinsic to the residual connection, which causes severe training instability and restricted scalability, and additionally incurs notable memory access overhead. To address these challenges, we propose Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability. We anticipate that mHC, as a flexible and practical extension of HC, will contribute to a deeper understanding of topological architecture design and suggest promising directions for the evolution of foundational models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models</title>
      <itunes:episode>1544</itunes:episode>
      <podcast:episode>1544</podcast:episode>
      <itunes:title>Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">141de6ef-0f38-4e85-bf6f-8a2bf6be2e1a</guid>
      <link>https://share.transistor.fm/s/d0e7b6b0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junru Lu, Jiarui Qin, Lingfeng Qiao, Yinghui Li, Xinyi Dai, Bo Ke, Jianfeng He, Ruizhi Qiao, Di Yin, Xing Sun, Yunsheng Wu, Yinsong Liu, Shuangyin Liu, Mingkong Tang, Haodong Lin, Jiayi Kuang, Fanxu Meng, Xiaojuan Tang, Yunjia Xi, Junjie Huang, Haotong Yang, Zhenyi Shen, Yangning Li, Qianwen Zhang, Yifei Yu, Siyu An, Junnan Dong, Qiufeng Wang, Jie Wang, Keyu Chen, Wei Wen, Taian Guo, Zhifeng Shen, Daohai Yu, Jiahao Li, Ke Li, Zongyi Li, Xiaoyu Tan</p>

            <p><strong>Title:</strong><br>
            Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24618v1">http://arxiv.org/abs/2512.24618v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Youtu-LLM, a lightweight yet powerful language model that harmonizes high computational efficiency with native agentic intelligence. Unlike typical small models that rely on distillation, Youtu-LLM (1.96B) is pre-trained from scratch to systematically cultivate reasoning and planning capabilities. The key technical advancements are as follows: (1) Compact Architecture with Long-Context Support: Built on a dense Multi-Latent Attention (MLA) architecture with a novel STEM-oriented vocabulary, Youtu-LLM supports a 128k context window. This design enables robust long-context reasoning and state tracking within a minimal memory footprint, making it ideal for long-horizon agent and reasoning tasks. (2) Principled "Commonsense-STEM-Agent" Curriculum: We curated a massive corpus of approximately 11T tokens and implemented a multi-stage training strategy. By progressively shifting the pre-training data distribution from general commonsense to complex STEM and agentic tasks, we ensure the model acquires deep cognitive abilities rather than superficial alignment. (3) Scalable Agentic Mid-training: Specifically for the agentic mid-training, we employ diverse data construction schemes to synthesize rich and varied trajectories across math, coding, and tool-use domains. This high-quality data enables the model to internalize planning and reflection behaviors effectively. Extensive evaluations show that Youtu-LLM sets a new state-of-the-art for sub-2B LLMs. On general benchmarks, it achieves competitive performance against larger models, while on agent-specific tasks, it significantly surpasses existing SOTA baselines, demonstrating that lightweight models can possess strong intrinsic agentic capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junru Lu, Jiarui Qin, Lingfeng Qiao, Yinghui Li, Xinyi Dai, Bo Ke, Jianfeng He, Ruizhi Qiao, Di Yin, Xing Sun, Yunsheng Wu, Yinsong Liu, Shuangyin Liu, Mingkong Tang, Haodong Lin, Jiayi Kuang, Fanxu Meng, Xiaojuan Tang, Yunjia Xi, Junjie Huang, Haotong Yang, Zhenyi Shen, Yangning Li, Qianwen Zhang, Yifei Yu, Siyu An, Junnan Dong, Qiufeng Wang, Jie Wang, Keyu Chen, Wei Wen, Taian Guo, Zhifeng Shen, Daohai Yu, Jiahao Li, Ke Li, Zongyi Li, Xiaoyu Tan</p>

            <p><strong>Title:</strong><br>
            Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24618v1">http://arxiv.org/abs/2512.24618v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Youtu-LLM, a lightweight yet powerful language model that harmonizes high computational efficiency with native agentic intelligence. Unlike typical small models that rely on distillation, Youtu-LLM (1.96B) is pre-trained from scratch to systematically cultivate reasoning and planning capabilities. The key technical advancements are as follows: (1) Compact Architecture with Long-Context Support: Built on a dense Multi-Latent Attention (MLA) architecture with a novel STEM-oriented vocabulary, Youtu-LLM supports a 128k context window. This design enables robust long-context reasoning and state tracking within a minimal memory footprint, making it ideal for long-horizon agent and reasoning tasks. (2) Principled "Commonsense-STEM-Agent" Curriculum: We curated a massive corpus of approximately 11T tokens and implemented a multi-stage training strategy. By progressively shifting the pre-training data distribution from general commonsense to complex STEM and agentic tasks, we ensure the model acquires deep cognitive abilities rather than superficial alignment. (3) Scalable Agentic Mid-training: Specifically for the agentic mid-training, we employ diverse data construction schemes to synthesize rich and varied trajectories across math, coding, and tool-use domains. This high-quality data enables the model to internalize planning and reflection behaviors effectively. Extensive evaluations show that Youtu-LLM sets a new state-of-the-art for sub-2B LLMs. On general benchmarks, it achieves competitive performance against larger models, while on agent-specific tasks, it significantly surpasses existing SOTA baselines, demonstrating that lightweight models can possess strong intrinsic agentic capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 01 Jan 2026 19:17:00 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d0e7b6b0/79ce3a5a.mp3" length="27496329" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1715</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junru Lu, Jiarui Qin, Lingfeng Qiao, Yinghui Li, Xinyi Dai, Bo Ke, Jianfeng He, Ruizhi Qiao, Di Yin, Xing Sun, Yunsheng Wu, Yinsong Liu, Shuangyin Liu, Mingkong Tang, Haodong Lin, Jiayi Kuang, Fanxu Meng, Xiaojuan Tang, Yunjia Xi, Junjie Huang, Haotong Yang, Zhenyi Shen, Yangning Li, Qianwen Zhang, Yifei Yu, Siyu An, Junnan Dong, Qiufeng Wang, Jie Wang, Keyu Chen, Wei Wen, Taian Guo, Zhifeng Shen, Daohai Yu, Jiahao Li, Ke Li, Zongyi Li, Xiaoyu Tan</p>

            <p><strong>Title:</strong><br>
            Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24618v1">http://arxiv.org/abs/2512.24618v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Youtu-LLM, a lightweight yet powerful language model that harmonizes high computational efficiency with native agentic intelligence. Unlike typical small models that rely on distillation, Youtu-LLM (1.96B) is pre-trained from scratch to systematically cultivate reasoning and planning capabilities. The key technical advancements are as follows: (1) Compact Architecture with Long-Context Support: Built on a dense Multi-Latent Attention (MLA) architecture with a novel STEM-oriented vocabulary, Youtu-LLM supports a 128k context window. This design enables robust long-context reasoning and state tracking within a minimal memory footprint, making it ideal for long-horizon agent and reasoning tasks. (2) Principled "Commonsense-STEM-Agent" Curriculum: We curated a massive corpus of approximately 11T tokens and implemented a multi-stage training strategy. By progressively shifting the pre-training data distribution from general commonsense to complex STEM and agentic tasks, we ensure the model acquires deep cognitive abilities rather than superficial alignment. (3) Scalable Agentic Mid-training: Specifically for the agentic mid-training, we employ diverse data construction schemes to synthesize rich and varied trajectories across math, coding, and tool-use domains. This high-quality data enables the model to internalize planning and reflection behaviors effectively. Extensive evaluations show that Youtu-LLM sets a new state-of-the-art for sub-2B LLMs. On general benchmarks, it achieves competitive performance against larger models, while on agent-specific tasks, it significantly surpasses existing SOTA baselines, demonstrating that lightweight models can possess strong intrinsic agentic capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem</title>
      <itunes:episode>1543</itunes:episode>
      <podcast:episode>1543</podcast:episode>
      <itunes:title>Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">edcec4b4-4a2d-42d1-a14a-bb1f412cc339</guid>
      <link>https://share.transistor.fm/s/77ecce65</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, Yang Li, Zhongwen Li, Shirong Lin, Jiashun Liu, Zenan Liu, Tao Luo, Dilxat Muhtar, Yuanbin Qu, Jiaqiang Shi, Qinghui Sun, Yingshui Tan, Hao Tang, Runze Wang, Yi Wang, Zhaoguo Wang, Yanan Wu, Shaopan Xiong, Binchen Xu, Xander Xu, Yuchi Xu, Qipeng Zhang, Xixia Zhang, Haizhou Zhao, Jie Zhao, Shuaibing Zhao, Baihui Zheng, Jianhui Zheng, Suhang Zheng, Yanni Zhu, Mengze Cai, Kerui Cao, Xitong Chen, Yue Dai, Lifan Du, Tao Feng, Tao He, Jin Hu, Yijie Hu, Ziyu Jiang, Cheng Li, Xiang Li, Jing Liang, Chonghuan Liu, ZhenDong Liu, Haodong Mi, Yanhu Mo, Junjia Ni, Shixin Pei, Jingyu Shen, XiaoShuai Song, Cecilia Wang, Chaofan Wang, Kangyu Wang, Pei Wang, Tao Wang, Wei Wang, Ke Xiao, Mingyu Xu, Tiange Xu, Nan Ya, Siran Yang, Jianan Ye, Yaxing Zang, Duo Zhang, Junbo Zhang, Boren Zheng, Wanxi Deng, Ling Pan, Lin Qu, Wenbo Su, Jiamang Wang, Wei Wang, Hu Wei, Minggang Wu, Cheng Yu, Bing Zhao, Zhicheng Zheng, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24873v1">http://arxiv.org/abs/2512.24873v1</a></p>

            <p><strong>Abstract:</strong><br>
            Agentic crafting requires LLMs to operate in real-world environments over multiple turns by taking actions, observing outcomes, and iteratively refining artifacts. Despite its importance, the open-source community lacks a principled, end-to-end ecosystem to streamline agent development. We introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimizes the production pipeline for agent LLMs. ALE consists of three components: ROLL, a post-training framework for weight optimization; ROCK, a sandbox environment manager for trajectory generation; and iFlow CLI, an agent framework for efficient context engineering. We release ROME (ROME is Obviously an Agentic Model), an open-source agent grounded by ALE and trained on over one million trajectories. Our approach includes data composition protocols for synthesizing complex behaviors and a novel policy optimization algorithm, Interaction-based Policy Alignment (IPA), which assigns credit over semantic interaction chunks rather than individual tokens to improve long-horizon training stability. Empirically, we evaluate ROME within a structured setting and introduce Terminal Bench Pro, a benchmark with improved scale and contamination control. ROME demonstrates strong performance across benchmarks like SWE-bench Verified and Terminal Bench, proving the effectiveness of the ALE infrastructure.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, Yang Li, Zhongwen Li, Shirong Lin, Jiashun Liu, Zenan Liu, Tao Luo, Dilxat Muhtar, Yuanbin Qu, Jiaqiang Shi, Qinghui Sun, Yingshui Tan, Hao Tang, Runze Wang, Yi Wang, Zhaoguo Wang, Yanan Wu, Shaopan Xiong, Binchen Xu, Xander Xu, Yuchi Xu, Qipeng Zhang, Xixia Zhang, Haizhou Zhao, Jie Zhao, Shuaibing Zhao, Baihui Zheng, Jianhui Zheng, Suhang Zheng, Yanni Zhu, Mengze Cai, Kerui Cao, Xitong Chen, Yue Dai, Lifan Du, Tao Feng, Tao He, Jin Hu, Yijie Hu, Ziyu Jiang, Cheng Li, Xiang Li, Jing Liang, Chonghuan Liu, ZhenDong Liu, Haodong Mi, Yanhu Mo, Junjia Ni, Shixin Pei, Jingyu Shen, XiaoShuai Song, Cecilia Wang, Chaofan Wang, Kangyu Wang, Pei Wang, Tao Wang, Wei Wang, Ke Xiao, Mingyu Xu, Tiange Xu, Nan Ya, Siran Yang, Jianan Ye, Yaxing Zang, Duo Zhang, Junbo Zhang, Boren Zheng, Wanxi Deng, Ling Pan, Lin Qu, Wenbo Su, Jiamang Wang, Wei Wang, Hu Wei, Minggang Wu, Cheng Yu, Bing Zhao, Zhicheng Zheng, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24873v1">http://arxiv.org/abs/2512.24873v1</a></p>

            <p><strong>Abstract:</strong><br>
            Agentic crafting requires LLMs to operate in real-world environments over multiple turns by taking actions, observing outcomes, and iteratively refining artifacts. Despite its importance, the open-source community lacks a principled, end-to-end ecosystem to streamline agent development. We introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimizes the production pipeline for agent LLMs. ALE consists of three components: ROLL, a post-training framework for weight optimization; ROCK, a sandbox environment manager for trajectory generation; and iFlow CLI, an agent framework for efficient context engineering. We release ROME (ROME is Obviously an Agentic Model), an open-source agent grounded by ALE and trained on over one million trajectories. Our approach includes data composition protocols for synthesizing complex behaviors and a novel policy optimization algorithm, Interaction-based Policy Alignment (IPA), which assigns credit over semantic interaction chunks rather than individual tokens to improve long-horizon training stability. Empirically, we evaluate ROME within a structured setting and introduce Terminal Bench Pro, a benchmark with improved scale and contamination control. ROME demonstrates strong performance across benchmarks like SWE-bench Verified and Terminal Bench, proving the effectiveness of the ALE infrastructure.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 01 Jan 2026 19:16:38 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/77ecce65/f7b1aaa0.mp3" length="24981912" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1558</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, Yancheng He, Ju Huang, Qiang Ji, Hanqi Jin, Xiaoyang Li, Yang Li, Zhongwen Li, Shirong Lin, Jiashun Liu, Zenan Liu, Tao Luo, Dilxat Muhtar, Yuanbin Qu, Jiaqiang Shi, Qinghui Sun, Yingshui Tan, Hao Tang, Runze Wang, Yi Wang, Zhaoguo Wang, Yanan Wu, Shaopan Xiong, Binchen Xu, Xander Xu, Yuchi Xu, Qipeng Zhang, Xixia Zhang, Haizhou Zhao, Jie Zhao, Shuaibing Zhao, Baihui Zheng, Jianhui Zheng, Suhang Zheng, Yanni Zhu, Mengze Cai, Kerui Cao, Xitong Chen, Yue Dai, Lifan Du, Tao Feng, Tao He, Jin Hu, Yijie Hu, Ziyu Jiang, Cheng Li, Xiang Li, Jing Liang, Chonghuan Liu, ZhenDong Liu, Haodong Mi, Yanhu Mo, Junjia Ni, Shixin Pei, Jingyu Shen, XiaoShuai Song, Cecilia Wang, Chaofan Wang, Kangyu Wang, Pei Wang, Tao Wang, Wei Wang, Ke Xiao, Mingyu Xu, Tiange Xu, Nan Ya, Siran Yang, Jianan Ye, Yaxing Zang, Duo Zhang, Junbo Zhang, Boren Zheng, Wanxi Deng, Ling Pan, Lin Qu, Wenbo Su, Jiamang Wang, Wei Wang, Hu Wei, Minggang Wu, Cheng Yu, Bing Zhao, Zhicheng Zheng, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.24873v1">http://arxiv.org/abs/2512.24873v1</a></p>

            <p><strong>Abstract:</strong><br>
            Agentic crafting requires LLMs to operate in real-world environments over multiple turns by taking actions, observing outcomes, and iteratively refining artifacts. Despite its importance, the open-source community lacks a principled, end-to-end ecosystem to streamline agent development. We introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimizes the production pipeline for agent LLMs. ALE consists of three components: ROLL, a post-training framework for weight optimization; ROCK, a sandbox environment manager for trajectory generation; and iFlow CLI, an agent framework for efficient context engineering. We release ROME (ROME is Obviously an Agentic Model), an open-source agent grounded by ALE and trained on over one million trajectories. Our approach includes data composition protocols for synthesizing complex behaviors and a novel policy optimization algorithm, Interaction-based Policy Alignment (IPA), which assigns credit over semantic interaction chunks rather than individual tokens to improve long-horizon training stability. Empirically, we evaluate ROME within a structured setting and introduce Terminal Bench Pro, a benchmark with improved scale and contamination control. ROME demonstrates strong performance across benchmarks like SWE-bench Verified and Terminal Bench, proving the effectiveness of the ALE infrastructure.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction</title>
      <itunes:episode>1542</itunes:episode>
      <podcast:episode>1542</podcast:episode>
      <itunes:title>GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3e98ff30-efb9-4c21-80bc-a3bb9c702259</guid>
      <link>https://share.transistor.fm/s/6f523eda</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yi-Chuan Huang, Hao-Jen Chien, Chin-Yang Lin, Ying-Huan Chen, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.25073v1">http://arxiv.org/abs/2512.25073v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, including regularization techniques, semantic priors, and geometric constraints, have been implemented to address this challenge. Latest diffusion-based methods have demonstrated substantial improvements by generating novel views from new camera poses to augment training data, surpassing earlier regularization and prior-based techniques. Despite this progress, we identify three critical limitations in these state-of-the-art approaches: inadequate coverage beyond known view peripheries, geometric inconsistencies across generated views, and computationally expensive pipelines. We introduce GaMO (Geometry-aware Multi-view Outpainter), a framework that reformulates sparse-view reconstruction through multi-view outpainting. Instead of generating new viewpoints, GaMO expands the field of view from existing camera poses, which inherently preserves geometric consistency while providing broader scene coverage. Our approach employs multi-view conditioning and geometry-aware denoising strategies in a zero-shot manner without training. Extensive experiments on Replica and ScanNet++ demonstrate state-of-the-art reconstruction quality across 3, 6, and 9 input views, outperforming prior methods in PSNR and LPIPS, while achieving a $25\times$ speedup over SOTA diffusion-based methods with processing time under 10 minutes. Project page: https://yichuanh.github.io/GaMO/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yi-Chuan Huang, Hao-Jen Chien, Chin-Yang Lin, Ying-Huan Chen, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.25073v1">http://arxiv.org/abs/2512.25073v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, including regularization techniques, semantic priors, and geometric constraints, have been implemented to address this challenge. Latest diffusion-based methods have demonstrated substantial improvements by generating novel views from new camera poses to augment training data, surpassing earlier regularization and prior-based techniques. Despite this progress, we identify three critical limitations in these state-of-the-art approaches: inadequate coverage beyond known view peripheries, geometric inconsistencies across generated views, and computationally expensive pipelines. We introduce GaMO (Geometry-aware Multi-view Outpainter), a framework that reformulates sparse-view reconstruction through multi-view outpainting. Instead of generating new viewpoints, GaMO expands the field of view from existing camera poses, which inherently preserves geometric consistency while providing broader scene coverage. Our approach employs multi-view conditioning and geometry-aware denoising strategies in a zero-shot manner without training. Extensive experiments on Replica and ScanNet++ demonstrate state-of-the-art reconstruction quality across 3, 6, and 9 input views, outperforming prior methods in PSNR and LPIPS, while achieving a $25\times$ speedup over SOTA diffusion-based methods with processing time under 10 minutes. Project page: https://yichuanh.github.io/GaMO/</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 01 Jan 2026 19:16:16 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6f523eda/5a750f44.mp3" length="21632362" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1348</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yi-Chuan Huang, Hao-Jen Chien, Chin-Yang Lin, Ying-Huan Chen, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.25073v1">http://arxiv.org/abs/2512.25073v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, including regularization techniques, semantic priors, and geometric constraints, have been implemented to address this challenge. Latest diffusion-based methods have demonstrated substantial improvements by generating novel views from new camera poses to augment training data, surpassing earlier regularization and prior-based techniques. Despite this progress, we identify three critical limitations in these state-of-the-art approaches: inadequate coverage beyond known view peripheries, geometric inconsistencies across generated views, and computationally expensive pipelines. We introduce GaMO (Geometry-aware Multi-view Outpainter), a framework that reformulates sparse-view reconstruction through multi-view outpainting. Instead of generating new viewpoints, GaMO expands the field of view from existing camera poses, which inherently preserves geometric consistency while providing broader scene coverage. Our approach employs multi-view conditioning and geometry-aware denoising strategies in a zero-shot manner without training. Extensive experiments on Replica and ScanNet++ demonstrate state-of-the-art reconstruction quality across 3, 6, and 9 input views, outperforming prior methods in PSNR and LPIPS, while achieving a $25\times$ speedup over SOTA diffusion-based methods with processing time under 10 minutes. Project page: https://yichuanh.github.io/GaMO/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss</title>
      <itunes:episode>1541</itunes:episode>
      <podcast:episode>1541</podcast:episode>
      <itunes:title>Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fdf14370-ff70-43aa-b293-971e126f483c</guid>
      <link>https://share.transistor.fm/s/9a7e26d9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 72 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao</p>

            <p><strong>Title:</strong><br>
            Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.23447v1">http://arxiv.org/abs/2512.23447v1</a></p>

            <p><strong>Abstract:</strong><br>
            Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 72 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao</p>

            <p><strong>Title:</strong><br>
            Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.23447v1">http://arxiv.org/abs/2512.23447v1</a></p>

            <p><strong>Abstract:</strong><br>
            Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Dec 2025 19:51:49 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9a7e26d9/5bd1d3cc.mp3" length="23880967" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1489</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 72 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ang Lv, Jin Ma, Yiyuan Ma, Siyuan Qiao</p>

            <p><strong>Title:</strong><br>
            Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.23447v1">http://arxiv.org/abs/2512.23447v1</a></p>

            <p><strong>Abstract:</strong><br>
            Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation</title>
      <itunes:episode>1540</itunes:episode>
      <podcast:episode>1540</podcast:episode>
      <itunes:title>LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d3859e60-84a7-478f-a841-40f137a7bb4a</guid>
      <link>https://share.transistor.fm/s/b59cc78e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.23576v1">http://arxiv.org/abs/2512.23576v1</a></p>

            <p><strong>Abstract:</strong><br>
            Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.23576v1">http://arxiv.org/abs/2512.23576v1</a></p>

            <p><strong>Abstract:</strong><br>
            Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Dec 2025 19:51:26 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b59cc78e/077d8d3b.mp3" length="22389293" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1396</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.23576v1">http://arxiv.org/abs/2512.23576v1</a></p>

            <p><strong>Abstract:</strong><br>
            Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Yume-1.5: A Text-Controlled Interactive World Generation Model</title>
      <itunes:episode>1539</itunes:episode>
      <podcast:episode>1539</podcast:episode>
      <itunes:title>Yume-1.5: A Text-Controlled Interactive World Generation Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1a7e4a82-5f2f-4f64-9114-fb7f14bab16e</guid>
      <link>https://share.transistor.fm/s/b220a888</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, Kaipeng Zhang</p>

            <p><strong>Title:</strong><br>
            Yume-1.5: A Text-Controlled Interactive World Generation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.22096v1">http://arxiv.org/abs/2512.22096v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, Kaipeng Zhang</p>

            <p><strong>Title:</strong><br>
            Yume-1.5: A Text-Controlled Interactive World Generation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.22096v1">http://arxiv.org/abs/2512.22096v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Dec 2025 19:51:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b220a888/ea519b52.mp3" length="24070710" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1501</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, Kaipeng Zhang</p>

            <p><strong>Title:</strong><br>
            Yume-1.5: A Text-Controlled Interactive World Generation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.22096v1">http://arxiv.org/abs/2512.22096v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents</title>
      <itunes:episode>1538</itunes:episode>
      <podcast:episode>1538</podcast:episode>
      <itunes:title>SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c57a7412-7fc6-4b07-9a00-8aa12051564a</guid>
      <link>https://share.transistor.fm/s/57ad3765</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL, cs.AI, cs.CV, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Shaofei Cai, Yulei Qin, Haojia Lin, Zihan Xu, Gang Li, Yuchen Shi, Zongyi Li, Yong Mao, Siqi Cai, Xiaoyu Tan, Yitao Liang, Ke Li, Xing Sun</p>

            <p><strong>Title:</strong><br>
            SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.22322v1">http://arxiv.org/abs/2512.22322v1</a></p>

            <p><strong>Abstract:</strong><br>
            Agentic reinforcement learning (RL) holds great promise for the development of autonomous agents under complex GUI tasks, but its scalability remains severely hampered by the verification of task completion. Existing task verification is treated as a passive, post-hoc process: a verifier (i.e., rule-based scoring script, reward or critic model, and LLM-as-a-Judge) analyzes the agent's entire interaction trajectory to determine if the agent succeeds. Such processing of verbose context that contains irrelevant, noisy history poses challenges to the verification protocols and therefore leads to prohibitive cost and low reliability. To overcome this bottleneck, we propose SmartSnap, a paradigm shift from this passive, post-hoc verification to proactive, in-situ self-verification by the agent itself. We introduce the Self-Verifying Agent, a new type of agent designed with dual missions: to not only complete a task but also to prove its accomplishment with curated snapshot evidences. Guided by our proposed 3C Principles (Completeness, Conciseness, and Creativity), the agent leverages its accessibility to the online environment to perform self-verification on a minimal, decisive set of snapshots. Such evidences are provided as the sole materials for a general LLM-as-a-Judge verifier to determine their validity and relevance. Experiments on mobile tasks across model families and scales demonstrate that our SmartSnap paradigm allows training LLM-driven agents in a scalable manner, bringing performance gains up to 26.08% and 16.66% respectively to 8B and 30B models. The synergizing between solution finding and evidence seeking facilitates the cultivation of efficient, self-verifying agents with competitive performance against DeepSeek V3.1 and Qwen3-235B-A22B.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL, cs.AI, cs.CV, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Shaofei Cai, Yulei Qin, Haojia Lin, Zihan Xu, Gang Li, Yuchen Shi, Zongyi Li, Yong Mao, Siqi Cai, Xiaoyu Tan, Yitao Liang, Ke Li, Xing Sun</p>

            <p><strong>Title:</strong><br>
            SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.22322v1">http://arxiv.org/abs/2512.22322v1</a></p>

            <p><strong>Abstract:</strong><br>
            Agentic reinforcement learning (RL) holds great promise for the development of autonomous agents under complex GUI tasks, but its scalability remains severely hampered by the verification of task completion. Existing task verification is treated as a passive, post-hoc process: a verifier (i.e., rule-based scoring script, reward or critic model, and LLM-as-a-Judge) analyzes the agent's entire interaction trajectory to determine if the agent succeeds. Such processing of verbose context that contains irrelevant, noisy history poses challenges to the verification protocols and therefore leads to prohibitive cost and low reliability. To overcome this bottleneck, we propose SmartSnap, a paradigm shift from this passive, post-hoc verification to proactive, in-situ self-verification by the agent itself. We introduce the Self-Verifying Agent, a new type of agent designed with dual missions: to not only complete a task but also to prove its accomplishment with curated snapshot evidences. Guided by our proposed 3C Principles (Completeness, Conciseness, and Creativity), the agent leverages its accessibility to the online environment to perform self-verification on a minimal, decisive set of snapshots. Such evidences are provided as the sole materials for a general LLM-as-a-Judge verifier to determine their validity and relevance. Experiments on mobile tasks across model families and scales demonstrate that our SmartSnap paradigm allows training LLM-driven agents in a scalable manner, bringing performance gains up to 26.08% and 16.66% respectively to 8B and 30B models. The synergizing between solution finding and evidence seeking facilitates the cultivation of efficient, self-verifying agents with competitive performance against DeepSeek V3.1 and Qwen3-235B-A22B.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Dec 2025 19:50:40 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/57ad3765/bdd4c1aa.mp3" length="23108151" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1441</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL, cs.AI, cs.CV, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Shaofei Cai, Yulei Qin, Haojia Lin, Zihan Xu, Gang Li, Yuchen Shi, Zongyi Li, Yong Mao, Siqi Cai, Xiaoyu Tan, Yitao Liang, Ke Li, Xing Sun</p>

            <p><strong>Title:</strong><br>
            SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.22322v1">http://arxiv.org/abs/2512.22322v1</a></p>

            <p><strong>Abstract:</strong><br>
            Agentic reinforcement learning (RL) holds great promise for the development of autonomous agents under complex GUI tasks, but its scalability remains severely hampered by the verification of task completion. Existing task verification is treated as a passive, post-hoc process: a verifier (i.e., rule-based scoring script, reward or critic model, and LLM-as-a-Judge) analyzes the agent's entire interaction trajectory to determine if the agent succeeds. Such processing of verbose context that contains irrelevant, noisy history poses challenges to the verification protocols and therefore leads to prohibitive cost and low reliability. To overcome this bottleneck, we propose SmartSnap, a paradigm shift from this passive, post-hoc verification to proactive, in-situ self-verification by the agent itself. We introduce the Self-Verifying Agent, a new type of agent designed with dual missions: to not only complete a task but also to prove its accomplishment with curated snapshot evidences. Guided by our proposed 3C Principles (Completeness, Conciseness, and Creativity), the agent leverages its accessibility to the online environment to perform self-verification on a minimal, decisive set of snapshots. Such evidences are provided as the sole materials for a general LLM-as-a-Judge verifier to determine their validity and relevance. Experiments on mobile tasks across model families and scales demonstrate that our SmartSnap paradigm allows training LLM-driven agents in a scalable manner, bringing performance gains up to 26.08% and 16.66% respectively to 8B and 30B models. The synergizing between solution finding and evidence seeking facilitates the cultivation of efficient, self-verifying agents with competitive performance against DeepSeek V3.1 and Qwen3-235B-A22B.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation</title>
      <itunes:episode>1537</itunes:episode>
      <podcast:episode>1537</podcast:episode>
      <itunes:title>Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8406abda-29ab-433e-b8c5-aee6d852d322</guid>
      <link>https://share.transistor.fm/s/184e5d39</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shaocong Xu, Songlin Wei, Qizhe Wei, Zheng Geng, Hong Li, Licheng Shen, Qianpu Sun, Shu Han, Bin Ma, Bohan Li, Chongjie Ye, Yuhang Zheng, Nan Wang, Saining Zhang, Hao Zhao</p>

            <p><strong>Title:</strong><br>
            Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.23705v1">http://arxiv.org/abs/2512.23705v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT's depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: "Diffusion knows transparency." Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shaocong Xu, Songlin Wei, Qizhe Wei, Zheng Geng, Hong Li, Licheng Shen, Qianpu Sun, Shu Han, Bin Ma, Bohan Li, Chongjie Ye, Yuhang Zheng, Nan Wang, Saining Zhang, Hao Zhao</p>

            <p><strong>Title:</strong><br>
            Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.23705v1">http://arxiv.org/abs/2512.23705v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT's depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: "Diffusion knows transparency." Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Dec 2025 19:50:16 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/184e5d39/a55cd9a8.mp3" length="24565202" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1532</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shaocong Xu, Songlin Wei, Qizhe Wei, Zheng Geng, Hong Li, Licheng Shen, Qianpu Sun, Shu Han, Bin Ma, Bohan Li, Chongjie Ye, Yuhang Zheng, Nan Wang, Saining Zhang, Hao Zhao</p>

            <p><strong>Title:</strong><br>
            Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.23705v1">http://arxiv.org/abs/2512.23705v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT's depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: "Diffusion knows transparency." Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion</title>
      <itunes:episode>1536</itunes:episode>
      <podcast:episode>1536</podcast:episode>
      <itunes:title>Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">683f17f3-35f7-42c2-a016-3cfd9ce7efc1</guid>
      <link>https://share.transistor.fm/s/8c795f12</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hau-Shiang Shiu, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Po-Fan Yu, Yu-Chih Chen, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.23709v1">http://arxiv.org/abs/2512.23709v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) that enhances detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior diffusion-based methods. Compared with the online SOTA TMP, it boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130x. Stream-DiffVSR achieves the lowest latency reported for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, thereby making it the first diffusion VSR method suitable for low-latency online deployment. Project page: https://jamichss.github.io/stream-diffvsr-project-page/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hau-Shiang Shiu, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Po-Fan Yu, Yu-Chih Chen, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.23709v1">http://arxiv.org/abs/2512.23709v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) that enhances detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior diffusion-based methods. Compared with the online SOTA TMP, it boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130x. Stream-DiffVSR achieves the lowest latency reported for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, thereby making it the first diffusion VSR method suitable for low-latency online deployment. Project page: https://jamichss.github.io/stream-diffvsr-project-page/</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Dec 2025 19:49:53 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8c795f12/00df2393.mp3" length="24152241" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1506</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hau-Shiang Shiu, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Po-Fan Yu, Yu-Chih Chen, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.23709v1">http://arxiv.org/abs/2512.23709v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) that enhances detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior diffusion-based methods. Compared with the online SOTA TMP, it boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130x. Stream-DiffVSR achieves the lowest latency reported for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, thereby making it the first diffusion VSR method suitable for low-latency online deployment. Project page: https://jamichss.github.io/stream-diffvsr-project-page/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Dream-VL &amp; Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone</title>
      <itunes:episode>1535</itunes:episode>
      <podcast:episode>1535</podcast:episode>
      <itunes:title>Dream-VL &amp; Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c769b729-5e82-468e-9576-561e291d1ecc</guid>
      <link>https://share.transistor.fm/s/2979dad5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, Lingpeng Kong</p>

            <p><strong>Title:</strong><br>
            Dream-VL &amp; Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.22615v1">http://arxiv.org/abs/2512.22615v1</a></p>

            <p><strong>Abstract:</strong><br>
            While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as $π_0$ and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, Lingpeng Kong</p>

            <p><strong>Title:</strong><br>
            Dream-VL &amp; Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.22615v1">http://arxiv.org/abs/2512.22615v1</a></p>

            <p><strong>Abstract:</strong><br>
            While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as $π_0$ and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Dec 2025 19:49:29 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2979dad5/282b6e13.mp3" length="22910927" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1428</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiacheng Ye, Shansan Gong, Jiahui Gao, Junming Fan, Shuang Wu, Wei Bi, Haoli Bai, Lifeng Shang, Lingpeng Kong</p>

            <p><strong>Title:</strong><br>
            Dream-VL &amp; Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.22615v1">http://arxiv.org/abs/2512.22615v1</a></p>

            <p><strong>Abstract:</strong><br>
            While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as $π_0$ and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SpotEdit: Selective Region Editing in Diffusion Transformers</title>
      <itunes:episode>1534</itunes:episode>
      <podcast:episode>1534</podcast:episode>
      <itunes:title>SpotEdit: Selective Region Editing in Diffusion Transformers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">398cdff4-3cfb-42db-836d-13e0655553c4</guid>
      <link>https://share.transistor.fm/s/aa616bdc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhibin Qin, Zhenxiong Tan, Zeqing Wang, Songhua Liu, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            SpotEdit: Selective Region Editing in Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.22323v1">http://arxiv.org/abs/2512.22323v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformer models have significantly advanced image editing by encoding conditional images and integrating them into transformer layers. However, most edits involve modifying only small regions, while current methods uniformly process and denoise all tokens at every timestep, causing redundant computation and potentially degrading unchanged areas. This raises a fundamental question: Is it truly necessary to regenerate every region during editing? To address this, we propose SpotEdit, a training-free diffusion editing framework that selectively updates only the modified regions. SpotEdit comprises two key components: SpotSelector identifies stable regions via perceptual similarity and skips their computation by reusing conditional image features; SpotFusion adaptively blends these features with edited tokens through a dynamic fusion mechanism, preserving contextual coherence and editing quality. By reducing unnecessary computation and maintaining high fidelity in unmodified areas, SpotEdit achieves efficient and precise image editing.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhibin Qin, Zhenxiong Tan, Zeqing Wang, Songhua Liu, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            SpotEdit: Selective Region Editing in Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.22323v1">http://arxiv.org/abs/2512.22323v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformer models have significantly advanced image editing by encoding conditional images and integrating them into transformer layers. However, most edits involve modifying only small regions, while current methods uniformly process and denoise all tokens at every timestep, causing redundant computation and potentially degrading unchanged areas. This raises a fundamental question: Is it truly necessary to regenerate every region during editing? To address this, we propose SpotEdit, a training-free diffusion editing framework that selectively updates only the modified regions. SpotEdit comprises two key components: SpotSelector identifies stable regions via perceptual similarity and skips their computation by reusing conditional image features; SpotFusion adaptively blends these features with edited tokens through a dynamic fusion mechanism, preserving contextual coherence and editing quality. By reducing unnecessary computation and maintaining high fidelity in unmodified areas, SpotEdit achieves efficient and precise image editing.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Dec 2025 19:49:06 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/aa616bdc/268788ce.mp3" length="21884364" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1364</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhibin Qin, Zhenxiong Tan, Zeqing Wang, Songhua Liu, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            SpotEdit: Selective Region Editing in Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.22323v1">http://arxiv.org/abs/2512.22323v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformer models have significantly advanced image editing by encoding conditional images and integrating them into transformer layers. However, most edits involve modifying only small regions, while current methods uniformly process and denoise all tokens at every timestep, causing redundant computation and potentially degrading unchanged areas. This raises a fundamental question: Is it truly necessary to regenerate every region during editing? To address this, we propose SpotEdit, a training-free diffusion editing framework that selectively updates only the modified regions. SpotEdit comprises two key components: SpotSelector identifies stable regions via perceptual similarity and skips their computation by reusing conditional image features; SpotFusion adaptively blends these features with edited tokens through a dynamic fusion mechanism, preserving contextual coherence and editing quality. By reducing unnecessary computation and maintaining high fidelity in unmodified areas, SpotEdit achieves efficient and precise image editing.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models</title>
      <itunes:episode>1533</itunes:episode>
      <podcast:episode>1533</podcast:episode>
      <itunes:title>GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">37ad4b02-8666-470d-beb2-34d3101498dc</guid>
      <link>https://share.transistor.fm/s/82050f70</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Bozhou Li, Sihan Yang, Yushuo Guan, Ruichuan An, Xinlong Chen, Yang Shi, Pengfei Wan, Wentao Zhang, Yuanxing zhang</p>

            <p><strong>Title:</strong><br>
            GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.15560v2">http://arxiv.org/abs/2512.15560v2</a></p>

            <p><strong>Abstract:</strong><br>
            The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder's effectiveness in downstream generation tasks. Notably, under our experimental setup, compared with training a diffusion model from scratch, evaluating with TED-6K is about \textbf{750$\times$ faster}. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our TED-6K dataset and evaluation code are available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Bozhou Li, Sihan Yang, Yushuo Guan, Ruichuan An, Xinlong Chen, Yang Shi, Pengfei Wan, Wentao Zhang, Yuanxing zhang</p>

            <p><strong>Title:</strong><br>
            GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.15560v2">http://arxiv.org/abs/2512.15560v2</a></p>

            <p><strong>Abstract:</strong><br>
            The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder's effectiveness in downstream generation tasks. Notably, under our experimental setup, compared with training a diffusion model from scratch, evaluating with TED-6K is about \textbf{750$\times$ faster}. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our TED-6K dataset and evaluation code are available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Dec 2025 19:48:43 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/82050f70/4eebb0bc.mp3" length="21227357" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1323</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Bozhou Li, Sihan Yang, Yushuo Guan, Ruichuan An, Xinlong Chen, Yang Shi, Pengfei Wan, Wentao Zhang, Yuanxing zhang</p>

            <p><strong>Title:</strong><br>
            GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.15560v2">http://arxiv.org/abs/2512.15560v2</a></p>

            <p><strong>Abstract:</strong><br>
            The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder's effectiveness in downstream generation tasks. Notably, under our experimental setup, compared with training a diffusion model from scratch, evaluating with TED-6K is about \textbf{750$\times$ faster}. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our TED-6K dataset and evaluation code are available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion</title>
      <itunes:episode>1532</itunes:episode>
      <podcast:episode>1532</podcast:episode>
      <itunes:title>InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9b5634f5-91cd-4be4-99a4-c554c1f4cd95</guid>
      <link>https://share.transistor.fm/s/e9e9e701</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 74 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hoiyeong Jin, Hyojin Jang, Jeongho Kim, Junha Hyung, Kinam Kim, Dongjin Kim, Huijin Choi, Hyeonji Kim, Jaegul Choo</p>

            <p><strong>Title:</strong><br>
            InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17504v1">http://arxiv.org/abs/2512.17504v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in diffusion-based video generation have opened new possibilities for controllable video editing, yet realistic video object insertion (VOI) remains challenging due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects. We present InsertAnywhere, a new VOI framework that achieves geometrically consistent object placement and appearance-faithful video synthesis. Our method begins with a 4D aware mask generation module that reconstructs the scene geometry and propagates user specified object placement across frames while maintaining temporal coherence and occlusion consistency. Building upon this spatial foundation, we extend a diffusion based video generation model to jointly synthesize the inserted object and its surrounding local variations such as illumination and shading. To enable supervised training, we introduce ROSE++, an illumination aware synthetic dataset constructed by transforming the ROSE object removal dataset into triplets of object removed video, object present video, and a VLM generated reference image. Through extensive experiments, we demonstrate that our framework produces geometrically plausible and visually coherent object insertions across diverse real world scenarios, significantly outperforming existing research and commercial models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 74 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hoiyeong Jin, Hyojin Jang, Jeongho Kim, Junha Hyung, Kinam Kim, Dongjin Kim, Huijin Choi, Hyeonji Kim, Jaegul Choo</p>

            <p><strong>Title:</strong><br>
            InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17504v1">http://arxiv.org/abs/2512.17504v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in diffusion-based video generation have opened new possibilities for controllable video editing, yet realistic video object insertion (VOI) remains challenging due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects. We present InsertAnywhere, a new VOI framework that achieves geometrically consistent object placement and appearance-faithful video synthesis. Our method begins with a 4D aware mask generation module that reconstructs the scene geometry and propagates user specified object placement across frames while maintaining temporal coherence and occlusion consistency. Building upon this spatial foundation, we extend a diffusion based video generation model to jointly synthesize the inserted object and its surrounding local variations such as illumination and shading. To enable supervised training, we introduce ROSE++, an illumination aware synthetic dataset constructed by transforming the ROSE object removal dataset into triplets of object removed video, object present video, and a VLM generated reference image. Through extensive experiments, we demonstrate that our framework produces geometrically plausible and visually coherent object insertions across diverse real world scenarios, significantly outperforming existing research and commercial models.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 29 Dec 2025 19:07:41 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e9e9e701/9cd386f4.mp3" length="22322843" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1391</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 74 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hoiyeong Jin, Hyojin Jang, Jeongho Kim, Junha Hyung, Kinam Kim, Dongjin Kim, Huijin Choi, Hyeonji Kim, Jaegul Choo</p>

            <p><strong>Title:</strong><br>
            InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17504v1">http://arxiv.org/abs/2512.17504v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in diffusion-based video generation have opened new possibilities for controllable video editing, yet realistic video object insertion (VOI) remains challenging due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects. We present InsertAnywhere, a new VOI framework that achieves geometrically consistent object placement and appearance-faithful video synthesis. Our method begins with a 4D aware mask generation module that reconstructs the scene geometry and propagates user specified object placement across frames while maintaining temporal coherence and occlusion consistency. Building upon this spatial foundation, we extend a diffusion based video generation model to jointly synthesize the inserted object and its surrounding local variations such as illumination and shading. To enable supervised training, we introduce ROSE++, an illumination aware synthetic dataset constructed by transforming the ROSE object removal dataset into triplets of object removed video, object present video, and a VLM generated reference image. Through extensive experiments, we demonstrate that our framework produces geometrically plausible and visually coherent object insertions across diverse real world scenarios, significantly outperforming existing research and commercial models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding</title>
      <itunes:episode>1531</itunes:episode>
      <podcast:episode>1531</podcast:episode>
      <itunes:title>Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9077ed3c-c0a0-4aaf-8b4d-c8443ea8c0e3</guid>
      <link>https://share.transistor.fm/s/e8800a96</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuqing Li, Jiangnan Li, Zheng Lin, Ziyan Zhou, Junjie Wu, Weiping Wang, Jie Zhou, Mo Yu</p>

            <p><strong>Title:</strong><br>
            Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17220v1">http://arxiv.org/abs/2512.17220v1</a></p>

            <p><strong>Abstract:</strong><br>
            Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks. In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first approach that equips LLM-based RAG systems with explicit global context awareness. MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation. This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context. We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuqing Li, Jiangnan Li, Zheng Lin, Ziyan Zhou, Junjie Wu, Weiping Wang, Jie Zhou, Mo Yu</p>

            <p><strong>Title:</strong><br>
            Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17220v1">http://arxiv.org/abs/2512.17220v1</a></p>

            <p><strong>Abstract:</strong><br>
            Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks. In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first approach that equips LLM-based RAG systems with explicit global context awareness. MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation. This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context. We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 29 Dec 2025 19:07:20 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e8800a96/ea26f2d6.mp3" length="20485063" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1277</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuqing Li, Jiangnan Li, Zheng Lin, Ziyan Zhou, Junjie Wu, Weiping Wang, Jie Zhou, Mo Yu</p>

            <p><strong>Title:</strong><br>
            Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17220v1">http://arxiv.org/abs/2512.17220v1</a></p>

            <p><strong>Abstract:</strong><br>
            Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks. In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first approach that equips LLM-based RAG systems with explicit global context awareness. MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation. This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context. We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MAI-UI Technical Report: Real-World Centric Foundation GUI Agents</title>
      <itunes:episode>1530</itunes:episode>
      <podcast:episode>1530</podcast:episode>
      <itunes:title>MAI-UI Technical Report: Real-World Centric Foundation GUI Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c5d0ca79-b91a-4524-b096-f4ea60d09d9f</guid>
      <link>https://share.transistor.fm/s/94147c40</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, Steven Hoi</p>

            <p><strong>Title:</strong><br>
            MAI-UI Technical Report: Real-World Centric Foundation GUI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.22047v1">http://arxiv.org/abs/2512.22047v1</a></p>

            <p><strong>Abstract:</strong><br>
            The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, the absence of a practical deployment architecture, and brittleness in dynamic environments. MAI-UI addresses these issues with a unified methodology: a self-evolving data pipeline that expands the navigation data to include user interaction and MCP tool calls, a native device-cloud collaboration system routes execution by task state, and an online RL framework with advanced optimizations to scale parallel environments and context length. MAI-UI establishes new state-of-the-art across GUI grounding and mobile navigation. On grounding benchmarks, it reaches 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, and 49.2% on UI-Vision, surpassing Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro. On mobile GUI navigation, it sets a new SOTA of 76.7% on AndroidWorld, surpassing UI-Tars-2, Gemini-2.5-Pro and Seed1.8. On MobileWorld, MAI-UI obtains 41.7% success rate, significantly outperforming end-to-end GUI models and competitive with Gemini-3-Pro based agentic frameworks. Our online RL experiments show significant gains from scaling parallel environments from 32 to 512 (+5.2 points) and increasing environment step budget from 15 to 50 (+4.3 points). Finally, the native device-cloud collaboration system improves on-device performance by 33%, reduces cloud model calls by over 40%, and preserves user privacy.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, Steven Hoi</p>

            <p><strong>Title:</strong><br>
            MAI-UI Technical Report: Real-World Centric Foundation GUI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.22047v1">http://arxiv.org/abs/2512.22047v1</a></p>

            <p><strong>Abstract:</strong><br>
            The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, the absence of a practical deployment architecture, and brittleness in dynamic environments. MAI-UI addresses these issues with a unified methodology: a self-evolving data pipeline that expands the navigation data to include user interaction and MCP tool calls, a native device-cloud collaboration system routes execution by task state, and an online RL framework with advanced optimizations to scale parallel environments and context length. MAI-UI establishes new state-of-the-art across GUI grounding and mobile navigation. On grounding benchmarks, it reaches 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, and 49.2% on UI-Vision, surpassing Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro. On mobile GUI navigation, it sets a new SOTA of 76.7% on AndroidWorld, surpassing UI-Tars-2, Gemini-2.5-Pro and Seed1.8. On MobileWorld, MAI-UI obtains 41.7% success rate, significantly outperforming end-to-end GUI models and competitive with Gemini-3-Pro based agentic frameworks. Our online RL experiments show significant gains from scaling parallel environments from 32 to 512 (+5.2 points) and increasing environment step budget from 15 to 50 (+4.3 points). Finally, the native device-cloud collaboration system improves on-device performance by 33%, reduces cloud model calls by over 40%, and preserves user privacy.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 29 Dec 2025 19:06:58 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/94147c40/27661f9c.mp3" length="24042710" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1499</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, Steven Hoi</p>

            <p><strong>Title:</strong><br>
            MAI-UI Technical Report: Real-World Centric Foundation GUI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.22047v1">http://arxiv.org/abs/2512.22047v1</a></p>

            <p><strong>Abstract:</strong><br>
            The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, the absence of a practical deployment architecture, and brittleness in dynamic environments. MAI-UI addresses these issues with a unified methodology: a self-evolving data pipeline that expands the navigation data to include user interaction and MCP tool calls, a native device-cloud collaboration system routes execution by task state, and an online RL framework with advanced optimizations to scale parallel environments and context length. MAI-UI establishes new state-of-the-art across GUI grounding and mobile navigation. On grounding benchmarks, it reaches 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, and 49.2% on UI-Vision, surpassing Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro. On mobile GUI navigation, it sets a new SOTA of 76.7% on AndroidWorld, surpassing UI-Tars-2, Gemini-2.5-Pro and Seed1.8. On MobileWorld, MAI-UI obtains 41.7% success rate, significantly outperforming end-to-end GUI models and competitive with Gemini-3-Pro based agentic frameworks. Our online RL experiments show significant gains from scaling parallel environments from 32 to 512 (+5.2 points) and increasing environment step budget from 15 to 50 (+4.3 points). Finally, the native device-cloud collaboration system improves on-device performance by 33%, reduces cloud model calls by over 40%, and preserves user privacy.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Latent Implicit Visual Reasoning</title>
      <itunes:episode>1529</itunes:episode>
      <podcast:episode>1529</podcast:episode>
      <itunes:title>Latent Implicit Visual Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">29d4dfe5-c755-40f2-bc01-a441c9a2a4a2</guid>
      <link>https://share.transistor.fm/s/1e31fddf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig</p>

            <p><strong>Title:</strong><br>
            Latent Implicit Visual Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.21218v1">http://arxiv.org/abs/2512.21218v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig</p>

            <p><strong>Title:</strong><br>
            Latent Implicit Visual Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.21218v1">http://arxiv.org/abs/2512.21218v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 26 Dec 2025 18:58:12 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1e31fddf/0ae233b6.mp3" length="24845995" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1549</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig</p>

            <p><strong>Title:</strong><br>
            Latent Implicit Visual Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.21218v1">http://arxiv.org/abs/2512.21218v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning</title>
      <itunes:episode>1528</itunes:episode>
      <podcast:episode>1528</podcast:episode>
      <itunes:title>Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8a760d78-27ec-4992-92a4-98068d03be67</guid>
      <link>https://share.transistor.fm/s/2f3c3fa0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel, Angelika Steger, Maciej Wolczyk, Johannes von Oswald, Nino Scherrer, Kaitlin Maile, Guillaume Lajoie, Blake A. Richards, Rif A. Saurous, James Manyika, Blaise Agüera y Arcas, Alexander Meulemans, João Sacramento</p>

            <p><strong>Title:</strong><br>
            Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.20605v2">http://arxiv.org/abs/2512.20605v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term "internal RL", enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel, Angelika Steger, Maciej Wolczyk, Johannes von Oswald, Nino Scherrer, Kaitlin Maile, Guillaume Lajoie, Blake A. Richards, Rif A. Saurous, James Manyika, Blaise Agüera y Arcas, Alexander Meulemans, João Sacramento</p>

            <p><strong>Title:</strong><br>
            Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.20605v2">http://arxiv.org/abs/2512.20605v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term "internal RL", enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 26 Dec 2025 18:57:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2f3c3fa0/a54db890.mp3" length="25036232" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1561</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel, Angelika Steger, Maciej Wolczyk, Johannes von Oswald, Nino Scherrer, Kaitlin Maile, Guillaume Lajoie, Blake A. Richards, Rif A. Saurous, James Manyika, Blaise Agüera y Arcas, Alexander Meulemans, João Sacramento</p>

            <p><strong>Title:</strong><br>
            Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.20605v2">http://arxiv.org/abs/2512.20605v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term "internal RL", enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times</title>
      <itunes:episode>1527</itunes:episode>
      <podcast:episode>1527</podcast:episode>
      <itunes:title>TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ec69607c-e7ee-4943-a2a3-526ada29f1c2</guid>
      <link>https://share.transistor.fm/s/e5b27b8f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica, Joseph E. Gonzalez, Jianfei Chen, Jun Zhu</p>

            <p><strong>Title:</strong><br>
            TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16093v1">http://arxiv.org/abs/2512.16093v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200x while maintaining video quality. TurboDiffusion mainly relies on several components for acceleration: (1) Attention acceleration: TurboDiffusion uses low-bit SageAttention and trainable Sparse-Linear Attention (SLA) to speed up attention computation. (2) Step distillation: TurboDiffusion adopts rCM for efficient step distillation. (3) W8A8 quantization: TurboDiffusion quantizes model parameters and activations to 8 bits to accelerate linear layers and compress the model. In addition, TurboDiffusion incorporates several other engineering optimizations.   We conduct experiments on the Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P, and Wan2.1-T2V-14B-480P models. Experimental results show that TurboDiffusion achieves 100-200x speedup for video generation even on a single RTX 5090 GPU, while maintaining comparable video quality. The GitHub repository, which includes model checkpoints and easy-to-use code, is available at https://github.com/thu-ml/TurboDiffusion.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica, Joseph E. Gonzalez, Jianfei Chen, Jun Zhu</p>

            <p><strong>Title:</strong><br>
            TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16093v1">http://arxiv.org/abs/2512.16093v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200x while maintaining video quality. TurboDiffusion mainly relies on several components for acceleration: (1) Attention acceleration: TurboDiffusion uses low-bit SageAttention and trainable Sparse-Linear Attention (SLA) to speed up attention computation. (2) Step distillation: TurboDiffusion adopts rCM for efficient step distillation. (3) W8A8 quantization: TurboDiffusion quantizes model parameters and activations to 8 bits to accelerate linear layers and compress the model. In addition, TurboDiffusion incorporates several other engineering optimizations.   We conduct experiments on the Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P, and Wan2.1-T2V-14B-480P models. Experimental results show that TurboDiffusion achieves 100-200x speedup for video generation even on a single RTX 5090 GPU, while maintaining comparable video quality. The GitHub repository, which includes model checkpoints and easy-to-use code, is available at https://github.com/thu-ml/TurboDiffusion.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 25 Dec 2025 19:11:07 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e5b27b8f/f0235646.mp3" length="20567383" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1282</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica, Joseph E. Gonzalez, Jianfei Chen, Jun Zhu</p>

            <p><strong>Title:</strong><br>
            TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16093v1">http://arxiv.org/abs/2512.16093v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200x while maintaining video quality. TurboDiffusion mainly relies on several components for acceleration: (1) Attention acceleration: TurboDiffusion uses low-bit SageAttention and trainable Sparse-Linear Attention (SLA) to speed up attention computation. (2) Step distillation: TurboDiffusion adopts rCM for efficient step distillation. (3) W8A8 quantization: TurboDiffusion quantizes model parameters and activations to 8 bits to accelerate linear layers and compress the model. In addition, TurboDiffusion incorporates several other engineering optimizations.   We conduct experiments on the Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P, and Wan2.1-T2V-14B-480P models. Experimental results show that TurboDiffusion achieves 100-200x speedup for video generation even on a single RTX 5090 GPU, while maintaining comparable video quality. The GitHub repository, which includes model checkpoints and easy-to-use code, is available at https://github.com/thu-ml/TurboDiffusion.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models</title>
      <itunes:episode>1526</itunes:episode>
      <podcast:episode>1526</podcast:episode>
      <itunes:title>Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d81175b6-1710-405a-a183-2fbc23136bd7</guid>
      <link>https://share.transistor.fm/s/bb51afba</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi</p>

            <p><strong>Title:</strong><br>
            Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.20557v1">http://arxiv.org/abs/2512.20557v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi</p>

            <p><strong>Title:</strong><br>
            Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.20557v1">http://arxiv.org/abs/2512.20557v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 25 Dec 2025 19:10:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bb51afba/a4893ed2.mp3" length="22082081" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1376</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi</p>

            <p><strong>Title:</strong><br>
            Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.20557v1">http://arxiv.org/abs/2512.20557v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation</title>
      <itunes:episode>1525</itunes:episode>
      <podcast:episode>1525</podcast:episode>
      <itunes:title>DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">62e6e3ce-262d-43db-ae14-0d6657dd33ba</guid>
      <link>https://share.transistor.fm/s/dfe3b47c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiawei Liu, Junqiao Li, Jiangfan Deng, Gen Li, Siyu Zhou, Zetao Fang, Shanshan Lao, Zengde Deng, Jianing Zhu, Tingting Ma, Jiayi Li, Yunqiu Wang, Qian He, Xinglong Wu</p>

            <p><strong>Title:</strong><br>
            DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.21252v1">http://arxiv.org/abs/2512.21252v1</a></p>

            <p><strong>Abstract:</strong><br>
            The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiawei Liu, Junqiao Li, Jiangfan Deng, Gen Li, Siyu Zhou, Zetao Fang, Shanshan Lao, Zengde Deng, Jianing Zhu, Tingting Ma, Jiayi Li, Yunqiu Wang, Qian He, Xinglong Wu</p>

            <p><strong>Title:</strong><br>
            DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.21252v1">http://arxiv.org/abs/2512.21252v1</a></p>

            <p><strong>Abstract:</strong><br>
            The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 25 Dec 2025 19:10:25 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/dfe3b47c/a101d46b.mp3" length="20785968" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1295</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiawei Liu, Junqiao Li, Jiangfan Deng, Gen Li, Siyu Zhou, Zetao Fang, Shanshan Lao, Zengde Deng, Jianing Zhu, Tingting Ma, Jiayi Li, Yunqiu Wang, Qian He, Xinglong Wu</p>

            <p><strong>Title:</strong><br>
            DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.21252v1">http://arxiv.org/abs/2512.21252v1</a></p>

            <p><strong>Abstract:</strong><br>
            The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation</title>
      <itunes:episode>1524</itunes:episode>
      <podcast:episode>1524</podcast:episode>
      <itunes:title>T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">59260bc9-2481-431c-b5c0-da0e43eeb88a</guid>
      <link>https://share.transistor.fm/s/a5b60ece</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jialu Chen, Miao Deng, Jiahao Wang, Yubin Guo, Chenxi Liao, Yize Zhang, Zhaoxiang Zhang, Jiaheng Liu</p>

            <p><strong>Title:</strong><br>
            T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.21094v1">http://arxiv.org/abs/2512.21094v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AVsystems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jialu Chen, Miao Deng, Jiahao Wang, Yubin Guo, Chenxi Liao, Yize Zhang, Zhaoxiang Zhang, Jiaheng Liu</p>

            <p><strong>Title:</strong><br>
            T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.21094v1">http://arxiv.org/abs/2512.21094v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AVsystems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 25 Dec 2025 19:10:04 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a5b60ece/f10701a7.mp3" length="20049956" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1249</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jialu Chen, Miao Deng, Jiahao Wang, Yubin Guo, Chenxi Liao, Yize Zhang, Zhaoxiang Zhang, Jiaheng Liu</p>

            <p><strong>Title:</strong><br>
            T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.21094v1">http://arxiv.org/abs/2512.21094v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AVsystems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SemanticGen: Video Generation in Semantic Space</title>
      <itunes:episode>1523</itunes:episode>
      <podcast:episode>1523</podcast:episode>
      <itunes:title>SemanticGen: Video Generation in Semantic Space</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9b39a44a-0388-4198-83b4-844eb57d7902</guid>
      <link>https://share.transistor.fm/s/bd3fd274</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 78 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianhong Bai, Xiaoshi Wu, Xintao Wang, Xiao Fu, Yuanxing Zhang, Qinghe Wang, Xiaoyu Shi, Menghan Xia, Zuozhu Liu, Haoji Hu, Pengfei Wan, Kun Gai</p>

            <p><strong>Title:</strong><br>
            SemanticGen: Video Generation in Semantic Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.20619v2">http://arxiv.org/abs/2512.20619v2</a></p>

            <p><strong>Abstract:</strong><br>
            State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 78 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianhong Bai, Xiaoshi Wu, Xintao Wang, Xiao Fu, Yuanxing Zhang, Qinghe Wang, Xiaoyu Shi, Menghan Xia, Zuozhu Liu, Haoji Hu, Pengfei Wan, Kun Gai</p>

            <p><strong>Title:</strong><br>
            SemanticGen: Video Generation in Semantic Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.20619v2">http://arxiv.org/abs/2512.20619v2</a></p>

            <p><strong>Abstract:</strong><br>
            State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 24 Dec 2025 19:13:15 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bd3fd274/2dbed30f.mp3" length="21385726" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1333</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 78 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianhong Bai, Xiaoshi Wu, Xintao Wang, Xiao Fu, Yuanxing Zhang, Qinghe Wang, Xiaoyu Shi, Menghan Xia, Zuozhu Liu, Haoji Hu, Pengfei Wan, Kun Gai</p>

            <p><strong>Title:</strong><br>
            SemanticGen: Video Generation in Semantic Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.20619v2">http://arxiv.org/abs/2512.20619v2</a></p>

            <p><strong>Abstract:</strong><br>
            State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies</title>
      <itunes:episode>1522</itunes:episode>
      <podcast:episode>1522</podcast:episode>
      <itunes:title>Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5f03e7bc-7b91-4fd2-b107-14ef9d46a2fd</guid>
      <link>https://share.transistor.fm/s/6c6b3ede</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu</p>

            <p><strong>Title:</strong><br>
            Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.19673v1">http://arxiv.org/abs/2512.19673v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama's prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at https://github.com/Trae1ounG/BuPO.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu</p>

            <p><strong>Title:</strong><br>
            Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.19673v1">http://arxiv.org/abs/2512.19673v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama's prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at https://github.com/Trae1ounG/BuPO.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 24 Dec 2025 19:12:53 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6c6b3ede/ace2c261.mp3" length="26425941" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1648</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu</p>

            <p><strong>Title:</strong><br>
            Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.19673v1">http://arxiv.org/abs/2512.19673v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama's prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at https://github.com/Trae1ounG/BuPO.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LongVideoAgent: Multi-Agent Reasoning with Long Videos</title>
      <itunes:episode>1521</itunes:episode>
      <podcast:episode>1521</podcast:episode>
      <itunes:title>LongVideoAgent: Multi-Agent Reasoning with Long Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">78fe8110-6c3c-40aa-a1ba-e51860f2d7cc</guid>
      <link>https://share.transistor.fm/s/e5170db7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.AI, cs.CV, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen</p>

            <p><strong>Title:</strong><br>
            LongVideoAgent: Multi-Agent Reasoning with Long Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.20618v1">http://arxiv.org/abs/2512.20618v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at https://longvideoagent.github.io/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.AI, cs.CV, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen</p>

            <p><strong>Title:</strong><br>
            LongVideoAgent: Multi-Agent Reasoning with Long Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.20618v1">http://arxiv.org/abs/2512.20618v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at https://longvideoagent.github.io/.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 24 Dec 2025 19:12:31 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e5170db7/cf2476da.mp3" length="21364417" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1332</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.AI, cs.CV, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen</p>

            <p><strong>Title:</strong><br>
            LongVideoAgent: Multi-Agent Reasoning with Long Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.20618v1">http://arxiv.org/abs/2512.20618v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at https://longvideoagent.github.io/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SpatialTree: How Spatial Abilities Branch Out in MLLMs</title>
      <itunes:episode>1520</itunes:episode>
      <podcast:episode>1520</podcast:episode>
      <itunes:title>SpatialTree: How Spatial Abilities Branch Out in MLLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">903077bc-3b50-4303-9483-3ab996229271</guid>
      <link>https://share.transistor.fm/s/e0729662</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, Bingyi Kang</p>

            <p><strong>Title:</strong><br>
            SpatialTree: How Spatial Abilities Branch Out in MLLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.20617v1">http://arxiv.org/abs/2512.20617v1</a></p>

            <p><strong>Abstract:</strong><br>
            Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, Bingyi Kang</p>

            <p><strong>Title:</strong><br>
            SpatialTree: How Spatial Abilities Branch Out in MLLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.20617v1">http://arxiv.org/abs/2512.20617v1</a></p>

            <p><strong>Abstract:</strong><br>
            Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 24 Dec 2025 19:12:10 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e0729662/c54bc9d7.mp3" length="21341011" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1330</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, Bingyi Kang</p>

            <p><strong>Title:</strong><br>
            SpatialTree: How Spatial Abilities Branch Out in MLLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.20617v1">http://arxiv.org/abs/2512.20617v1</a></p>

            <p><strong>Abstract:</strong><br>
            Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI</title>
      <itunes:episode>1519</itunes:episode>
      <podcast:episode>1519</podcast:episode>
      <itunes:title>DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4b47b697-6518-4faf-9120-5b8c43b7269e</guid>
      <link>https://share.transistor.fm/s/70bb209a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 159 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, Meiyi Qiang, Yalin Feng, Tianyi Bai, Zewei Pan, Ziyi Guo, Yizhen Jiang, Jingwen Deng, Qijie You, Peichao Lai, Tianyu Guo, Chi Hsu Tsai, Hengyi Feng, Rui Hu, Wenkai Yu, Junbo Niu, Bohan Zeng, Ruichuan An, Lu Ma, Jihao Huang, Yaowei Zheng, Conghui He, Linpeng Tang, Bin Cui, Weinan E, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16676v1">http://arxiv.org/abs/2512.16676v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 159 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, Meiyi Qiang, Yalin Feng, Tianyi Bai, Zewei Pan, Ziyi Guo, Yizhen Jiang, Jingwen Deng, Qijie You, Peichao Lai, Tianyu Guo, Chi Hsu Tsai, Hengyi Feng, Rui Hu, Wenkai Yu, Junbo Niu, Bohan Zeng, Ruichuan An, Lu Ma, Jihao Huang, Yaowei Zheng, Conghui He, Linpeng Tang, Bin Cui, Weinan E, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16676v1">http://arxiv.org/abs/2512.16676v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 23 Dec 2025 19:28:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/70bb209a/5d84578f.mp3" length="23185945" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1445</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 159 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, Meiyi Qiang, Yalin Feng, Tianyi Bai, Zewei Pan, Ziyi Guo, Yizhen Jiang, Jingwen Deng, Qijie You, Peichao Lai, Tianyu Guo, Chi Hsu Tsai, Hengyi Feng, Rui Hu, Wenkai Yu, Junbo Niu, Bohan Zeng, Ruichuan An, Lu Ma, Jihao Huang, Yaowei Zheng, Conghui He, Linpeng Tang, Bin Cui, Weinan E, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16676v1">http://arxiv.org/abs/2512.16676v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding</title>
      <itunes:episode>1518</itunes:episode>
      <podcast:episode>1518</podcast:episode>
      <itunes:title>The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2f87e462-0691-4f6f-842d-971e87344167</guid>
      <link>https://share.transistor.fm/s/8f331e8d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.19693v1">http://arxiv.org/abs/2512.19693v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder's feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.19693v1">http://arxiv.org/abs/2512.19693v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder's feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 23 Dec 2025 19:28:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8f331e8d/383f2f8a.mp3" length="25229324" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1573</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.19693v1">http://arxiv.org/abs/2512.19693v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder's feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Region-Constraint In-Context Generation for Instructional Video Editing</title>
      <itunes:episode>1517</itunes:episode>
      <podcast:episode>1517</podcast:episode>
      <itunes:title>Region-Constraint In-Context Generation for Instructional Video Editing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ed68882a-a3a1-462f-b2cc-a9ef0e1f182b</guid>
      <link>https://share.transistor.fm/s/66056528</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, Tao Mei</p>

            <p><strong>Title:</strong><br>
            Region-Constraint In-Context Generation for Instructional Video Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17650v1">http://arxiv.org/abs/2512.17650v1</a></p>

            <p><strong>Abstract:</strong><br>
            The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between editing and non-editing areas during denoising. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into constraint modeling between editing and non-editing regions during in-context generation. Technically, ReCo width-wise concatenates source and target video for joint denoising. To calibrate video diffusion learning, ReCo capitalizes on two regularization terms, i.e., latent and attention regularization, conducting on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing the modification on editing area and alleviating outside unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training. Extensive experiments conducted on four major instruction-based video editing tasks demonstrate the superiority of our proposal.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, Tao Mei</p>

            <p><strong>Title:</strong><br>
            Region-Constraint In-Context Generation for Instructional Video Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17650v1">http://arxiv.org/abs/2512.17650v1</a></p>

            <p><strong>Abstract:</strong><br>
            The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between editing and non-editing areas during denoising. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into constraint modeling between editing and non-editing regions during in-context generation. Technically, ReCo width-wise concatenates source and target video for joint denoising. To calibrate video diffusion learning, ReCo capitalizes on two regularization terms, i.e., latent and attention regularization, conducting on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing the modification on editing area and alleviating outside unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training. Extensive experiments conducted on four major instruction-based video editing tasks demonstrate the superiority of our proposal.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 23 Dec 2025 19:28:02 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/66056528/07199f91.mp3" length="20336254" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1267</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Zhongwei Zhang, Fuchen Long, Wei Li, Zhaofan Qiu, Wu Liu, Ting Yao, Tao Mei</p>

            <p><strong>Title:</strong><br>
            Region-Constraint In-Context Generation for Instructional Video Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17650v1">http://arxiv.org/abs/2512.17650v1</a></p>

            <p><strong>Abstract:</strong><br>
            The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between editing and non-editing areas during denoising. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into constraint modeling between editing and non-editing regions during in-context generation. Technically, ReCo width-wise concatenates source and target video for joint denoising. To calibrate video diffusion learning, ReCo capitalizes on two regularization terms, i.e., latent and attention regularization, conducting on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing the modification on editing area and alleviating outside unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training. Extensive experiments conducted on four major instruction-based video editing tasks demonstrate the superiority of our proposal.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation</title>
      <itunes:episode>1516</itunes:episode>
      <podcast:episode>1516</podcast:episode>
      <itunes:title>QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c5b5e794-6591-4857-bb4d-289aaa04ce8b</guid>
      <link>https://share.transistor.fm/s/25466266</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Dehai Min, Kailin Zhang, Tongtong Wu, Lu Cheng</p>

            <p><strong>Title:</strong><br>
            QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.19134v1">http://arxiv.org/abs/2512.19134v1</a></p>

            <p><strong>Abstract:</strong><br>
            Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to objective statistics computed from pre-training data. Our method quantifies uncertainty through two stages: (1) before generation, we identify low-frequency entities indicating long-tail knowledge gaps; (2) during generation, we verify entity co-occurrence in the pre-training corpus, where zero co-occurrence often signals hallucination risk. Both stages leverage Infini-gram for millisecond-latency queries over 4 trillion tokens, triggering retrieval when uncertainty is high. Experiments on multi-hop QA benchmarks show QuCo-RAG achieves EM gains of 5--12 points over state-of-the-art baselines with OLMo-2 models, and transfers effectively to models with undisclosed pre-training data (Llama, Qwen, GPT), improving EM by up to 14 points. Domain generalization on biomedical QA further validates the robustness of our paradigm. These results establish corpus-grounded verification as a principled, practically model-agnostic paradigm for dynamic RAG. Our code is publicly available at https://github.com/ZhishanQ/QuCo-RAG.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Dehai Min, Kailin Zhang, Tongtong Wu, Lu Cheng</p>

            <p><strong>Title:</strong><br>
            QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.19134v1">http://arxiv.org/abs/2512.19134v1</a></p>

            <p><strong>Abstract:</strong><br>
            Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to objective statistics computed from pre-training data. Our method quantifies uncertainty through two stages: (1) before generation, we identify low-frequency entities indicating long-tail knowledge gaps; (2) during generation, we verify entity co-occurrence in the pre-training corpus, where zero co-occurrence often signals hallucination risk. Both stages leverage Infini-gram for millisecond-latency queries over 4 trillion tokens, triggering retrieval when uncertainty is high. Experiments on multi-hop QA benchmarks show QuCo-RAG achieves EM gains of 5--12 points over state-of-the-art baselines with OLMo-2 models, and transfers effectively to models with undisclosed pre-training data (Llama, Qwen, GPT), improving EM by up to 14 points. Domain generalization on biomedical QA further validates the robustness of our paradigm. These results establish corpus-grounded verification as a principled, practically model-agnostic paradigm for dynamic RAG. Our code is publicly available at https://github.com/ZhishanQ/QuCo-RAG.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 23 Dec 2025 19:27:41 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/25466266/0e8ef7b4.mp3" length="23321771" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1454</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Dehai Min, Kailin Zhang, Tongtong Wu, Lu Cheng</p>

            <p><strong>Title:</strong><br>
            QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.19134v1">http://arxiv.org/abs/2512.19134v1</a></p>

            <p><strong>Abstract:</strong><br>
            Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to objective statistics computed from pre-training data. Our method quantifies uncertainty through two stages: (1) before generation, we identify low-frequency entities indicating long-tail knowledge gaps; (2) during generation, we verify entity co-occurrence in the pre-training corpus, where zero co-occurrence often signals hallucination risk. Both stages leverage Infini-gram for millisecond-latency queries over 4 trillion tokens, triggering retrieval when uncertainty is high. Experiments on multi-hop QA benchmarks show QuCo-RAG achieves EM gains of 5--12 points over state-of-the-art baselines with OLMo-2 models, and transfers effectively to models with undisclosed pre-training data (Llama, Qwen, GPT), improving EM by up to 14 points. Domain generalization on biomedical QA further validates the robustness of our paradigm. These results establish corpus-grounded verification as a principled, practically model-agnostic paradigm for dynamic RAG. Our code is publicly available at https://github.com/ZhishanQ/QuCo-RAG.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation</title>
      <itunes:episode>1515</itunes:episode>
      <podcast:episode>1515</podcast:episode>
      <itunes:title>Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4c2737b9-9820-4742-95a2-116756ff1e12</guid>
      <link>https://share.transistor.fm/s/5488d574</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Min-Jung Kim, Jeongho Kim, Hoiyeong Jin, Junha Hyung, Jaegul Choo</p>

            <p><strong>Title:</strong><br>
            Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17040v1">http://arxiv.org/abs/2512.17040v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in video diffusion models has spurred growing interest in camera-controlled novel-view video generation for dynamic scenes, aiming to provide creators with cinematic camera control capabilities in post-production. A key challenge in camera-controlled video generation is ensuring fidelity to the specified camera pose, while maintaining view consistency and reasoning about occluded geometry from limited observations. To address this, existing methods either train trajectory-conditioned video generation model on trajectory-video pair dataset, or estimate depth from the input video to reproject it along a target trajectory and generate the unprojected regions. Nevertheless, existing methods struggle to generate camera-pose-faithful, high-quality videos for two main reasons: (1) reprojection-based approaches are highly susceptible to errors caused by inaccurate depth estimation; and (2) the limited diversity of camera trajectories in existing datasets restricts learned models. To address these limitations, we present InfCam, a depth-free, camera-controlled video-to-video generation framework with high pose fidelity. The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model. Conditioning on this noise-free rotational information, the residual parallax term is predicted through end-to-end training to achieve high camera-pose fidelity; and (2) a data augmentation pipeline that transforms existing synthetic multiview datasets into sequences with diverse trajectories and focal lengths. Experimental results demonstrate that InfCam outperforms baseline methods in camera-pose accuracy and visual fidelity, generalizing well from synthetic to real-world data. Link to our project page:https://emjay73.github.io/InfCam/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Min-Jung Kim, Jeongho Kim, Hoiyeong Jin, Junha Hyung, Jaegul Choo</p>

            <p><strong>Title:</strong><br>
            Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17040v1">http://arxiv.org/abs/2512.17040v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in video diffusion models has spurred growing interest in camera-controlled novel-view video generation for dynamic scenes, aiming to provide creators with cinematic camera control capabilities in post-production. A key challenge in camera-controlled video generation is ensuring fidelity to the specified camera pose, while maintaining view consistency and reasoning about occluded geometry from limited observations. To address this, existing methods either train trajectory-conditioned video generation model on trajectory-video pair dataset, or estimate depth from the input video to reproject it along a target trajectory and generate the unprojected regions. Nevertheless, existing methods struggle to generate camera-pose-faithful, high-quality videos for two main reasons: (1) reprojection-based approaches are highly susceptible to errors caused by inaccurate depth estimation; and (2) the limited diversity of camera trajectories in existing datasets restricts learned models. To address these limitations, we present InfCam, a depth-free, camera-controlled video-to-video generation framework with high pose fidelity. The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model. Conditioning on this noise-free rotational information, the residual parallax term is predicted through end-to-end training to achieve high camera-pose fidelity; and (2) a data augmentation pipeline that transforms existing synthetic multiview datasets into sequences with diverse trajectories and focal lengths. Experimental results demonstrate that InfCam outperforms baseline methods in camera-pose accuracy and visual fidelity, generalizing well from synthetic to real-world data. Link to our project page:https://emjay73.github.io/InfCam/</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 23 Dec 2025 19:27:20 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5488d574/404dcfba.mp3" length="23757260" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1481</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Min-Jung Kim, Jeongho Kim, Hoiyeong Jin, Junha Hyung, Jaegul Choo</p>

            <p><strong>Title:</strong><br>
            Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17040v1">http://arxiv.org/abs/2512.17040v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in video diffusion models has spurred growing interest in camera-controlled novel-view video generation for dynamic scenes, aiming to provide creators with cinematic camera control capabilities in post-production. A key challenge in camera-controlled video generation is ensuring fidelity to the specified camera pose, while maintaining view consistency and reasoning about occluded geometry from limited observations. To address this, existing methods either train trajectory-conditioned video generation model on trajectory-video pair dataset, or estimate depth from the input video to reproject it along a target trajectory and generate the unprojected regions. Nevertheless, existing methods struggle to generate camera-pose-faithful, high-quality videos for two main reasons: (1) reprojection-based approaches are highly susceptible to errors caused by inaccurate depth estimation; and (2) the limited diversity of camera trajectories in existing datasets restricts learned models. To address these limitations, we present InfCam, a depth-free, camera-controlled video-to-video generation framework with high pose fidelity. The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model. Conditioning on this noise-free rotational information, the residual parallax term is predicted through end-to-end training to achieve high camera-pose fidelity; and (2) a data augmentation pipeline that transforms existing synthetic multiview datasets into sequences with diverse trajectories and focal lengths. Experimental results demonstrate that InfCam outperforms baseline methods in camera-pose accuracy and visual fidelity, generalizing well from synthetic to real-world data. Link to our project page:https://emjay73.github.io/InfCam/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction</title>
      <itunes:episode>1514</itunes:episode>
      <podcast:episode>1514</podcast:episode>
      <itunes:title>Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ddc0e663-d207-40c2-a458-b94820f82505</guid>
      <link>https://share.transistor.fm/s/7a34160d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.AI, cs.CY</p>

            <p><strong>Authors:</strong><br>
            Ming Li, Han Chen, Yunze Xiao, Jian Chen, Hong Jiao, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.18880v1">http://arxiv.org/abs/2512.18880v1</a></p>

            <p><strong>Abstract:</strong><br>
            Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.AI, cs.CY</p>

            <p><strong>Authors:</strong><br>
            Ming Li, Han Chen, Yunze Xiao, Jian Chen, Hong Jiao, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.18880v1">http://arxiv.org/abs/2512.18880v1</a></p>

            <p><strong>Abstract:</strong><br>
            Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 23 Dec 2025 19:26:58 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7a34160d/a709ec01.mp3" length="24972311" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1557</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.AI, cs.CY</p>

            <p><strong>Authors:</strong><br>
            Ming Li, Han Chen, Yunze Xiao, Jian Chen, Hong Jiao, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.18880v1">http://arxiv.org/abs/2512.18880v1</a></p>

            <p><strong>Abstract:</strong><br>
            Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows</title>
      <itunes:episode>1513</itunes:episode>
      <podcast:episode>1513</podcast:episode>
      <itunes:title>Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">934dbb16-e1ef-479f-b5cd-2d5d457ab7ca</guid>
      <link>https://share.transistor.fm/s/6afb3149</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 78 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wanghan Xu, Yuhao Zhou, Yifan Zhou, Qinglong Cao, Shuo Li, Jia Bu, Bo Liu, Yixin Chen, Xuming He, Xiangyu Zhao, Xiang Zhuang, Fengxiang Wang, Zhiwang Zhou, Qiantai Feng, Wenxuan Huang, Jiaqi Wei, Hao Wu, Yuejin Yang, Guangshuai Wang, Sheng Xu, Ziyan Huang, Xinyao Liu, Jiyao Liu, Cheng Tang, Wei Li, Ying Chen, Junzhi Ning, Pengfei Jiang, Chenglong Ma, Ye Du, Changkai Ji, Huihui Xu, Ming Hu, Jiangbin Zheng, Xin Chen, Yucheng Wu, Feifei Jiang, Xi Chen, Xiangru Tang, Yuchen Fu, Yingzhou Lu, Yuanyuan Zhang, Lihao Sun, Chengbo Li, Jinzhe Ma, Wanhao Liu, Yating Liu, Kuo-Cheng Wu, Shengdu Chai, Yizhou Wang, Ouwen Zhangjin, Chen Tang, Shufei Zhang, Wenbo Cao, Junjie Ren, Taoyong Cui, Zhouheng Yao, Juntao Deng, Yijie Sun, Feng Liu, Wangxu Wei, Jingyi Xu, Zhangrui Li, Junchao Gong, Zijie Guo, Zhiyu Yao, Zaoyu Chen, Tianhao Peng, Fangchen Yu, Bo Zhang, Dongzhan Zhou, Shixiang Tang, Jiaheng Liu, Fenghua Ling, Yan Lu, Yuchen Ren, Ben Fei, Zhen Zhao, Xinyu Gu, Rui Su, Xiao-Ming Wu, Weikang Si, Yang Liu, Hao Chen, Xiangchao Yan, Xue Yang, Junchi Yan, Jiamin Wu, Qihao Zheng, Chenhui Li, Zhiqiang Gao, Hao Kong, Junjun He, Mao Su, Tianfan Fu, Peng Ye, Chunfeng Song, Nanqing Dong, Yuqiang Li, Huazhu Fu, Siqi Sun, Lijing Cheng, Jintai Lin, Wanli Ouyang, Bowen Zhou, Wenlong Zhang, Lei Bai</p>

            <p><strong>Title:</strong><br>
            Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16969v1">http://arxiv.org/abs/2512.16969v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 78 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wanghan Xu, Yuhao Zhou, Yifan Zhou, Qinglong Cao, Shuo Li, Jia Bu, Bo Liu, Yixin Chen, Xuming He, Xiangyu Zhao, Xiang Zhuang, Fengxiang Wang, Zhiwang Zhou, Qiantai Feng, Wenxuan Huang, Jiaqi Wei, Hao Wu, Yuejin Yang, Guangshuai Wang, Sheng Xu, Ziyan Huang, Xinyao Liu, Jiyao Liu, Cheng Tang, Wei Li, Ying Chen, Junzhi Ning, Pengfei Jiang, Chenglong Ma, Ye Du, Changkai Ji, Huihui Xu, Ming Hu, Jiangbin Zheng, Xin Chen, Yucheng Wu, Feifei Jiang, Xi Chen, Xiangru Tang, Yuchen Fu, Yingzhou Lu, Yuanyuan Zhang, Lihao Sun, Chengbo Li, Jinzhe Ma, Wanhao Liu, Yating Liu, Kuo-Cheng Wu, Shengdu Chai, Yizhou Wang, Ouwen Zhangjin, Chen Tang, Shufei Zhang, Wenbo Cao, Junjie Ren, Taoyong Cui, Zhouheng Yao, Juntao Deng, Yijie Sun, Feng Liu, Wangxu Wei, Jingyi Xu, Zhangrui Li, Junchao Gong, Zijie Guo, Zhiyu Yao, Zaoyu Chen, Tianhao Peng, Fangchen Yu, Bo Zhang, Dongzhan Zhou, Shixiang Tang, Jiaheng Liu, Fenghua Ling, Yan Lu, Yuchen Ren, Ben Fei, Zhen Zhao, Xinyu Gu, Rui Su, Xiao-Ming Wu, Weikang Si, Yang Liu, Hao Chen, Xiangchao Yan, Xue Yang, Junchi Yan, Jiamin Wu, Qihao Zheng, Chenhui Li, Zhiqiang Gao, Hao Kong, Junjun He, Mao Su, Tianfan Fu, Peng Ye, Chunfeng Song, Nanqing Dong, Yuqiang Li, Huazhu Fu, Siqi Sun, Lijing Cheng, Jintai Lin, Wanli Ouyang, Bowen Zhou, Wenlong Zhang, Lei Bai</p>

            <p><strong>Title:</strong><br>
            Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16969v1">http://arxiv.org/abs/2512.16969v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 22 Dec 2025 19:50:40 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6afb3149/07f8d06e.mp3" length="23121961" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1441</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 78 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wanghan Xu, Yuhao Zhou, Yifan Zhou, Qinglong Cao, Shuo Li, Jia Bu, Bo Liu, Yixin Chen, Xuming He, Xiangyu Zhao, Xiang Zhuang, Fengxiang Wang, Zhiwang Zhou, Qiantai Feng, Wenxuan Huang, Jiaqi Wei, Hao Wu, Yuejin Yang, Guangshuai Wang, Sheng Xu, Ziyan Huang, Xinyao Liu, Jiyao Liu, Cheng Tang, Wei Li, Ying Chen, Junzhi Ning, Pengfei Jiang, Chenglong Ma, Ye Du, Changkai Ji, Huihui Xu, Ming Hu, Jiangbin Zheng, Xin Chen, Yucheng Wu, Feifei Jiang, Xi Chen, Xiangru Tang, Yuchen Fu, Yingzhou Lu, Yuanyuan Zhang, Lihao Sun, Chengbo Li, Jinzhe Ma, Wanhao Liu, Yating Liu, Kuo-Cheng Wu, Shengdu Chai, Yizhou Wang, Ouwen Zhangjin, Chen Tang, Shufei Zhang, Wenbo Cao, Junjie Ren, Taoyong Cui, Zhouheng Yao, Juntao Deng, Yijie Sun, Feng Liu, Wangxu Wei, Jingyi Xu, Zhangrui Li, Junchao Gong, Zijie Guo, Zhiyu Yao, Zaoyu Chen, Tianhao Peng, Fangchen Yu, Bo Zhang, Dongzhan Zhou, Shixiang Tang, Jiaheng Liu, Fenghua Ling, Yan Lu, Yuchen Ren, Ben Fei, Zhen Zhao, Xinyu Gu, Rui Su, Xiao-Ming Wu, Weikang Si, Yang Liu, Hao Chen, Xiangchao Yan, Xue Yang, Junchi Yan, Jiamin Wu, Qihao Zheng, Chenhui Li, Zhiqiang Gao, Hao Kong, Junjun He, Mao Su, Tianfan Fu, Peng Ye, Chunfeng Song, Nanqing Dong, Yuqiang Li, Huazhu Fu, Siqi Sun, Lijing Cheng, Jintai Lin, Wanli Ouyang, Bowen Zhou, Wenlong Zhang, Lei Bai</p>

            <p><strong>Title:</strong><br>
            Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16969v1">http://arxiv.org/abs/2512.16969v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence</title>
      <itunes:episode>1512</itunes:episode>
      <podcast:episode>1512</podcast:episode>
      <itunes:title>PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">90d8212d-b415-4a1d-b4b3-c4f16e2d4c19</guid>
      <link>https://share.transistor.fm/s/a68d4bac</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Xiaopeng Lin, Shijie Lian, Bin Yu, Ruoqi Yang, Changti Wu, Yuzhuo Miao, Yurun Jin, Yukun Shi, Cong Huang, Bojun Cheng, Kai Chen</p>

            <p><strong>Title:</strong><br>
            PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16793v1">http://arxiv.org/abs/2512.16793v1</a></p>

            <p><strong>Abstract:</strong><br>
            Robotic generalization relies on physical intelligence: the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception and action. However, most VLMs are trained primarily on third-person data, creating a fundamental viewpoint mismatch for humanoid robots. Scaling robot egocentric data collection remains impractical due to high cost and limited diversity, whereas large-scale human egocentric videos offer a scalable alternative that naturally capture rich interaction context and causal structure. The key challenge is to convert raw egocentric videos into structured and reliable embodiment training supervision. Accordingly, we propose an Egocentric2Embodiment translation pipeline that transforms first-person videos into multi-level, schema-driven VQA supervision with enforced evidence grounding and temporal consistency, enabling the construction of the Egocentric2Embodiment dataset (E2E-3M) at scale. An egocentric-aware embodied brain, termed PhysBrain, is obtained by training on the E2E-3M dataset. PhysBrain exhibits substantially improved egocentric understanding, particularly for planning on EgoThink. It provides an egocentric-aware initialization that enables more sample-efficient VLA fine-tuning and higher SimplerEnv success rates (53.9\%), demonstrating effective transfer from human egocentric supervision to downstream robot control.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Xiaopeng Lin, Shijie Lian, Bin Yu, Ruoqi Yang, Changti Wu, Yuzhuo Miao, Yurun Jin, Yukun Shi, Cong Huang, Bojun Cheng, Kai Chen</p>

            <p><strong>Title:</strong><br>
            PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16793v1">http://arxiv.org/abs/2512.16793v1</a></p>

            <p><strong>Abstract:</strong><br>
            Robotic generalization relies on physical intelligence: the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception and action. However, most VLMs are trained primarily on third-person data, creating a fundamental viewpoint mismatch for humanoid robots. Scaling robot egocentric data collection remains impractical due to high cost and limited diversity, whereas large-scale human egocentric videos offer a scalable alternative that naturally capture rich interaction context and causal structure. The key challenge is to convert raw egocentric videos into structured and reliable embodiment training supervision. Accordingly, we propose an Egocentric2Embodiment translation pipeline that transforms first-person videos into multi-level, schema-driven VQA supervision with enforced evidence grounding and temporal consistency, enabling the construction of the Egocentric2Embodiment dataset (E2E-3M) at scale. An egocentric-aware embodied brain, termed PhysBrain, is obtained by training on the E2E-3M dataset. PhysBrain exhibits substantially improved egocentric understanding, particularly for planning on EgoThink. It provides an egocentric-aware initialization that enables more sample-efficient VLA fine-tuning and higher SimplerEnv success rates (53.9\%), demonstrating effective transfer from human egocentric supervision to downstream robot control.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 22 Dec 2025 19:50:19 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a68d4bac/f59656e7.mp3" length="24592358" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1533</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Xiaopeng Lin, Shijie Lian, Bin Yu, Ruoqi Yang, Changti Wu, Yuzhuo Miao, Yurun Jin, Yukun Shi, Cong Huang, Bojun Cheng, Kai Chen</p>

            <p><strong>Title:</strong><br>
            PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16793v1">http://arxiv.org/abs/2512.16793v1</a></p>

            <p><strong>Abstract:</strong><br>
            Robotic generalization relies on physical intelligence: the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception and action. However, most VLMs are trained primarily on third-person data, creating a fundamental viewpoint mismatch for humanoid robots. Scaling robot egocentric data collection remains impractical due to high cost and limited diversity, whereas large-scale human egocentric videos offer a scalable alternative that naturally capture rich interaction context and causal structure. The key challenge is to convert raw egocentric videos into structured and reliable embodiment training supervision. Accordingly, we propose an Egocentric2Embodiment translation pipeline that transforms first-person videos into multi-level, schema-driven VQA supervision with enforced evidence grounding and temporal consistency, enabling the construction of the Egocentric2Embodiment dataset (E2E-3M) at scale. An egocentric-aware embodied brain, termed PhysBrain, is obtained by training on the E2E-3M dataset. PhysBrain exhibits substantially improved egocentric understanding, particularly for planning on EgoThink. It provides an egocentric-aware initialization that enables more sample-efficient VLA fine-tuning and higher SimplerEnv success rates (53.9\%), demonstrating effective transfer from human egocentric supervision to downstream robot control.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>When Reasoning Meets Its Laws</title>
      <itunes:episode>1511</itunes:episode>
      <podcast:episode>1511</podcast:episode>
      <itunes:title>When Reasoning Meets Its Laws</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">27d88299-14ce-43b2-a432-39cc8b32a933</guid>
      <link>https://share.transistor.fm/s/df8c796d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junyu Zhang, Yifan Sun, Tianang Leng, Jingyan Shen, Liu Ziyin, Paul Pu Liang, Huan Zhang</p>

            <p><strong>Title:</strong><br>
            When Reasoning Meets Its Laws</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17901v1">http://arxiv.org/abs/2512.17901v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabilities. To theoretically formalize the desired reasoning behaviors, this paper presents the Laws of Reasoning (LoRe), a unified framework that characterizes intrinsic reasoning patterns in LRMs. We first propose compute law with the hypothesis that the reasoning compute should scale linearly with question complexity. Beyond compute, we extend LoRe with a supplementary accuracy law. Since the question complexity is difficult to quantify in practice, we examine these hypotheses by two properties of the laws, monotonicity and compositionality. We therefore introduce LoRe-Bench, a benchmark that systematically measures these two tractable properties for large reasoning models. Evaluation shows that most reasoning models exhibit reasonable monotonicity but lack compositionality. In response, we develop an effective finetuning approach that enforces compute-law compositionality. Extensive empirical studies demonstrate that better compliance with compute laws yields consistently improved reasoning performance on multiple benchmarks, and uncovers synergistic effects across properties and laws. Project page: https://lore-project.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junyu Zhang, Yifan Sun, Tianang Leng, Jingyan Shen, Liu Ziyin, Paul Pu Liang, Huan Zhang</p>

            <p><strong>Title:</strong><br>
            When Reasoning Meets Its Laws</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17901v1">http://arxiv.org/abs/2512.17901v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabilities. To theoretically formalize the desired reasoning behaviors, this paper presents the Laws of Reasoning (LoRe), a unified framework that characterizes intrinsic reasoning patterns in LRMs. We first propose compute law with the hypothesis that the reasoning compute should scale linearly with question complexity. Beyond compute, we extend LoRe with a supplementary accuracy law. Since the question complexity is difficult to quantify in practice, we examine these hypotheses by two properties of the laws, monotonicity and compositionality. We therefore introduce LoRe-Bench, a benchmark that systematically measures these two tractable properties for large reasoning models. Evaluation shows that most reasoning models exhibit reasonable monotonicity but lack compositionality. In response, we develop an effective finetuning approach that enforces compute-law compositionality. Extensive empirical studies demonstrate that better compliance with compute laws yields consistently improved reasoning performance on multiple benchmarks, and uncovers synergistic effects across properties and laws. Project page: https://lore-project.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 22 Dec 2025 19:49:58 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/df8c796d/68b7237d.mp3" length="20936401" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1305</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junyu Zhang, Yifan Sun, Tianang Leng, Jingyan Shen, Liu Ziyin, Paul Pu Liang, Huan Zhang</p>

            <p><strong>Title:</strong><br>
            When Reasoning Meets Its Laws</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17901v1">http://arxiv.org/abs/2512.17901v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabilities. To theoretically formalize the desired reasoning behaviors, this paper presents the Laws of Reasoning (LoRe), a unified framework that characterizes intrinsic reasoning patterns in LRMs. We first propose compute law with the hypothesis that the reasoning compute should scale linearly with question complexity. Beyond compute, we extend LoRe with a supplementary accuracy law. Since the question complexity is difficult to quantify in practice, we examine these hypotheses by two properties of the laws, monotonicity and compositionality. We therefore introduce LoRe-Bench, a benchmark that systematically measures these two tractable properties for large reasoning models. Evaluation shows that most reasoning models exhibit reasonable monotonicity but lack compositionality. In response, we develop an effective finetuning approach that enforces compute-law compositionality. Extensive empirical studies demonstrate that better compliance with compute laws yields consistently improved reasoning performance on multiple benchmarks, and uncovers synergistic effects across properties and laws. Project page: https://lore-project.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience</title>
      <itunes:episode>1510</itunes:episode>
      <podcast:episode>1510</podcast:episode>
      <itunes:title>Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9c88d380-c559-4e54-bb86-18dee39dc0ac</guid>
      <link>https://share.transistor.fm/s/3337a6de</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, Zhihong Wang, Mingxuan Wang, Chenrui Wei, Shufa Wei, Huajian Xin, Fan Yang, Weihao Gao, Zheng Yuan, Tianyang Zhan, Zeyu Zheng, Tianxi Zhou, Thomas Hanwen Zhu</p>

            <p><strong>Title:</strong><br>
            Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17260v1">http://arxiv.org/abs/2512.17260v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models have recently made significant progress to generate rigorous mathematical proofs. In contrast, utilizing LLMs for theorem proving in formal languages (such as Lean) remains challenging and computationally expensive, particularly when addressing problems at the undergraduate level and beyond. In this work, we present \textbf{Seed-Prover 1.5}, a formal theorem-proving model trained via large-scale agentic reinforcement learning, alongside an efficient test-time scaling (TTS) workflow. Through extensive interactions with Lean and other tools, the model continuously accumulates experience during the RL process, substantially enhancing the capability and efficiency of formal theorem proving. Furthermore, leveraging recent advancements in natural language proving, our TTS workflow efficiently bridges the gap between natural and formal languages. Compared to state-of-the-art methods, Seed-Prover 1.5 achieves superior performance with a smaller compute budget. It solves \textbf{88\% of PutnamBench} (undergraduate-level), \textbf{80\% of Fate-H} (graduate-level), and \textbf{33\% of Fate-X} (PhD-level) problems. Notably, using our system, we solved \textbf{11 out of 12 problems} from Putnam 2025 within 9 hours. Our findings suggest that scaling learning from experience, driven by high-quality formal feedback, holds immense potential for the future of formal mathematical reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, Zhihong Wang, Mingxuan Wang, Chenrui Wei, Shufa Wei, Huajian Xin, Fan Yang, Weihao Gao, Zheng Yuan, Tianyang Zhan, Zeyu Zheng, Tianxi Zhou, Thomas Hanwen Zhu</p>

            <p><strong>Title:</strong><br>
            Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17260v1">http://arxiv.org/abs/2512.17260v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models have recently made significant progress to generate rigorous mathematical proofs. In contrast, utilizing LLMs for theorem proving in formal languages (such as Lean) remains challenging and computationally expensive, particularly when addressing problems at the undergraduate level and beyond. In this work, we present \textbf{Seed-Prover 1.5}, a formal theorem-proving model trained via large-scale agentic reinforcement learning, alongside an efficient test-time scaling (TTS) workflow. Through extensive interactions with Lean and other tools, the model continuously accumulates experience during the RL process, substantially enhancing the capability and efficiency of formal theorem proving. Furthermore, leveraging recent advancements in natural language proving, our TTS workflow efficiently bridges the gap between natural and formal languages. Compared to state-of-the-art methods, Seed-Prover 1.5 achieves superior performance with a smaller compute budget. It solves \textbf{88\% of PutnamBench} (undergraduate-level), \textbf{80\% of Fate-H} (graduate-level), and \textbf{33\% of Fate-X} (PhD-level) problems. Notably, using our system, we solved \textbf{11 out of 12 problems} from Putnam 2025 within 9 hours. Our findings suggest that scaling learning from experience, driven by high-quality formal feedback, holds immense potential for the future of formal mathematical reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 22 Dec 2025 19:49:37 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3337a6de/c21323b4.mp3" length="24610743" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1534</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiangjie Chen, Wenxiang Chen, Jiacheng Du, Jinyi Hu, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Wenlei Shi, Zhihong Wang, Mingxuan Wang, Chenrui Wei, Shufa Wei, Huajian Xin, Fan Yang, Weihao Gao, Zheng Yuan, Tianyang Zhan, Zeyu Zheng, Tianxi Zhou, Thomas Hanwen Zhu</p>

            <p><strong>Title:</strong><br>
            Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17260v1">http://arxiv.org/abs/2512.17260v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models have recently made significant progress to generate rigorous mathematical proofs. In contrast, utilizing LLMs for theorem proving in formal languages (such as Lean) remains challenging and computationally expensive, particularly when addressing problems at the undergraduate level and beyond. In this work, we present \textbf{Seed-Prover 1.5}, a formal theorem-proving model trained via large-scale agentic reinforcement learning, alongside an efficient test-time scaling (TTS) workflow. Through extensive interactions with Lean and other tools, the model continuously accumulates experience during the RL process, substantially enhancing the capability and efficiency of formal theorem proving. Furthermore, leveraging recent advancements in natural language proving, our TTS workflow efficiently bridges the gap between natural and formal languages. Compared to state-of-the-art methods, Seed-Prover 1.5 achieves superior performance with a smaller compute budget. It solves \textbf{88\% of PutnamBench} (undergraduate-level), \textbf{80\% of Fate-H} (graduate-level), and \textbf{33\% of Fate-X} (PhD-level) problems. Notably, using our system, we solved \textbf{11 out of 12 problems} from Putnam 2025 within 9 hours. Our findings suggest that scaling learning from experience, driven by high-quality formal feedback, holds immense potential for the future of formal mathematical reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation</title>
      <itunes:episode>1509</itunes:episode>
      <podcast:episode>1509</podcast:episode>
      <itunes:title>4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8729e230-3ba4-4648-a8de-6f24b24b452a</guid>
      <link>https://share.transistor.fm/s/26766bf6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen</p>

            <p><strong>Title:</strong><br>
            4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17012v1">http://arxiv.org/abs/2512.17012v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen</p>

            <p><strong>Title:</strong><br>
            4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17012v1">http://arxiv.org/abs/2512.17012v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 22 Dec 2025 19:49:16 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/26766bf6/8be13862.mp3" length="25503485" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1590</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen</p>

            <p><strong>Title:</strong><br>
            4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17012v1">http://arxiv.org/abs/2512.17012v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing</title>
      <itunes:episode>1508</itunes:episode>
      <podcast:episode>1508</podcast:episode>
      <itunes:title>Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9e90bf30-1305-4063-bdd1-8d6470608f10</guid>
      <link>https://share.transistor.fm/s/4181f979</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, Ping Luo</p>

            <p><strong>Title:</strong><br>
            Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17909v1">http://arxiv.org/abs/2512.17909v1</a></p>

            <p><strong>Abstract:</strong><br>
            Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, Ping Luo</p>

            <p><strong>Title:</strong><br>
            Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17909v1">http://arxiv.org/abs/2512.17909v1</a></p>

            <p><strong>Abstract:</strong><br>
            Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 22 Dec 2025 19:48:55 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4181f979/20188609.mp3" length="22784289" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1420</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, Ping Luo</p>

            <p><strong>Title:</strong><br>
            Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.17909v1">http://arxiv.org/abs/2512.17909v1</a></p>

            <p><strong>Abstract:</strong><br>
            Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Are We on the Right Way to Assessing LLM-as-a-Judge?</title>
      <itunes:episode>1507</itunes:episode>
      <podcast:episode>1507</podcast:episode>
      <itunes:title>Are We on the Right Way to Assessing LLM-as-a-Judge?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9a237b20-ef18-4663-b25c-703ccc6a95f5</guid>
      <link>https://share.transistor.fm/s/6aba08e5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuanning Feng, Sinan Wang, Zhengxiang Cheng, Yao Wan, Dongping Chen</p>

            <p><strong>Title:</strong><br>
            Are We on the Right Way to Assessing LLM-as-a-Judge?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16041v1">http://arxiv.org/abs/2512.16041v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage's reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuanning Feng, Sinan Wang, Zhengxiang Cheng, Yao Wan, Dongping Chen</p>

            <p><strong>Title:</strong><br>
            Are We on the Right Way to Assessing LLM-as-a-Judge?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16041v1">http://arxiv.org/abs/2512.16041v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage's reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 22 Dec 2025 19:48:34 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6aba08e5/525f6ddd.mp3" length="22396356" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1396</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuanning Feng, Sinan Wang, Zhengxiang Cheng, Yao Wan, Dongping Chen</p>

            <p><strong>Title:</strong><br>
            Are We on the Right Way to Assessing LLM-as-a-Judge?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16041v1">http://arxiv.org/abs/2512.16041v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage's reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Kling-Omni Technical Report</title>
      <itunes:episode>1506</itunes:episode>
      <podcast:episode>1506</podcast:episode>
      <itunes:title>Kling-Omni Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9834e67a-f829-4af4-9311-33a6cce87911</guid>
      <link>https://share.transistor.fm/s/d9617a83</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 112 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, Xiao Hu, Xiaohua Hu, Boyuan Jiang, Fangyuan Kong, Hang Li, Jie Li, Qingyu Li, Shen Li, Xiaohan Li, Yan Li, Jiajun Liang, Borui Liao, Yiqiao Liao, Weihong Lin, Quande Liu, Xiaokun Liu, Yilun Liu, Yuliang Liu, Shun Lu, Hangyu Mao, Yunyao Mao, Haodong Ouyang, Wenyu Qin, Wanqi Shi, Xiaoyu Shi, Lianghao Su, Haozhi Sun, Peiqin Sun, Pengfei Wan, Chao Wang, Chenyu Wang, Meng Wang, Qiulin Wang, Runqi Wang, Xintao Wang, Xuebo Wang, Zekun Wang, Min Wei, Tiancheng Wen, Guohao Wu, Xiaoshi Wu, Zhenhua Wu, Da Xie, Yingtong Xiong, Yulong Xu, Sile Yang, Zikang Yang, Weicai Ye, Ziyang Yuan, Shenglong Zhang, Shuaiyu Zhang, Yuanxing Zhang, Yufan Zhang, Wenzheng Zhao, Ruiliang Zhou, Yan Zhou, Guosheng Zhu, Yongjie Zhu</p>

            <p><strong>Title:</strong><br>
            Kling-Omni Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16776v1">http://arxiv.org/abs/2512.16776v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 112 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, Xiao Hu, Xiaohua Hu, Boyuan Jiang, Fangyuan Kong, Hang Li, Jie Li, Qingyu Li, Shen Li, Xiaohan Li, Yan Li, Jiajun Liang, Borui Liao, Yiqiao Liao, Weihong Lin, Quande Liu, Xiaokun Liu, Yilun Liu, Yuliang Liu, Shun Lu, Hangyu Mao, Yunyao Mao, Haodong Ouyang, Wenyu Qin, Wanqi Shi, Xiaoyu Shi, Lianghao Su, Haozhi Sun, Peiqin Sun, Pengfei Wan, Chao Wang, Chenyu Wang, Meng Wang, Qiulin Wang, Runqi Wang, Xintao Wang, Xuebo Wang, Zekun Wang, Min Wei, Tiancheng Wen, Guohao Wu, Xiaoshi Wu, Zhenhua Wu, Da Xie, Yingtong Xiong, Yulong Xu, Sile Yang, Zikang Yang, Weicai Ye, Ziyang Yuan, Shenglong Zhang, Shuaiyu Zhang, Yuanxing Zhang, Yufan Zhang, Wenzheng Zhao, Ruiliang Zhou, Yan Zhou, Guosheng Zhu, Yongjie Zhu</p>

            <p><strong>Title:</strong><br>
            Kling-Omni Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16776v1">http://arxiv.org/abs/2512.16776v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 19 Dec 2025 19:49:23 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d9617a83/b51ef68d.mp3" length="23378535" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1457</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 112 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, Xiao Hu, Xiaohua Hu, Boyuan Jiang, Fangyuan Kong, Hang Li, Jie Li, Qingyu Li, Shen Li, Xiaohan Li, Yan Li, Jiajun Liang, Borui Liao, Yiqiao Liao, Weihong Lin, Quande Liu, Xiaokun Liu, Yilun Liu, Yuliang Liu, Shun Lu, Hangyu Mao, Yunyao Mao, Haodong Ouyang, Wenyu Qin, Wanqi Shi, Xiaoyu Shi, Lianghao Su, Haozhi Sun, Peiqin Sun, Pengfei Wan, Chao Wang, Chenyu Wang, Meng Wang, Qiulin Wang, Runqi Wang, Xintao Wang, Xuebo Wang, Zekun Wang, Min Wei, Tiancheng Wen, Guohao Wu, Xiaoshi Wu, Zhenhua Wu, Da Xie, Yingtong Xiong, Yulong Xu, Sile Yang, Zikang Yang, Weicai Ye, Ziyang Yuan, Shenglong Zhang, Shuaiyu Zhang, Yuanxing Zhang, Yufan Zhang, Wenzheng Zhao, Ruiliang Zhou, Yan Zhou, Guosheng Zhu, Yongjie Zhu</p>

            <p><strong>Title:</strong><br>
            Kling-Omni Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16776v1">http://arxiv.org/abs/2512.16776v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Adaptation of Agentic AI</title>
      <itunes:episode>1505</itunes:episode>
      <podcast:episode>1505</podcast:episode>
      <itunes:title>Adaptation of Agentic AI</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7f70fa35-cdec-4952-98ee-eb3f23631114</guid>
      <link>https://share.transistor.fm/s/46007d9d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, Xueqiang Xu, Hanwen Xu, Pengrui Han, Dylan Zhang, Jiashuo Sun, Chaoqi Yang, Kun Qian, Tian Wang, Changran Hu, Manling Li, Quanzheng Li, Hao Peng, Sheng Wang, Jingbo Shang, Chao Zhang, Jiaxuan You, Liyuan Liu, Pan Lu, Yu Zhang, Heng Ji, Yejin Choi, Dawn Song, Jimeng Sun, Jiawei Han</p>

            <p><strong>Title:</strong><br>
            Adaptation of Agentic AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16301v1">http://arxiv.org/abs/2512.16301v1</a></p>

            <p><strong>Abstract:</strong><br>
            Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these systems grow in capability and scope, adaptation becomes a central mechanism for improving performance, reliability, and generalization. In this paper, we unify the rapidly expanding research landscape into a systematic framework that spans both agent adaptations and tool adaptations. We further decompose these into tool-execution-signaled and agent-output-signaled forms of agent adaptation, as well as agent-agnostic and agent-supervised forms of tool adaptation. We demonstrate that this framework helps clarify the design space of adaptation strategies in agentic AI, makes their trade-offs explicit, and provides practical guidance for selecting or switching among strategies during system design. We then review the representative approaches in each category, analyze their strengths and limitations, and highlight key open challenges and future opportunities. Overall, this paper aims to offer a conceptual foundation and practical roadmap for researchers and practitioners seeking to build more capable, efficient, and reliable agentic AI systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, Xueqiang Xu, Hanwen Xu, Pengrui Han, Dylan Zhang, Jiashuo Sun, Chaoqi Yang, Kun Qian, Tian Wang, Changran Hu, Manling Li, Quanzheng Li, Hao Peng, Sheng Wang, Jingbo Shang, Chao Zhang, Jiaxuan You, Liyuan Liu, Pan Lu, Yu Zhang, Heng Ji, Yejin Choi, Dawn Song, Jimeng Sun, Jiawei Han</p>

            <p><strong>Title:</strong><br>
            Adaptation of Agentic AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16301v1">http://arxiv.org/abs/2512.16301v1</a></p>

            <p><strong>Abstract:</strong><br>
            Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these systems grow in capability and scope, adaptation becomes a central mechanism for improving performance, reliability, and generalization. In this paper, we unify the rapidly expanding research landscape into a systematic framework that spans both agent adaptations and tool adaptations. We further decompose these into tool-execution-signaled and agent-output-signaled forms of agent adaptation, as well as agent-agnostic and agent-supervised forms of tool adaptation. We demonstrate that this framework helps clarify the design space of adaptation strategies in agentic AI, makes their trade-offs explicit, and provides practical guidance for selecting or switching among strategies during system design. We then review the representative approaches in each category, analyze their strengths and limitations, and highlight key open challenges and future opportunities. Overall, this paper aims to offer a conceptual foundation and practical roadmap for researchers and practitioners seeking to build more capable, efficient, and reliable agentic AI systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 19 Dec 2025 19:49:02 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/46007d9d/3bfd2bc2.mp3" length="25338761" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1580</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He, Yichen Wu, Ming Zhong, Peiyang Song, Qizheng Zhang, Heng Wang, Xueqiang Xu, Hanwen Xu, Pengrui Han, Dylan Zhang, Jiashuo Sun, Chaoqi Yang, Kun Qian, Tian Wang, Changran Hu, Manling Li, Quanzheng Li, Hao Peng, Sheng Wang, Jingbo Shang, Chao Zhang, Jiaxuan You, Liyuan Liu, Pan Lu, Yu Zhang, Heng Ji, Yejin Choi, Dawn Song, Jimeng Sun, Jiawei Han</p>

            <p><strong>Title:</strong><br>
            Adaptation of Agentic AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16301v1">http://arxiv.org/abs/2512.16301v1</a></p>

            <p><strong>Abstract:</strong><br>
            Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these systems grow in capability and scope, adaptation becomes a central mechanism for improving performance, reliability, and generalization. In this paper, we unify the rapidly expanding research landscape into a systematic framework that spans both agent adaptations and tool adaptations. We further decompose these into tool-execution-signaled and agent-output-signaled forms of agent adaptation, as well as agent-agnostic and agent-supervised forms of tool adaptation. We demonstrate that this framework helps clarify the design space of adaptation strategies in agentic AI, makes their trade-offs explicit, and provides practical guidance for selecting or switching among strategies during system design. We then review the representative approaches in each category, analyze their strengths and limitations, and highlight key open challenges and future opportunities. Overall, this paper aims to offer a conceptual foundation and practical roadmap for researchers and practitioners seeking to build more capable, efficient, and reliable agentic AI systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LLaDA2.0: Scaling Up Diffusion Language Models to 100B</title>
      <itunes:episode>1504</itunes:episode>
      <podcast:episode>1504</podcast:episode>
      <itunes:title>LLaDA2.0: Scaling Up Diffusion Language Models to 100B</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">dd191d97-dde7-414c-b7b9-6c12aeeb4662</guid>
      <link>https://share.transistor.fm/s/71647452</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Ling Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen, Yipeng Xing, Xiaolu Zhang, Junbo Zhao, Da Zheng, Jun Zhou, Junlin Zhou, Zhanchao Zhou, Liwang Zhu, Yihong Zhuang</p>

            <p><strong>Title:</strong><br>
            LLaDA2.0: Scaling Up Diffusion Language Models to 100B</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.15745v1">http://arxiv.org/abs/2512.15745v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Ling Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen, Yipeng Xing, Xiaolu Zhang, Junbo Zhao, Da Zheng, Jun Zhou, Junlin Zhou, Zhanchao Zhou, Liwang Zhu, Yihong Zhuang</p>

            <p><strong>Title:</strong><br>
            LLaDA2.0: Scaling Up Diffusion Language Models to 100B</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.15745v1">http://arxiv.org/abs/2512.15745v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 19 Dec 2025 19:48:41 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/71647452/d0aae0d6.mp3" length="25525200" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1592</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Ling Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen, Yipeng Xing, Xiaolu Zhang, Junbo Zhao, Da Zheng, Jun Zhou, Junlin Zhou, Zhanchao Zhou, Liwang Zhu, Yihong Zhuang</p>

            <p><strong>Title:</strong><br>
            LLaDA2.0: Scaling Up Diffusion Language Models to 100B</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.15745v1">http://arxiv.org/abs/2512.15745v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Next-Embedding Prediction Makes Strong Vision Learners</title>
      <itunes:episode>1503</itunes:episode>
      <podcast:episode>1503</podcast:episode>
      <itunes:title>Next-Embedding Prediction Makes Strong Vision Learners</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d7619e70-6735-483e-9099-f30981e587ae</guid>
      <link>https://share.transistor.fm/s/cf46095a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen, Weiyang Jin, Joyce Chai, Saining Xie, Stella X. Yu</p>

            <p><strong>Title:</strong><br>
            Next-Embedding Prediction Makes Strong Vision Learners</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16922v1">http://arxiv.org/abs/2512.16922v1</a></p>

            <p><strong>Abstract:</strong><br>
            Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen, Weiyang Jin, Joyce Chai, Saining Xie, Stella X. Yu</p>

            <p><strong>Title:</strong><br>
            Next-Embedding Prediction Makes Strong Vision Learners</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16922v1">http://arxiv.org/abs/2512.16922v1</a></p>

            <p><strong>Abstract:</strong><br>
            Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 19 Dec 2025 19:48:19 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cf46095a/537928ca.mp3" length="21178843" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1320</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen, Weiyang Jin, Joyce Chai, Saining Xie, Stella X. Yu</p>

            <p><strong>Title:</strong><br>
            Next-Embedding Prediction Makes Strong Vision Learners</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16922v1">http://arxiv.org/abs/2512.16922v1</a></p>

            <p><strong>Abstract:</strong><br>
            Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors</title>
      <itunes:episode>1502</itunes:episode>
      <podcast:episode>1502</podcast:episode>
      <itunes:title>StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e52806c8-c498-424f-8779-7d902ddc4502</guid>
      <link>https://share.transistor.fm/s/833ca7bd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Guibao Shen, Yihua Du, Wenhang Ge, Jing He, Chirui Chang, Donghao Zhou, Zhen Yang, Luozhou Wang, Xin Tao, Ying-Cong Chen</p>

            <p><strong>Title:</strong><br>
            StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16915v1">http://arxiv.org/abs/2512.16915v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint'' (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: https://hit-perfect.github.io/StereoPilot/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Guibao Shen, Yihua Du, Wenhang Ge, Jing He, Chirui Chang, Donghao Zhou, Zhen Yang, Luozhou Wang, Xin Tao, Ying-Cong Chen</p>

            <p><strong>Title:</strong><br>
            StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16915v1">http://arxiv.org/abs/2512.16915v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint'' (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: https://hit-perfect.github.io/StereoPilot/.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 19 Dec 2025 19:47:58 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/833ca7bd/74de6470.mp3" length="23162924" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1444</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Guibao Shen, Yihua Du, Wenhang Ge, Jing He, Chirui Chang, Donghao Zhou, Zhen Yang, Luozhou Wang, Xin Tao, Ying-Cong Chen</p>

            <p><strong>Title:</strong><br>
            StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16915v1">http://arxiv.org/abs/2512.16915v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint'' (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: https://hit-perfect.github.io/StereoPilot/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model</title>
      <itunes:episode>1501</itunes:episode>
      <podcast:episode>1501</podcast:episode>
      <itunes:title>Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2448217e-fdc0-4225-a3f8-8bde0a4421bd</guid>
      <link>https://share.transistor.fm/s/e6ca4bae</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, Xuyan Chi, Jian Cong, Jing Cui, Qinpeng Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chengjian Feng, Han Feng, Mingyuan Gao, Yu Gao, Dong Guo, Qiushan Guo, Boyang Hao, Qingkai Hao, Bibo He, Qian He, Tuyen Hoang, Ruoqing Hu, Xi Hu, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Donglei Ji, Siqi Jiang, Wei Jiang, Yunpu Jiang, Zhuo Jiang, Ashley Kim, Jianan Kong, Zhichao Lai, Shanshan Lao, Yichong Leng, Ai Li, Feiya Li, Gen Li, Huixia Li, JiaShi Li, Liang Li, Ming Li, Shanshan Li, Tao Li, Xian Li, Xiaojie Li, Xiaoyang Li, Xingxing Li, Yameng Li, Yifu Li, Yiying Li, Chao Liang, Han Liang, Jianzhong Liang, Ying Liang, Zhiqiang Liang, Wang Liao, Yalin Liao, Heng Lin, Kengyu Lin, Shanchuan Lin, Xi Lin, Zhijie Lin, Feng Ling, Fangfang Liu, Gaohong Liu, Jiawei Liu, Jie Liu, Jihao Liu, Shouda Liu, Shu Liu, Sichao Liu, Songwei Liu, Xin Liu, Xue Liu, Yibo Liu, Zikun Liu, Zuxi Liu, Junlin Lyu, Lecheng Lyu, Qian Lyu, Han Mu, Xiaonan Nie, Jingzhe Ning, Xitong Pan, Yanghua Peng, Lianke Qin, Xueqiong Qu, Yuxi Ren, Kai Shen, Guang Shi, Lei Shi, Yan Song, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Yan Sun, Zeyu Sun, Wenjing Tang, Yaxue Tang, Zirui Tao, Feng Wang, Furui Wang, Jinran Wang, Junkai Wang, Ke Wang, Kexin Wang, Qingyi Wang, Rui Wang, Sen Wang, Shuai Wang, Tingru Wang, Weichen Wang, Xin Wang, Yanhui Wang, Yue Wang, Yuping Wang, Yuxuan Wang, Ziyu Wang, Guoqiang Wei, Wanru Wei, Di Wu, Guohong Wu, Hanjie Wu, Jian Wu, Jie Wu, Ruolan Wu, Xinglong Wu, Yonghui Wu, Ruiqi Xia, Liang Xiang, Fei Xiao, XueFeng Xiao, Pan Xie, Shuangyi Xie, Shuang Xu, Jinlan Xue, Shen Yan, Bangbang Yang, Ceyuan Yang, Jiaqi Yang, Runkai Yang, Tao Yang, Yang Yang, Yihang Yang, ZhiXian Yang, Ziyan Yang, Songting Yao, Yifan Yao, Zilyu Ye, Bowen Yu, Jian Yu, Chujie Yuan, Linxiao Yuan, Sichun Zeng, Weihong Zeng, Xuejiao Zeng, Yan Zeng, Chuntao Zhang, Heng Zhang, Jingjie Zhang, Kuo Zhang, Liang Zhang, Liying Zhang, Manlin Zhang, Ting Zhang, Weida Zhang, Xiaohe Zhang, Xinyan Zhang, Yan Zhang, Yuan Zhang, Zixiang Zhang, Fengxuan Zhao, Huating Zhao, Yang Zhao, Hao Zheng, Jianbin Zheng, Xiaozheng Zheng, Yangyang Zheng, Yijie Zheng, Jiexin Zhou, Jiahui Zhu, Kuan Zhu, Shenhan Zhu, Wenjia Zhu, Benhui Zou, Feilong Zuo</p>

            <p><strong>Title:</strong><br>
            Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13507v2">http://arxiv.org/abs/2512.13507v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, Xuyan Chi, Jian Cong, Jing Cui, Qinpeng Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chengjian Feng, Han Feng, Mingyuan Gao, Yu Gao, Dong Guo, Qiushan Guo, Boyang Hao, Qingkai Hao, Bibo He, Qian He, Tuyen Hoang, Ruoqing Hu, Xi Hu, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Donglei Ji, Siqi Jiang, Wei Jiang, Yunpu Jiang, Zhuo Jiang, Ashley Kim, Jianan Kong, Zhichao Lai, Shanshan Lao, Yichong Leng, Ai Li, Feiya Li, Gen Li, Huixia Li, JiaShi Li, Liang Li, Ming Li, Shanshan Li, Tao Li, Xian Li, Xiaojie Li, Xiaoyang Li, Xingxing Li, Yameng Li, Yifu Li, Yiying Li, Chao Liang, Han Liang, Jianzhong Liang, Ying Liang, Zhiqiang Liang, Wang Liao, Yalin Liao, Heng Lin, Kengyu Lin, Shanchuan Lin, Xi Lin, Zhijie Lin, Feng Ling, Fangfang Liu, Gaohong Liu, Jiawei Liu, Jie Liu, Jihao Liu, Shouda Liu, Shu Liu, Sichao Liu, Songwei Liu, Xin Liu, Xue Liu, Yibo Liu, Zikun Liu, Zuxi Liu, Junlin Lyu, Lecheng Lyu, Qian Lyu, Han Mu, Xiaonan Nie, Jingzhe Ning, Xitong Pan, Yanghua Peng, Lianke Qin, Xueqiong Qu, Yuxi Ren, Kai Shen, Guang Shi, Lei Shi, Yan Song, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Yan Sun, Zeyu Sun, Wenjing Tang, Yaxue Tang, Zirui Tao, Feng Wang, Furui Wang, Jinran Wang, Junkai Wang, Ke Wang, Kexin Wang, Qingyi Wang, Rui Wang, Sen Wang, Shuai Wang, Tingru Wang, Weichen Wang, Xin Wang, Yanhui Wang, Yue Wang, Yuping Wang, Yuxuan Wang, Ziyu Wang, Guoqiang Wei, Wanru Wei, Di Wu, Guohong Wu, Hanjie Wu, Jian Wu, Jie Wu, Ruolan Wu, Xinglong Wu, Yonghui Wu, Ruiqi Xia, Liang Xiang, Fei Xiao, XueFeng Xiao, Pan Xie, Shuangyi Xie, Shuang Xu, Jinlan Xue, Shen Yan, Bangbang Yang, Ceyuan Yang, Jiaqi Yang, Runkai Yang, Tao Yang, Yang Yang, Yihang Yang, ZhiXian Yang, Ziyan Yang, Songting Yao, Yifan Yao, Zilyu Ye, Bowen Yu, Jian Yu, Chujie Yuan, Linxiao Yuan, Sichun Zeng, Weihong Zeng, Xuejiao Zeng, Yan Zeng, Chuntao Zhang, Heng Zhang, Jingjie Zhang, Kuo Zhang, Liang Zhang, Liying Zhang, Manlin Zhang, Ting Zhang, Weida Zhang, Xiaohe Zhang, Xinyan Zhang, Yan Zhang, Yuan Zhang, Zixiang Zhang, Fengxuan Zhao, Huating Zhao, Yang Zhao, Hao Zheng, Jianbin Zheng, Xiaozheng Zheng, Yangyang Zheng, Yijie Zheng, Jiexin Zhou, Jiahui Zhu, Kuan Zhu, Shenhan Zhu, Wenjia Zhu, Benhui Zou, Feilong Zuo</p>

            <p><strong>Title:</strong><br>
            Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13507v2">http://arxiv.org/abs/2512.13507v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 19 Dec 2025 19:47:37 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e6ca4bae/17696531.mp3" length="21406650" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1334</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, Xuyan Chi, Jian Cong, Jing Cui, Qinpeng Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chengjian Feng, Han Feng, Mingyuan Gao, Yu Gao, Dong Guo, Qiushan Guo, Boyang Hao, Qingkai Hao, Bibo He, Qian He, Tuyen Hoang, Ruoqing Hu, Xi Hu, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Donglei Ji, Siqi Jiang, Wei Jiang, Yunpu Jiang, Zhuo Jiang, Ashley Kim, Jianan Kong, Zhichao Lai, Shanshan Lao, Yichong Leng, Ai Li, Feiya Li, Gen Li, Huixia Li, JiaShi Li, Liang Li, Ming Li, Shanshan Li, Tao Li, Xian Li, Xiaojie Li, Xiaoyang Li, Xingxing Li, Yameng Li, Yifu Li, Yiying Li, Chao Liang, Han Liang, Jianzhong Liang, Ying Liang, Zhiqiang Liang, Wang Liao, Yalin Liao, Heng Lin, Kengyu Lin, Shanchuan Lin, Xi Lin, Zhijie Lin, Feng Ling, Fangfang Liu, Gaohong Liu, Jiawei Liu, Jie Liu, Jihao Liu, Shouda Liu, Shu Liu, Sichao Liu, Songwei Liu, Xin Liu, Xue Liu, Yibo Liu, Zikun Liu, Zuxi Liu, Junlin Lyu, Lecheng Lyu, Qian Lyu, Han Mu, Xiaonan Nie, Jingzhe Ning, Xitong Pan, Yanghua Peng, Lianke Qin, Xueqiong Qu, Yuxi Ren, Kai Shen, Guang Shi, Lei Shi, Yan Song, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Yan Sun, Zeyu Sun, Wenjing Tang, Yaxue Tang, Zirui Tao, Feng Wang, Furui Wang, Jinran Wang, Junkai Wang, Ke Wang, Kexin Wang, Qingyi Wang, Rui Wang, Sen Wang, Shuai Wang, Tingru Wang, Weichen Wang, Xin Wang, Yanhui Wang, Yue Wang, Yuping Wang, Yuxuan Wang, Ziyu Wang, Guoqiang Wei, Wanru Wei, Di Wu, Guohong Wu, Hanjie Wu, Jian Wu, Jie Wu, Ruolan Wu, Xinglong Wu, Yonghui Wu, Ruiqi Xia, Liang Xiang, Fei Xiao, XueFeng Xiao, Pan Xie, Shuangyi Xie, Shuang Xu, Jinlan Xue, Shen Yan, Bangbang Yang, Ceyuan Yang, Jiaqi Yang, Runkai Yang, Tao Yang, Yang Yang, Yihang Yang, ZhiXian Yang, Ziyan Yang, Songting Yao, Yifan Yao, Zilyu Ye, Bowen Yu, Jian Yu, Chujie Yuan, Linxiao Yuan, Sichun Zeng, Weihong Zeng, Xuejiao Zeng, Yan Zeng, Chuntao Zhang, Heng Zhang, Jingjie Zhang, Kuo Zhang, Liang Zhang, Liying Zhang, Manlin Zhang, Ting Zhang, Weida Zhang, Xiaohe Zhang, Xinyan Zhang, Yan Zhang, Yuan Zhang, Zixiang Zhang, Fengxuan Zhao, Huating Zhao, Yang Zhao, Hao Zheng, Jianbin Zheng, Xiaozheng Zheng, Yangyang Zheng, Yijie Zheng, Jiexin Zhou, Jiahui Zhu, Kuan Zhu, Shenhan Zhu, Wenjia Zhu, Benhui Zou, Feilong Zuo</p>

            <p><strong>Title:</strong><br>
            Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13507v2">http://arxiv.org/abs/2512.13507v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation</title>
      <itunes:episode>1500</itunes:episode>
      <podcast:episode>1500</podcast:episode>
      <itunes:title>Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">959bb7b6-53f0-48ff-8ad9-edbfad412ba6</guid>
      <link>https://share.transistor.fm/s/649dcf8a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, Lu Qi</p>

            <p><strong>Title:</strong><br>
            Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16913v1">http://arxiv.org/abs/2512.16913v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 simulator and text-to-image models, and real panoramic images from the web. To reduce domain gaps between indoor/outdoor and synthetic/real data, we introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. For the model, we adopt DINOv3-Large as the backbone for its strong pre-trained generalization, and introduce a plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization to improve robustness to varying distances and enforce geometric consistency across views. Experiments on multiple benchmarks (e.g., Stanford2D3D, Matterport3D, and Deep360) demonstrate strong performance and zero-shot generalization, with particularly robust and stable metric predictions in diverse real-world scenes. The project page can be found at: \href{https://insta360-research-team.github.io/DAP_website/} {https://insta360-research-team.github.io/DAP\_website/}</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, Lu Qi</p>

            <p><strong>Title:</strong><br>
            Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16913v1">http://arxiv.org/abs/2512.16913v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 simulator and text-to-image models, and real panoramic images from the web. To reduce domain gaps between indoor/outdoor and synthetic/real data, we introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. For the model, we adopt DINOv3-Large as the backbone for its strong pre-trained generalization, and introduce a plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization to improve robustness to varying distances and enforce geometric consistency across views. Experiments on multiple benchmarks (e.g., Stanford2D3D, Matterport3D, and Deep360) demonstrate strong performance and zero-shot generalization, with particularly robust and stable metric predictions in diverse real-world scenes. The project page can be found at: \href{https://insta360-research-team.github.io/DAP_website/} {https://insta360-research-team.github.io/DAP\_website/}</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 19 Dec 2025 19:47:16 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/649dcf8a/a8fc6a3d.mp3" length="20679398" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1289</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, Lu Qi</p>

            <p><strong>Title:</strong><br>
            Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16913v1">http://arxiv.org/abs/2512.16913v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 simulator and text-to-image models, and real panoramic images from the web. To reduce domain gaps between indoor/outdoor and synthetic/real data, we introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. For the model, we adopt DINOv3-Large as the backbone for its strong pre-trained generalization, and introduce a plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization to improve robustness to varying distances and enforce geometric consistency across views. Experiments on multiple benchmarks (e.g., Stanford2D3D, Matterport3D, and Deep360) demonstrate strong performance and zero-shot generalization, with particularly robust and stable metric predictions in diverse real-world scenes. The project page can be found at: \href{https://insta360-research-team.github.io/DAP_website/} {https://insta360-research-team.github.io/DAP\_website/}</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Generative Refocusing: Flexible Defocus Control from a Single Image</title>
      <itunes:episode>1499</itunes:episode>
      <podcast:episode>1499</podcast:episode>
      <itunes:title>Generative Refocusing: Flexible Defocus Control from a Single Image</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e8772cd2-5dbb-4ae7-8ddf-4682ca405164</guid>
      <link>https://share.transistor.fm/s/130ea219</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chun-Wei Tuan Mu, Jia-Bin Huang, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            Generative Refocusing: Flexible Defocus Control from a Single Image</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16923v1">http://arxiv.org/abs/2512.16923v1</a></p>

            <p><strong>Abstract:</strong><br>
            Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment. Single-image refocusing is still difficult. It involves recovering sharp content and creating realistic bokeh. Current methods have significant drawbacks. They need all-in-focus inputs, depend on synthetic data from simulators, and have limited control over aperture. We introduce Generative Refocusing, a two-step process that uses DeblurNet to recover all-in-focus images from various inputs and BokehNet for creating controllable bokeh. Our main innovation is semi-supervised training. This method combines synthetic paired data with unpaired real bokeh images, using EXIF metadata to capture real optical characteristics beyond what simulators can provide. Our experiments show we achieve top performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Additionally, our Generative Refocusing allows text-guided adjustments and custom aperture shapes.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chun-Wei Tuan Mu, Jia-Bin Huang, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            Generative Refocusing: Flexible Defocus Control from a Single Image</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16923v1">http://arxiv.org/abs/2512.16923v1</a></p>

            <p><strong>Abstract:</strong><br>
            Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment. Single-image refocusing is still difficult. It involves recovering sharp content and creating realistic bokeh. Current methods have significant drawbacks. They need all-in-focus inputs, depend on synthetic data from simulators, and have limited control over aperture. We introduce Generative Refocusing, a two-step process that uses DeblurNet to recover all-in-focus images from various inputs and BokehNet for creating controllable bokeh. Our main innovation is semi-supervised training. This method combines synthetic paired data with unpaired real bokeh images, using EXIF metadata to capture real optical characteristics beyond what simulators can provide. Our experiments show we achieve top performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Additionally, our Generative Refocusing allows text-guided adjustments and custom aperture shapes.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 19 Dec 2025 19:46:55 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/130ea219/2d383134.mp3" length="24483659" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1527</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chun-Wei Tuan Mu, Jia-Bin Huang, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            Generative Refocusing: Flexible Defocus Control from a Single Image</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16923v1">http://arxiv.org/abs/2512.16923v1</a></p>

            <p><strong>Abstract:</strong><br>
            Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment. Single-image refocusing is still difficult. It involves recovering sharp content and creating realistic bokeh. Current methods have significant drawbacks. They need all-in-focus inputs, depend on synthetic data from simulators, and have limited control over aperture. We introduce Generative Refocusing, a two-step process that uses DeblurNet to recover all-in-focus images from various inputs and BokehNet for creating controllable bokeh. Our main innovation is semi-supervised training. This method combines synthetic paired data with unpaired real bokeh images, using EXIF metadata to capture real optical characteristics beyond what simulators can provide. Our experiments show we achieve top performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Additionally, our Generative Refocusing allows text-guided adjustments and custom aperture shapes.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DeContext as Defense: Safe Image Editing in Diffusion Transformers</title>
      <itunes:episode>1498</itunes:episode>
      <podcast:episode>1498</podcast:episode>
      <itunes:title>DeContext as Defense: Safe Image Editing in Diffusion Transformers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">84922603-cfcd-4fd9-9770-732073a2d05d</guid>
      <link>https://share.transistor.fm/s/b8a1412c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Linghui Shen, Mingyue Cui, Xingyi Yang</p>

            <p><strong>Title:</strong><br>
            DeContext as Defense: Safe Image Editing in Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16625v1">http://arxiv.org/abs/2512.16625v1</a></p>

            <p><strong>Abstract:</strong><br>
            In-context diffusion models allow users to modify images with remarkable ease and realism. However, the same power raises serious privacy concerns: personal images can be easily manipulated for identity impersonation, misinformation, or other malicious uses, all without the owner's consent. While prior work has explored input perturbations to protect against misuse in personalized text-to-image generation, the robustness of modern, large-scale in-context DiT-based models remains largely unexamined. In this paper, we propose DeContext, a new method to safeguard input images from unauthorized in-context editing. Our key insight is that contextual information from the source image propagates to the output primarily through multimodal attention layers. By injecting small, targeted perturbations that weaken these cross-attention pathways, DeContext breaks this flow, effectively decouples the link between input and output. This simple defense is both efficient and robust. We further show that early denoising steps and specific transformer blocks dominate context propagation, which allows us to concentrate perturbations where they matter most. Experiments on Flux Kontext and Step1X-Edit show that DeContext consistently blocks unwanted image edits while preserving visual quality. These results highlight the effectiveness of attention-based perturbations as a powerful defense against image manipulation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Linghui Shen, Mingyue Cui, Xingyi Yang</p>

            <p><strong>Title:</strong><br>
            DeContext as Defense: Safe Image Editing in Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16625v1">http://arxiv.org/abs/2512.16625v1</a></p>

            <p><strong>Abstract:</strong><br>
            In-context diffusion models allow users to modify images with remarkable ease and realism. However, the same power raises serious privacy concerns: personal images can be easily manipulated for identity impersonation, misinformation, or other malicious uses, all without the owner's consent. While prior work has explored input perturbations to protect against misuse in personalized text-to-image generation, the robustness of modern, large-scale in-context DiT-based models remains largely unexamined. In this paper, we propose DeContext, a new method to safeguard input images from unauthorized in-context editing. Our key insight is that contextual information from the source image propagates to the output primarily through multimodal attention layers. By injecting small, targeted perturbations that weaken these cross-attention pathways, DeContext breaks this flow, effectively decouples the link between input and output. This simple defense is both efficient and robust. We further show that early denoising steps and specific transformer blocks dominate context propagation, which allows us to concentrate perturbations where they matter most. Experiments on Flux Kontext and Step1X-Edit show that DeContext consistently blocks unwanted image edits while preserving visual quality. These results highlight the effectiveness of attention-based perturbations as a powerful defense against image manipulation.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 19 Dec 2025 19:46:33 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b8a1412c/64293810.mp3" length="22680164" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1414</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Linghui Shen, Mingyue Cui, Xingyi Yang</p>

            <p><strong>Title:</strong><br>
            DeContext as Defense: Safe Image Editing in Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.16625v1">http://arxiv.org/abs/2512.16625v1</a></p>

            <p><strong>Abstract:</strong><br>
            In-context diffusion models allow users to modify images with remarkable ease and realism. However, the same power raises serious privacy concerns: personal images can be easily manipulated for identity impersonation, misinformation, or other malicious uses, all without the owner's consent. While prior work has explored input perturbations to protect against misuse in personalized text-to-image generation, the robustness of modern, large-scale in-context DiT-based models remains largely unexamined. In this paper, we propose DeContext, a new method to safeguard input images from unauthorized in-context editing. Our key insight is that contextual information from the source image propagates to the output primarily through multimodal attention layers. By injecting small, targeted perturbations that weaken these cross-attention pathways, DeContext breaks this flow, effectively decouples the link between input and output. This simple defense is both efficient and robust. We further show that early denoising steps and specific transformer blocks dominate context propagation, which allows us to concentrate perturbations where they matter most. Experiments on Flux Kontext and Step1X-Edit show that DeContext consistently blocks unwanted image edits while preserving visual quality. These results highlight the effectiveness of attention-based perturbations as a powerful defense against image manipulation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Step-GUI Technical Report</title>
      <itunes:episode>1497</itunes:episode>
      <podcast:episode>1497</podcast:episode>
      <itunes:title>Step-GUI Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f1b502e6-711a-4bdc-90b3-92e034d88190</guid>
      <link>https://share.transistor.fm/s/c7bd352b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 87 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin Chen, Wen Sun, Chengxu Yan, Chunqin Xu, Dong Li, Fengqiong Xiao, Guanghao Fan, Guopeng Li, Guozhen Peng, Hongbing Li, Hang Li, Hongming Chen, Jingjing Xie, Jianyong Li, Jingyang Zhang, Jiaju Ren, Jiayu Yuan, Jianpeng Yin, Kai Cao, Liang Zhao, Liguo Tan, Liying Shi, Mengqiang Ren, Min Xu, Manjiao Liu, Mao Luo, Mingxin Wan, Na Wang, Nan Wu, Ning Wang, Peiyao Ma, Qingzhou Zhang, Qiao Wang, Qinlin Zeng, Qiong Gao, Qiongyao Li, Shangwu Zhong, Shuli Gao, Shaofan Liu, Shisi Gao, Shuang Luo, Xingbin Liu, Xiaojia Liu, Xiaojie Hou, Xin Liu, Xuanti Feng, Xuedan Cai, Xuan Wen, Xianwei Zhu, Xin Liang, Xin Liu, Xin Zhou, Yingxiu Zhao, Yukang Shi, Yunfang Xu, Yuqing Zeng, Yixun Zhang, Zejia Weng, Zhonghao Yan, Zhiguo Huang, Zhuoyu Wang, Zheng Ge, Jing Li, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Daxin Jiang</p>

            <p><strong>Title:</strong><br>
            Step-GUI Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.15431v1">http://arxiv.org/abs/2512.15431v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving &gt;90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 87 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin Chen, Wen Sun, Chengxu Yan, Chunqin Xu, Dong Li, Fengqiong Xiao, Guanghao Fan, Guopeng Li, Guozhen Peng, Hongbing Li, Hang Li, Hongming Chen, Jingjing Xie, Jianyong Li, Jingyang Zhang, Jiaju Ren, Jiayu Yuan, Jianpeng Yin, Kai Cao, Liang Zhao, Liguo Tan, Liying Shi, Mengqiang Ren, Min Xu, Manjiao Liu, Mao Luo, Mingxin Wan, Na Wang, Nan Wu, Ning Wang, Peiyao Ma, Qingzhou Zhang, Qiao Wang, Qinlin Zeng, Qiong Gao, Qiongyao Li, Shangwu Zhong, Shuli Gao, Shaofan Liu, Shisi Gao, Shuang Luo, Xingbin Liu, Xiaojia Liu, Xiaojie Hou, Xin Liu, Xuanti Feng, Xuedan Cai, Xuan Wen, Xianwei Zhu, Xin Liang, Xin Liu, Xin Zhou, Yingxiu Zhao, Yukang Shi, Yunfang Xu, Yuqing Zeng, Yixun Zhang, Zejia Weng, Zhonghao Yan, Zhiguo Huang, Zhuoyu Wang, Zheng Ge, Jing Li, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Daxin Jiang</p>

            <p><strong>Title:</strong><br>
            Step-GUI Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.15431v1">http://arxiv.org/abs/2512.15431v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving &gt;90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 18 Dec 2025 19:21:35 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c7bd352b/51c5293e.mp3" length="25351300" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1581</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 87 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin Chen, Wen Sun, Chengxu Yan, Chunqin Xu, Dong Li, Fengqiong Xiao, Guanghao Fan, Guopeng Li, Guozhen Peng, Hongbing Li, Hang Li, Hongming Chen, Jingjing Xie, Jianyong Li, Jingyang Zhang, Jiaju Ren, Jiayu Yuan, Jianpeng Yin, Kai Cao, Liang Zhao, Liguo Tan, Liying Shi, Mengqiang Ren, Min Xu, Manjiao Liu, Mao Luo, Mingxin Wan, Na Wang, Nan Wu, Ning Wang, Peiyao Ma, Qingzhou Zhang, Qiao Wang, Qinlin Zeng, Qiong Gao, Qiongyao Li, Shangwu Zhong, Shuli Gao, Shaofan Liu, Shisi Gao, Shuang Luo, Xingbin Liu, Xiaojia Liu, Xiaojie Hou, Xin Liu, Xuanti Feng, Xuedan Cai, Xuan Wen, Xianwei Zhu, Xin Liang, Xin Liu, Xin Zhou, Yingxiu Zhao, Yukang Shi, Yunfang Xu, Yuqing Zeng, Yixun Zhang, Zejia Weng, Zhonghao Yan, Zhiguo Huang, Zhuoyu Wang, Zheng Ge, Jing Li, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Daxin Jiang</p>

            <p><strong>Title:</strong><br>
            Step-GUI Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.15431v1">http://arxiv.org/abs/2512.15431v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving &gt;90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DEER: Draft with Diffusion, Verify with Autoregressive Models</title>
      <itunes:episode>1496</itunes:episode>
      <podcast:episode>1496</podcast:episode>
      <itunes:title>DEER: Draft with Diffusion, Verify with Autoregressive Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a92c1cfc-d4c6-4f69-9c99-3e885bdadc64</guid>
      <link>https://share.transistor.fm/s/10dc9307</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zicong Cheng, Guo-Wei Yang, Jia Li, Zhijie Deng, Meng-Hao Guo, Shi-Min Hu</p>

            <p><strong>Title:</strong><br>
            DEER: Draft with Diffusion, Verify with Autoregressive Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.15176v1">http://arxiv.org/abs/2512.15176v1</a></p>

            <p><strong>Abstract:</strong><br>
            Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify scheme, yet existing approaches rely on AR draft models (a.k.a., drafters), which introduce two fundamental issues: (1) step-wise uncertainty accumulation leads to a progressive collapse of trust between the target model and the drafter, and (2) inherently sequential decoding of AR drafters. Together, these factors cause limited speedups. In this paper, we show that a diffusion large language model (dLLM) drafters can naturally overcome these issues through its fundamentally different probabilistic modeling and efficient parallel decoding strategy. Building on this insight, we introduce DEER, an efficient speculative decoding framework that drafts with diffusion and verifies with AR models. To enable high-quality drafting, DEER employs a two-stage training pipeline to align the dLLM-based drafters with the target AR model, and further adopts single-step decoding to generate long draft segments. Experiments show DEER reaches draft acceptance lengths of up to 32 tokens, far surpassing the 10 tokens achieved by EAGLE-3. Moreover, on HumanEval with Qwen3-30B-A3B, DEER attains a 5.54x speedup, while EAGLE-3 achieves only 2.41x. Code, model, demo, etc, will be available at https://czc726.github.io/DEER/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zicong Cheng, Guo-Wei Yang, Jia Li, Zhijie Deng, Meng-Hao Guo, Shi-Min Hu</p>

            <p><strong>Title:</strong><br>
            DEER: Draft with Diffusion, Verify with Autoregressive Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.15176v1">http://arxiv.org/abs/2512.15176v1</a></p>

            <p><strong>Abstract:</strong><br>
            Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify scheme, yet existing approaches rely on AR draft models (a.k.a., drafters), which introduce two fundamental issues: (1) step-wise uncertainty accumulation leads to a progressive collapse of trust between the target model and the drafter, and (2) inherently sequential decoding of AR drafters. Together, these factors cause limited speedups. In this paper, we show that a diffusion large language model (dLLM) drafters can naturally overcome these issues through its fundamentally different probabilistic modeling and efficient parallel decoding strategy. Building on this insight, we introduce DEER, an efficient speculative decoding framework that drafts with diffusion and verifies with AR models. To enable high-quality drafting, DEER employs a two-stage training pipeline to align the dLLM-based drafters with the target AR model, and further adopts single-step decoding to generate long draft segments. Experiments show DEER reaches draft acceptance lengths of up to 32 tokens, far surpassing the 10 tokens achieved by EAGLE-3. Moreover, on HumanEval with Qwen3-30B-A3B, DEER attains a 5.54x speedup, while EAGLE-3 achieves only 2.41x. Code, model, demo, etc, will be available at https://czc726.github.io/DEER/</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 18 Dec 2025 19:21:12 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/10dc9307/030f9617.mp3" length="24762014" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1544</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zicong Cheng, Guo-Wei Yang, Jia Li, Zhijie Deng, Meng-Hao Guo, Shi-Min Hu</p>

            <p><strong>Title:</strong><br>
            DEER: Draft with Diffusion, Verify with Autoregressive Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.15176v1">http://arxiv.org/abs/2512.15176v1</a></p>

            <p><strong>Abstract:</strong><br>
            Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify scheme, yet existing approaches rely on AR draft models (a.k.a., drafters), which introduce two fundamental issues: (1) step-wise uncertainty accumulation leads to a progressive collapse of trust between the target model and the drafter, and (2) inherently sequential decoding of AR drafters. Together, these factors cause limited speedups. In this paper, we show that a diffusion large language model (dLLM) drafters can naturally overcome these issues through its fundamentally different probabilistic modeling and efficient parallel decoding strategy. Building on this insight, we introduce DEER, an efficient speculative decoding framework that drafts with diffusion and verifies with AR models. To enable high-quality drafting, DEER employs a two-stage training pipeline to align the dLLM-based drafters with the target AR model, and further adopts single-step decoding to generate long draft segments. Experiments show DEER reaches draft acceptance lengths of up to 32 tokens, far surpassing the 10 tokens achieved by EAGLE-3. Moreover, on HumanEval with Qwen3-30B-A3B, DEER attains a 5.54x speedup, while EAGLE-3 achieves only 2.41x. Code, model, demo, etc, will be available at https://czc726.github.io/DEER/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Fast and Accurate Causal Parallel Decoding using Jacobi Forcing</title>
      <itunes:episode>1495</itunes:episode>
      <podcast:episode>1495</podcast:episode>
      <itunes:title>Fast and Accurate Causal Parallel Decoding using Jacobi Forcing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f79ed90e-8f55-4452-920d-75ad60107f7f</guid>
      <link>https://share.transistor.fm/s/9891e0f8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Yuxiong He, Zhijie Deng, Hao Zhang</p>

            <p><strong>Title:</strong><br>
            Fast and Accurate Causal Parallel Decoding using Jacobi Forcing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.14681v1">http://arxiv.org/abs/2512.14681v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Yuxiong He, Zhijie Deng, Hao Zhang</p>

            <p><strong>Title:</strong><br>
            Fast and Accurate Causal Parallel Decoding using Jacobi Forcing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.14681v1">http://arxiv.org/abs/2512.14681v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 18 Dec 2025 19:20:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9891e0f8/bd4fe9cb.mp3" length="20978649" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1307</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Yuxiong He, Zhijie Deng, Hao Zhang</p>

            <p><strong>Title:</strong><br>
            Fast and Accurate Causal Parallel Decoding using Jacobi Forcing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.14681v1">http://arxiv.org/abs/2512.14681v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices</title>
      <itunes:episode>1494</itunes:episode>
      <podcast:episode>1494</podcast:episode>
      <itunes:title>HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4d9e85df-9f05-4a8d-8818-0165c1425e2f</guid>
      <link>https://share.transistor.fm/s/92a686dc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            HyperAI Team, Yuchen Liu, Kaiyang Han, Zhiqiang Xia, Yuhang Dong, Chen Song, Kangyu Tang, Jiaming Xu, Xiushi Feng, WenXuan Yu, Li Peng, Mingyang Wang, Kai Wang, Changpeng Yang, Yang Li, Haoyu Lu, Hao Wang, Bingna Xu, Guangyao Liu, Long Huang, Kaibin Guo, Jinyang Wu, Dan Wu, Hongzhen Wang, Peng Zhou, Shuai Nie, Shande Wang, Runyu Shi, Ying Huang</p>

            <p><strong>Title:</strong><br>
            HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.14052v1">http://arxiv.org/abs/2512.14052v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            HyperAI Team, Yuchen Liu, Kaiyang Han, Zhiqiang Xia, Yuhang Dong, Chen Song, Kangyu Tang, Jiaming Xu, Xiushi Feng, WenXuan Yu, Li Peng, Mingyang Wang, Kai Wang, Changpeng Yang, Yang Li, Haoyu Lu, Hao Wang, Bingna Xu, Guangyao Liu, Long Huang, Kaibin Guo, Jinyang Wu, Dan Wu, Hongzhen Wang, Peng Zhou, Shuai Nie, Shande Wang, Runyu Shi, Ying Huang</p>

            <p><strong>Title:</strong><br>
            HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.14052v1">http://arxiv.org/abs/2512.14052v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 18 Dec 2025 19:20:25 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/92a686dc/d4005a60.mp3" length="21292974" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1327</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            HyperAI Team, Yuchen Liu, Kaiyang Han, Zhiqiang Xia, Yuhang Dong, Chen Song, Kangyu Tang, Jiaming Xu, Xiushi Feng, WenXuan Yu, Li Peng, Mingyang Wang, Kai Wang, Changpeng Yang, Yang Li, Haoyu Lu, Hao Wang, Bingna Xu, Guangyao Liu, Long Huang, Kaibin Guo, Jinyang Wu, Dan Wu, Hongzhen Wang, Peng Zhou, Shuai Nie, Shande Wang, Runyu Shi, Ying Huang</p>

            <p><strong>Title:</strong><br>
            HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.14052v1">http://arxiv.org/abs/2512.14052v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Puzzle Curriculum GRPO for Vision-Centric Reasoning</title>
      <itunes:episode>1493</itunes:episode>
      <podcast:episode>1493</podcast:episode>
      <itunes:title>Puzzle Curriculum GRPO for Vision-Centric Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">30522e97-6c9f-4efc-aa03-c81bc1fbf30e</guid>
      <link>https://share.transistor.fm/s/2bee0bfb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ahmadreza Jeddi, Hakki Can Karaimer, Hue Nguyen, Zhongling Wang, Ke Zhao, Javad Rajabi, Ran Zhang, Raghav Goyal, Babak Taati, Radek Grzeszczuk</p>

            <p><strong>Title:</strong><br>
            Puzzle Curriculum GRPO for Vision-Centric Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.14944v1">http://arxiv.org/abs/2512.14944v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain's reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ahmadreza Jeddi, Hakki Can Karaimer, Hue Nguyen, Zhongling Wang, Ke Zhao, Javad Rajabi, Ran Zhang, Raghav Goyal, Babak Taati, Radek Grzeszczuk</p>

            <p><strong>Title:</strong><br>
            Puzzle Curriculum GRPO for Vision-Centric Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.14944v1">http://arxiv.org/abs/2512.14944v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain's reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 18 Dec 2025 19:20:02 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2bee0bfb/4f0f7273.mp3" length="24641214" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1536</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ahmadreza Jeddi, Hakki Can Karaimer, Hue Nguyen, Zhongling Wang, Ke Zhao, Javad Rajabi, Ran Zhang, Raghav Goyal, Babak Taati, Radek Grzeszczuk</p>

            <p><strong>Title:</strong><br>
            Puzzle Curriculum GRPO for Vision-Centric Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.14944v1">http://arxiv.org/abs/2512.14944v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain's reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MMGR: Multi-Modal Generative Reasoning</title>
      <itunes:episode>1492</itunes:episode>
      <podcast:episode>1492</podcast:episode>
      <itunes:title>MMGR: Multi-Modal Generative Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b4c4e3c7-7dfc-471b-b545-10a6b07c7139</guid>
      <link>https://share.transistor.fm/s/d7870b33</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 82 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Wen Xiao, Jiuxiang Gu, Nanyun Peng, Junjie Hu</p>

            <p><strong>Title:</strong><br>
            MMGR: Multi-Modal Generative Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.14691v2">http://arxiv.org/abs/2512.14691v2</a></p>

            <p><strong>Abstract:</strong><br>
            Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 82 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Wen Xiao, Jiuxiang Gu, Nanyun Peng, Junjie Hu</p>

            <p><strong>Title:</strong><br>
            MMGR: Multi-Modal Generative Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.14691v2">http://arxiv.org/abs/2512.14691v2</a></p>

            <p><strong>Abstract:</strong><br>
            Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 17 Dec 2025 19:38:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d7870b33/ee599dfb.mp3" length="23661922" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1475</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 82 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Wen Xiao, Jiuxiang Gu, Nanyun Peng, Junjie Hu</p>

            <p><strong>Title:</strong><br>
            MMGR: Multi-Modal Generative Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.14691v2">http://arxiv.org/abs/2512.14691v2</a></p>

            <p><strong>Abstract:</strong><br>
            Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?</title>
      <itunes:episode>1491</itunes:episode>
      <podcast:episode>1491</podcast:episode>
      <itunes:title>Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">877c1b0b-2d1f-4b24-99e9-95b06f1f0d2e</guid>
      <link>https://share.transistor.fm/s/be517d42</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiaqi Wang, Weijia Wu, Yi Zhan, Rui Zhao, Ming Hu, James Cheng, Wei Liu, Philip Torr, Kevin Qinghong Lin</p>

            <p><strong>Title:</strong><br>
            Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13281v2">http://arxiv.org/abs/2512.13281v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: \textbf{(i) Immersive ASMR video-audio sources.} Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. \textbf{(ii) Peer-Review evaluation.} An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56\% accuracy (random 50\%), far below that of human experts (81.25\%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at https://github.com/video-reality-test/video-reality-test.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiaqi Wang, Weijia Wu, Yi Zhan, Rui Zhao, Ming Hu, James Cheng, Wei Liu, Philip Torr, Kevin Qinghong Lin</p>

            <p><strong>Title:</strong><br>
            Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13281v2">http://arxiv.org/abs/2512.13281v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: \textbf{(i) Immersive ASMR video-audio sources.} Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. \textbf{(ii) Peer-Review evaluation.} An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56\% accuracy (random 50\%), far below that of human experts (81.25\%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at https://github.com/video-reality-test/video-reality-test.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 17 Dec 2025 19:38:04 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/be517d42/d5ee432b.mp3" length="23412433" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1460</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiaqi Wang, Weijia Wu, Yi Zhan, Rui Zhao, Ming Hu, James Cheng, Wei Liu, Philip Torr, Kevin Qinghong Lin</p>

            <p><strong>Title:</strong><br>
            Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13281v2">http://arxiv.org/abs/2512.13281v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in video generation have produced vivid content that are often indistinguishable from real videos, making AI-generated video detection an emerging societal challenge. Prior AIGC detection benchmarks mostly evaluate video without audio, target broad narrative domains, and focus on classification solely. Yet it remains unclear whether state-of-the-art video generation models can produce immersive, audio-paired videos that reliably deceive humans and VLMs. To this end, we introduce Video Reality Test, an ASMR-sourced video benchmark suite for testing perceptual realism under tight audio-visual coupling, featuring the following dimensions: \textbf{(i) Immersive ASMR video-audio sources.} Built on carefully curated real ASMR videos, the benchmark targets fine-grained action-object interactions with diversity across objects, actions, and backgrounds. \textbf{(ii) Peer-Review evaluation.} An adversarial creator-reviewer protocol where video generation models act as creators aiming to fool reviewers, while VLMs serve as reviewers seeking to identify fakeness. Our experimental findings show: The best creator Veo3.1-Fast even fools most VLMs: the strongest reviewer (Gemini 2.5-Pro) achieves only 56\% accuracy (random 50\%), far below that of human experts (81.25\%). Adding audio improves real-fake discrimination, yet superficial cues such as watermarks can still significantly mislead models. These findings delineate the current boundary of video generation realism and expose limitations of VLMs in perceptual fidelity and audio-visual consistency. Our code is available at https://github.com/video-reality-test/video-reality-test.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling</title>
      <itunes:episode>1490</itunes:episode>
      <podcast:episode>1490</podcast:episode>
      <itunes:title>WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2d91eb17-0c1f-4f6d-b008-847b4a4511e0</guid>
      <link>https://share.transistor.fm/s/4026a0a1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, Chunchao Guo</p>

            <p><strong>Title:</strong><br>
            WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.14614v1">http://arxiv.org/abs/2512.14614v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, Chunchao Guo</p>

            <p><strong>Title:</strong><br>
            WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.14614v1">http://arxiv.org/abs/2512.14614v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 17 Dec 2025 19:37:42 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4026a0a1/71f8f623.mp3" length="20724558" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1292</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, Chunchao Guo</p>

            <p><strong>Title:</strong><br>
            WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.14614v1">http://arxiv.org/abs/2512.14614v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling</title>
      <itunes:episode>1489</itunes:episode>
      <podcast:episode>1489</podcast:episode>
      <itunes:title>Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6e318f06-ce94-4f9c-a0a1-b66e648ac85a</guid>
      <link>https://share.transistor.fm/s/46eae7c6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuran Wang, Bohan Zeng, Chengzhuo Tong, Wenxuan Liu, Yang Shi, Xiaochen Ma, Hao Liang, Yuanxing Zhang, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.12675v1">http://arxiv.org/abs/2512.12675v1</a></p>

            <p><strong>Abstract:</strong><br>
            Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuran Wang, Bohan Zeng, Chengzhuo Tong, Wenxuan Liu, Yang Shi, Xiaochen Ma, Hao Liang, Yuanxing Zhang, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.12675v1">http://arxiv.org/abs/2512.12675v1</a></p>

            <p><strong>Abstract:</strong><br>
            Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 17 Dec 2025 19:37:19 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/46eae7c6/e1e0da65.mp3" length="21682136" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1351</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuran Wang, Bohan Zeng, Chengzhuo Tong, Wenxuan Liu, Yang Shi, Xiaochen Ma, Hao Liang, Yuanxing Zhang, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.12675v1">http://arxiv.org/abs/2512.12675v1</a></p>

            <p><strong>Abstract:</strong><br>
            Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics</title>
      <itunes:episode>1488</itunes:episode>
      <podcast:episode>1488</podcast:episode>
      <itunes:title>RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6dfbc6ae-367b-4798-89a8-822c0ff51b9b</guid>
      <link>https://share.transistor.fm/s/aa75d88e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Enshen Zhou, Cheng Chi, Yibo Li, Jingkun An, Jiayuan Zhang, Shanyu Rong, Yi Han, Yuheng Ji, Mengzhen Liu, Pengwei Wang, Zhongyuan Wang, Lu Sheng, Shanghang Zhang</p>

            <p><strong>Title:</strong><br>
            RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13660v1">http://arxiv.org/abs/2512.13660v1</a></p>

            <p><strong>Abstract:</strong><br>
            Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Enshen Zhou, Cheng Chi, Yibo Li, Jingkun An, Jiayuan Zhang, Shanyu Rong, Yi Han, Yuheng Ji, Mengzhen Liu, Pengwei Wang, Zhongyuan Wang, Lu Sheng, Shanghang Zhang</p>

            <p><strong>Title:</strong><br>
            RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13660v1">http://arxiv.org/abs/2512.13660v1</a></p>

            <p><strong>Abstract:</strong><br>
            Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 17 Dec 2025 19:36:57 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/aa75d88e/b77cd7a0.mp3" length="19165986" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1194</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Enshen Zhou, Cheng Chi, Yibo Li, Jingkun An, Jiayuan Zhang, Shanyu Rong, Yi Han, Yuheng Ji, Mengzhen Liu, Pengwei Wang, Zhongyuan Wang, Lu Sheng, Shanghang Zhang</p>

            <p><strong>Title:</strong><br>
            RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13660v1">http://arxiv.org/abs/2512.13660v1</a></p>

            <p><strong>Abstract:</strong><br>
            Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value</title>
      <itunes:episode>1487</itunes:episode>
      <podcast:episode>1487</podcast:episode>
      <itunes:title>OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d6ceadfe-6541-45b6-a375-ba5bc48288c0</guid>
      <link>https://share.transistor.fm/s/f60b7fda</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu, Zhuoshi Pan, Qizhi Pei, Xiaoran Shang, Mengyuan Sun, Zinan Tang, Xiaoyang Wang, Zhanping Zhong, Yun Zhu, Dahua Lin, Conghui He, Lijun Wu</p>

            <p><strong>Title:</strong><br>
            OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.14051v1">http://arxiv.org/abs/2512.14051v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid evolution of Large Language Models (LLMs) is predicated on the quality and diversity of post-training datasets. However, a critical dichotomy persists: while models are rigorously benchmarked, the data fueling them remains a black box--characterized by opaque composition, uncertain provenance, and a lack of systematic evaluation. This opacity hinders reproducibility and obscures the causal link between data characteristics and model behaviors. To bridge this gap, we introduce OpenDataArena (ODA), a holistic and open platform designed to benchmark the intrinsic value of post-training data. ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models (e.g., Llama, Qwen) and domains; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources; and (iv) a fully open-source toolkit for training, evaluation, and scoring to foster data research. Extensive experiments on ODA--covering over 120 training datasets across multiple domains on 22 benchmarks, validated by more than 600 training runs and 40 million processed data points--reveal non-trivial insights. Our analysis uncovers the inherent trade-offs between data complexity and task performance, identifies redundancy in popular benchmarks through lineage tracing, and maps the genealogical relationships across datasets. We release all results, tools, and configurations to democratize access to high-quality data evaluation. Rather than merely expanding a leaderboard, ODA envisions a shift from trial-and-error data curation to a principled science of Data-Centric AI, paving the way for rigorous studies on data mixing laws and the strategic composition of foundation models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu, Zhuoshi Pan, Qizhi Pei, Xiaoran Shang, Mengyuan Sun, Zinan Tang, Xiaoyang Wang, Zhanping Zhong, Yun Zhu, Dahua Lin, Conghui He, Lijun Wu</p>

            <p><strong>Title:</strong><br>
            OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.14051v1">http://arxiv.org/abs/2512.14051v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid evolution of Large Language Models (LLMs) is predicated on the quality and diversity of post-training datasets. However, a critical dichotomy persists: while models are rigorously benchmarked, the data fueling them remains a black box--characterized by opaque composition, uncertain provenance, and a lack of systematic evaluation. This opacity hinders reproducibility and obscures the causal link between data characteristics and model behaviors. To bridge this gap, we introduce OpenDataArena (ODA), a holistic and open platform designed to benchmark the intrinsic value of post-training data. ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models (e.g., Llama, Qwen) and domains; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources; and (iv) a fully open-source toolkit for training, evaluation, and scoring to foster data research. Extensive experiments on ODA--covering over 120 training datasets across multiple domains on 22 benchmarks, validated by more than 600 training runs and 40 million processed data points--reveal non-trivial insights. Our analysis uncovers the inherent trade-offs between data complexity and task performance, identifies redundancy in popular benchmarks through lineage tracing, and maps the genealogical relationships across datasets. We release all results, tools, and configurations to democratize access to high-quality data evaluation. Rather than merely expanding a leaderboard, ODA envisions a shift from trial-and-error data curation to a principled science of Data-Centric AI, paving the way for rigorous studies on data mixing laws and the strategic composition of foundation models.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 17 Dec 2025 19:36:34 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f60b7fda/474fa353.mp3" length="28306746" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1765</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mengzhang Cai, Xin Gao, Yu Li, Honglin Lin, Zheng Liu, Zhuoshi Pan, Qizhi Pei, Xiaoran Shang, Mengyuan Sun, Zinan Tang, Xiaoyang Wang, Zhanping Zhong, Yun Zhu, Dahua Lin, Conghui He, Lijun Wu</p>

            <p><strong>Title:</strong><br>
            OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.14051v1">http://arxiv.org/abs/2512.14051v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid evolution of Large Language Models (LLMs) is predicated on the quality and diversity of post-training datasets. However, a critical dichotomy persists: while models are rigorously benchmarked, the data fueling them remains a black box--characterized by opaque composition, uncertain provenance, and a lack of systematic evaluation. This opacity hinders reproducibility and obscures the causal link between data characteristics and model behaviors. To bridge this gap, we introduce OpenDataArena (ODA), a holistic and open platform designed to benchmark the intrinsic value of post-training data. ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models (e.g., Llama, Qwen) and domains; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources; and (iv) a fully open-source toolkit for training, evaluation, and scoring to foster data research. Extensive experiments on ODA--covering over 120 training datasets across multiple domains on 22 benchmarks, validated by more than 600 training runs and 40 million processed data points--reveal non-trivial insights. Our analysis uncovers the inherent trade-offs between data complexity and task performance, identifies redundancy in popular benchmarks through lineage tracing, and maps the genealogical relationships across datasets. We release all results, tools, and configurations to democratize access to high-quality data evaluation. Rather than merely expanding a leaderboard, ODA envisions a shift from trial-and-error data curation to a principled science of Data-Centric AI, paving the way for rigorous studies on data mixing laws and the strategic composition of foundation models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding</title>
      <itunes:episode>1486</itunes:episode>
      <podcast:episode>1486</podcast:episode>
      <itunes:title>ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">04a3bd20-f75e-4275-8711-0fac215e4ce4</guid>
      <link>https://share.transistor.fm/s/54f6f683</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 75 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li</p>

            <p><strong>Title:</strong><br>
            ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13586v1">http://arxiv.org/abs/2512.13586v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 75 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li</p>

            <p><strong>Title:</strong><br>
            ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13586v1">http://arxiv.org/abs/2512.13586v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 16 Dec 2025 20:00:28 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/54f6f683/cf389ce9.mp3" length="25499731" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1590</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 75 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li</p>

            <p><strong>Title:</strong><br>
            ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13586v1">http://arxiv.org/abs/2512.13586v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Towards Scalable Pre-training of Visual Tokenizers for Generation</title>
      <itunes:episode>1485</itunes:episode>
      <podcast:episode>1485</podcast:episode>
      <itunes:title>Towards Scalable Pre-training of Visual Tokenizers for Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b18d79dc-1f54-4ea9-993c-04d7daecd681</guid>
      <link>https://share.transistor.fm/s/3e9437fd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            Towards Scalable Pre-training of Visual Tokenizers for Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13687v1">http://arxiv.org/abs/2512.13687v1</a></p>

            <p><strong>Abstract:</strong><br>
            The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at https://github.com/MiniMax-AI/VTP.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            Towards Scalable Pre-training of Visual Tokenizers for Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13687v1">http://arxiv.org/abs/2512.13687v1</a></p>

            <p><strong>Abstract:</strong><br>
            The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at https://github.com/MiniMax-AI/VTP.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 16 Dec 2025 20:00:07 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3e9437fd/5fb885ab.mp3" length="21095680" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1315</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jingfeng Yao, Yuda Song, Yucong Zhou, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            Towards Scalable Pre-training of Visual Tokenizers for Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13687v1">http://arxiv.org/abs/2512.13687v1</a></p>

            <p><strong>Abstract:</strong><br>
            The quality of the latent space in visual tokenizers (e.g., VAEs) is crucial for modern generative models. However, the standard reconstruction-based training paradigm produces a latent space that is biased towards low-level information, leading to a foundation flaw: better pixel-level accuracy does not lead to higher-quality generation. This implies that pouring extensive compute into visual tokenizer pre-training translates poorly to improved performance in generation. We identify this as the ``pre-training scaling problem`` and suggest a necessary shift: to be effective for generation, a latent space must concisely represent high-level semantics. We present VTP, a unified visual tokenizer pre-training framework, pioneering the joint optimization of image-text contrastive, self-supervised, and reconstruction losses. Our large-scale study reveals two principal findings: (1) understanding is a key driver of generation, and (2) much better scaling properties, where generative performance scales effectively with compute, parameters, and data allocated to the pretraining of the visual tokenizer. After large-scale pre-training, our tokenizer delivers a competitive profile (78.2 zero-shot accuracy and 0.36 rFID on ImageNet) and 4.1 times faster convergence on generation compared to advanced distillation methods. More importantly, it scales effectively: without modifying standard DiT training specs, solely investing more FLOPS in pretraining VTP achieves 65.8\% FID improvement in downstream generation, while conventional autoencoder stagnates very early at 1/10 FLOPS. Our pre-trained models are available at https://github.com/MiniMax-AI/VTP.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Memory in the Age of AI Agents</title>
      <itunes:episode>1484</itunes:episode>
      <podcast:episode>1484</podcast:episode>
      <itunes:title>Memory in the Age of AI Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c33a5dcb-c3a7-4def-a6c6-868ac5ab286c</guid>
      <link>https://share.transistor.fm/s/d7a53c18</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhenfei Yin, Xiaobin Hu, Yue Liao, Qiankun Li, Kun Wang, Wangchunshu Zhou, Yixin Liu, Dawei Cheng, Qi Zhang, Tao Gui, Shirui Pan, Yan Zhang, Philip Torr, Zhicheng Dou, Ji-Rong Wen, Xuanjing Huang, Yu-Gang Jiang, Shuicheng Yan</p>

            <p><strong>Title:</strong><br>
            Memory in the Age of AI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13564v1">http://arxiv.org/abs/2512.13564v1</a></p>

            <p><strong>Abstract:</strong><br>
            Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhenfei Yin, Xiaobin Hu, Yue Liao, Qiankun Li, Kun Wang, Wangchunshu Zhou, Yixin Liu, Dawei Cheng, Qi Zhang, Tao Gui, Shirui Pan, Yan Zhang, Philip Torr, Zhicheng Dou, Ji-Rong Wen, Xuanjing Huang, Yu-Gang Jiang, Shuicheng Yan</p>

            <p><strong>Title:</strong><br>
            Memory in the Age of AI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13564v1">http://arxiv.org/abs/2512.13564v1</a></p>

            <p><strong>Abstract:</strong><br>
            Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 16 Dec 2025 19:59:45 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d7a53c18/7c048034.mp3" length="23012406" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1435</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhenfei Yin, Xiaobin Hu, Yue Liao, Qiankun Li, Kun Wang, Wangchunshu Zhou, Yixin Liu, Dawei Cheng, Qi Zhang, Tao Gui, Shirui Pan, Yan Zhang, Philip Torr, Zhicheng Dou, Ji-Rong Wen, Xuanjing Huang, Yu-Gang Jiang, Shuicheng Yan</p>

            <p><strong>Title:</strong><br>
            Memory in the Age of AI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13564v1">http://arxiv.org/abs/2512.13564v1</a></p>

            <p><strong>Abstract:</strong><br>
            Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management</title>
      <itunes:episode>1483</itunes:episode>
      <podcast:episode>1483</podcast:episode>
      <itunes:title>QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f85fc95f-6447-42d4-b94d-58ae10aae53e</guid>
      <link>https://share.transistor.fm/s/ce56a647</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weizhou Shen, Ziyi Yang, Chenliang Li, Zhiyuan Lu, Miao Peng, Huashan Sun, Yingcheng Shi, Shengyi Liao, Shaopeng Lai, Bo Zhang, Dayiheng Liu, Fei Huang, Jingren Zhou, Ming Yan</p>

            <p><strong>Title:</strong><br>
            QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.12967v1">http://arxiv.org/abs/2512.12967v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weizhou Shen, Ziyi Yang, Chenliang Li, Zhiyuan Lu, Miao Peng, Huashan Sun, Yingcheng Shi, Shengyi Liao, Shaopeng Lai, Bo Zhang, Dayiheng Liu, Fei Huang, Jingren Zhou, Ming Yan</p>

            <p><strong>Title:</strong><br>
            QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.12967v1">http://arxiv.org/abs/2512.12967v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 16 Dec 2025 19:59:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ce56a647/c4444566.mp3" length="23418716" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1460</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weizhou Shen, Ziyi Yang, Chenliang Li, Zhiyuan Lu, Miao Peng, Huashan Sun, Yingcheng Shi, Shengyi Liao, Shaopeng Lai, Bo Zhang, Dayiheng Liu, Fei Huang, Jingren Zhou, Ming Yan</p>

            <p><strong>Title:</strong><br>
            QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.12967v1">http://arxiv.org/abs/2512.12967v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LongVie 2: Multimodal Controllable Ultra-Long Video World Model</title>
      <itunes:episode>1482</itunes:episode>
      <podcast:episode>1482</podcast:episode>
      <itunes:title>LongVie 2: Multimodal Controllable Ultra-Long Video World Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a3d558d1-a3cf-4cc3-bad7-766e25aadfad</guid>
      <link>https://share.transistor.fm/s/7adc686e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianxiong Gao, Zhaoxi Chen, Xian Liu, Junhao Zhuang, Chengming Xu, Jianfeng Feng, Yu Qiao, Yanwei Fu, Chenyang Si, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            LongVie 2: Multimodal Controllable Ultra-Long Video World Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13604v1">http://arxiv.org/abs/2512.13604v1</a></p>

            <p><strong>Abstract:</strong><br>
            Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianxiong Gao, Zhaoxi Chen, Xian Liu, Junhao Zhuang, Chengming Xu, Jianfeng Feng, Yu Qiao, Yanwei Fu, Chenyang Si, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            LongVie 2: Multimodal Controllable Ultra-Long Video World Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13604v1">http://arxiv.org/abs/2512.13604v1</a></p>

            <p><strong>Abstract:</strong><br>
            Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 16 Dec 2025 19:59:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7adc686e/ab75d4f4.mp3" length="25396478" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1584</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianxiong Gao, Zhaoxi Chen, Xian Liu, Junhao Zhuang, Chengming Xu, Jianfeng Feng, Yu Qiao, Yanwei Fu, Chenyang Si, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            LongVie 2: Multimodal Controllable Ultra-Long Video World Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13604v1">http://arxiv.org/abs/2512.13604v1</a></p>

            <p><strong>Abstract:</strong><br>
            Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Finch: Benchmarking Finance &amp; Accounting across Spreadsheet-Centric Enterprise Workflows</title>
      <itunes:episode>1481</itunes:episode>
      <podcast:episode>1481</podcast:episode>
      <itunes:title>Finch: Benchmarking Finance &amp; Accounting across Spreadsheet-Centric Enterprise Workflows</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9567c97d-f7d1-48c7-ba4a-ce9f8ec512f8</guid>
      <link>https://share.transistor.fm/s/758cc09e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.AI, cs.CE, cs.IR, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Haoyu Dong, Pengkun Zhang, Yan Gao, Xuanyu Dong, Yilin Cheng, Mingzhe Lu, Adina Yakefu, Shuxin Zheng</p>

            <p><strong>Title:</strong><br>
            Finch: Benchmarking Finance &amp; Accounting across Spreadsheet-Centric Enterprise Workflows</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13168v1">http://arxiv.org/abs/2512.13168v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a finance &amp; accounting benchmark (Finch) for evaluating AI agents on real-world, enterprise-grade professional workflows -- interleaving data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces at Enron (15,000 spreadsheets and 500,000 emails from 150 employees) and other financial institutions, preserving in-the-wild messiness across multimodal artifacts (text, tables, formulas, charts, code, and images) and spanning diverse domains such as budgeting, trading, and asset management.   We propose a workflow construction process that combines LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and version histories of spreadsheet files, and (2) meticulous expert annotation for workflows, requiring over 700 hours of domain-expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work.   We conduct both human and automated evaluations of frontier AI systems including GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max, and GPT 5.1 Pro spends 48 hours in total yet passes only 38.4% of workflows, while Claude Sonnet 4.5 passes just 25.0%. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.AI, cs.CE, cs.IR, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Haoyu Dong, Pengkun Zhang, Yan Gao, Xuanyu Dong, Yilin Cheng, Mingzhe Lu, Adina Yakefu, Shuxin Zheng</p>

            <p><strong>Title:</strong><br>
            Finch: Benchmarking Finance &amp; Accounting across Spreadsheet-Centric Enterprise Workflows</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13168v1">http://arxiv.org/abs/2512.13168v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a finance &amp; accounting benchmark (Finch) for evaluating AI agents on real-world, enterprise-grade professional workflows -- interleaving data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces at Enron (15,000 spreadsheets and 500,000 emails from 150 employees) and other financial institutions, preserving in-the-wild messiness across multimodal artifacts (text, tables, formulas, charts, code, and images) and spanning diverse domains such as budgeting, trading, and asset management.   We propose a workflow construction process that combines LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and version histories of spreadsheet files, and (2) meticulous expert annotation for workflows, requiring over 700 hours of domain-expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work.   We conduct both human and automated evaluations of frontier AI systems including GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max, and GPT 5.1 Pro spends 48 hours in total yet passes only 38.4% of workflows, while Claude Sonnet 4.5 passes just 25.0%. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 16 Dec 2025 19:58:42 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/758cc09e/191a10ea.mp3" length="29703572" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1853</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.AI, cs.CE, cs.IR, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Haoyu Dong, Pengkun Zhang, Yan Gao, Xuanyu Dong, Yilin Cheng, Mingzhe Lu, Adina Yakefu, Shuxin Zheng</p>

            <p><strong>Title:</strong><br>
            Finch: Benchmarking Finance &amp; Accounting across Spreadsheet-Centric Enterprise Workflows</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13168v1">http://arxiv.org/abs/2512.13168v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a finance &amp; accounting benchmark (Finch) for evaluating AI agents on real-world, enterprise-grade professional workflows -- interleaving data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces at Enron (15,000 spreadsheets and 500,000 emails from 150 employees) and other financial institutions, preserving in-the-wild messiness across multimodal artifacts (text, tables, formulas, charts, code, and images) and spanning diverse domains such as budgeting, trading, and asset management.   We propose a workflow construction process that combines LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and version histories of spreadsheet files, and (2) meticulous expert annotation for workflows, requiring over 700 hours of domain-expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work.   We conduct both human and automated evaluations of frontier AI systems including GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max, and GPT 5.1 Pro spends 48 hours in total yet passes only 38.4% of workflows, while Claude Sonnet 4.5 passes just 25.0%. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents</title>
      <itunes:episode>1480</itunes:episode>
      <podcast:episode>1480</podcast:episode>
      <itunes:title>NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">af4e9333-bd89-4fb6-bd7a-5b55a11dbdb3</guid>
      <link>https://share.transistor.fm/s/253e465e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, Weiran Shi, Zaiyuan Wang, Daoguang Zan, Chenchen Zhang, Xiaoxu Zhang, Qizhi Chen, Xianfu Cheng, Bo Deng, Qingshui Gu, Kai Hua, Juntao Lin, Pai Liu, Mingchen Li, Xuanguang Pan, Zifan Peng, Yujia Qin, Yong Shan, Zhewen Tan, Weihao Xie, Zihan Wang, Yishuo Yuan, Jiayu Zhang, Enduo Zhao, Yunfei Zhao, He Zhu, Chenyang Zou, Ming Ding, Jianpeng Jiao, Jiaheng Liu, Minghao Liu, Qian Liu, Chongyao Tao, Jian Yang, Tong Yang, Zhaoxiang Zhang, Xinjie Chen, Wenhao Huang, Ge Zhang</p>

            <p><strong>Title:</strong><br>
            NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.12730v1">http://arxiv.org/abs/2512.12730v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, Weiran Shi, Zaiyuan Wang, Daoguang Zan, Chenchen Zhang, Xiaoxu Zhang, Qizhi Chen, Xianfu Cheng, Bo Deng, Qingshui Gu, Kai Hua, Juntao Lin, Pai Liu, Mingchen Li, Xuanguang Pan, Zifan Peng, Yujia Qin, Yong Shan, Zhewen Tan, Weihao Xie, Zihan Wang, Yishuo Yuan, Jiayu Zhang, Enduo Zhao, Yunfei Zhao, He Zhu, Chenyang Zou, Ming Ding, Jianpeng Jiao, Jiaheng Liu, Minghao Liu, Qian Liu, Chongyao Tao, Jian Yang, Tong Yang, Zhaoxiang Zhang, Xinjie Chen, Wenhao Huang, Ge Zhang</p>

            <p><strong>Title:</strong><br>
            NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.12730v1">http://arxiv.org/abs/2512.12730v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 16 Dec 2025 19:58:20 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/253e465e/046e0df1.mp3" length="23936151" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1492</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, Weiran Shi, Zaiyuan Wang, Daoguang Zan, Chenchen Zhang, Xiaoxu Zhang, Qizhi Chen, Xianfu Cheng, Bo Deng, Qingshui Gu, Kai Hua, Juntao Lin, Pai Liu, Mingchen Li, Xuanguang Pan, Zifan Peng, Yujia Qin, Yong Shan, Zhewen Tan, Weihao Xie, Zihan Wang, Yishuo Yuan, Jiayu Zhang, Enduo Zhao, Yunfei Zhao, He Zhu, Chenyang Zou, Ming Ding, Jianpeng Jiao, Jiaheng Liu, Minghao Liu, Qian Liu, Chongyao Tao, Jian Yang, Tong Yang, Zhaoxiang Zhang, Xinjie Chen, Wenhao Huang, Ge Zhang</p>

            <p><strong>Title:</strong><br>
            NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.12730v1">http://arxiv.org/abs/2512.12730v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics</title>
      <itunes:episode>1479</itunes:episode>
      <podcast:episode>1479</podcast:episode>
      <itunes:title>Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">56d2cb94-153e-49b8-9a15-37aa362b6e42</guid>
      <link>https://share.transistor.fm/s/645daf4c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jingdi Lei, Di Zhang, Soujanya Poria</p>

            <p><strong>Title:</strong><br>
            Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.12602v1">http://arxiv.org/abs/2512.12602v1</a></p>

            <p><strong>Abstract:</strong><br>
            Linear-time attention and State Space Models (SSMs) promise to solve the quadratic cost bottleneck in long-context language models employing softmax attention. We introduce Error-Free Linear Attention (EFLA), a numerically stable, fully parallelism and generalized formulation of the delta rule. Specifically, we formulate the online learning update as a continuous-time dynamical system and prove that its exact solution is not only attainable but also computable in linear time with full parallelism. By leveraging the rank-1 structure of the dynamics matrix, we directly derive the exact closed-form solution effectively corresponding to the infinite-order Runge-Kutta method. This attention mechanism is theoretically free from error accumulation, perfectly capturing the continuous dynamics while preserving the linear-time complexity. Through an extensive suite of experiments, we show that EFLA enables robust performance in noisy environments, achieving lower language modeling perplexity and superior downstream benchmark performance than DeltaNet without introducing additional parameters. Our work provides a new theoretical foundation for building high-fidelity, scalable linear-time attention models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jingdi Lei, Di Zhang, Soujanya Poria</p>

            <p><strong>Title:</strong><br>
            Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.12602v1">http://arxiv.org/abs/2512.12602v1</a></p>

            <p><strong>Abstract:</strong><br>
            Linear-time attention and State Space Models (SSMs) promise to solve the quadratic cost bottleneck in long-context language models employing softmax attention. We introduce Error-Free Linear Attention (EFLA), a numerically stable, fully parallelism and generalized formulation of the delta rule. Specifically, we formulate the online learning update as a continuous-time dynamical system and prove that its exact solution is not only attainable but also computable in linear time with full parallelism. By leveraging the rank-1 structure of the dynamics matrix, we directly derive the exact closed-form solution effectively corresponding to the infinite-order Runge-Kutta method. This attention mechanism is theoretically free from error accumulation, perfectly capturing the continuous dynamics while preserving the linear-time complexity. Through an extensive suite of experiments, we show that EFLA enables robust performance in noisy environments, achieving lower language modeling perplexity and superior downstream benchmark performance than DeltaNet without introducing additional parameters. Our work provides a new theoretical foundation for building high-fidelity, scalable linear-time attention models.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 16 Dec 2025 19:57:59 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/645daf4c/80f2407d.mp3" length="21848448" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1362</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jingdi Lei, Di Zhang, Soujanya Poria</p>

            <p><strong>Title:</strong><br>
            Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.12602v1">http://arxiv.org/abs/2512.12602v1</a></p>

            <p><strong>Abstract:</strong><br>
            Linear-time attention and State Space Models (SSMs) promise to solve the quadratic cost bottleneck in long-context language models employing softmax attention. We introduce Error-Free Linear Attention (EFLA), a numerically stable, fully parallelism and generalized formulation of the delta rule. Specifically, we formulate the online learning update as a continuous-time dynamical system and prove that its exact solution is not only attainable but also computable in linear time with full parallelism. By leveraging the rank-1 structure of the dynamics matrix, we directly derive the exact closed-form solution effectively corresponding to the infinite-order Runge-Kutta method. This attention mechanism is theoretically free from error accumulation, perfectly capturing the continuous dynamics while preserving the linear-time complexity. Through an extensive suite of experiments, we show that EFLA enables robust performance in noisy environments, achieving lower language modeling perplexity and superior downstream benchmark performance than DeltaNet without introducing additional parameters. Our work provides a new theoretical foundation for building high-fidelity, scalable linear-time attention models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>KlingAvatar 2.0 Technical Report</title>
      <itunes:episode>1478</itunes:episode>
      <podcast:episode>1478</podcast:episode>
      <itunes:title>KlingAvatar 2.0 Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">dd137734-9df1-4f51-b36f-f4c9cff92260</guid>
      <link>https://share.transistor.fm/s/471552e4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kling Team, Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai, Yuan Gao, Kang He, Jingyun Hua, Boyuan Jiang, Mingming Lao, Xiaohan Li, Hui Liu, Jiwen Liu, Xiaoqiang Liu, Yuan Liu, Shun Lu, Yongsen Mao, Yingchao Shao, Huafeng Shi, Xiaoyu Shi, Peiqin Sun, Songlin Tang, Pengfei Wan, Chao Wang, Xuebo Wang, Haoxian Zhang, Yuanxing Zhang, Yan Zhou</p>

            <p><strong>Title:</strong><br>
            KlingAvatar 2.0 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13313v1">http://arxiv.org/abs/2512.13313v1</a></p>

            <p><strong>Abstract:</strong><br>
            Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kling Team, Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai, Yuan Gao, Kang He, Jingyun Hua, Boyuan Jiang, Mingming Lao, Xiaohan Li, Hui Liu, Jiwen Liu, Xiaoqiang Liu, Yuan Liu, Shun Lu, Yongsen Mao, Yingchao Shao, Huafeng Shi, Xiaoyu Shi, Peiqin Sun, Songlin Tang, Pengfei Wan, Chao Wang, Xuebo Wang, Haoxian Zhang, Yuanxing Zhang, Yan Zhou</p>

            <p><strong>Title:</strong><br>
            KlingAvatar 2.0 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13313v1">http://arxiv.org/abs/2512.13313v1</a></p>

            <p><strong>Abstract:</strong><br>
            Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 16 Dec 2025 19:57:37 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/471552e4/1c8ba570.mp3" length="23289097" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1452</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kling Team, Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai, Yuan Gao, Kang He, Jingyun Hua, Boyuan Jiang, Mingming Lao, Xiaohan Li, Hui Liu, Jiwen Liu, Xiaoqiang Liu, Yuan Liu, Shun Lu, Yongsen Mao, Yingchao Shao, Huafeng Shi, Xiaoyu Shi, Peiqin Sun, Songlin Tang, Pengfei Wan, Chao Wang, Xuebo Wang, Haoxian Zhang, Yuanxing Zhang, Yan Zhou</p>

            <p><strong>Title:</strong><br>
            KlingAvatar 2.0 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.13313v1">http://arxiv.org/abs/2512.13313v1</a></p>

            <p><strong>Abstract:</strong><br>
            Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment</title>
      <itunes:episode>1477</itunes:episode>
      <podcast:episode>1477</podcast:episode>
      <itunes:title>MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fe28af04-5ad7-4cc8-8e1b-10f4a112cc93</guid>
      <link>https://share.transistor.fm/s/2c41feb4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mengxi Xiao, Kailai Yang, Pengde Zhao, Enze Zhang, Ziyan Kuang, Zhiwei Liu, Weiguang Han, Shu Liao, Lianting Huang, Jinpeng Hu, Min Peng, Qianqian Xie, Sophia Ananiadou</p>

            <p><strong>Title:</strong><br>
            MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.09636v2">http://arxiv.org/abs/2512.09636v2</a></p>

            <p><strong>Abstract:</strong><br>
            Mental health disorders affect hundreds of millions globally, and the Web now serves as a primary medium for accessing support, information, and assessment. Large language models (LLMs) offer scalable and accessible assistance, yet their deployment in mental-health settings remains risky when their reasoning is incomplete, inconsistent, or ungrounded. Existing psychological LLMs emphasize emotional understanding or knowledge recall but overlook the step-wise, clinically aligned reasoning required for appraisal, diagnosis, intervention planning, abstraction, and verification. To address these issues, we introduce MentraSuite, a unified framework for advancing reliable mental-health reasoning. We propose MentraBench, a comprehensive benchmark spanning five core reasoning aspects, six tasks, and 13 datasets, evaluating both task performance and reasoning quality across five dimensions: conciseness, coherence, hallucination avoidance, task understanding, and internal consistency. We further present Mindora, a post-trained model optimized through a hybrid SFT-RL framework with an inconsistency-detection reward to enforce faithful and coherent reasoning. To support training, we construct high-quality trajectories using a novel reasoning trajectory generation strategy, that strategically filters difficult samples and applies a structured, consistency-oriented rewriting process to produce concise, readable, and well-balanced trajectories. Across 20 evaluated LLMs, Mindora achieves the highest average performance on MentraBench and shows remarkable performances in reasoning reliability, demonstrating its effectiveness for complex mental-health scenarios.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mengxi Xiao, Kailai Yang, Pengde Zhao, Enze Zhang, Ziyan Kuang, Zhiwei Liu, Weiguang Han, Shu Liao, Lianting Huang, Jinpeng Hu, Min Peng, Qianqian Xie, Sophia Ananiadou</p>

            <p><strong>Title:</strong><br>
            MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.09636v2">http://arxiv.org/abs/2512.09636v2</a></p>

            <p><strong>Abstract:</strong><br>
            Mental health disorders affect hundreds of millions globally, and the Web now serves as a primary medium for accessing support, information, and assessment. Large language models (LLMs) offer scalable and accessible assistance, yet their deployment in mental-health settings remains risky when their reasoning is incomplete, inconsistent, or ungrounded. Existing psychological LLMs emphasize emotional understanding or knowledge recall but overlook the step-wise, clinically aligned reasoning required for appraisal, diagnosis, intervention planning, abstraction, and verification. To address these issues, we introduce MentraSuite, a unified framework for advancing reliable mental-health reasoning. We propose MentraBench, a comprehensive benchmark spanning five core reasoning aspects, six tasks, and 13 datasets, evaluating both task performance and reasoning quality across five dimensions: conciseness, coherence, hallucination avoidance, task understanding, and internal consistency. We further present Mindora, a post-trained model optimized through a hybrid SFT-RL framework with an inconsistency-detection reward to enforce faithful and coherent reasoning. To support training, we construct high-quality trajectories using a novel reasoning trajectory generation strategy, that strategically filters difficult samples and applies a structured, consistency-oriented rewriting process to produce concise, readable, and well-balanced trajectories. Across 20 evaluated LLMs, Mindora achieves the highest average performance on MentraBench and shows remarkable performances in reasoning reliability, demonstrating its effectiveness for complex mental-health scenarios.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 16 Dec 2025 19:57:16 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2c41feb4/6088f4ee.mp3" length="22882899" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1426</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mengxi Xiao, Kailai Yang, Pengde Zhao, Enze Zhang, Ziyan Kuang, Zhiwei Liu, Weiguang Han, Shu Liao, Lianting Huang, Jinpeng Hu, Min Peng, Qianqian Xie, Sophia Ananiadou</p>

            <p><strong>Title:</strong><br>
            MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.09636v2">http://arxiv.org/abs/2512.09636v2</a></p>

            <p><strong>Abstract:</strong><br>
            Mental health disorders affect hundreds of millions globally, and the Web now serves as a primary medium for accessing support, information, and assessment. Large language models (LLMs) offer scalable and accessible assistance, yet their deployment in mental-health settings remains risky when their reasoning is incomplete, inconsistent, or ungrounded. Existing psychological LLMs emphasize emotional understanding or knowledge recall but overlook the step-wise, clinically aligned reasoning required for appraisal, diagnosis, intervention planning, abstraction, and verification. To address these issues, we introduce MentraSuite, a unified framework for advancing reliable mental-health reasoning. We propose MentraBench, a comprehensive benchmark spanning five core reasoning aspects, six tasks, and 13 datasets, evaluating both task performance and reasoning quality across five dimensions: conciseness, coherence, hallucination avoidance, task understanding, and internal consistency. We further present Mindora, a post-trained model optimized through a hybrid SFT-RL framework with an inconsistency-detection reward to enforce faithful and coherent reasoning. To support training, we construct high-quality trajectories using a novel reasoning trajectory generation strategy, that strategically filters difficult samples and applies a structured, consistency-oriented rewriting process to produce concise, readable, and well-balanced trajectories. Across 20 evaluated LLMs, Mindora achieves the highest average performance on MentraBench and shows remarkable performances in reasoning reliability, demonstrating its effectiveness for complex mental-health scenarios.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>EgoX: Egocentric Video Generation from a Single Exocentric Video</title>
      <itunes:episode>1476</itunes:episode>
      <podcast:episode>1476</podcast:episode>
      <itunes:title>EgoX: Egocentric Video Generation from a Single Exocentric Video</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">298e73a7-5862-4cdd-b85d-728f36d559c6</guid>
      <link>https://share.transistor.fm/s/a11ae1f7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Taewoong Kang, Kinam Kim, Dohyeon Kim, Minho Park, Junha Hyung, Jaegul Choo</p>

            <p><strong>Title:</strong><br>
            EgoX: Egocentric Video Generation from a Single Exocentric Video</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.08269v1">http://arxiv.org/abs/2512.08269v1</a></p>

            <p><strong>Abstract:</strong><br>
            Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input. EgoX leverages the pretrained spatio temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width and channel wise concatenation. Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity. Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Taewoong Kang, Kinam Kim, Dohyeon Kim, Minho Park, Junha Hyung, Jaegul Choo</p>

            <p><strong>Title:</strong><br>
            EgoX: Egocentric Video Generation from a Single Exocentric Video</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.08269v1">http://arxiv.org/abs/2512.08269v1</a></p>

            <p><strong>Abstract:</strong><br>
            Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input. EgoX leverages the pretrained spatio temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width and channel wise concatenation. Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity. Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 15 Dec 2025 19:13:07 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a11ae1f7/0abd80ca.mp3" length="21078125" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1314</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Taewoong Kang, Kinam Kim, Dohyeon Kim, Minho Park, Junha Hyung, Jaegul Choo</p>

            <p><strong>Title:</strong><br>
            EgoX: Egocentric Video Generation from a Single Exocentric Video</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.08269v1">http://arxiv.org/abs/2512.08269v1</a></p>

            <p><strong>Abstract:</strong><br>
            Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input. EgoX leverages the pretrained spatio temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width and channel wise concatenation. Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity. Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry</title>
      <itunes:episode>1475</itunes:episode>
      <podcast:episode>1475</podcast:episode>
      <itunes:title>DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">edfbfe01-2d21-4c5d-80c3-8445d8e8f5ab</guid>
      <link>https://share.transistor.fm/s/eaa14b16</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhenyang Cai, Jiaming Zhang, Junjie Zhao, Ziyi Zeng, Yanchao Li, Jingyi Liang, Junying Chen, Yunjin Yang, Jiajun You, Shuzhi Deng, Tongfei Wang, Wanting Chen, Chunxiu Hao, Ruiqi Xie, Zhenwei Wen, Xiangyi Feng, Zou Ting, Jin Zou Lin, Jianquan Li, Guangjun Yu, Liangyi Chen, Junwen Wang, Shan Jiang, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.11558v1">http://arxiv.org/abs/2512.11558v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reliable interpretation of multimodal data in dentistry is essential for automated oral healthcare, yet current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details and lack sufficient reasoning ability for precise diagnosis. To address these limitations, we present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning. Specifically, the largest annotated multimodal dataset for dentistry to date was constructed by aggregating over 120k dental images paired with detailed descriptions that highlight diagnostically relevant visual features, making it the multimodal dataset with the most extensive collection of dental images to date. Training on this dataset significantly enhances the MLLM's visual understanding of dental conditions, while the subsequent reinforcement learning stage further strengthens its capability for multimodal complex reasoning. Comprehensive evaluations on intraoral and panoramic benchmarks, along with dental subsets of medical VQA benchmarks, show that DentalGPT achieves superior performance in disease classification and dental VQA tasks, outperforming many state-of-the-art MLLMs despite having only 7B parameters. These results demonstrate that high-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhenyang Cai, Jiaming Zhang, Junjie Zhao, Ziyi Zeng, Yanchao Li, Jingyi Liang, Junying Chen, Yunjin Yang, Jiajun You, Shuzhi Deng, Tongfei Wang, Wanting Chen, Chunxiu Hao, Ruiqi Xie, Zhenwei Wen, Xiangyi Feng, Zou Ting, Jin Zou Lin, Jianquan Li, Guangjun Yu, Liangyi Chen, Junwen Wang, Shan Jiang, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.11558v1">http://arxiv.org/abs/2512.11558v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reliable interpretation of multimodal data in dentistry is essential for automated oral healthcare, yet current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details and lack sufficient reasoning ability for precise diagnosis. To address these limitations, we present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning. Specifically, the largest annotated multimodal dataset for dentistry to date was constructed by aggregating over 120k dental images paired with detailed descriptions that highlight diagnostically relevant visual features, making it the multimodal dataset with the most extensive collection of dental images to date. Training on this dataset significantly enhances the MLLM's visual understanding of dental conditions, while the subsequent reinforcement learning stage further strengthens its capability for multimodal complex reasoning. Comprehensive evaluations on intraoral and panoramic benchmarks, along with dental subsets of medical VQA benchmarks, show that DentalGPT achieves superior performance in disease classification and dental VQA tasks, outperforming many state-of-the-art MLLMs despite having only 7B parameters. These results demonstrate that high-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 15 Dec 2025 19:12:44 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/eaa14b16/f4903a07.mp3" length="17752425" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1106</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhenyang Cai, Jiaming Zhang, Junjie Zhao, Ziyi Zeng, Yanchao Li, Jingyi Liang, Junying Chen, Yunjin Yang, Jiajun You, Shuzhi Deng, Tongfei Wang, Wanting Chen, Chunxiu Hao, Ruiqi Xie, Zhenwei Wen, Xiangyi Feng, Zou Ting, Jin Zou Lin, Jianquan Li, Guangjun Yu, Liangyi Chen, Junwen Wang, Shan Jiang, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.11558v1">http://arxiv.org/abs/2512.11558v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reliable interpretation of multimodal data in dentistry is essential for automated oral healthcare, yet current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details and lack sufficient reasoning ability for precise diagnosis. To address these limitations, we present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning. Specifically, the largest annotated multimodal dataset for dentistry to date was constructed by aggregating over 120k dental images paired with detailed descriptions that highlight diagnostically relevant visual features, making it the multimodal dataset with the most extensive collection of dental images to date. Training on this dataset significantly enhances the MLLM's visual understanding of dental conditions, while the subsequent reinforcement learning stage further strengthens its capability for multimodal complex reasoning. Comprehensive evaluations on intraoral and panoramic benchmarks, along with dental subsets of medical VQA benchmarks, show that DentalGPT achieves superior performance in disease classification and dental VQA tasks, outperforming many state-of-the-art MLLMs despite having only 7B parameters. These results demonstrate that high-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder</title>
      <itunes:episode>1474</itunes:episode>
      <podcast:episode>1474</podcast:episode>
      <itunes:title>SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">22f6ed70-a7c3-4b49-9e05-665a09dda9fd</guid>
      <link>https://share.transistor.fm/s/fc4b7188</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Minglei Shi, Haolin Wang, Borui Zhang, Wenzhao Zheng, Bohan Zeng, Ziyang Yuan, Xiaoshi Wu, Yuanxing Zhang, Huan Yang, Xintao Wang, Pengfei Wan, Kun Gai, Jie Zhou, Jiwen Lu</p>

            <p><strong>Title:</strong><br>
            SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.11749v1">http://arxiv.org/abs/2512.11749v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Minglei Shi, Haolin Wang, Borui Zhang, Wenzhao Zheng, Bohan Zeng, Ziyang Yuan, Xiaoshi Wu, Yuanxing Zhang, Huan Yang, Xintao Wang, Pengfei Wan, Kun Gai, Jie Zhou, Jiwen Lu</p>

            <p><strong>Title:</strong><br>
            SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.11749v1">http://arxiv.org/abs/2512.11749v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 15 Dec 2025 19:12:21 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fc4b7188/5a5d3def.mp3" length="21453894" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1337</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Minglei Shi, Haolin Wang, Borui Zhang, Wenzhao Zheng, Bohan Zeng, Ziyang Yuan, Xiaoshi Wu, Yuanxing Zhang, Huan Yang, Xintao Wang, Pengfei Wan, Kun Gai, Jie Zhou, Jiwen Lu</p>

            <p><strong>Title:</strong><br>
            SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.11749v1">http://arxiv.org/abs/2512.11749v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties</title>
      <itunes:episode>1473</itunes:episode>
      <podcast:episode>1473</podcast:episode>
      <itunes:title>V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">86f8e582-e0a5-4524-be73-6f0d109ce443</guid>
      <link>https://share.transistor.fm/s/1f85be91</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ye Fang, Tong Wu, Valentin Deschaintre, Duygu Ceylan, Iliyan Georgiev, Chun-Hao Paul Huang, Yiwei Hu, Xuelin Chen, Tuanfeng Yang Wang</p>

            <p><strong>Title:</strong><br>
            V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.11799v1">http://arxiv.org/abs/2512.11799v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ye Fang, Tong Wu, Valentin Deschaintre, Duygu Ceylan, Iliyan Georgiev, Chun-Hao Paul Huang, Yiwei Hu, Xuelin Chen, Tuanfeng Yang Wang</p>

            <p><strong>Title:</strong><br>
            V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.11799v1">http://arxiv.org/abs/2512.11799v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 15 Dec 2025 19:11:58 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1f85be91/1a3d1de3.mp3" length="22551437" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1406</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ye Fang, Tong Wu, Valentin Deschaintre, Duygu Ceylan, Iliyan Georgiev, Chun-Hao Paul Huang, Yiwei Hu, Xuelin Chen, Tuanfeng Yang Wang</p>

            <p><strong>Title:</strong><br>
            V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.11799v1">http://arxiv.org/abs/2512.11799v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground</title>
      <itunes:episode>1472</itunes:episode>
      <podcast:episode>1472</podcast:episode>
      <itunes:title>T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e616695c-184d-49d5-834f-09e4ba5481e4</guid>
      <link>https://share.transistor.fm/s/a5da7da7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dmitrii Stoianov, Danil Taranets, Olga Tsymboi, Ramil Latypov, Almaz Dautov, Vladislav Kruglikov, Nikita Surkov, German Abramov, Pavel Gein, Dmitry Abulkhanov, Mikhail Gashkov, Viktor Zelenkovskiy, Artem Batalov, Aleksandr Medvedev, Anatolii Potapov</p>

            <p><strong>Title:</strong><br>
            T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.10430v1">http://arxiv.org/abs/2512.10430v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce T-pro 2.0, an open-weight Russian LLM for hybrid reasoning and efficient inference. The model supports direct answering and reasoning-trace generation, using a Cyrillic-dense tokenizer and an adapted EAGLE speculative-decoding pipeline to reduce latency. To enable reproducible and extensible research, we release the model weights, the T-Wix 500k instruction corpus, the T-Math reasoning benchmark, and the EAGLE weights on Hugging Face. These resources allow users to study Russian-language reasoning and to extend or adapt both the model and the inference pipeline. A public web demo exposes reasoning and non-reasoning modes and illustrates the speedups achieved by our inference stack across domains. T-pro 2.0 thus serves as an accessible open system for building and evaluating efficient, practical Russian LLM applications.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dmitrii Stoianov, Danil Taranets, Olga Tsymboi, Ramil Latypov, Almaz Dautov, Vladislav Kruglikov, Nikita Surkov, German Abramov, Pavel Gein, Dmitry Abulkhanov, Mikhail Gashkov, Viktor Zelenkovskiy, Artem Batalov, Aleksandr Medvedev, Anatolii Potapov</p>

            <p><strong>Title:</strong><br>
            T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.10430v1">http://arxiv.org/abs/2512.10430v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce T-pro 2.0, an open-weight Russian LLM for hybrid reasoning and efficient inference. The model supports direct answering and reasoning-trace generation, using a Cyrillic-dense tokenizer and an adapted EAGLE speculative-decoding pipeline to reduce latency. To enable reproducible and extensible research, we release the model weights, the T-Wix 500k instruction corpus, the T-Math reasoning benchmark, and the EAGLE weights on Hugging Face. These resources allow users to study Russian-language reasoning and to extend or adapt both the model and the inference pipeline. A public web demo exposes reasoning and non-reasoning modes and illustrates the speedups achieved by our inference stack across domains. T-pro 2.0 thus serves as an accessible open system for building and evaluating efficient, practical Russian LLM applications.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 12 Dec 2025 19:18:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a5da7da7/f9658e69.mp3" length="22018538" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1372</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dmitrii Stoianov, Danil Taranets, Olga Tsymboi, Ramil Latypov, Almaz Dautov, Vladislav Kruglikov, Nikita Surkov, German Abramov, Pavel Gein, Dmitry Abulkhanov, Mikhail Gashkov, Viktor Zelenkovskiy, Artem Batalov, Aleksandr Medvedev, Anatolii Potapov</p>

            <p><strong>Title:</strong><br>
            T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.10430v1">http://arxiv.org/abs/2512.10430v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce T-pro 2.0, an open-weight Russian LLM for hybrid reasoning and efficient inference. The model supports direct answering and reasoning-trace generation, using a Cyrillic-dense tokenizer and an adapted EAGLE speculative-decoding pipeline to reduce latency. To enable reproducible and extensible research, we release the model weights, the T-Wix 500k instruction corpus, the T-Math reasoning benchmark, and the EAGLE weights on Hugging Face. These resources allow users to study Russian-language reasoning and to extend or adapt both the model and the inference pipeline. A public web demo exposes reasoning and non-reasoning modes and illustrates the speedups achieved by our inference stack across domains. T-pro 2.0 thus serves as an accessible open system for building and evaluating efficient, practical Russian LLM applications.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving</title>
      <itunes:episode>1471</itunes:episode>
      <podcast:episode>1471</podcast:episode>
      <itunes:title>Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">60b32dbb-1df4-4a2f-8fd7-e1cd88c7e388</guid>
      <link>https://share.transistor.fm/s/d994f596</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Songyang Gao, Yuzhe Gu, Zijian Wu, Lingkai Kong, Wenwei Zhang, Zhongrui Cai, Fan Zheng, Tianyou Ma, Junhao Shen, Haiteng Zhao, Duanyang Zhang, Huilun Zhang, Kuikun Liu, Chengqi Lyu, Yanhui Duan, Chiyu Chen, Ningsheng Ma, Jianfei Gao, Han Lyu, Dahua Lin, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.10739v1">http://arxiv.org/abs/2512.10739v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the \textbf{O}utcome-based \textbf{P}rocess \textbf{V}erifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \textsc{\thisbench}, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2\% to 73.3\% on AIME2025 as the compute budget scales.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Songyang Gao, Yuzhe Gu, Zijian Wu, Lingkai Kong, Wenwei Zhang, Zhongrui Cai, Fan Zheng, Tianyou Ma, Junhao Shen, Haiteng Zhao, Duanyang Zhang, Huilun Zhang, Kuikun Liu, Chengqi Lyu, Yanhui Duan, Chiyu Chen, Ningsheng Ma, Jianfei Gao, Han Lyu, Dahua Lin, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.10739v1">http://arxiv.org/abs/2512.10739v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the \textbf{O}utcome-based \textbf{P}rocess \textbf{V}erifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \textsc{\thisbench}, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2\% to 73.3\% on AIME2025 as the compute budget scales.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 12 Dec 2025 19:18:25 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d994f596/6cbf285f.mp3" length="22233794" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1386</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Songyang Gao, Yuzhe Gu, Zijian Wu, Lingkai Kong, Wenwei Zhang, Zhongrui Cai, Fan Zheng, Tianyou Ma, Junhao Shen, Haiteng Zhao, Duanyang Zhang, Huilun Zhang, Kuikun Liu, Chengqi Lyu, Yanhui Duan, Chiyu Chen, Ningsheng Ma, Jianfei Gao, Han Lyu, Dahua Lin, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.10739v1">http://arxiv.org/abs/2512.10739v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the \textbf{O}utcome-based \textbf{P}rocess \textbf{V}erifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \textsc{\thisbench}, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2\% to 73.3\% on AIME2025 as the compute budget scales.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation</title>
      <itunes:episode>1470</itunes:episode>
      <podcast:episode>1470</podcast:episode>
      <itunes:title>Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c7add07b-9bf6-443a-ab0e-b8eabf780eaf</guid>
      <link>https://share.transistor.fm/s/2076edb8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen, Dongzhi Jiang, Junli Liu, Bohan Zeng, Haoming Song, Delin Qu, Tianyi Bai, Dan Xu, Wentao Zhang, Bin Zhao</p>

            <p><strong>Title:</strong><br>
            Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.10949v1">http://arxiv.org/abs/2512.10949v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen, Dongzhi Jiang, Junli Liu, Bohan Zeng, Haoming Song, Delin Qu, Tianyi Bai, Dan Xu, Wentao Zhang, Bin Zhao</p>

            <p><strong>Title:</strong><br>
            Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.10949v1">http://arxiv.org/abs/2512.10949v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 12 Dec 2025 19:18:02 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2076edb8/20c4f99d.mp3" length="27785961" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1733</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiwen Tang, Zoey Guo, Kaixin Zhu, Ray Zhang, Qizhi Chen, Dongzhi Jiang, Junli Liu, Bohan Zeng, Haoming Song, Delin Qu, Tianyi Bai, Dan Xu, Wentao Zhang, Bin Zhao</p>

            <p><strong>Title:</strong><br>
            Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.10949v1">http://arxiv.org/abs/2512.10949v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification</title>
      <itunes:episode>1469</itunes:episode>
      <podcast:episode>1469</podcast:episode>
      <itunes:title>OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">74389e7d-23fb-4972-a33f-319650d5cb28</guid>
      <link>https://share.transistor.fm/s/263e32ef</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zijian Wu, Lingkai Kong, Wenwei Zhang, Songyang Gao, Yuzhe Gu, Zhongrui Cai, Tianyou Ma, Yuhong Liu, Zhi Wang, Runyuan Ma, Guangyu Wang, Wei Li, Conghui He, Dahua Lin, Kai Chen</p>

            <p><strong>Title:</strong><br>
            OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.10756v1">http://arxiv.org/abs/2512.10756v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zijian Wu, Lingkai Kong, Wenwei Zhang, Songyang Gao, Yuzhe Gu, Zhongrui Cai, Tianyou Ma, Yuhong Liu, Zhi Wang, Runyuan Ma, Guangyu Wang, Wei Li, Conghui He, Dahua Lin, Kai Chen</p>

            <p><strong>Title:</strong><br>
            OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.10756v1">http://arxiv.org/abs/2512.10756v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 12 Dec 2025 19:17:39 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/263e32ef/d9d4c414.mp3" length="23992156" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1496</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zijian Wu, Lingkai Kong, Wenwei Zhang, Songyang Gao, Yuzhe Gu, Zhongrui Cai, Tianyou Ma, Yuhong Liu, Zhi Wang, Runyuan Ma, Guangyu Wang, Wei Li, Conghui He, Dahua Lin, Kai Chen</p>

            <p><strong>Title:</strong><br>
            OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.10756v1">http://arxiv.org/abs/2512.10756v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning</title>
      <itunes:episode>1468</itunes:episode>
      <podcast:episode>1468</podcast:episode>
      <itunes:title>Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bbc68b25-36f1-4623-8b0a-f5f0c510bd68</guid>
      <link>https://share.transistor.fm/s/0e61ecb7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haiteng Zhao, Junhao Shen, Yiming Zhang, Songyang Gao, Kuikun Liu, Tianyou Ma, Fan Zheng, Dahua Lin, Wenwei Zhang, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.10534v1">http://arxiv.org/abs/2512.10534v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM) agents exhibit strong mathematical problem-solving abilities and can even solve International Mathematical Olympiad (IMO) level problems with the assistance of formal proof systems. However, due to weak heuristics for auxiliary constructions, AI for geometry problem solving remains dominated by expert models such as AlphaGeometry 2, which rely heavily on large-scale data synthesis and search for both training and evaluation. In this work, we make the first attempt to build a medalist-level LLM agent for geometry and present InternGeometry. InternGeometry overcomes the heuristic limitations in geometry by iteratively proposing propositions and auxiliary constructions, verifying them with a symbolic engine, and reflecting on the engine's feedback to guide subsequent proposals. A dynamic memory mechanism enables InternGeometry to conduct more than two hundred interactions with the symbolic engine per problem. To further accelerate learning, we introduce Complexity-Boosting Reinforcement Learning (CBRL), which gradually increases the complexity of synthesized problems across training stages. Built on InternThinker-32B, InternGeometry solves 44 of 50 IMO geometry problems (2000-2024), exceeding the average gold medalist score (40.9), using only 13K training examples, just 0.004% of the data used by AlphaGeometry 2, demonstrating the potential of LLM agents on expert-level geometry tasks. InternGeometry can also propose novel auxiliary constructions for IMO problems that do not appear in human solutions. We will release the model, data, and symbolic engine to support future research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haiteng Zhao, Junhao Shen, Yiming Zhang, Songyang Gao, Kuikun Liu, Tianyou Ma, Fan Zheng, Dahua Lin, Wenwei Zhang, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.10534v1">http://arxiv.org/abs/2512.10534v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM) agents exhibit strong mathematical problem-solving abilities and can even solve International Mathematical Olympiad (IMO) level problems with the assistance of formal proof systems. However, due to weak heuristics for auxiliary constructions, AI for geometry problem solving remains dominated by expert models such as AlphaGeometry 2, which rely heavily on large-scale data synthesis and search for both training and evaluation. In this work, we make the first attempt to build a medalist-level LLM agent for geometry and present InternGeometry. InternGeometry overcomes the heuristic limitations in geometry by iteratively proposing propositions and auxiliary constructions, verifying them with a symbolic engine, and reflecting on the engine's feedback to guide subsequent proposals. A dynamic memory mechanism enables InternGeometry to conduct more than two hundred interactions with the symbolic engine per problem. To further accelerate learning, we introduce Complexity-Boosting Reinforcement Learning (CBRL), which gradually increases the complexity of synthesized problems across training stages. Built on InternThinker-32B, InternGeometry solves 44 of 50 IMO geometry problems (2000-2024), exceeding the average gold medalist score (40.9), using only 13K training examples, just 0.004% of the data used by AlphaGeometry 2, demonstrating the potential of LLM agents on expert-level geometry tasks. InternGeometry can also propose novel auxiliary constructions for IMO problems that do not appear in human solutions. We will release the model, data, and symbolic engine to support future research.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 12 Dec 2025 19:17:16 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0e61ecb7/2a4e4dab.mp3" length="22791381" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1421</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haiteng Zhao, Junhao Shen, Yiming Zhang, Songyang Gao, Kuikun Liu, Tianyou Ma, Fan Zheng, Dahua Lin, Wenwei Zhang, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.10534v1">http://arxiv.org/abs/2512.10534v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM) agents exhibit strong mathematical problem-solving abilities and can even solve International Mathematical Olympiad (IMO) level problems with the assistance of formal proof systems. However, due to weak heuristics for auxiliary constructions, AI for geometry problem solving remains dominated by expert models such as AlphaGeometry 2, which rely heavily on large-scale data synthesis and search for both training and evaluation. In this work, we make the first attempt to build a medalist-level LLM agent for geometry and present InternGeometry. InternGeometry overcomes the heuristic limitations in geometry by iteratively proposing propositions and auxiliary constructions, verifying them with a symbolic engine, and reflecting on the engine's feedback to guide subsequent proposals. A dynamic memory mechanism enables InternGeometry to conduct more than two hundred interactions with the symbolic engine per problem. To further accelerate learning, we introduce Complexity-Boosting Reinforcement Learning (CBRL), which gradually increases the complexity of synthesized problems across training stages. Built on InternThinker-32B, InternGeometry solves 44 of 50 IMO geometry problems (2000-2024), exceeding the average gold medalist score (40.9), using only 13K training examples, just 0.004% of the data used by AlphaGeometry 2, demonstrating the potential of LLM agents on expert-level geometry tasks. InternGeometry can also propose novel auxiliary constructions for IMO problems that do not appear in human solutions. We will release the model, data, and symbolic engine to support future research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation</title>
      <itunes:episode>1467</itunes:episode>
      <podcast:episode>1467</podcast:episode>
      <itunes:title>StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f077d790-92a3-46c7-9d61-9ca8f06e34b4</guid>
      <link>https://share.transistor.fm/s/d8597dd5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ke Xing, Xiaojie Jin, Longfei Li, Yuyang Yin, Hanwen Liang, Guixun Luo, Chen Fang, Jue Wang, Konstantinos N. Plataniotis, Yao Zhao, Yunchao Wei</p>

            <p><strong>Title:</strong><br>
            StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.09363v2">http://arxiv.org/abs/2512.09363v2</a></p>

            <p><strong>Abstract:</strong><br>
            The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency. The project webpage is available at https://ke-xing.github.io/StereoWorld/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ke Xing, Xiaojie Jin, Longfei Li, Yuyang Yin, Hanwen Liang, Guixun Luo, Chen Fang, Jue Wang, Konstantinos N. Plataniotis, Yao Zhao, Yunchao Wei</p>

            <p><strong>Title:</strong><br>
            StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.09363v2">http://arxiv.org/abs/2512.09363v2</a></p>

            <p><strong>Abstract:</strong><br>
            The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency. The project webpage is available at https://ke-xing.github.io/StereoWorld/.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 11 Dec 2025 19:12:37 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d8597dd5/c076d08c.mp3" length="21003728" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1309</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ke Xing, Xiaojie Jin, Longfei Li, Yuyang Yin, Hanwen Liang, Guixun Luo, Chen Fang, Jue Wang, Konstantinos N. Plataniotis, Yao Zhao, Yunchao Wei</p>

            <p><strong>Title:</strong><br>
            StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.09363v2">http://arxiv.org/abs/2512.09363v2</a></p>

            <p><strong>Abstract:</strong><br>
            The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency. The project webpage is available at https://ke-xing.github.io/StereoWorld/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain</title>
      <itunes:episode>1466</itunes:episode>
      <podcast:episode>1466</podcast:episode>
      <itunes:title>BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2de135c6-1758-4087-a30d-93758f97503e</guid>
      <link>https://share.transistor.fm/s/a6f0acb6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Navve Wasserman, Matias Cosarinsky, Yuval Golbari, Aude Oliva, Antonio Torralba, Tamar Rott Shaham, Michal Irani</p>

            <p><strong>Title:</strong><br>
            BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.08560v1">http://arxiv.org/abs/2512.08560v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding how the human brain represents visual concepts, and in which brain regions these representations are encoded, remains a long-standing challenge. Decades of work have advanced our understanding of visual representations, yet brain signals remain large and complex, and the space of possible visual concepts is vast. As a result, most studies remain small-scale, rely on manual inspection, focus on specific regions and properties, and rarely include systematic validation. We present a large-scale, automated framework for discovering and explaining visual representations across the human cortex. Our method comprises two main stages. First, we discover candidate interpretable patterns in fMRI activity through unsupervised, data-driven decomposition methods. Next, we explain each pattern by identifying the set of natural images that most strongly elicit it and generating a natural-language description of their shared visual meaning. To scale this process, we introduce an automated pipeline that tests multiple candidate explanations, assigns quantitative reliability scores, and selects the most consistent description for each voxel pattern. Our framework reveals thousands of interpretable patterns spanning many distinct visual concepts, including fine-grained representations previously unreported.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Navve Wasserman, Matias Cosarinsky, Yuval Golbari, Aude Oliva, Antonio Torralba, Tamar Rott Shaham, Michal Irani</p>

            <p><strong>Title:</strong><br>
            BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.08560v1">http://arxiv.org/abs/2512.08560v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding how the human brain represents visual concepts, and in which brain regions these representations are encoded, remains a long-standing challenge. Decades of work have advanced our understanding of visual representations, yet brain signals remain large and complex, and the space of possible visual concepts is vast. As a result, most studies remain small-scale, rely on manual inspection, focus on specific regions and properties, and rarely include systematic validation. We present a large-scale, automated framework for discovering and explaining visual representations across the human cortex. Our method comprises two main stages. First, we discover candidate interpretable patterns in fMRI activity through unsupervised, data-driven decomposition methods. Next, we explain each pattern by identifying the set of natural images that most strongly elicit it and generating a natural-language description of their shared visual meaning. To scale this process, we introduce an automated pipeline that tests multiple candidate explanations, assigns quantitative reliability scores, and selects the most consistent description for each voxel pattern. Our framework reveals thousands of interpretable patterns spanning many distinct visual concepts, including fine-grained representations previously unreported.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 11 Dec 2025 19:12:13 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a6f0acb6/c999c475.mp3" length="21881890" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1364</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Navve Wasserman, Matias Cosarinsky, Yuval Golbari, Aude Oliva, Antonio Torralba, Tamar Rott Shaham, Michal Irani</p>

            <p><strong>Title:</strong><br>
            BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.08560v1">http://arxiv.org/abs/2512.08560v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding how the human brain represents visual concepts, and in which brain regions these representations are encoded, remains a long-standing challenge. Decades of work have advanced our understanding of visual representations, yet brain signals remain large and complex, and the space of possible visual concepts is vast. As a result, most studies remain small-scale, rely on manual inspection, focus on specific regions and properties, and rarely include systematic validation. We present a large-scale, automated framework for discovering and explaining visual representations across the human cortex. Our method comprises two main stages. First, we discover candidate interpretable patterns in fMRI activity through unsupervised, data-driven decomposition methods. Next, we explain each pattern by identifying the set of natural images that most strongly elicit it and generating a natural-language description of their shared visual meaning. To scale this process, we introduce an automated pipeline that tests multiple candidate explanations, assigns quantitative reliability scores, and selects the most consistent description for each voxel pattern. Our framework reveals thousands of interpretable patterns spanning many distinct visual concepts, including fine-grained representations previously unreported.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OmniPSD: Layered PSD Generation with Diffusion Transformer</title>
      <itunes:episode>1465</itunes:episode>
      <podcast:episode>1465</podcast:episode>
      <itunes:title>OmniPSD: Layered PSD Generation with Diffusion Transformer</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">da4d209d-1aaf-469f-aac6-ee0bd6f83482</guid>
      <link>https://share.transistor.fm/s/2c8b0b04</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Cheng Liu, Yiren Song, Haofan Wang, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            OmniPSD: Layered PSD Generation with Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.09247v1">http://arxiv.org/abs/2512.09247v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in diffusion models have greatly improved image generation and editing, yet generating or reconstructing layered PSD files with transparent alpha channels remains highly challenging. We propose OmniPSD, a unified diffusion framework built upon the Flux ecosystem that enables both text-to-PSD generation and image-to-PSD decomposition through in-context learning. For text-to-PSD generation, OmniPSD arranges multiple target layers spatially into a single canvas and learns their compositional relationships through spatial attention, producing semantically coherent and hierarchically structured layers. For image-to-PSD decomposition, it performs iterative in-context editing, progressively extracting and erasing textual and foreground components to reconstruct editable PSD layers from a single flattened image. An RGBA-VAE is employed as an auxiliary representation module to preserve transparency without affecting structure learning. Extensive experiments on our new RGBA-layered dataset demonstrate that OmniPSD achieves high-fidelity generation, structural consistency, and transparency awareness, offering a new paradigm for layered design generation and decomposition with diffusion transformers.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Cheng Liu, Yiren Song, Haofan Wang, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            OmniPSD: Layered PSD Generation with Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.09247v1">http://arxiv.org/abs/2512.09247v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in diffusion models have greatly improved image generation and editing, yet generating or reconstructing layered PSD files with transparent alpha channels remains highly challenging. We propose OmniPSD, a unified diffusion framework built upon the Flux ecosystem that enables both text-to-PSD generation and image-to-PSD decomposition through in-context learning. For text-to-PSD generation, OmniPSD arranges multiple target layers spatially into a single canvas and learns their compositional relationships through spatial attention, producing semantically coherent and hierarchically structured layers. For image-to-PSD decomposition, it performs iterative in-context editing, progressively extracting and erasing textual and foreground components to reconstruct editable PSD layers from a single flattened image. An RGBA-VAE is employed as an auxiliary representation module to preserve transparency without affecting structure learning. Extensive experiments on our new RGBA-layered dataset demonstrate that OmniPSD achieves high-fidelity generation, structural consistency, and transparency awareness, offering a new paradigm for layered design generation and decomposition with diffusion transformers.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 11 Dec 2025 19:11:51 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2c8b0b04/935e0758.mp3" length="25153221" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1568</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Cheng Liu, Yiren Song, Haofan Wang, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            OmniPSD: Layered PSD Generation with Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.09247v1">http://arxiv.org/abs/2512.09247v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in diffusion models have greatly improved image generation and editing, yet generating or reconstructing layered PSD files with transparent alpha channels remains highly challenging. We propose OmniPSD, a unified diffusion framework built upon the Flux ecosystem that enables both text-to-PSD generation and image-to-PSD decomposition through in-context learning. For text-to-PSD generation, OmniPSD arranges multiple target layers spatially into a single canvas and learns their compositional relationships through spatial attention, producing semantically coherent and hierarchically structured layers. For image-to-PSD decomposition, it performs iterative in-context editing, progressively extracting and erasing textual and foreground components to reconstruct editable PSD layers from a single flattened image. An RGBA-VAE is employed as an auxiliary representation module to preserve transparency without affecting structure learning. Extensive experiments on our new RGBA-layered dataset demonstrate that OmniPSD achieves high-fidelity generation, structural consistency, and transparency awareness, offering a new paradigm for layered design generation and decomposition with diffusion transformers.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Composing Concepts from Images and Videos via Concept-prompt Binding</title>
      <itunes:episode>1464</itunes:episode>
      <podcast:episode>1464</podcast:episode>
      <itunes:title>Composing Concepts from Images and Videos via Concept-prompt Binding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4c33dc8e-503d-4b97-9117-6fdb7e3802a7</guid>
      <link>https://share.transistor.fm/s/f93d6a6a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Xianghao Kong, Zeyu Zhang, Yuwei Guo, Zhuoran Zhao, Songchun Zhang, Anyi Rao</p>

            <p><strong>Title:</strong><br>
            Composing Concepts from Images and Videos via Concept-prompt Binding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.09824v1">http://arxiv.org/abs/2512.09824v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind &amp; Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Xianghao Kong, Zeyu Zhang, Yuwei Guo, Zhuoran Zhao, Songchun Zhang, Anyi Rao</p>

            <p><strong>Title:</strong><br>
            Composing Concepts from Images and Videos via Concept-prompt Binding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.09824v1">http://arxiv.org/abs/2512.09824v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind &amp; Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 11 Dec 2025 19:11:28 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f93d6a6a/70e2a28c.mp3" length="22200349" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1384</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Xianghao Kong, Zeyu Zhang, Yuwei Guo, Zhuoran Zhao, Songchun Zhang, Anyi Rao</p>

            <p><strong>Title:</strong><br>
            Composing Concepts from Images and Videos via Concept-prompt Binding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.09824v1">http://arxiv.org/abs/2512.09824v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind &amp; Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance</title>
      <itunes:episode>1463</itunes:episode>
      <podcast:episode>1463</podcast:episode>
      <itunes:title>Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7c5a9582-4807-4122-92a4-d69d5bf24b7e</guid>
      <link>https://share.transistor.fm/s/83e8319a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 94 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, Yingya Zhang, Yujiu Yang</p>

            <p><strong>Title:</strong><br>
            Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.08765v1">http://arxiv.org/abs/2512.08765v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Wan-Move, a simple and scalable framework that brings motion control to video generative models. Existing motion-controllable methods typically suffer from coarse control granularity and limited scalability, leaving their outputs insufficient for practical use. We narrow this gap by achieving precise and high-quality motion control. Our core idea is to directly make the original condition features motion-aware for guiding video synthesis. To this end, we first represent object motions with dense point trajectories, allowing fine-grained control over the scene. We then project these trajectories into latent space and propagate the first frame's features along each trajectory, producing an aligned spatiotemporal feature map that tells how each scene element should move. This feature map serves as the updated latent condition, which is naturally integrated into the off-the-shelf image-to-video model, e.g., Wan-I2V-14B, as motion guidance without any architecture change. It removes the need for auxiliary motion encoders and makes fine-tuning base models easily scalable. Through scaled training, Wan-Move generates 5-second, 480p videos whose motion controllability rivals Kling 1.5 Pro's commercial Motion Brush, as indicated by user studies. To support comprehensive evaluation, we further design MoveBench, a rigorously curated benchmark featuring diverse content categories and hybrid-verified annotations. It is distinguished by larger data volume, longer video durations, and high-quality motion annotations. Extensive experiments on MoveBench and the public dataset consistently show Wan-Move's superior motion quality. Code, models, and benchmark data are made publicly available.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 94 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, Yingya Zhang, Yujiu Yang</p>

            <p><strong>Title:</strong><br>
            Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.08765v1">http://arxiv.org/abs/2512.08765v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Wan-Move, a simple and scalable framework that brings motion control to video generative models. Existing motion-controllable methods typically suffer from coarse control granularity and limited scalability, leaving their outputs insufficient for practical use. We narrow this gap by achieving precise and high-quality motion control. Our core idea is to directly make the original condition features motion-aware for guiding video synthesis. To this end, we first represent object motions with dense point trajectories, allowing fine-grained control over the scene. We then project these trajectories into latent space and propagate the first frame's features along each trajectory, producing an aligned spatiotemporal feature map that tells how each scene element should move. This feature map serves as the updated latent condition, which is naturally integrated into the off-the-shelf image-to-video model, e.g., Wan-I2V-14B, as motion guidance without any architecture change. It removes the need for auxiliary motion encoders and makes fine-tuning base models easily scalable. Through scaled training, Wan-Move generates 5-second, 480p videos whose motion controllability rivals Kling 1.5 Pro's commercial Motion Brush, as indicated by user studies. To support comprehensive evaluation, we further design MoveBench, a rigorously curated benchmark featuring diverse content categories and hybrid-verified annotations. It is distinguished by larger data volume, longer video durations, and high-quality motion annotations. Extensive experiments on MoveBench and the public dataset consistently show Wan-Move's superior motion quality. Code, models, and benchmark data are made publicly available.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 10 Dec 2025 19:15:09 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/83e8319a/2715e16b.mp3" length="22589896" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1408</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 94 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, Yingya Zhang, Yujiu Yang</p>

            <p><strong>Title:</strong><br>
            Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.08765v1">http://arxiv.org/abs/2512.08765v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Wan-Move, a simple and scalable framework that brings motion control to video generative models. Existing motion-controllable methods typically suffer from coarse control granularity and limited scalability, leaving their outputs insufficient for practical use. We narrow this gap by achieving precise and high-quality motion control. Our core idea is to directly make the original condition features motion-aware for guiding video synthesis. To this end, we first represent object motions with dense point trajectories, allowing fine-grained control over the scene. We then project these trajectories into latent space and propagate the first frame's features along each trajectory, producing an aligned spatiotemporal feature map that tells how each scene element should move. This feature map serves as the updated latent condition, which is naturally integrated into the off-the-shelf image-to-video model, e.g., Wan-I2V-14B, as motion guidance without any architecture change. It removes the need for auxiliary motion encoders and makes fine-tuning base models easily scalable. Through scaled training, Wan-Move generates 5-second, 480p videos whose motion controllability rivals Kling 1.5 Pro's commercial Motion Brush, as indicated by user studies. To support comprehensive evaluation, we further design MoveBench, a rigorously curated benchmark featuring diverse content categories and hybrid-verified annotations. It is distinguished by larger data volume, longer video durations, and high-quality motion annotations. Extensive experiments on MoveBench and the public dataset consistently show Wan-Move's superior motion quality. Code, models, and benchmark data are made publicly available.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform</title>
      <itunes:episode>1462</itunes:episode>
      <podcast:episode>1462</podcast:episode>
      <itunes:title>Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b7d2cd6f-fb32-414c-95be-89468394a004</guid>
      <link>https://share.transistor.fm/s/87eedd79</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Yuning Gong, Yifei Liu, Yifan Zhan, Muyao Niu, Xueying Li, Yuanjun Liao, Jiaming Chen, Yuanyuan Gao, Jiaqi Chen, Minming Chen, Li Zhou, Yuning Zhang, Wei Wang, Xiaoqing Hou, Huaxi Huang, Shixiang Tang, Le Ma, Dingwen Zhang, Xue Yang, Junchi Yan, Yanchi Zhang, Yinqiang Zheng, Xiao Sun, Zhihang Zhong</p>

            <p><strong>Title:</strong><br>
            Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.08478v1">http://arxiv.org/abs/2512.08478v1</a></p>

            <p><strong>Abstract:</strong><br>
            Neural rendering, particularly 3D Gaussian Splatting (3DGS), has evolved rapidly and become a key component for building world models. However, existing viewer solutions remain fragmented, heavy, or constrained by legacy pipelines, resulting in high deployment friction and limited support for dynamic content and generative models. In this work, we present Visionary, an open, web-native platform for real-time various Gaussian Splatting and meshes rendering. Built on an efficient WebGPU renderer with per-frame ONNX inference, Visionary enables dynamic neural processing while maintaining a lightweight, "click-to-run" browser experience. It introduces a standardized Gaussian Generator contract, which not only supports standard 3DGS rendering but also allows plug-and-play algorithms to generate or update Gaussians each frame. Such inference also enables us to apply feedforward generative post-processing. The platform further offers a plug in three.js library with a concise TypeScript API for seamless integration into existing web applications. Experiments show that, under identical 3DGS assets, Visionary achieves superior rendering efficiency compared to current Web viewers due to GPU-based primitive sorting. It already supports multiple variants, including MLP-based 3DGS, 4DGS, neural avatars, and style transformation or enhancement networks. By unifying inference and rendering directly in the browser, Visionary significantly lowers the barrier to reproduction, comparison, and deployment of 3DGS-family methods, serving as a unified World Model Carrier for both reconstructive and generative paradigms.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Yuning Gong, Yifei Liu, Yifan Zhan, Muyao Niu, Xueying Li, Yuanjun Liao, Jiaming Chen, Yuanyuan Gao, Jiaqi Chen, Minming Chen, Li Zhou, Yuning Zhang, Wei Wang, Xiaoqing Hou, Huaxi Huang, Shixiang Tang, Le Ma, Dingwen Zhang, Xue Yang, Junchi Yan, Yanchi Zhang, Yinqiang Zheng, Xiao Sun, Zhihang Zhong</p>

            <p><strong>Title:</strong><br>
            Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.08478v1">http://arxiv.org/abs/2512.08478v1</a></p>

            <p><strong>Abstract:</strong><br>
            Neural rendering, particularly 3D Gaussian Splatting (3DGS), has evolved rapidly and become a key component for building world models. However, existing viewer solutions remain fragmented, heavy, or constrained by legacy pipelines, resulting in high deployment friction and limited support for dynamic content and generative models. In this work, we present Visionary, an open, web-native platform for real-time various Gaussian Splatting and meshes rendering. Built on an efficient WebGPU renderer with per-frame ONNX inference, Visionary enables dynamic neural processing while maintaining a lightweight, "click-to-run" browser experience. It introduces a standardized Gaussian Generator contract, which not only supports standard 3DGS rendering but also allows plug-and-play algorithms to generate or update Gaussians each frame. Such inference also enables us to apply feedforward generative post-processing. The platform further offers a plug in three.js library with a concise TypeScript API for seamless integration into existing web applications. Experiments show that, under identical 3DGS assets, Visionary achieves superior rendering efficiency compared to current Web viewers due to GPU-based primitive sorting. It already supports multiple variants, including MLP-based 3DGS, 4DGS, neural avatars, and style transformation or enhancement networks. By unifying inference and rendering directly in the browser, Visionary significantly lowers the barrier to reproduction, comparison, and deployment of 3DGS-family methods, serving as a unified World Model Carrier for both reconstructive and generative paradigms.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 10 Dec 2025 19:14:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/87eedd79/cd32a96e.mp3" length="24665908" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1538</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Yuning Gong, Yifei Liu, Yifan Zhan, Muyao Niu, Xueying Li, Yuanjun Liao, Jiaming Chen, Yuanyuan Gao, Jiaqi Chen, Minming Chen, Li Zhou, Yuning Zhang, Wei Wang, Xiaoqing Hou, Huaxi Huang, Shixiang Tang, Le Ma, Dingwen Zhang, Xue Yang, Junchi Yan, Yanchi Zhang, Yinqiang Zheng, Xiao Sun, Zhihang Zhong</p>

            <p><strong>Title:</strong><br>
            Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.08478v1">http://arxiv.org/abs/2512.08478v1</a></p>

            <p><strong>Abstract:</strong><br>
            Neural rendering, particularly 3D Gaussian Splatting (3DGS), has evolved rapidly and become a key component for building world models. However, existing viewer solutions remain fragmented, heavy, or constrained by legacy pipelines, resulting in high deployment friction and limited support for dynamic content and generative models. In this work, we present Visionary, an open, web-native platform for real-time various Gaussian Splatting and meshes rendering. Built on an efficient WebGPU renderer with per-frame ONNX inference, Visionary enables dynamic neural processing while maintaining a lightweight, "click-to-run" browser experience. It introduces a standardized Gaussian Generator contract, which not only supports standard 3DGS rendering but also allows plug-and-play algorithms to generate or update Gaussians each frame. Such inference also enables us to apply feedforward generative post-processing. The platform further offers a plug in three.js library with a concise TypeScript API for seamless integration into existing web applications. Experiments show that, under identical 3DGS assets, Visionary achieves superior rendering efficiency compared to current Web viewers due to GPU-based primitive sorting. It already supports multiple variants, including MLP-based 3DGS, 4DGS, neural avatars, and style transformation or enhancement networks. By unifying inference and rendering directly in the browser, Visionary significantly lowers the barrier to reproduction, comparison, and deployment of 3DGS-family methods, serving as a unified World Model Carrier for both reconstructive and generative paradigms.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality</title>
      <itunes:episode>1461</itunes:episode>
      <podcast:episode>1461</podcast:episode>
      <itunes:title>Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a3336915-579b-499e-b91f-fb233386bd33</guid>
      <link>https://share.transistor.fm/s/ad4f1889</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zekai Luo, Zongze Du, Zhouhang Zhu, Hao Zhong, Muzhi Zhu, Wen Wang, Yuling Xi, Chenchen Jing, Hao Chen, Chunhua Shen</p>

            <p><strong>Title:</strong><br>
            Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.07951v1">http://arxiv.org/abs/2512.07951v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. Building on this insight, this work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video's expressions, lighting, and motion, while significantly reducing manual effort in production workflows. Project webpage: https://aim-uofa.github.io/LivingSwap</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zekai Luo, Zongze Du, Zhouhang Zhu, Hao Zhong, Muzhi Zhu, Wen Wang, Yuling Xi, Chenchen Jing, Hao Chen, Chunhua Shen</p>

            <p><strong>Title:</strong><br>
            Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.07951v1">http://arxiv.org/abs/2512.07951v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. Building on this insight, this work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video's expressions, lighting, and motion, while significantly reducing manual effort in production workflows. Project webpage: https://aim-uofa.github.io/LivingSwap</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 10 Dec 2025 19:14:23 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ad4f1889/48e452b3.mp3" length="22196183" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1384</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zekai Luo, Zongze Du, Zhouhang Zhu, Hao Zhong, Muzhi Zhu, Wen Wang, Yuling Xi, Chenchen Jing, Hao Chen, Chunhua Shen</p>

            <p><strong>Title:</strong><br>
            Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.07951v1">http://arxiv.org/abs/2512.07951v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. Building on this insight, this work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video's expressions, lighting, and motion, while significantly reducing manual effort in production workflows. Project webpage: https://aim-uofa.github.io/LivingSwap</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory</title>
      <itunes:episode>1460</itunes:episode>
      <podcast:episode>1460</podcast:episode>
      <itunes:title>OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0c74eb14-8c46-410b-9c26-3531d80f0977</guid>
      <link>https://share.transistor.fm/s/68cb14d3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, Chenyang Zhang, Tao Xiang, Fanny Yang, Serge Belongie, Tian Xie</p>

            <p><strong>Title:</strong><br>
            OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.07802v1">http://arxiv.org/abs/2512.07802v1</a></p>

            <p><strong>Abstract:</strong><br>
            Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, Chenyang Zhang, Tao Xiang, Fanny Yang, Serge Belongie, Tian Xie</p>

            <p><strong>Title:</strong><br>
            OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.07802v1">http://arxiv.org/abs/2512.07802v1</a></p>

            <p><strong>Abstract:</strong><br>
            Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 10 Dec 2025 19:14:00 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/68cb14d3/e3c04e80.mp3" length="21765253" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1357</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, Chenyang Zhang, Tao Xiang, Fanny Yang, Serge Belongie, Tian Xie</p>

            <p><strong>Title:</strong><br>
            OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.07802v1">http://arxiv.org/abs/2512.07802v1</a></p>

            <p><strong>Abstract:</strong><br>
            Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning</title>
      <itunes:episode>1459</itunes:episode>
      <podcast:episode>1459</podcast:episode>
      <itunes:title>Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">69c1ef86-051f-477f-bb80-a71345066015</guid>
      <link>https://share.transistor.fm/s/2a4cb073</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tong Wu, Yang Liu, Jun Bai, Zixia Jia, Shuyi Zhang, Ziyong Lin, Yanting Wang, Song-Chun Zhu, Zilong Zheng</p>

            <p><strong>Title:</strong><br>
            Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.07461v1">http://arxiv.org/abs/2512.07461v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tong Wu, Yang Liu, Jun Bai, Zixia Jia, Shuyi Zhang, Ziyong Lin, Yanting Wang, Song-Chun Zhu, Zilong Zheng</p>

            <p><strong>Title:</strong><br>
            Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.07461v1">http://arxiv.org/abs/2512.07461v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 09 Dec 2025 19:30:30 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2a4cb073/4162cac1.mp3" length="23180487" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1445</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tong Wu, Yang Liu, Jun Bai, Zixia Jia, Shuyi Zhang, Ziyong Lin, Yanting Wang, Song-Chun Zhu, Zilong Zheng</p>

            <p><strong>Title:</strong><br>
            Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.07461v1">http://arxiv.org/abs/2512.07461v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs</title>
      <itunes:episode>1458</itunes:episode>
      <podcast:episode>1458</podcast:episode>
      <itunes:title>Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8154fcc5-a16d-4707-bdb0-977e9f40511c</guid>
      <link>https://share.transistor.fm/s/76a2aa27</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xiaoran Liu, Yuerong Song, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Zhaoxiang Liu, Shiguo Lian, Ziwei He, Xipeng Qiu</p>

            <p><strong>Title:</strong><br>
            Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.07525v1">http://arxiv.org/abs/2512.07525v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models (LLMs) by applying rotations to query and key vectors in the complex plane. Standard implementations, however, utilize only the real component of the complex-valued dot product for attention score calculation. This simplification discards the imaginary component, which contains valuable phase information, leading to a potential loss of relational details crucial for modeling long-context dependencies. In this paper, we propose an extension that re-incorporates this discarded imaginary component. Our method leverages the full complex-valued representation to create a dual-component attention score. We theoretically and empirically demonstrate that this approach enhances the modeling of long-context dependencies by preserving more positional information. Furthermore, evaluations on a suite of long-context language modeling benchmarks show that our method consistently improves performance over the standard RoPE, with the benefits becoming more significant as context length increases. The code is available at https://github.com/OpenMOSS/rope_pp.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xiaoran Liu, Yuerong Song, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Zhaoxiang Liu, Shiguo Lian, Ziwei He, Xipeng Qiu</p>

            <p><strong>Title:</strong><br>
            Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.07525v1">http://arxiv.org/abs/2512.07525v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models (LLMs) by applying rotations to query and key vectors in the complex plane. Standard implementations, however, utilize only the real component of the complex-valued dot product for attention score calculation. This simplification discards the imaginary component, which contains valuable phase information, leading to a potential loss of relational details crucial for modeling long-context dependencies. In this paper, we propose an extension that re-incorporates this discarded imaginary component. Our method leverages the full complex-valued representation to create a dual-component attention score. We theoretically and empirically demonstrate that this approach enhances the modeling of long-context dependencies by preserving more positional information. Furthermore, evaluations on a suite of long-context language modeling benchmarks show that our method consistently improves performance over the standard RoPE, with the benefits becoming more significant as context length increases. The code is available at https://github.com/OpenMOSS/rope_pp.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 09 Dec 2025 19:30:09 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/76a2aa27/abeaeecb.mp3" length="21002912" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1309</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xiaoran Liu, Yuerong Song, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Zhaoxiang Liu, Shiguo Lian, Ziwei He, Xipeng Qiu</p>

            <p><strong>Title:</strong><br>
            Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.07525v1">http://arxiv.org/abs/2512.07525v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models (LLMs) by applying rotations to query and key vectors in the complex plane. Standard implementations, however, utilize only the real component of the complex-valued dot product for attention score calculation. This simplification discards the imaginary component, which contains valuable phase information, leading to a potential loss of relational details crucial for modeling long-context dependencies. In this paper, we propose an extension that re-incorporates this discarded imaginary component. Our method leverages the full complex-valued representation to create a dual-component attention score. We theoretically and empirically demonstrate that this approach enhances the modeling of long-context dependencies by preserving more positional information. Furthermore, evaluations on a suite of long-context language modeling benchmarks show that our method consistently improves performance over the standard RoPE, with the benefits becoming more significant as context length increases. The code is available at https://github.com/OpenMOSS/rope_pp.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Unified Video Editing with Temporal Reasoner</title>
      <itunes:episode>1457</itunes:episode>
      <podcast:episode>1457</podcast:episode>
      <itunes:title>Unified Video Editing with Temporal Reasoner</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f8b07e93-2191-406a-a9ef-5f679d1e2216</guid>
      <link>https://share.transistor.fm/s/0ae30bcf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yan Huang, Min Xu, Qiang Wu</p>

            <p><strong>Title:</strong><br>
            Unified Video Editing with Temporal Reasoner</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.07469v1">http://arxiv.org/abs/2512.07469v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a ``see, reason, then edit" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach. Our code, weight, data are available at https://github.com/knightyxp/VideoCoF.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yan Huang, Min Xu, Qiang Wu</p>

            <p><strong>Title:</strong><br>
            Unified Video Editing with Temporal Reasoner</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.07469v1">http://arxiv.org/abs/2512.07469v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a ``see, reason, then edit" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach. Our code, weight, data are available at https://github.com/knightyxp/VideoCoF.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 09 Dec 2025 19:29:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0ae30bcf/17868b62.mp3" length="19803329" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1234</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiangpeng Yang, Ji Xie, Yiyuan Yang, Yan Huang, Min Xu, Qiang Wu</p>

            <p><strong>Title:</strong><br>
            Unified Video Editing with Temporal Reasoner</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.07469v1">http://arxiv.org/abs/2512.07469v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. VideoCoF enforces a ``see, reason, then edit" procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves state-of-the-art performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach. Our code, weight, data are available at https://github.com/knightyxp/VideoCoF.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Voxify3D: Pixel Art Meets Volumetric Rendering</title>
      <itunes:episode>1456</itunes:episode>
      <podcast:episode>1456</podcast:episode>
      <itunes:title>Voxify3D: Pixel Art Meets Volumetric Rendering</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5efc7339-b67a-4f6d-8fae-68f631888961</guid>
      <link>https://share.transistor.fm/s/c178fdfc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yi-Chuan Huang, Jiewen Chan, Hao-Jen Chien, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            Voxify3D: Pixel Art Meets Volumetric Rendering</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.07834v1">http://arxiv.org/abs/2512.07834v1</a></p>

            <p><strong>Abstract:</strong><br>
            Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90\% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: https://yichuanh.github.io/Voxify-3D/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yi-Chuan Huang, Jiewen Chan, Hao-Jen Chien, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            Voxify3D: Pixel Art Meets Volumetric Rendering</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.07834v1">http://arxiv.org/abs/2512.07834v1</a></p>

            <p><strong>Abstract:</strong><br>
            Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90\% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: https://yichuanh.github.io/Voxify-3D/</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 09 Dec 2025 19:29:26 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c178fdfc/e3646415.mp3" length="21447583" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1337</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yi-Chuan Huang, Jiewen Chan, Hao-Jen Chien, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            Voxify3D: Pixel Art Meets Volumetric Rendering</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.07834v1">http://arxiv.org/abs/2512.07834v1</a></p>

            <p><strong>Abstract:</strong><br>
            Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90\% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: https://yichuanh.github.io/Voxify-3D/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Scaling Zero-Shot Reference-to-Video Generation</title>
      <itunes:episode>1455</itunes:episode>
      <podcast:episode>1455</podcast:episode>
      <itunes:title>Scaling Zero-Shot Reference-to-Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c1decd3a-eb76-4569-b836-67cbc324ad97</guid>
      <link>https://share.transistor.fm/s/78c9c5a0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zijian Zhou, Shikun Liu, Haozhe Liu, Haonan Qiu, Zhaochong An, Weiming Ren, Zhiheng Liu, Xiaoke Huang, Kam Woh Ng, Tian Xie, Xiao Han, Yuren Cong, Hang Li, Chuyan Zhu, Aditya Patel, Tao Xiang, Sen He</p>

            <p><strong>Title:</strong><br>
            Scaling Zero-Shot Reference-to-Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.06905v1">http://arxiv.org/abs/2512.06905v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reference-to-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scale. We bypass this bottleneck by introducing Saber, a scalable zero-shot framework that requires no explicit R2V data. Trained exclusively on video-text pairs, Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations. Mask augmentation techniques are further integrated to mitigate copy-paste artifacts common in reference-to-video generation. Moreover, Saber demonstrates remarkable generalization capabilities across a varying number of references and achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zijian Zhou, Shikun Liu, Haozhe Liu, Haonan Qiu, Zhaochong An, Weiming Ren, Zhiheng Liu, Xiaoke Huang, Kam Woh Ng, Tian Xie, Xiao Han, Yuren Cong, Hang Li, Chuyan Zhu, Aditya Patel, Tao Xiang, Sen He</p>

            <p><strong>Title:</strong><br>
            Scaling Zero-Shot Reference-to-Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.06905v1">http://arxiv.org/abs/2512.06905v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reference-to-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scale. We bypass this bottleneck by introducing Saber, a scalable zero-shot framework that requires no explicit R2V data. Trained exclusively on video-text pairs, Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations. Mask augmentation techniques are further integrated to mitigate copy-paste artifacts common in reference-to-video generation. Moreover, Saber demonstrates remarkable generalization capabilities across a varying number of references and achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 09 Dec 2025 19:29:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/78c9c5a0/9e075b0d.mp3" length="22966865" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1432</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zijian Zhou, Shikun Liu, Haozhe Liu, Haonan Qiu, Zhaochong An, Weiming Ren, Zhiheng Liu, Xiaoke Huang, Kam Woh Ng, Tian Xie, Xiao Han, Yuren Cong, Hang Li, Chuyan Zhu, Aditya Patel, Tao Xiang, Sen He</p>

            <p><strong>Title:</strong><br>
            Scaling Zero-Shot Reference-to-Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.06905v1">http://arxiv.org/abs/2512.06905v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reference-to-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scale. We bypass this bottleneck by introducing Saber, a scalable zero-shot framework that requires no explicit R2V data. Trained exclusively on video-text pairs, Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations. Mask augmentation techniques are further integrated to mitigate copy-paste artifacts common in reference-to-video generation. Moreover, Saber demonstrates remarkable generalization capabilities across a varying number of references and achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems</title>
      <itunes:episode>1454</itunes:episode>
      <podcast:episode>1454</podcast:episode>
      <itunes:title>DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d6884adc-5766-4f87-aaa3-067196a4387e</guid>
      <link>https://share.transistor.fm/s/a269113d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Ming Ma, Jue Zhang, Fangkai Yang, Yu Kang, Qingwei Lin, Tianming Yang, Saravan Rajmohan, Dongmei Zhang</p>

            <p><strong>Title:</strong><br>
            DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.06749v2">http://arxiv.org/abs/2512.06749v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM)-based multi-agent systems are challenging to debug because failures often arise from long, branching interaction traces. The prevailing practice is to leverage LLMs for log-based failure localization, attributing errors to a specific agent and step. However, this paradigm has two key limitations: (i) log-only debugging lacks validation, producing untested hypotheses, and (ii) single-step or single-agent attribution is often ill-posed, as we find that multiple distinct interventions can independently repair the failed task. To address the first limitation, we introduce DoVer, an intervention-driven debugging framework, which augments hypothesis generation with active verification through targeted interventions (e.g., editing messages, altering plans). For the second limitation, rather than evaluating on attribution accuracy, we focus on measuring whether the system resolves the failure or makes quantifiable progress toward task success, reflecting a more outcome-oriented view of debugging. Within the Magnetic-One agent framework, on the datasets derived from GAIA and AssistantBench, DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses. DoVer also performs effectively on a different dataset (GSMPlus) and agent framework (AG2), where it recovers 49% of failed trials. These results highlight intervention as a practical mechanism for improving reliability in agentic systems and open opportunities for more robust, scalable debugging methods for LLM-based multi-agent systems. Project website and code will be available at https://aka.ms/DoVer.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Ming Ma, Jue Zhang, Fangkai Yang, Yu Kang, Qingwei Lin, Tianming Yang, Saravan Rajmohan, Dongmei Zhang</p>

            <p><strong>Title:</strong><br>
            DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.06749v2">http://arxiv.org/abs/2512.06749v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM)-based multi-agent systems are challenging to debug because failures often arise from long, branching interaction traces. The prevailing practice is to leverage LLMs for log-based failure localization, attributing errors to a specific agent and step. However, this paradigm has two key limitations: (i) log-only debugging lacks validation, producing untested hypotheses, and (ii) single-step or single-agent attribution is often ill-posed, as we find that multiple distinct interventions can independently repair the failed task. To address the first limitation, we introduce DoVer, an intervention-driven debugging framework, which augments hypothesis generation with active verification through targeted interventions (e.g., editing messages, altering plans). For the second limitation, rather than evaluating on attribution accuracy, we focus on measuring whether the system resolves the failure or makes quantifiable progress toward task success, reflecting a more outcome-oriented view of debugging. Within the Magnetic-One agent framework, on the datasets derived from GAIA and AssistantBench, DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses. DoVer also performs effectively on a different dataset (GSMPlus) and agent framework (AG2), where it recovers 49% of failed trials. These results highlight intervention as a practical mechanism for improving reliability in agentic systems and open opportunities for more robust, scalable debugging methods for LLM-based multi-agent systems. Project website and code will be available at https://aka.ms/DoVer.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 09 Dec 2025 19:28:44 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a269113d/2121d637.mp3" length="23948255" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1493</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Ming Ma, Jue Zhang, Fangkai Yang, Yu Kang, Qingwei Lin, Tianming Yang, Saravan Rajmohan, Dongmei Zhang</p>

            <p><strong>Title:</strong><br>
            DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.06749v2">http://arxiv.org/abs/2512.06749v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM)-based multi-agent systems are challenging to debug because failures often arise from long, branching interaction traces. The prevailing practice is to leverage LLMs for log-based failure localization, attributing errors to a specific agent and step. However, this paradigm has two key limitations: (i) log-only debugging lacks validation, producing untested hypotheses, and (ii) single-step or single-agent attribution is often ill-posed, as we find that multiple distinct interventions can independently repair the failed task. To address the first limitation, we introduce DoVer, an intervention-driven debugging framework, which augments hypothesis generation with active verification through targeted interventions (e.g., editing messages, altering plans). For the second limitation, rather than evaluating on attribution accuracy, we focus on measuring whether the system resolves the failure or makes quantifiable progress toward task success, reflecting a more outcome-oriented view of debugging. Within the Magnetic-One agent framework, on the datasets derived from GAIA and AssistantBench, DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses. DoVer also performs effectively on a different dataset (GSMPlus) and agent framework (AG2), where it recovers 49% of failed trials. These results highlight intervention as a practical mechanism for improving reliability in agentic systems and open opportunities for more robust, scalable debugging methods for LLM-based multi-agent systems. Project website and code will be available at https://aka.ms/DoVer.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows</title>
      <itunes:episode>1453</itunes:episode>
      <podcast:episode>1453</podcast:episode>
      <itunes:title>TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">80c51514-5cdb-45d7-9b41-03c84708e151</guid>
      <link>https://share.transistor.fm/s/38607cd3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhenglin Cheng, Peng Sun, Jianguo Li, Tao Lin</p>

            <p><strong>Title:</strong><br>
            TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.05150v1">http://arxiv.org/abs/2512.05150v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large multi-modal generative models have demonstrated impressive capabilities in multi-modal generation, including image and video generation. These models are typically built upon multi-step frameworks like diffusion and flow matching, which inherently limits their inference efficiency (requiring 40-100 Number of Function Evaluations (NFEs)). While various few-step methods aim to accelerate the inference, existing solutions have clear limitations. Prominent distillation-based methods, such as progressive and consistency distillation, either require an iterative distillation procedure or show significant degradation at very few steps (&lt; 4-NFE). Meanwhile, integrating adversarial training into distillation (e.g., DMD/DMD2 and SANA-Sprint) to enhance performance introduces training instability, added complexity, and high GPU memory overhead due to the auxiliary trained models. To this end, we propose TwinFlow, a simple yet effective framework for training 1-step generative models that bypasses the need of fixed pretrained teacher models and avoids standard adversarial networks during training, making it ideal for building large-scale, efficient models. On text-to-image tasks, our method achieves a GenEval score of 0.83 in 1-NFE, outperforming strong baselines like SANA-Sprint (a GAN loss-based framework) and RCGM (a consistency-based framework). Notably, we demonstrate the scalability of TwinFlow by full-parameter training on Qwen-Image-20B and transform it into an efficient few-step generator. With just 1-NFE, our approach matches the performance of the original 100-NFE model on both the GenEval and DPG-Bench benchmarks, reducing computational cost by $100\times$ with minor quality degradation. Project page is available at https://zhenglin-cheng.com/twinflow.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhenglin Cheng, Peng Sun, Jianguo Li, Tao Lin</p>

            <p><strong>Title:</strong><br>
            TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.05150v1">http://arxiv.org/abs/2512.05150v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large multi-modal generative models have demonstrated impressive capabilities in multi-modal generation, including image and video generation. These models are typically built upon multi-step frameworks like diffusion and flow matching, which inherently limits their inference efficiency (requiring 40-100 Number of Function Evaluations (NFEs)). While various few-step methods aim to accelerate the inference, existing solutions have clear limitations. Prominent distillation-based methods, such as progressive and consistency distillation, either require an iterative distillation procedure or show significant degradation at very few steps (&lt; 4-NFE). Meanwhile, integrating adversarial training into distillation (e.g., DMD/DMD2 and SANA-Sprint) to enhance performance introduces training instability, added complexity, and high GPU memory overhead due to the auxiliary trained models. To this end, we propose TwinFlow, a simple yet effective framework for training 1-step generative models that bypasses the need of fixed pretrained teacher models and avoids standard adversarial networks during training, making it ideal for building large-scale, efficient models. On text-to-image tasks, our method achieves a GenEval score of 0.83 in 1-NFE, outperforming strong baselines like SANA-Sprint (a GAN loss-based framework) and RCGM (a consistency-based framework). Notably, we demonstrate the scalability of TwinFlow by full-parameter training on Qwen-Image-20B and transform it into an efficient few-step generator. With just 1-NFE, our approach matches the performance of the original 100-NFE model on both the GenEval and DPG-Bench benchmarks, reducing computational cost by $100\times$ with minor quality degradation. Project page is available at https://zhenglin-cheng.com/twinflow.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 08 Dec 2025 19:15:30 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/38607cd3/ac4789f8.mp3" length="22021895" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1373</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhenglin Cheng, Peng Sun, Jianguo Li, Tao Lin</p>

            <p><strong>Title:</strong><br>
            TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.05150v1">http://arxiv.org/abs/2512.05150v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large multi-modal generative models have demonstrated impressive capabilities in multi-modal generation, including image and video generation. These models are typically built upon multi-step frameworks like diffusion and flow matching, which inherently limits their inference efficiency (requiring 40-100 Number of Function Evaluations (NFEs)). While various few-step methods aim to accelerate the inference, existing solutions have clear limitations. Prominent distillation-based methods, such as progressive and consistency distillation, either require an iterative distillation procedure or show significant degradation at very few steps (&lt; 4-NFE). Meanwhile, integrating adversarial training into distillation (e.g., DMD/DMD2 and SANA-Sprint) to enhance performance introduces training instability, added complexity, and high GPU memory overhead due to the auxiliary trained models. To this end, we propose TwinFlow, a simple yet effective framework for training 1-step generative models that bypasses the need of fixed pretrained teacher models and avoids standard adversarial networks during training, making it ideal for building large-scale, efficient models. On text-to-image tasks, our method achieves a GenEval score of 0.83 in 1-NFE, outperforming strong baselines like SANA-Sprint (a GAN loss-based framework) and RCGM (a consistency-based framework). Notably, we demonstrate the scalability of TwinFlow by full-parameter training on Qwen-Image-20B and transform it into an efficient few-step generator. With just 1-NFE, our approach matches the performance of the original 100-NFE model on both the GenEval and DPG-Bench benchmarks, reducing computational cost by $100\times$ with minor quality degradation. Project page is available at https://zhenglin-cheng.com/twinflow.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>EditThinker: Unlocking Iterative Reasoning for Any Image Editor</title>
      <itunes:episode>1452</itunes:episode>
      <podcast:episode>1452</podcast:episode>
      <itunes:title>EditThinker: Unlocking Iterative Reasoning for Any Image Editor</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5aa0b976-4e72-49c9-b050-505ac767ed81</guid>
      <link>https://share.transistor.fm/s/f14fbb2d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, Xunliang Cai, Linjiang Huang, Hongsheng Li, Si Liu</p>

            <p><strong>Title:</strong><br>
            EditThinker: Unlocking Iterative Reasoning for Any Image Editor</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.05965v1">http://arxiv.org/abs/2512.05965v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-based image editing has emerged as a prominent research area, which, benefiting from image generation foundation models, have achieved high aesthetic quality, making instruction-following capability the primary challenge. Existing approaches improve instruction adherence via supervised or reinforcement learning, yet single-turn success rates remain limited due to inherent stochasticity and a lack of deliberation. In this work, we propose a deliberative editing framework to 'think' while they edit, which simulates the human cognitive loop by iteratively executing a Think-while-Edit cycle: Critiquing results and Refining instructions , followed by Repeating the generation until satisfactory. Specifically, we train a single MLLM, EditThinker, to act as the reasoning engine of this framework, which jointly produce the critique score, reasoning process, and refined instructions. We employ reinforcement learning to align the EditThinker's thinking with its editing, thereby generating more targeted instruction improvements. Extensive experiments on four benchmarks demonstrate that our approach significantly improves the instruction-following capability of any image editing model by a large margin. We will release our data construction framework, datasets, and models to benefit the community.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, Xunliang Cai, Linjiang Huang, Hongsheng Li, Si Liu</p>

            <p><strong>Title:</strong><br>
            EditThinker: Unlocking Iterative Reasoning for Any Image Editor</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.05965v1">http://arxiv.org/abs/2512.05965v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-based image editing has emerged as a prominent research area, which, benefiting from image generation foundation models, have achieved high aesthetic quality, making instruction-following capability the primary challenge. Existing approaches improve instruction adherence via supervised or reinforcement learning, yet single-turn success rates remain limited due to inherent stochasticity and a lack of deliberation. In this work, we propose a deliberative editing framework to 'think' while they edit, which simulates the human cognitive loop by iteratively executing a Think-while-Edit cycle: Critiquing results and Refining instructions , followed by Repeating the generation until satisfactory. Specifically, we train a single MLLM, EditThinker, to act as the reasoning engine of this framework, which jointly produce the critique score, reasoning process, and refined instructions. We employ reinforcement learning to align the EditThinker's thinking with its editing, thereby generating more targeted instruction improvements. Extensive experiments on four benchmarks demonstrate that our approach significantly improves the instruction-following capability of any image editing model by a large margin. We will release our data construction framework, datasets, and models to benefit the community.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 08 Dec 2025 19:15:08 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f14fbb2d/f0526cc0.mp3" length="25513924" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1591</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, Xunliang Cai, Linjiang Huang, Hongsheng Li, Si Liu</p>

            <p><strong>Title:</strong><br>
            EditThinker: Unlocking Iterative Reasoning for Any Image Editor</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.05965v1">http://arxiv.org/abs/2512.05965v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-based image editing has emerged as a prominent research area, which, benefiting from image generation foundation models, have achieved high aesthetic quality, making instruction-following capability the primary challenge. Existing approaches improve instruction adherence via supervised or reinforcement learning, yet single-turn success rates remain limited due to inherent stochasticity and a lack of deliberation. In this work, we propose a deliberative editing framework to 'think' while they edit, which simulates the human cognitive loop by iteratively executing a Think-while-Edit cycle: Critiquing results and Refining instructions , followed by Repeating the generation until satisfactory. Specifically, we train a single MLLM, EditThinker, to act as the reasoning engine of this framework, which jointly produce the critique score, reasoning process, and refined instructions. We employ reinforcement learning to align the EditThinker's thinking with its editing, thereby generating more targeted instruction improvements. Extensive experiments on four benchmarks demonstrate that our approach significantly improves the instruction-following capability of any image editing model by a large margin. We will release our data construction framework, datasets, and models to benefit the community.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks</title>
      <itunes:episode>1451</itunes:episode>
      <podcast:episode>1451</podcast:episode>
      <itunes:title>From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a62bffab-eb0a-4e51-bad3-467765d987c1</guid>
      <link>https://share.transistor.fm/s/5a499e5a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Changpeng Yang, Jinyang Wu, Yuchen Liu, Shuai Zhang, Yang Li, Qiliang Liang, Hongzhen Wang, Shuai Nie, Jiaming Xu, Runyu Shi, Ying Huang, Guoquan Zhang</p>

            <p><strong>Title:</strong><br>
            From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02580v1">http://arxiv.org/abs/2512.02580v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose **CAPO** (**C**urriculum **A**dvantage **P**olicy **O**ptimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Changpeng Yang, Jinyang Wu, Yuchen Liu, Shuai Zhang, Yang Li, Qiliang Liang, Hongzhen Wang, Shuai Nie, Jiaming Xu, Runyu Shi, Ying Huang, Guoquan Zhang</p>

            <p><strong>Title:</strong><br>
            From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02580v1">http://arxiv.org/abs/2512.02580v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose **CAPO** (**C**urriculum **A**dvantage **P**olicy **O**ptimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 08 Dec 2025 19:14:45 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5a499e5a/96ec5916.mp3" length="22770083" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1419</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Changpeng Yang, Jinyang Wu, Yuchen Liu, Shuai Zhang, Yang Li, Qiliang Liang, Hongzhen Wang, Shuai Nie, Jiaming Xu, Runyu Shi, Ying Huang, Guoquan Zhang</p>

            <p><strong>Title:</strong><br>
            From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02580v1">http://arxiv.org/abs/2512.02580v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose **CAPO** (**C**urriculum **A**dvantage **P**olicy **O**ptimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture</title>
      <itunes:episode>1450</itunes:episode>
      <podcast:episode>1450</podcast:episode>
      <itunes:title>EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a990274f-2f22-4204-82d8-e0a0e402b114</guid>
      <link>https://share.transistor.fm/s/979063ab</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xin He, Longhui Wei, Jianbo Ouyang, Lingxi Xie, Qi Tian</p>

            <p><strong>Title:</strong><br>
            EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.04810v2">http://arxiv.org/abs/2512.04810v2</a></p>

            <p><strong>Abstract:</strong><br>
            We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xin He, Longhui Wei, Jianbo Ouyang, Lingxi Xie, Qi Tian</p>

            <p><strong>Title:</strong><br>
            EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.04810v2">http://arxiv.org/abs/2512.04810v2</a></p>

            <p><strong>Abstract:</strong><br>
            We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 08 Dec 2025 19:14:22 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/979063ab/af38713a.mp3" length="23623525" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1473</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xin He, Longhui Wei, Jianbo Ouyang, Lingxi Xie, Qi Tian</p>

            <p><strong>Title:</strong><br>
            EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.04810v2">http://arxiv.org/abs/2512.04810v2</a></p>

            <p><strong>Abstract:</strong><br>
            We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle</title>
      <itunes:episode>1449</itunes:episode>
      <podcast:episode>1449</podcast:episode>
      <itunes:title>DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3d2d34b6-48bd-4b08-abf4-a30842c908e6</guid>
      <link>https://share.transistor.fm/s/83f27c05</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 120 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Fangyu Lei, Jinxiang Meng, Yiming Huang, Junjie Zhao, Yitong Zhang, Jianwen Luo, Xin Zou, Ruiyi Yang, Wenbo Shi, Yan Gao, Shizhu He, Zuo Wang, Qian Liu, Yang Wang, Ke Wang, Jun Zhao, Kang Liu</p>

            <p><strong>Title:</strong><br>
            DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.04324v1">http://arxiv.org/abs/2512.04324v1</a></p>

            <p><strong>Abstract:</strong><br>
            Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 120 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Fangyu Lei, Jinxiang Meng, Yiming Huang, Junjie Zhao, Yitong Zhang, Jianwen Luo, Xin Zou, Ruiyi Yang, Wenbo Shi, Yan Gao, Shizhu He, Zuo Wang, Qian Liu, Yang Wang, Ke Wang, Jun Zhao, Kang Liu</p>

            <p><strong>Title:</strong><br>
            DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.04324v1">http://arxiv.org/abs/2512.04324v1</a></p>

            <p><strong>Abstract:</strong><br>
            Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 05 Dec 2025 19:33:52 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/83f27c05/cc12f348.mp3" length="26376187" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1645</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 120 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Fangyu Lei, Jinxiang Meng, Yiming Huang, Junjie Zhao, Yitong Zhang, Jianwen Luo, Xin Zou, Ruiyi Yang, Wenbo Shi, Yan Gao, Shizhu He, Zuo Wang, Qian Liu, Yang Wang, Ke Wang, Jun Zhao, Kang Liu</p>

            <p><strong>Title:</strong><br>
            DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.04324v1">http://arxiv.org/abs/2512.04324v1</a></p>

            <p><strong>Abstract:</strong><br>
            Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length</title>
      <itunes:episode>1448</itunes:episode>
      <podcast:episode>1448</podcast:episode>
      <itunes:title>Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">09668d9e-16f8-4e31-a19d-29bb02fe4309</guid>
      <link>https://share.transistor.fm/s/79ff70c2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 113 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, Steven Hoi</p>

            <p><strong>Title:</strong><br>
            Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.04677v1">http://arxiv.org/abs/2512.04677v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 113 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, Steven Hoi</p>

            <p><strong>Title:</strong><br>
            Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.04677v1">http://arxiv.org/abs/2512.04677v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 05 Dec 2025 19:33:28 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/79ff70c2/79f07aad.mp3" length="24295594" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1515</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 113 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, Steven Hoi</p>

            <p><strong>Title:</strong><br>
            Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.04677v1">http://arxiv.org/abs/2512.04677v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction</title>
      <itunes:episode>1447</itunes:episode>
      <podcast:episode>1447</podcast:episode>
      <itunes:title>Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c4c8aa6e-0eba-4d76-8b66-1a94f55ae77d</guid>
      <link>https://share.transistor.fm/s/db5f799f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Nex-AGI Team, :, Yuxuan Cai, Lu Chen, Qiaoling Chen, Yuyang Ding, Liwen Fan, Wenjie Fu, Yufei Gao, Honglin Guo, Pinxue Guo, Zhenhua Han, Zhengfu He, Hanglei Hu, Kai Hu, Shengjia Hua, Tianyu Huai, Baodai Huang, Li Ji, Zhen Jiang, Zhikai Lei, Bufan Li, Jiahang Lin, Lizhi Lin, Jinxiu Liu, Shichun Liu, Ziming Liu, Yuchen Ni, Pengfang Qian, Yujiong Shen, Qingyun Shi, Wentao Shu, Peng Sun, Yiran Suo, Tian Tang, Boyu Tian, Guoteng Wang, Junzhe Wang, Peixin Wang, Zhiheng Xi, Hang Yan, Jie Yang, Zhixiong Yang, Tianchu Yao, Guangze Ye, Qianxi Yu, Shuo Zhang, Xinyue Zhang, Yiqi Zhang, Jiarong Zhao, Miao Zheng, Rui Zheng, Enyu Zhou, Jiazheng Zhou, Maosen Zhou, Yuhao Zhou, Tao Gui, Yining Zheng, Xinchi Chen, Jie Zhou, Siyuan Feng, Qin Chen, Liang He, Qi Zhang, Xuanjing Huang, Xipeng Qiu</p>

            <p><strong>Title:</strong><br>
            Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.04987v1">http://arxiv.org/abs/2512.04987v1</a></p>

            <p><strong>Abstract:</strong><br>
            The evolution of Large Language Models (LLMs) from passive responders to autonomous agents necessitates a fundamental shift in learning paradigms -- from static imitation to incentive-driven decision making. However, this transition is significantly impeded by the lack of scalable infrastructure capable of constructing high-quality interaction signals for effective policy learning. To address this, we introduce a comprehensive method designed to systematically scale the diversity and complexity of interactive environments. Our method realizes this scaling by addressing three orthogonal dimensions: (1) Complexity: NexAU, a flexible agent framework that supports building complex agent hierarchies via simple configurations; (2) Diversity: NexA4A automatically generates diverse agent hierarchies from natural language to cover infinite domains; and (3) Fidelity: NexGAP bridges the simulation-reality gap by integrating dynamic real-world environment for grounded trajectories synthesis. We train Nex-N1 upon the diverse and complex interactive environments established by our infrastructure. Empirical results on benchmarks such as SWE-bench and tau2 demonstrate that Nex-N1 consistently outperforms SOTA open-source models and achieves competitive performance against frontier proprietary models on complex agentic tasks. We open-source the Nex ecosystem and model weights to facilitate further research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Nex-AGI Team, :, Yuxuan Cai, Lu Chen, Qiaoling Chen, Yuyang Ding, Liwen Fan, Wenjie Fu, Yufei Gao, Honglin Guo, Pinxue Guo, Zhenhua Han, Zhengfu He, Hanglei Hu, Kai Hu, Shengjia Hua, Tianyu Huai, Baodai Huang, Li Ji, Zhen Jiang, Zhikai Lei, Bufan Li, Jiahang Lin, Lizhi Lin, Jinxiu Liu, Shichun Liu, Ziming Liu, Yuchen Ni, Pengfang Qian, Yujiong Shen, Qingyun Shi, Wentao Shu, Peng Sun, Yiran Suo, Tian Tang, Boyu Tian, Guoteng Wang, Junzhe Wang, Peixin Wang, Zhiheng Xi, Hang Yan, Jie Yang, Zhixiong Yang, Tianchu Yao, Guangze Ye, Qianxi Yu, Shuo Zhang, Xinyue Zhang, Yiqi Zhang, Jiarong Zhao, Miao Zheng, Rui Zheng, Enyu Zhou, Jiazheng Zhou, Maosen Zhou, Yuhao Zhou, Tao Gui, Yining Zheng, Xinchi Chen, Jie Zhou, Siyuan Feng, Qin Chen, Liang He, Qi Zhang, Xuanjing Huang, Xipeng Qiu</p>

            <p><strong>Title:</strong><br>
            Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.04987v1">http://arxiv.org/abs/2512.04987v1</a></p>

            <p><strong>Abstract:</strong><br>
            The evolution of Large Language Models (LLMs) from passive responders to autonomous agents necessitates a fundamental shift in learning paradigms -- from static imitation to incentive-driven decision making. However, this transition is significantly impeded by the lack of scalable infrastructure capable of constructing high-quality interaction signals for effective policy learning. To address this, we introduce a comprehensive method designed to systematically scale the diversity and complexity of interactive environments. Our method realizes this scaling by addressing three orthogonal dimensions: (1) Complexity: NexAU, a flexible agent framework that supports building complex agent hierarchies via simple configurations; (2) Diversity: NexA4A automatically generates diverse agent hierarchies from natural language to cover infinite domains; and (3) Fidelity: NexGAP bridges the simulation-reality gap by integrating dynamic real-world environment for grounded trajectories synthesis. We train Nex-N1 upon the diverse and complex interactive environments established by our infrastructure. Empirical results on benchmarks such as SWE-bench and tau2 demonstrate that Nex-N1 consistently outperforms SOTA open-source models and achieves competitive performance against frontier proprietary models on complex agentic tasks. We open-source the Nex ecosystem and model weights to facilitate further research.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 05 Dec 2025 19:33:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/db5f799f/ba64ef75.mp3" length="23733032" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1480</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Nex-AGI Team, :, Yuxuan Cai, Lu Chen, Qiaoling Chen, Yuyang Ding, Liwen Fan, Wenjie Fu, Yufei Gao, Honglin Guo, Pinxue Guo, Zhenhua Han, Zhengfu He, Hanglei Hu, Kai Hu, Shengjia Hua, Tianyu Huai, Baodai Huang, Li Ji, Zhen Jiang, Zhikai Lei, Bufan Li, Jiahang Lin, Lizhi Lin, Jinxiu Liu, Shichun Liu, Ziming Liu, Yuchen Ni, Pengfang Qian, Yujiong Shen, Qingyun Shi, Wentao Shu, Peng Sun, Yiran Suo, Tian Tang, Boyu Tian, Guoteng Wang, Junzhe Wang, Peixin Wang, Zhiheng Xi, Hang Yan, Jie Yang, Zhixiong Yang, Tianchu Yao, Guangze Ye, Qianxi Yu, Shuo Zhang, Xinyue Zhang, Yiqi Zhang, Jiarong Zhao, Miao Zheng, Rui Zheng, Enyu Zhou, Jiazheng Zhou, Maosen Zhou, Yuhao Zhou, Tao Gui, Yining Zheng, Xinchi Chen, Jie Zhou, Siyuan Feng, Qin Chen, Liang He, Qi Zhang, Xuanjing Huang, Xipeng Qiu</p>

            <p><strong>Title:</strong><br>
            Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.04987v1">http://arxiv.org/abs/2512.04987v1</a></p>

            <p><strong>Abstract:</strong><br>
            The evolution of Large Language Models (LLMs) from passive responders to autonomous agents necessitates a fundamental shift in learning paradigms -- from static imitation to incentive-driven decision making. However, this transition is significantly impeded by the lack of scalable infrastructure capable of constructing high-quality interaction signals for effective policy learning. To address this, we introduce a comprehensive method designed to systematically scale the diversity and complexity of interactive environments. Our method realizes this scaling by addressing three orthogonal dimensions: (1) Complexity: NexAU, a flexible agent framework that supports building complex agent hierarchies via simple configurations; (2) Diversity: NexA4A automatically generates diverse agent hierarchies from natural language to cover infinite domains; and (3) Fidelity: NexGAP bridges the simulation-reality gap by integrating dynamic real-world environment for grounded trajectories synthesis. We train Nex-N1 upon the diverse and complex interactive environments established by our infrastructure. Empirical results on benchmarks such as SWE-bench and tau2 demonstrate that Nex-N1 consistently outperforms SOTA open-source models and achieves competitive performance against frontier proprietary models on complex agentic tasks. We open-source the Nex ecosystem and model weights to facilitate further research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning</title>
      <itunes:episode>1446</itunes:episode>
      <podcast:episode>1446</podcast:episode>
      <itunes:title>ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">81bf7b88-f7cc-45e6-bc8b-93435cffbbd8</guid>
      <link>https://share.transistor.fm/s/bdcc55a2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, Conghui He, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.05111v1">http://arxiv.org/abs/2512.05111v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, Conghui He, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.05111v1">http://arxiv.org/abs/2512.05111v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 05 Dec 2025 19:32:41 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bdcc55a2/85c841cb.mp3" length="22774660" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1420</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, Conghui He, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.05111v1">http://arxiv.org/abs/2512.05111v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation</title>
      <itunes:episode>1445</itunes:episode>
      <podcast:episode>1445</podcast:episode>
      <itunes:title>Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d05a3613-3015-4dfa-adf7-ddb164184df8</guid>
      <link>https://share.transistor.fm/s/2e152e21</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, Min Zhang</p>

            <p><strong>Title:</strong><br>
            Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.04678v1">http://arxiv.org/abs/2512.04678v1</a></p>

            <p><strong>Abstract:</strong><br>
            Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, Min Zhang</p>

            <p><strong>Title:</strong><br>
            Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.04678v1">http://arxiv.org/abs/2512.04678v1</a></p>

            <p><strong>Abstract:</strong><br>
            Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 05 Dec 2025 19:32:18 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2e152e21/5a44dce1.mp3" length="20368467" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1269</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, Min Zhang</p>

            <p><strong>Title:</strong><br>
            Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.04678v1">http://arxiv.org/abs/2512.04678v1</a></p>

            <p><strong>Abstract:</strong><br>
            Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion</title>
      <itunes:episode>1444</itunes:episode>
      <podcast:episode>1444</podcast:episode>
      <itunes:title>Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ba99ebb0-f523-4ab4-bfa8-ee182f70ff86</guid>
      <link>https://share.transistor.fm/s/4571a6a0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yueming Pan, Ruoyu Feng, Qi Dai, Yuqi Wang, Wenfeng Lin, Mingyu Guo, Chong Luo, Nanning Zheng</p>

            <p><strong>Title:</strong><br>
            Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.04926v1">http://arxiv.org/abs/2512.04926v1</a></p>

            <p><strong>Abstract:</strong><br>
            Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining a compact semantic latent, which is extracted from a pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100x faster convergence than the original DiT. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling. Project page and code: https://yuemingpan.github.io/SFD.github.io/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yueming Pan, Ruoyu Feng, Qi Dai, Yuqi Wang, Wenfeng Lin, Mingyu Guo, Chong Luo, Nanning Zheng</p>

            <p><strong>Title:</strong><br>
            Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.04926v1">http://arxiv.org/abs/2512.04926v1</a></p>

            <p><strong>Abstract:</strong><br>
            Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining a compact semantic latent, which is extracted from a pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100x faster convergence than the original DiT. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling. Project page and code: https://yuemingpan.github.io/SFD.github.io/.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 05 Dec 2025 19:31:55 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4571a6a0/d770efe7.mp3" length="24703539" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1540</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yueming Pan, Ruoyu Feng, Qi Dai, Yuqi Wang, Wenfeng Lin, Mingyu Guo, Chong Luo, Nanning Zheng</p>

            <p><strong>Title:</strong><br>
            Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.04926v1">http://arxiv.org/abs/2512.04926v1</a></p>

            <p><strong>Abstract:</strong><br>
            Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining a compact semantic latent, which is extracted from a pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100x faster convergence than the original DiT. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling. Project page and code: https://yuemingpan.github.io/SFD.github.io/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing</title>
      <itunes:episode>1443</itunes:episode>
      <podcast:episode>1443</podcast:episode>
      <itunes:title>PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2f6d3616-b877-448a-b165-b694f190986e</guid>
      <link>https://share.transistor.fm/s/04315df2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.AI, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Junyi Hou, Andre Lin Huikai, Nuo Chen, Yiwei Gong, Bingsheng He</p>

            <p><strong>Title:</strong><br>
            PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02589v1">http://arxiv.org/abs/2512.02589v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models are increasingly embedded into academic writing workflows, yet existing assistants remain external to the editor, preventing deep interaction with document state, structure, and revision history. This separation makes it impossible to support agentic, context-aware operations directly within LaTeX editors such as Overleaf. We present PaperDebugger, an in-editor, multi-agent, and plugin-based academic writing assistant that brings LLM-driven reasoning directly into the writing environment. Enabling such in-editor interaction is technically non-trivial: it requires reliable bidirectional synchronization with the editor, fine-grained version control and patching, secure state management, multi-agent scheduling, and extensible communication with external tools. PaperDebugger addresses these challenges through a Chrome-approved extension, a Kubernetes-native orchestration layer, and a Model Context Protocol (MCP) toolchain that integrates literature search, reference lookup, document scoring, and revision pipelines. Our demo showcases a fully integrated workflow, including localized edits, structured reviews, parallel agent execution, and diff-based updates, encapsulated within a minimal-intrusion user interface (UI). Early aggregated analytics demonstrate active user engagement and validate the practicality of an editor-native, agentic writing assistant. More details about this demo and video could be found at https://github.com/PaperDebugger/PaperDebugger.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.AI, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Junyi Hou, Andre Lin Huikai, Nuo Chen, Yiwei Gong, Bingsheng He</p>

            <p><strong>Title:</strong><br>
            PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02589v1">http://arxiv.org/abs/2512.02589v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models are increasingly embedded into academic writing workflows, yet existing assistants remain external to the editor, preventing deep interaction with document state, structure, and revision history. This separation makes it impossible to support agentic, context-aware operations directly within LaTeX editors such as Overleaf. We present PaperDebugger, an in-editor, multi-agent, and plugin-based academic writing assistant that brings LLM-driven reasoning directly into the writing environment. Enabling such in-editor interaction is technically non-trivial: it requires reliable bidirectional synchronization with the editor, fine-grained version control and patching, secure state management, multi-agent scheduling, and extensible communication with external tools. PaperDebugger addresses these challenges through a Chrome-approved extension, a Kubernetes-native orchestration layer, and a Model Context Protocol (MCP) toolchain that integrates literature search, reference lookup, document scoring, and revision pipelines. Our demo showcases a fully integrated workflow, including localized edits, structured reviews, parallel agent execution, and diff-based updates, encapsulated within a minimal-intrusion user interface (UI). Early aggregated analytics demonstrate active user engagement and validate the practicality of an editor-native, agentic writing assistant. More details about this demo and video could be found at https://github.com/PaperDebugger/PaperDebugger.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 05 Dec 2025 19:31:32 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/04315df2/3ab1799c.mp3" length="22530151" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1404</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.AI, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Junyi Hou, Andre Lin Huikai, Nuo Chen, Yiwei Gong, Bingsheng He</p>

            <p><strong>Title:</strong><br>
            PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02589v1">http://arxiv.org/abs/2512.02589v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models are increasingly embedded into academic writing workflows, yet existing assistants remain external to the editor, preventing deep interaction with document state, structure, and revision history. This separation makes it impossible to support agentic, context-aware operations directly within LaTeX editors such as Overleaf. We present PaperDebugger, an in-editor, multi-agent, and plugin-based academic writing assistant that brings LLM-driven reasoning directly into the writing environment. Enabling such in-editor interaction is technically non-trivial: it requires reliable bidirectional synchronization with the editor, fine-grained version control and patching, secure state management, multi-agent scheduling, and extensible communication with external tools. PaperDebugger addresses these challenges through a Chrome-approved extension, a Kubernetes-native orchestration layer, and a Model Context Protocol (MCP) toolchain that integrates literature search, reference lookup, document scoring, and revision pipelines. Our demo showcases a fully integrated workflow, including localized edits, structured reviews, parallel agent execution, and diff-based updates, encapsulated within a minimal-intrusion user interface (UI). Early aggregated analytics demonstrate active user engagement and validate the practicality of an editor-native, agentic writing assistant. More details about this demo and video could be found at https://github.com/PaperDebugger/PaperDebugger.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Qwen3-VL Technical Report</title>
      <itunes:episode>1442</itunes:episode>
      <podcast:episode>1442</podcast:episode>
      <itunes:title>Qwen3-VL Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">722519e6-5596-4366-abed-2915f62e532d</guid>
      <link>https://share.transistor.fm/s/131e9d13</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 91 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, Ke Zhu</p>

            <p><strong>Title:</strong><br>
            Qwen3-VL Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.21631v2">http://arxiv.org/abs/2511.21631v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 91 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, Ke Zhu</p>

            <p><strong>Title:</strong><br>
            Qwen3-VL Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.21631v2">http://arxiv.org/abs/2511.21631v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 04 Dec 2025 19:12:12 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/131e9d13/1e2be85c.mp3" length="26060159" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1625</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 91 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, Ke Zhu</p>

            <p><strong>Title:</strong><br>
            Qwen3-VL Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.21631v2">http://arxiv.org/abs/2511.21631v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach</title>
      <itunes:episode>1441</itunes:episode>
      <podcast:episode>1441</podcast:episode>
      <itunes:title>Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">db12268d-c585-4dde-ab43-3b4bff2c351c</guid>
      <link>https://share.transistor.fm/s/eaacfb36</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.RO, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Siyuan Yang, Yang Zhang, Haoran He, Ling Pan, Xiu Li, Chenjia Bai, Xuelong Li</p>

            <p><strong>Title:</strong><br>
            Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02834v1">http://arxiv.org/abs/2512.02834v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language-Action (VLA) models, trained via flow-matching or diffusion objectives, excel at learning complex behaviors from large-scale, multi-modal datasets (e.g., human teleoperation, scripted policies). However, since VLAs incorporate diverse data modes in the pre-training stage, and the finetuning dataset often contains demonstration data collected in a kinematically suboptimal or undesirable way, it exists redundant action modes that are irrelevant to the success action modes of the downstream task. Specifically, we observe a critical inference-time fragility among various sampled noises after supervised finetuning of pre-trained VLAs. In this paper, we attribute this instability to the distribution shift between the VLA policy and the policy induced by stable success modes of the downstream task dataset. Thus, we propose \textbf{TACO}, a test-time-scaling (TTS) framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks. The VLA models integrated with TACO can execute the actions with maximum pseudo-count from all sampled action chunks, thereby preventing distribution shifts while preserving the generalization ability of VLAs since the constraint is applied only during inference. Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits compared to RL update, especially for flow or diffusion-based VLAs which are difficult to perform RL update due to denoising process. Extensive experiments across four simulation benchmarks (RoboTwin2.0, Robotwin, LIBERO, SimplerEnv) and a dual-arm platform demonstrate that our method significantly improves the inference stability and success rates in downstream-task adaptations.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.RO, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Siyuan Yang, Yang Zhang, Haoran He, Ling Pan, Xiu Li, Chenjia Bai, Xuelong Li</p>

            <p><strong>Title:</strong><br>
            Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02834v1">http://arxiv.org/abs/2512.02834v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language-Action (VLA) models, trained via flow-matching or diffusion objectives, excel at learning complex behaviors from large-scale, multi-modal datasets (e.g., human teleoperation, scripted policies). However, since VLAs incorporate diverse data modes in the pre-training stage, and the finetuning dataset often contains demonstration data collected in a kinematically suboptimal or undesirable way, it exists redundant action modes that are irrelevant to the success action modes of the downstream task. Specifically, we observe a critical inference-time fragility among various sampled noises after supervised finetuning of pre-trained VLAs. In this paper, we attribute this instability to the distribution shift between the VLA policy and the policy induced by stable success modes of the downstream task dataset. Thus, we propose \textbf{TACO}, a test-time-scaling (TTS) framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks. The VLA models integrated with TACO can execute the actions with maximum pseudo-count from all sampled action chunks, thereby preventing distribution shifts while preserving the generalization ability of VLAs since the constraint is applied only during inference. Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits compared to RL update, especially for flow or diffusion-based VLAs which are difficult to perform RL update due to denoising process. Extensive experiments across four simulation benchmarks (RoboTwin2.0, Robotwin, LIBERO, SimplerEnv) and a dual-arm platform demonstrate that our method significantly improves the inference stability and success rates in downstream-task adaptations.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 04 Dec 2025 19:11:49 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/eaacfb36/dd3dc523.mp3" length="21654932" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1350</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.RO, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Siyuan Yang, Yang Zhang, Haoran He, Ling Pan, Xiu Li, Chenjia Bai, Xuelong Li</p>

            <p><strong>Title:</strong><br>
            Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02834v1">http://arxiv.org/abs/2512.02834v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language-Action (VLA) models, trained via flow-matching or diffusion objectives, excel at learning complex behaviors from large-scale, multi-modal datasets (e.g., human teleoperation, scripted policies). However, since VLAs incorporate diverse data modes in the pre-training stage, and the finetuning dataset often contains demonstration data collected in a kinematically suboptimal or undesirable way, it exists redundant action modes that are irrelevant to the success action modes of the downstream task. Specifically, we observe a critical inference-time fragility among various sampled noises after supervised finetuning of pre-trained VLAs. In this paper, we attribute this instability to the distribution shift between the VLA policy and the policy induced by stable success modes of the downstream task dataset. Thus, we propose \textbf{TACO}, a test-time-scaling (TTS) framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks. The VLA models integrated with TACO can execute the actions with maximum pseudo-count from all sampled action chunks, thereby preventing distribution shifts while preserving the generalization ability of VLAs since the constraint is applied only during inference. Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits compared to RL update, especially for flow or diffusion-based VLAs which are difficult to perform RL update due to denoising process. Extensive experiments across four simulation benchmarks (RoboTwin2.0, Robotwin, LIBERO, SimplerEnv) and a dual-arm platform demonstrate that our method significantly improves the inference stability and success rates in downstream-task adaptations.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PretrainZero: Reinforcement Active Pretraining</title>
      <itunes:episode>1440</itunes:episode>
      <podcast:episode>1440</podcast:episode>
      <itunes:title>PretrainZero: Reinforcement Active Pretraining</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ec191afd-b9f1-4878-a36e-bd4388d2e8c3</guid>
      <link>https://share.transistor.fm/s/ce66d576</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingrun Xing, Zhiyuan Fan, Jie Lou, Guoqi Li, Jiajun Zhang, Debing Zhang</p>

            <p><strong>Title:</strong><br>
            PretrainZero: Reinforcement Active Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.03442v1">http://arxiv.org/abs/2512.03442v1</a></p>

            <p><strong>Abstract:</strong><br>
            Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingrun Xing, Zhiyuan Fan, Jie Lou, Guoqi Li, Jiajun Zhang, Debing Zhang</p>

            <p><strong>Title:</strong><br>
            PretrainZero: Reinforcement Active Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.03442v1">http://arxiv.org/abs/2512.03442v1</a></p>

            <p><strong>Abstract:</strong><br>
            Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 04 Dec 2025 19:11:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ce66d576/df351c1f.mp3" length="21952477" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1368</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingrun Xing, Zhiyuan Fan, Jie Lou, Guoqi Li, Jiajun Zhang, Debing Zhang</p>

            <p><strong>Title:</strong><br>
            PretrainZero: Reinforcement Active Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.03442v1">http://arxiv.org/abs/2512.03442v1</a></p>

            <p><strong>Abstract:</strong><br>
            Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ViDiC: Video Difference Captioning</title>
      <itunes:episode>1439</itunes:episode>
      <podcast:episode>1439</podcast:episode>
      <itunes:title>ViDiC: Video Difference Captioning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">93697458-e0b7-4d15-82f7-b5ff2fa03bae</guid>
      <link>https://share.transistor.fm/s/8302abbb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiangtao Wu, Shihao Li, Zhaozhou Bian, Yuanxing Zhang, Jialu Chen, Runzhe Wen, An Ping, Yiwen He, Jiakai Wang, Jiaheng Liu</p>

            <p><strong>Title:</strong><br>
            ViDiC: Video Difference Captioning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.03405v1">http://arxiv.org/abs/2512.03405v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding visual differences between dynamic scenes requires the comparative perception of compositional, spatial, and temporal changes--a capability that remains underexplored in existing vision-language systems. While prior work on Image Difference Captioning (IDC) has enabled models to describe semantic changes between static images, these approaches fail to capture motion continuity, event evolution, or editing consistency over time. We introduce the ViDiC (Video Difference Captioning) task and its corresponding ViDiC-1K dataset, designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to provide fine-grained descriptions of similarities and differences between video pairs. ViDiC-1K comprises 1,000 curated video pairs annotated with over 4,000 comparative checklist items, covering seven categories: subject, style, background, cinematography, motion, location, and playback techniques. To ensure reliable evaluation, we propose a dual-checklist framework that measures the accuracy of similarity and difference separately, based on the LLM-as-a-Judge protocol. Experiments on nineteen representative multimodal models reveal a significant performance gap in their comparative description and difference perception abilities. We hope ViDiC-1K can be a challenging benchmark that lays a solid foundation for advancing video understanding, edit awareness, and comparative reasoning in multimodal intelligence.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiangtao Wu, Shihao Li, Zhaozhou Bian, Yuanxing Zhang, Jialu Chen, Runzhe Wen, An Ping, Yiwen He, Jiakai Wang, Jiaheng Liu</p>

            <p><strong>Title:</strong><br>
            ViDiC: Video Difference Captioning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.03405v1">http://arxiv.org/abs/2512.03405v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding visual differences between dynamic scenes requires the comparative perception of compositional, spatial, and temporal changes--a capability that remains underexplored in existing vision-language systems. While prior work on Image Difference Captioning (IDC) has enabled models to describe semantic changes between static images, these approaches fail to capture motion continuity, event evolution, or editing consistency over time. We introduce the ViDiC (Video Difference Captioning) task and its corresponding ViDiC-1K dataset, designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to provide fine-grained descriptions of similarities and differences between video pairs. ViDiC-1K comprises 1,000 curated video pairs annotated with over 4,000 comparative checklist items, covering seven categories: subject, style, background, cinematography, motion, location, and playback techniques. To ensure reliable evaluation, we propose a dual-checklist framework that measures the accuracy of similarity and difference separately, based on the LLM-as-a-Judge protocol. Experiments on nineteen representative multimodal models reveal a significant performance gap in their comparative description and difference perception abilities. We hope ViDiC-1K can be a challenging benchmark that lays a solid foundation for advancing video understanding, edit awareness, and comparative reasoning in multimodal intelligence.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 04 Dec 2025 19:11:04 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8302abbb/aae250c5.mp3" length="22660906" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1413</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiangtao Wu, Shihao Li, Zhaozhou Bian, Yuanxing Zhang, Jialu Chen, Runzhe Wen, An Ping, Yiwen He, Jiakai Wang, Jiaheng Liu</p>

            <p><strong>Title:</strong><br>
            ViDiC: Video Difference Captioning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.03405v1">http://arxiv.org/abs/2512.03405v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding visual differences between dynamic scenes requires the comparative perception of compositional, spatial, and temporal changes--a capability that remains underexplored in existing vision-language systems. While prior work on Image Difference Captioning (IDC) has enabled models to describe semantic changes between static images, these approaches fail to capture motion continuity, event evolution, or editing consistency over time. We introduce the ViDiC (Video Difference Captioning) task and its corresponding ViDiC-1K dataset, designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to provide fine-grained descriptions of similarities and differences between video pairs. ViDiC-1K comprises 1,000 curated video pairs annotated with over 4,000 comparative checklist items, covering seven categories: subject, style, background, cinematography, motion, location, and playback techniques. To ensure reliable evaluation, we propose a dual-checklist framework that measures the accuracy of similarity and difference separately, based on the LLM-as-a-Judge protocol. Experiments on nineteen representative multimodal models reveal a significant performance gap in their comparative description and difference perception abilities. We hope ViDiC-1K can be a challenging benchmark that lays a solid foundation for advancing video understanding, edit awareness, and comparative reasoning in multimodal intelligence.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models</title>
      <itunes:episode>1438</itunes:episode>
      <podcast:episode>1438</podcast:episode>
      <itunes:title>DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7bcae5e1-3237-40e3-9ab4-c0700f14e195</guid>
      <link>https://share.transistor.fm/s/3eb7f97c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 114 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Li, Haofen Liang, Haoran Wei, Haowei Zhang, Haowen Luo, Haozhe Ji, Honghui Ding, Hongxuan Tang, Huanqi Cao, Huazuo Gao, Hui Qu, Hui Zeng, Jialiang Huang, Jiashi Li, Jiaxin Xu, Jiewen Hu, Jingchang Chen, Jingting Xiang, Jingyang Yuan, Jingyuan Cheng, Jinhua Zhu, Jun Ran, Junguang Jiang, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Kexin Huang, Kexing Zhou, Kezhao Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Wang, Liang Zhao, Liangsheng Yin, Lihua Guo, Lingxiao Luo, Linwang Ma, Litong Wang, Liyue Zhang, M. S. Di, M. Y Xu, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingxu Zhou, Panpan Huang, Peixin Cong, Peiyi Wang, Qiancheng Wang, Qihao Zhu, Qingyang Li, Qinyu Chen, Qiushi Du, Ruiling Xu, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runqiu Yin, Runxin Xu, Ruomeng Shen, Ruoyu Zhang, S. H. Liu, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaofei Cai, Shaoyuan Chen, Shengding Hu, Shengyu Liu, Shiqiang Hu, Shirong Ma, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, Songyang Zhou, Tao Ni, Tao Yun, Tian Pei, Tian Ye, Tianyuan Yue, Wangding Zeng, Wen Liu, Wenfeng Liang, Wenjie Pang, Wenjing Luo, Wenjun Gao, Wentao Zhang, Xi Gao, Xiangwen Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaokang Zhang, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xingyou Li, Xinyu Yang, Xinyuan Li, Xu Chen, Xuecheng Su, Xuehai Pan, Xuheng Lin, Xuwei Fu, Y. Q. Wang, Yang Zhang, Yanhong Xu, Yanru Ma, Yao Li, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Qian, Yi Yu, Yichao Zhang, Yifan Ding, Yifan Shi, Yiliang Xiong, Ying He, Ying Zhou, Yinmin Zhong, Yishi Piao, Yisong Wang, Yixiao Chen, Yixuan Tan, Yixuan Wei, Yiyang Ma, Yiyuan Liu, Yonglun Yang, Yongqiang Guo, Yongtong Wu, Yu Wu, Yuan Cheng, Yuan Ou, Yuanfan Xu, Yuduan Wang, Yue Gong, Yuhan Wu, Yuheng Zou, Yukun Li, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Z. F. Wu, Z. Z. Ren, Zehua Zhao, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhixian Huang, Zhiyu Wu, Zhuoshu Li, Zhuping Zhang, Zian Xu, Zihao Wang, Zihui Gu, Zijia Zhu, Zilin Li, Zipeng Zhang, Ziwei Xie, Ziyi Gao, Zizheng Pan, Zongqing Yao, Bei Feng, Hui Li, J. L. Cai, Jiaqi Ni, Lei Xu, Meng Li, Ning Tian, R. J. Chen, R. L. Jin, S. S. Li, Shuang Zhou, Tianyu Sun, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xinnan Song, Xinyi Zhou, Y. X. Zhu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Dongjie Ji, Jian Liang, Jianzhong Guo, Jin Chen, Leyi Xia, Miaojun Wang, Mingming Li, Peng Zhang, Ruyi Chen, Shangmian Sun, Shaoqing Wu, Shengfeng Ye, T. Wang, W. L. Xiao, Wei An, Xianzu Wang, Xiaowen Sun, Xiaoxiang Wang, Ying Tang, Yukun Zha, Zekai Zhang, Zhe Ju, Zhen Zhang, Zihua Qu</p>

            <p><strong>Title:</strong><br>
            DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02556v1">http://arxiv.org/abs/2512.02556v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 114 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Li, Haofen Liang, Haoran Wei, Haowei Zhang, Haowen Luo, Haozhe Ji, Honghui Ding, Hongxuan Tang, Huanqi Cao, Huazuo Gao, Hui Qu, Hui Zeng, Jialiang Huang, Jiashi Li, Jiaxin Xu, Jiewen Hu, Jingchang Chen, Jingting Xiang, Jingyang Yuan, Jingyuan Cheng, Jinhua Zhu, Jun Ran, Junguang Jiang, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Kexin Huang, Kexing Zhou, Kezhao Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Wang, Liang Zhao, Liangsheng Yin, Lihua Guo, Lingxiao Luo, Linwang Ma, Litong Wang, Liyue Zhang, M. S. Di, M. Y Xu, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingxu Zhou, Panpan Huang, Peixin Cong, Peiyi Wang, Qiancheng Wang, Qihao Zhu, Qingyang Li, Qinyu Chen, Qiushi Du, Ruiling Xu, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runqiu Yin, Runxin Xu, Ruomeng Shen, Ruoyu Zhang, S. H. Liu, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaofei Cai, Shaoyuan Chen, Shengding Hu, Shengyu Liu, Shiqiang Hu, Shirong Ma, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, Songyang Zhou, Tao Ni, Tao Yun, Tian Pei, Tian Ye, Tianyuan Yue, Wangding Zeng, Wen Liu, Wenfeng Liang, Wenjie Pang, Wenjing Luo, Wenjun Gao, Wentao Zhang, Xi Gao, Xiangwen Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaokang Zhang, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xingyou Li, Xinyu Yang, Xinyuan Li, Xu Chen, Xuecheng Su, Xuehai Pan, Xuheng Lin, Xuwei Fu, Y. Q. Wang, Yang Zhang, Yanhong Xu, Yanru Ma, Yao Li, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Qian, Yi Yu, Yichao Zhang, Yifan Ding, Yifan Shi, Yiliang Xiong, Ying He, Ying Zhou, Yinmin Zhong, Yishi Piao, Yisong Wang, Yixiao Chen, Yixuan Tan, Yixuan Wei, Yiyang Ma, Yiyuan Liu, Yonglun Yang, Yongqiang Guo, Yongtong Wu, Yu Wu, Yuan Cheng, Yuan Ou, Yuanfan Xu, Yuduan Wang, Yue Gong, Yuhan Wu, Yuheng Zou, Yukun Li, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Z. F. Wu, Z. Z. Ren, Zehua Zhao, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhixian Huang, Zhiyu Wu, Zhuoshu Li, Zhuping Zhang, Zian Xu, Zihao Wang, Zihui Gu, Zijia Zhu, Zilin Li, Zipeng Zhang, Ziwei Xie, Ziyi Gao, Zizheng Pan, Zongqing Yao, Bei Feng, Hui Li, J. L. Cai, Jiaqi Ni, Lei Xu, Meng Li, Ning Tian, R. J. Chen, R. L. Jin, S. S. Li, Shuang Zhou, Tianyu Sun, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xinnan Song, Xinyi Zhou, Y. X. Zhu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Dongjie Ji, Jian Liang, Jianzhong Guo, Jin Chen, Leyi Xia, Miaojun Wang, Mingming Li, Peng Zhang, Ruyi Chen, Shangmian Sun, Shaoqing Wu, Shengfeng Ye, T. Wang, W. L. Xiao, Wei An, Xianzu Wang, Xiaowen Sun, Xiaoxiang Wang, Ying Tang, Yukun Zha, Zekai Zhang, Zhe Ju, Zhen Zhang, Zihua Qu</p>

            <p><strong>Title:</strong><br>
            DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02556v1">http://arxiv.org/abs/2512.02556v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 03 Dec 2025 19:49:15 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3eb7f97c/8aac7d85.mp3" length="21356546" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1331</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 114 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Li, Haofen Liang, Haoran Wei, Haowei Zhang, Haowen Luo, Haozhe Ji, Honghui Ding, Hongxuan Tang, Huanqi Cao, Huazuo Gao, Hui Qu, Hui Zeng, Jialiang Huang, Jiashi Li, Jiaxin Xu, Jiewen Hu, Jingchang Chen, Jingting Xiang, Jingyang Yuan, Jingyuan Cheng, Jinhua Zhu, Jun Ran, Junguang Jiang, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Kexin Huang, Kexing Zhou, Kezhao Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Wang, Liang Zhao, Liangsheng Yin, Lihua Guo, Lingxiao Luo, Linwang Ma, Litong Wang, Liyue Zhang, M. S. Di, M. Y Xu, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingxu Zhou, Panpan Huang, Peixin Cong, Peiyi Wang, Qiancheng Wang, Qihao Zhu, Qingyang Li, Qinyu Chen, Qiushi Du, Ruiling Xu, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runqiu Yin, Runxin Xu, Ruomeng Shen, Ruoyu Zhang, S. H. Liu, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaofei Cai, Shaoyuan Chen, Shengding Hu, Shengyu Liu, Shiqiang Hu, Shirong Ma, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, Songyang Zhou, Tao Ni, Tao Yun, Tian Pei, Tian Ye, Tianyuan Yue, Wangding Zeng, Wen Liu, Wenfeng Liang, Wenjie Pang, Wenjing Luo, Wenjun Gao, Wentao Zhang, Xi Gao, Xiangwen Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaokang Zhang, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingkai Yu, Xingyou Li, Xinyu Yang, Xinyuan Li, Xu Chen, Xuecheng Su, Xuehai Pan, Xuheng Lin, Xuwei Fu, Y. Q. Wang, Yang Zhang, Yanhong Xu, Yanru Ma, Yao Li, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Qian, Yi Yu, Yichao Zhang, Yifan Ding, Yifan Shi, Yiliang Xiong, Ying He, Ying Zhou, Yinmin Zhong, Yishi Piao, Yisong Wang, Yixiao Chen, Yixuan Tan, Yixuan Wei, Yiyang Ma, Yiyuan Liu, Yonglun Yang, Yongqiang Guo, Yongtong Wu, Yu Wu, Yuan Cheng, Yuan Ou, Yuanfan Xu, Yuduan Wang, Yue Gong, Yuhan Wu, Yuheng Zou, Yukun Li, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Z. F. Wu, Z. Z. Ren, Zehua Zhao, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhixian Huang, Zhiyu Wu, Zhuoshu Li, Zhuping Zhang, Zian Xu, Zihao Wang, Zihui Gu, Zijia Zhu, Zilin Li, Zipeng Zhang, Ziwei Xie, Ziyi Gao, Zizheng Pan, Zongqing Yao, Bei Feng, Hui Li, J. L. Cai, Jiaqi Ni, Lei Xu, Meng Li, Ning Tian, R. J. Chen, R. L. Jin, S. S. Li, Shuang Zhou, Tianyu Sun, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xinnan Song, Xinyi Zhou, Y. X. Zhu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Dongjie Ji, Jian Liang, Jianzhong Guo, Jin Chen, Leyi Xia, Miaojun Wang, Mingming Li, Peng Zhang, Ruyi Chen, Shangmian Sun, Shaoqing Wu, Shengfeng Ye, T. Wang, W. L. Xiao, Wei An, Xianzu Wang, Xiaowen Sun, Xiaoxiang Wang, Ying Tang, Yukun Zha, Zekai Zhang, Zhe Ju, Zhen Zhang, Zihua Qu</p>

            <p><strong>Title:</strong><br>
            DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02556v1">http://arxiv.org/abs/2512.02556v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration</title>
      <itunes:episode>1437</itunes:episode>
      <podcast:episode>1437</podcast:episode>
      <itunes:title>ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e16dd24f-798c-4ad1-bf53-af48f8acb290</guid>
      <link>https://share.transistor.fm/s/2b8e40f6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CL, cs.AI, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, Yi Dong, Evelina Bakhturina, Tao Yu, Yejin Choi, Jan Kautz, Pavlo Molchanov</p>

            <p><strong>Title:</strong><br>
            ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.21689v1">http://arxiv.org/abs/2511.21689v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CL, cs.AI, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, Yi Dong, Evelina Bakhturina, Tao Yu, Yejin Choi, Jan Kautz, Pavlo Molchanov</p>

            <p><strong>Title:</strong><br>
            ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.21689v1">http://arxiv.org/abs/2511.21689v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 03 Dec 2025 19:48:53 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2b8e40f6/e556f1d2.mp3" length="20593368" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1283</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CL, cs.AI, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, Yi Dong, Evelina Bakhturina, Tao Yu, Yejin Choi, Jan Kautz, Pavlo Molchanov</p>

            <p><strong>Title:</strong><br>
            ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.21689v1">http://arxiv.org/abs/2511.21689v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MultiShotMaster: A Controllable Multi-Shot Video Generation Framework</title>
      <itunes:episode>1436</itunes:episode>
      <podcast:episode>1436</podcast:episode>
      <itunes:title>MultiShotMaster: A Controllable Multi-Shot Video Generation Framework</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f241a2e5-383c-4237-8cb2-96d5cebdfc47</guid>
      <link>https://share.transistor.fm/s/e283ac24</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, Xu Jia</p>

            <p><strong>Title:</strong><br>
            MultiShotMaster: A Controllable Multi-Shot Video Generation Framework</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.03041v1">http://arxiv.org/abs/2512.03041v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, Xu Jia</p>

            <p><strong>Title:</strong><br>
            MultiShotMaster: A Controllable Multi-Shot Video Generation Framework</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.03041v1">http://arxiv.org/abs/2512.03041v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 03 Dec 2025 19:48:32 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e283ac24/d45d2b4a.mp3" length="26282617" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1639</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, Xu Jia</p>

            <p><strong>Title:</strong><br>
            MultiShotMaster: A Controllable Multi-Shot Video Generation Framework</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.03041v1">http://arxiv.org/abs/2512.03041v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory</title>
      <itunes:episode>1435</itunes:episode>
      <podcast:episode>1435</podcast:episode>
      <itunes:title>MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e709844d-a6ce-42ed-9716-2634c8e45e41</guid>
      <link>https://share.transistor.fm/s/f4712a93</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Bo Wang, Jiehong Lin, Chenzhi Liu, Xinting Hu, Yifei Yu, Tianjia Liu, Zhongrui Wang, Xiaojuan Qi</p>

            <p><strong>Title:</strong><br>
            MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.22609v1">http://arxiv.org/abs/2511.22609v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present MG-Nav (Memory-Guided Navigation), a dual-scale framework for zero-shot visual navigation that unifies global memory-guided planning with local geometry-enhanced control. At its core is the Sparse Spatial Memory Graph (SMG), a compact, region-centric memory where each node aggregates multi-view keyframe and object semantics, capturing both appearance and spatial structure while preserving viewpoint diversity. At the global level, the agent is localized on SMG and a goal-conditioned node path is planned via an image-to-instance hybrid retrieval, producing a sequence of reachable waypoints for long-horizon guidance. At the local level, a navigation foundation policy executes these waypoints in point-goal mode with obstacle-aware control, and switches to image-goal mode when navigating from the final node towards the visual target. To further enhance viewpoint alignment and goal recognition, we introduce VGGT-adapter, a lightweight geometric module built on the pre-trained VGGT model, which aligns observation and goal features in a shared 3D-aware space. MG-Nav operates global planning and local control at different frequencies, using periodic re-localization to correct errors. Experiments on HM3D Instance-Image-Goal and MP3D Image-Goal benchmarks demonstrate that MG-Nav achieves state-of-the-art zero-shot performance and remains robust under dynamic rearrangements and unseen scene conditions.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Bo Wang, Jiehong Lin, Chenzhi Liu, Xinting Hu, Yifei Yu, Tianjia Liu, Zhongrui Wang, Xiaojuan Qi</p>

            <p><strong>Title:</strong><br>
            MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.22609v1">http://arxiv.org/abs/2511.22609v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present MG-Nav (Memory-Guided Navigation), a dual-scale framework for zero-shot visual navigation that unifies global memory-guided planning with local geometry-enhanced control. At its core is the Sparse Spatial Memory Graph (SMG), a compact, region-centric memory where each node aggregates multi-view keyframe and object semantics, capturing both appearance and spatial structure while preserving viewpoint diversity. At the global level, the agent is localized on SMG and a goal-conditioned node path is planned via an image-to-instance hybrid retrieval, producing a sequence of reachable waypoints for long-horizon guidance. At the local level, a navigation foundation policy executes these waypoints in point-goal mode with obstacle-aware control, and switches to image-goal mode when navigating from the final node towards the visual target. To further enhance viewpoint alignment and goal recognition, we introduce VGGT-adapter, a lightweight geometric module built on the pre-trained VGGT model, which aligns observation and goal features in a shared 3D-aware space. MG-Nav operates global planning and local control at different frequencies, using periodic re-localization to correct errors. Experiments on HM3D Instance-Image-Goal and MP3D Image-Goal benchmarks demonstrate that MG-Nav achieves state-of-the-art zero-shot performance and remains robust under dynamic rearrangements and unseen scene conditions.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 03 Dec 2025 19:48:10 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f4712a93/327051c1.mp3" length="22807698" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1422</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Bo Wang, Jiehong Lin, Chenzhi Liu, Xinting Hu, Yifei Yu, Tianjia Liu, Zhongrui Wang, Xiaojuan Qi</p>

            <p><strong>Title:</strong><br>
            MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.22609v1">http://arxiv.org/abs/2511.22609v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present MG-Nav (Memory-Guided Navigation), a dual-scale framework for zero-shot visual navigation that unifies global memory-guided planning with local geometry-enhanced control. At its core is the Sparse Spatial Memory Graph (SMG), a compact, region-centric memory where each node aggregates multi-view keyframe and object semantics, capturing both appearance and spatial structure while preserving viewpoint diversity. At the global level, the agent is localized on SMG and a goal-conditioned node path is planned via an image-to-instance hybrid retrieval, producing a sequence of reachable waypoints for long-horizon guidance. At the local level, a navigation foundation policy executes these waypoints in point-goal mode with obstacle-aware control, and switches to image-goal mode when navigating from the final node towards the visual target. To further enhance viewpoint alignment and goal recognition, we introduce VGGT-adapter, a lightweight geometric module built on the pre-trained VGGT model, which aligns observation and goal features in a shared 3D-aware space. MG-Nav operates global planning and local control at different frequencies, using periodic re-localization to correct errors. Experiments on HM3D Instance-Image-Goal and MP3D Image-Goal benchmarks demonstrate that MG-Nav achieves state-of-the-art zero-shot performance and remains robust under dynamic rearrangements and unseen scene conditions.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch</title>
      <itunes:episode>1434</itunes:episode>
      <podcast:episode>1434</podcast:episode>
      <itunes:title>Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7730c568-c911-4403-9b67-fd0492ff2e60</guid>
      <link>https://share.transistor.fm/s/ee29b6f0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, Tianyidan Xie, Eric Li, Yang Liu, Xuchen Song, Yahui Zhou</p>

            <p><strong>Title:</strong><br>
            Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02395v1">http://arxiv.org/abs/2512.02395v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, Tianyidan Xie, Eric Li, Yang Liu, Xuchen Song, Yahui Zhou</p>

            <p><strong>Title:</strong><br>
            Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02395v1">http://arxiv.org/abs/2512.02395v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 03 Dec 2025 19:47:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ee29b6f0/345f607b.mp3" length="24668500" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1538</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, Tianyidan Xie, Eric Li, Yang Liu, Xuchen Song, Yahui Zhou</p>

            <p><strong>Title:</strong><br>
            Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02395v1">http://arxiv.org/abs/2512.02395v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation</title>
      <itunes:episode>1433</itunes:episode>
      <podcast:episode>1433</podcast:episode>
      <itunes:title>DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8ac670ba-ebd6-41ca-9498-915fef191791</guid>
      <link>https://share.transistor.fm/s/820d9918</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hongfei Zhang, Kanghao Chen, Zixin Zhang, Harold Haodong Chen, Yuanhuiyi Lyu, Yuqi Zhang, Shuai Yang, Kun Zhou, Yingcong Chen</p>

            <p><strong>Title:</strong><br>
            DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.23127v2">http://arxiv.org/abs/2511.23127v2</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents DualCamCtrl, a novel end-to-end diffusion model for camera-controlled video generation. Recent works have advanced this field by representing camera poses as ray-based conditions, yet they often lack sufficient scene understanding and geometric awareness. DualCamCtrl specifically targets this limitation by introducing a dual-branch framework that mutually generates camera-consistent RGB and depth sequences. To harmonize these two modalities, we further propose the Semantic Guided Mutual Alignment (SIGMA) mechanism, which performs RGB-depth fusion in a semantics-guided and mutually reinforced manner. These designs collectively enable DualCamCtrl to better disentangle appearance and geometry modeling, generating videos that more faithfully adhere to the specified camera trajectories. Additionally, we analyze and reveal the distinct influence of depth and camera poses across denoising stages and further demonstrate that early and late stages play complementary roles in forming global structure and refining local details. Extensive experiments demonstrate that DualCamCtrl achieves more consistent camera-controlled video generation, with over 40\% reduction in camera motion errors compared with prior methods. Our project page: https://soyouthinkyoucantell.github.io/dualcamctrl-page/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hongfei Zhang, Kanghao Chen, Zixin Zhang, Harold Haodong Chen, Yuanhuiyi Lyu, Yuqi Zhang, Shuai Yang, Kun Zhou, Yingcong Chen</p>

            <p><strong>Title:</strong><br>
            DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.23127v2">http://arxiv.org/abs/2511.23127v2</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents DualCamCtrl, a novel end-to-end diffusion model for camera-controlled video generation. Recent works have advanced this field by representing camera poses as ray-based conditions, yet they often lack sufficient scene understanding and geometric awareness. DualCamCtrl specifically targets this limitation by introducing a dual-branch framework that mutually generates camera-consistent RGB and depth sequences. To harmonize these two modalities, we further propose the Semantic Guided Mutual Alignment (SIGMA) mechanism, which performs RGB-depth fusion in a semantics-guided and mutually reinforced manner. These designs collectively enable DualCamCtrl to better disentangle appearance and geometry modeling, generating videos that more faithfully adhere to the specified camera trajectories. Additionally, we analyze and reveal the distinct influence of depth and camera poses across denoising stages and further demonstrate that early and late stages play complementary roles in forming global structure and refining local details. Extensive experiments demonstrate that DualCamCtrl achieves more consistent camera-controlled video generation, with over 40\% reduction in camera motion errors compared with prior methods. Our project page: https://soyouthinkyoucantell.github.io/dualcamctrl-page/</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 03 Dec 2025 19:47:26 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/820d9918/83549be0.mp3" length="19747433" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1231</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hongfei Zhang, Kanghao Chen, Zixin Zhang, Harold Haodong Chen, Yuanhuiyi Lyu, Yuqi Zhang, Shuai Yang, Kun Zhou, Yingcong Chen</p>

            <p><strong>Title:</strong><br>
            DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.23127v2">http://arxiv.org/abs/2511.23127v2</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents DualCamCtrl, a novel end-to-end diffusion model for camera-controlled video generation. Recent works have advanced this field by representing camera poses as ray-based conditions, yet they often lack sufficient scene understanding and geometric awareness. DualCamCtrl specifically targets this limitation by introducing a dual-branch framework that mutually generates camera-consistent RGB and depth sequences. To harmonize these two modalities, we further propose the Semantic Guided Mutual Alignment (SIGMA) mechanism, which performs RGB-depth fusion in a semantics-guided and mutually reinforced manner. These designs collectively enable DualCamCtrl to better disentangle appearance and geometry modeling, generating videos that more faithfully adhere to the specified camera trajectories. Additionally, we analyze and reveal the distinct influence of depth and camera poses across denoising stages and further demonstrate that early and late stages play complementary roles in forming global structure and refining local details. Extensive experiments demonstrate that DualCamCtrl achieves more consistent camera-controlled video generation, with over 40\% reduction in camera motion errors compared with prior methods. Our project page: https://soyouthinkyoucantell.github.io/dualcamctrl-page/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Guided Self-Evolving LLMs with Minimal Human Supervision</title>
      <itunes:episode>1432</itunes:episode>
      <podcast:episode>1432</podcast:episode>
      <itunes:title>Guided Self-Evolving LLMs with Minimal Human Supervision</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e44ae795-c064-43ed-ba83-aef549111cc2</guid>
      <link>https://share.transistor.fm/s/edbf07c5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang, Haitao Mi, Dong Yu</p>

            <p><strong>Title:</strong><br>
            Guided Self-Evolving LLMs with Minimal Human Supervision</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02472v1">http://arxiv.org/abs/2512.02472v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang, Haitao Mi, Dong Yu</p>

            <p><strong>Title:</strong><br>
            Guided Self-Evolving LLMs with Minimal Human Supervision</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02472v1">http://arxiv.org/abs/2512.02472v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 03 Dec 2025 19:47:04 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/edbf07c5/0a0abbc6.mp3" length="24458631" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1525</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wenhao Yu, Zhenwen Liang, Chengsong Huang, Kishan Panaganti, Tianqing Fang, Haitao Mi, Dong Yu</p>

            <p><strong>Title:</strong><br>
            Guided Self-Evolving LLMs with Minimal Human Supervision</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02472v1">http://arxiv.org/abs/2512.02472v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI self-evolution has long been envisioned as a path toward superintelligence, where models autonomously acquire, refine, and internalize knowledge from their own learning experiences. Yet in practice, unguided self-evolving systems often plateau quickly or even degrade as training progresses. These failures arise from issues such as concept drift, diversity collapse, and mis-evolution, as models reinforce their own biases and converge toward low-entropy behaviors. To enable models to self-evolve in a stable and controllable manner while minimizing reliance on human supervision, we introduce R-Few, a guided Self-Play Challenger-Solver framework that incorporates lightweight human oversight through in-context grounding and mixed training. At each iteration, the Challenger samples a small set of human-labeled examples to guide synthetic question generation, while the Solver jointly trains on human and synthetic examples under an online, difficulty-based curriculum. Across math and general reasoning benchmarks, R-Few achieves consistent and iterative improvements. For example, Qwen3-8B-Base improves by +3.0 points over R-Zero on math tasks and achieves performance on par with General-Reasoner, despite the latter being trained on 20 times more human data. Ablation studies confirm the complementary contributions of grounded challenger training and curriculum-based solver training, and further analysis shows that R-Few mitigates drift, yielding more stable and controllable co-evolutionary dynamics.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SimScale: Learning to Drive via Real-World Simulation at Scale</title>
      <itunes:episode>1431</itunes:episode>
      <podcast:episode>1431</podcast:episode>
      <itunes:title>SimScale: Learning to Drive via Real-World Simulation at Scale</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c8eba946-b061-453c-832a-eb6a6e635500</guid>
      <link>https://share.transistor.fm/s/9e9bae63</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Haochen Tian, Tianyu Li, Haochen Liu, Jiazhi Yang, Yihang Qiu, Guang Li, Junli Wang, Yinfeng Gao, Zhang Zhang, Liang Wang, Hangjun Ye, Tieniu Tan, Long Chen, Hongyang Li</p>

            <p><strong>Title:</strong><br>
            SimScale: Learning to Drive via Real-World Simulation at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.23369v1">http://arxiv.org/abs/2511.23369v1</a></p>

            <p><strong>Abstract:</strong><br>
            Achieving fully autonomous driving systems requires learning rational decisions in a wide span of scenarios, including safety-critical and out-of-distribution ones. However, such cases are underrepresented in real-world corpus collected by human experts. To complement for the lack of data diversity, we introduce a novel and scalable simulation framework capable of synthesizing massive unseen states upon existing driving logs. Our pipeline utilizes advanced neural rendering with a reactive environment to generate high-fidelity multi-view observations controlled by the perturbed ego trajectory. Furthermore, we develop a pseudo-expert trajectory generation mechanism for these newly simulated states to provide action supervision. Upon the synthesized data, we find that a simple co-training strategy on both real-world and simulated samples can lead to significant improvements in both robustness and generalization for various planning methods on challenging real-world benchmarks, up to +6.8 EPDMS on navhard and +2.9 on navtest. More importantly, such policy improvement scales smoothly by increasing simulation data only, even without extra real-world data streaming in. We further reveal several crucial findings of such a sim-real learning system, which we term SimScale, including the design of pseudo-experts and the scaling properties for different policy architectures. Our simulation data and code would be released.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Haochen Tian, Tianyu Li, Haochen Liu, Jiazhi Yang, Yihang Qiu, Guang Li, Junli Wang, Yinfeng Gao, Zhang Zhang, Liang Wang, Hangjun Ye, Tieniu Tan, Long Chen, Hongyang Li</p>

            <p><strong>Title:</strong><br>
            SimScale: Learning to Drive via Real-World Simulation at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.23369v1">http://arxiv.org/abs/2511.23369v1</a></p>

            <p><strong>Abstract:</strong><br>
            Achieving fully autonomous driving systems requires learning rational decisions in a wide span of scenarios, including safety-critical and out-of-distribution ones. However, such cases are underrepresented in real-world corpus collected by human experts. To complement for the lack of data diversity, we introduce a novel and scalable simulation framework capable of synthesizing massive unseen states upon existing driving logs. Our pipeline utilizes advanced neural rendering with a reactive environment to generate high-fidelity multi-view observations controlled by the perturbed ego trajectory. Furthermore, we develop a pseudo-expert trajectory generation mechanism for these newly simulated states to provide action supervision. Upon the synthesized data, we find that a simple co-training strategy on both real-world and simulated samples can lead to significant improvements in both robustness and generalization for various planning methods on challenging real-world benchmarks, up to +6.8 EPDMS on navhard and +2.9 on navtest. More importantly, such policy improvement scales smoothly by increasing simulation data only, even without extra real-world data streaming in. We further reveal several crucial findings of such a sim-real learning system, which we term SimScale, including the design of pseudo-experts and the scaling properties for different policy architectures. Our simulation data and code would be released.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 03 Dec 2025 19:46:42 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9e9bae63/633ef8b8.mp3" length="21784534" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1358</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Haochen Tian, Tianyu Li, Haochen Liu, Jiazhi Yang, Yihang Qiu, Guang Li, Junli Wang, Yinfeng Gao, Zhang Zhang, Liang Wang, Hangjun Ye, Tieniu Tan, Long Chen, Hongyang Li</p>

            <p><strong>Title:</strong><br>
            SimScale: Learning to Drive via Real-World Simulation at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.23369v1">http://arxiv.org/abs/2511.23369v1</a></p>

            <p><strong>Abstract:</strong><br>
            Achieving fully autonomous driving systems requires learning rational decisions in a wide span of scenarios, including safety-critical and out-of-distribution ones. However, such cases are underrepresented in real-world corpus collected by human experts. To complement for the lack of data diversity, we introduce a novel and scalable simulation framework capable of synthesizing massive unseen states upon existing driving logs. Our pipeline utilizes advanced neural rendering with a reactive environment to generate high-fidelity multi-view observations controlled by the perturbed ego trajectory. Furthermore, we develop a pseudo-expert trajectory generation mechanism for these newly simulated states to provide action supervision. Upon the synthesized data, we find that a simple co-training strategy on both real-world and simulated samples can lead to significant improvements in both robustness and generalization for various planning methods on challenging real-world benchmarks, up to +6.8 EPDMS on navhard and +2.9 on navtest. More importantly, such policy improvement scales smoothly by increasing simulation data only, even without extra real-world data streaming in. We further reveal several crucial findings of such a sim-real learning system, which we term SimScale, including the design of pseudo-experts and the scaling properties for different policy architectures. Our simulation data and code would be released.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>InnoGym: Benchmarking the Innovation Potential of AI Agents</title>
      <itunes:episode>1430</itunes:episode>
      <podcast:episode>1430</podcast:episode>
      <itunes:title>InnoGym: Benchmarking the Innovation Potential of AI Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a79afb0d-a1ce-4073-94e0-72a791b7ee3c</guid>
      <link>https://share.transistor.fm/s/8ab54544</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.AI, cs.CV, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Jintian Zhang, Kewei Xu, Jingsheng Zheng, Zhuoyun Yu, Yuqi Zhu, Yujie Luo, Lanning Wei, Shuofei Qiao, Lun Du, Da Zheng, Shumin Deng, Huajun Chen, Ningyu Zhang</p>

            <p><strong>Title:</strong><br>
            InnoGym: Benchmarking the Innovation Potential of AI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.01822v1">http://arxiv.org/abs/2512.01822v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.AI, cs.CV, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Jintian Zhang, Kewei Xu, Jingsheng Zheng, Zhuoyun Yu, Yuqi Zhu, Yujie Luo, Lanning Wei, Shuofei Qiao, Lun Du, Da Zheng, Shumin Deng, Huajun Chen, Ningyu Zhang</p>

            <p><strong>Title:</strong><br>
            InnoGym: Benchmarking the Innovation Potential of AI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.01822v1">http://arxiv.org/abs/2512.01822v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 03 Dec 2025 19:46:20 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8ab54544/f21c646a.mp3" length="22727447" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1417</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.AI, cs.CV, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Jintian Zhang, Kewei Xu, Jingsheng Zheng, Zhuoyun Yu, Yuqi Zhu, Yujie Luo, Lanning Wei, Shuofei Qiao, Lun Du, Da Zheng, Shumin Deng, Huajun Chen, Ningyu Zhang</p>

            <p><strong>Title:</strong><br>
            InnoGym: Benchmarking the Innovation Potential of AI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.01822v1">http://arxiv.org/abs/2512.01822v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling</title>
      <itunes:episode>1429</itunes:episode>
      <podcast:episode>1429</podcast:episode>
      <itunes:title>LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8fd94b3e-0543-42ff-8071-02f4fcbc0dc3</guid>
      <link>https://share.transistor.fm/s/8aa372cf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 140 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20785v1">http://arxiv.org/abs/2511.20785v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 140 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20785v1">http://arxiv.org/abs/2511.20785v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 02 Dec 2025 20:14:32 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8aa372cf/f15007b2.mp3" length="22135630" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1380</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 140 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20785v1">http://arxiv.org/abs/2511.20785v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Envision: Benchmarking Unified Understanding &amp; Generation for Causal World Process Insights</title>
      <itunes:episode>1428</itunes:episode>
      <podcast:episode>1428</podcast:episode>
      <itunes:title>Envision: Benchmarking Unified Understanding &amp; Generation for Causal World Process Insights</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8275f35e-eaa6-4c93-b3bf-225dcaa823fc</guid>
      <link>https://share.transistor.fm/s/95d57af0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Juanxi Tian, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan</p>

            <p><strong>Title:</strong><br>
            Envision: Benchmarking Unified Understanding &amp; Generation for Causal World Process Insights</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.01816v1">http://arxiv.org/abs/2512.01816v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision-a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly internalize world knowledge while adhering to causal-temporal constraints, we introduce Envision-Score, a holistic metric integrating multi-dimensional consistency, physicality, and aesthetics. Comprehensive evaluation of 15 models (10 specialized T2I models, 5 unified models) uncovers: specialized T2I models demonstrate proficiency in aesthetic rendering yet lack intrinsic world knowledge. Unified multimodal models bridge this gap, consistently outperforming specialized counterparts in causal narrative coherence. However, even these unified architectures remain subordinate to closed-source models and struggle to overcome the core challenge of spatiotemporal consistency. This demonstrates that a focus on causally-isolated single images impedes multi-frame reasoning and generation, promoting static pattern matching over dynamic world modeling-ultimately limiting world knowledge internalization, generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Juanxi Tian, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan</p>

            <p><strong>Title:</strong><br>
            Envision: Benchmarking Unified Understanding &amp; Generation for Causal World Process Insights</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.01816v1">http://arxiv.org/abs/2512.01816v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision-a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly internalize world knowledge while adhering to causal-temporal constraints, we introduce Envision-Score, a holistic metric integrating multi-dimensional consistency, physicality, and aesthetics. Comprehensive evaluation of 15 models (10 specialized T2I models, 5 unified models) uncovers: specialized T2I models demonstrate proficiency in aesthetic rendering yet lack intrinsic world knowledge. Unified multimodal models bridge this gap, consistently outperforming specialized counterparts in causal narrative coherence. However, even these unified architectures remain subordinate to closed-source models and struggle to overcome the core challenge of spatiotemporal consistency. This demonstrates that a focus on causally-isolated single images impedes multi-frame reasoning and generation, promoting static pattern matching over dynamic world modeling-ultimately limiting world knowledge internalization, generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 02 Dec 2025 20:14:09 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/95d57af0/3c5579fa.mp3" length="23261212" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1450</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Juanxi Tian, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan</p>

            <p><strong>Title:</strong><br>
            Envision: Benchmarking Unified Understanding &amp; Generation for Causal World Process Insights</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.01816v1">http://arxiv.org/abs/2512.01816v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision-a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly internalize world knowledge while adhering to causal-temporal constraints, we introduce Envision-Score, a holistic metric integrating multi-dimensional consistency, physicality, and aesthetics. Comprehensive evaluation of 15 models (10 specialized T2I models, 5 unified models) uncovers: specialized T2I models demonstrate proficiency in aesthetic rendering yet lack intrinsic world knowledge. Unified multimodal models bridge this gap, consistently outperforming specialized counterparts in causal narrative coherence. However, even these unified architectures remain subordinate to closed-source models and struggle to overcome the core challenge of spatiotemporal consistency. This demonstrates that a focus on causally-isolated single images impedes multi-frame reasoning and generation, promoting static pattern matching over dynamic world modeling-ultimately limiting world knowledge internalization, generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Stabilizing Reinforcement Learning with LLMs: Formulation and Practices</title>
      <itunes:episode>1427</itunes:episode>
      <podcast:episode>1427</podcast:episode>
      <itunes:title>Stabilizing Reinforcement Learning with LLMs: Formulation and Practices</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">114ee797-2ece-4781-a9f6-82b9d49d6b0b</guid>
      <link>https://share.transistor.fm/s/4ea398b9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, An Yang, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Stabilizing Reinforcement Learning with LLMs: Formulation and Practices</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.01374v2">http://arxiv.org/abs/2512.01374v2</a></p>

            <p><strong>Abstract:</strong><br>
            This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, An Yang, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Stabilizing Reinforcement Learning with LLMs: Formulation and Practices</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.01374v2">http://arxiv.org/abs/2512.01374v2</a></p>

            <p><strong>Abstract:</strong><br>
            This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 02 Dec 2025 20:13:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4ea398b9/aab40a6d.mp3" length="18962900" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1181</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, An Yang, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Stabilizing Reinforcement Learning with LLMs: Formulation and Practices</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.01374v2">http://arxiv.org/abs/2512.01374v2</a></p>

            <p><strong>Abstract:</strong><br>
            This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>How Far Are We from Genuinely Useful Deep Research Agents?</title>
      <itunes:episode>1426</itunes:episode>
      <podcast:episode>1426</podcast:episode>
      <itunes:title>How Far Are We from Genuinely Useful Deep Research Agents?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cc8ad101-9147-4b1a-9621-306a17fcbb44</guid>
      <link>https://share.transistor.fm/s/469a8c8f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dingling Zhang, He Zhu, Jincheng Ren, Kangqi Song, Xinran Zhou, Boyu Feng, Shudong Liu, Jiabin Luo, Weihao Xie, Zhaohui Wang, Tianrui Qin, King Zhu, Yuqing Wang, Qianben Chen, Yuchen Eleanor Jiang, Wei Wang, Jiaheng Liu, Wangchunshu Zhou</p>

            <p><strong>Title:</strong><br>
            How Far Are We from Genuinely Useful Deep Research Agents?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.01948v1">http://arxiv.org/abs/2512.01948v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics -- this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research agents. DEFT contains 14 fine-grained failure modes across reasoning, retrieval, and generation, and is built upon grounded theory with human-LLM co-annotating and inter-annotator reliability validation. Our experimental findings reveal that current DRAs struggle not with task comprehension but with evidence integration, verification, and reasoning-resilient planning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dingling Zhang, He Zhu, Jincheng Ren, Kangqi Song, Xinran Zhou, Boyu Feng, Shudong Liu, Jiabin Luo, Weihao Xie, Zhaohui Wang, Tianrui Qin, King Zhu, Yuqing Wang, Qianben Chen, Yuchen Eleanor Jiang, Wei Wang, Jiaheng Liu, Wangchunshu Zhou</p>

            <p><strong>Title:</strong><br>
            How Far Are We from Genuinely Useful Deep Research Agents?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.01948v1">http://arxiv.org/abs/2512.01948v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics -- this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research agents. DEFT contains 14 fine-grained failure modes across reasoning, retrieval, and generation, and is built upon grounded theory with human-LLM co-annotating and inter-annotator reliability validation. Our experimental findings reveal that current DRAs struggle not with task comprehension but with evidence integration, verification, and reasoning-resilient planning.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 02 Dec 2025 20:13:23 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/469a8c8f/f621803c.mp3" length="23687498" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1477</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dingling Zhang, He Zhu, Jincheng Ren, Kangqi Song, Xinran Zhou, Boyu Feng, Shudong Liu, Jiabin Luo, Weihao Xie, Zhaohui Wang, Tianrui Qin, King Zhu, Yuqing Wang, Qianben Chen, Yuchen Eleanor Jiang, Wei Wang, Jiaheng Liu, Wangchunshu Zhou</p>

            <p><strong>Title:</strong><br>
            How Far Are We from Genuinely Useful Deep Research Agents?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.01948v1">http://arxiv.org/abs/2512.01948v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics -- this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research agents. DEFT contains 14 fine-grained failure modes across reasoning, retrieval, and generation, and is built upon grounded theory with human-LLM co-annotating and inter-annotator reliability validation. Our experimental findings reveal that current DRAs struggle not with task comprehension but with evidence integration, verification, and reasoning-resilient planning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards</title>
      <itunes:episode>1425</itunes:episode>
      <podcast:episode>1425</podcast:episode>
      <itunes:title>What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">223cdeb6-a31c-4f0d-b404-8f66545f0a7a</guid>
      <link>https://share.transistor.fm/s/8848d4b9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Minh-Quan Le, Yuanzhi Zhu, Vicky Kalogeiton, Dimitris Samaras</p>

            <p><strong>Title:</strong><br>
            What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.00425v1">http://arxiv.org/abs/2512.00425v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent video diffusion models can synthesize visually compelling clips, yet often violate basic physical laws-objects float, accelerations drift, and collisions behave inconsistently-revealing a persistent gap between visual realism and physical realism. We propose $\texttt{NewtonRewards}$, the first physics-grounded post-training framework for video generation based on $\textit{verifiable rewards}$. Instead of relying on human or VLM feedback, $\texttt{NewtonRewards}$ extracts $\textit{measurable proxies}$ from generated videos using frozen utility models: optical flow serves as a proxy for velocity, while high-level appearance features serve as a proxy for mass. These proxies enable explicit enforcement of Newtonian structure through two complementary rewards: a Newtonian kinematic constraint enforcing constant-acceleration dynamics, and a mass conservation reward preventing trivial, degenerate solutions. We evaluate $\texttt{NewtonRewards}$ on five Newtonian Motion Primitives (free fall, horizontal/parabolic throw, and ramp sliding down/up) using our newly constructed large-scale benchmark, $\texttt{NewtonBench-60K}$. Across all primitives in visual and physics metrics, $\texttt{NewtonRewards}$ consistently improves physical plausibility, motion smoothness, and temporal coherence over prior post-training methods. It further maintains strong performance under out-of-distribution shifts in height, speed, and friction. Our results show that physics-grounded verifiable rewards offer a scalable path toward physics-aware video generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Minh-Quan Le, Yuanzhi Zhu, Vicky Kalogeiton, Dimitris Samaras</p>

            <p><strong>Title:</strong><br>
            What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.00425v1">http://arxiv.org/abs/2512.00425v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent video diffusion models can synthesize visually compelling clips, yet often violate basic physical laws-objects float, accelerations drift, and collisions behave inconsistently-revealing a persistent gap between visual realism and physical realism. We propose $\texttt{NewtonRewards}$, the first physics-grounded post-training framework for video generation based on $\textit{verifiable rewards}$. Instead of relying on human or VLM feedback, $\texttt{NewtonRewards}$ extracts $\textit{measurable proxies}$ from generated videos using frozen utility models: optical flow serves as a proxy for velocity, while high-level appearance features serve as a proxy for mass. These proxies enable explicit enforcement of Newtonian structure through two complementary rewards: a Newtonian kinematic constraint enforcing constant-acceleration dynamics, and a mass conservation reward preventing trivial, degenerate solutions. We evaluate $\texttt{NewtonRewards}$ on five Newtonian Motion Primitives (free fall, horizontal/parabolic throw, and ramp sliding down/up) using our newly constructed large-scale benchmark, $\texttt{NewtonBench-60K}$. Across all primitives in visual and physics metrics, $\texttt{NewtonRewards}$ consistently improves physical plausibility, motion smoothness, and temporal coherence over prior post-training methods. It further maintains strong performance under out-of-distribution shifts in height, speed, and friction. Our results show that physics-grounded verifiable rewards offer a scalable path toward physics-aware video generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 02 Dec 2025 20:13:00 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8848d4b9/ece4081c.mp3" length="21424282" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1335</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Minh-Quan Le, Yuanzhi Zhu, Vicky Kalogeiton, Dimitris Samaras</p>

            <p><strong>Title:</strong><br>
            What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.00425v1">http://arxiv.org/abs/2512.00425v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent video diffusion models can synthesize visually compelling clips, yet often violate basic physical laws-objects float, accelerations drift, and collisions behave inconsistently-revealing a persistent gap between visual realism and physical realism. We propose $\texttt{NewtonRewards}$, the first physics-grounded post-training framework for video generation based on $\textit{verifiable rewards}$. Instead of relying on human or VLM feedback, $\texttt{NewtonRewards}$ extracts $\textit{measurable proxies}$ from generated videos using frozen utility models: optical flow serves as a proxy for velocity, while high-level appearance features serve as a proxy for mass. These proxies enable explicit enforcement of Newtonian structure through two complementary rewards: a Newtonian kinematic constraint enforcing constant-acceleration dynamics, and a mass conservation reward preventing trivial, degenerate solutions. We evaluate $\texttt{NewtonRewards}$ on five Newtonian Motion Primitives (free fall, horizontal/parabolic throw, and ramp sliding down/up) using our newly constructed large-scale benchmark, $\texttt{NewtonBench-60K}$. Across all primitives in visual and physics metrics, $\texttt{NewtonRewards}$ consistently improves physical plausibility, motion smoothness, and temporal coherence over prior post-training methods. It further maintains strong performance under out-of-distribution shifts in height, speed, and friction. Our results show that physics-grounded verifiable rewards offer a scalable path toward physics-aware video generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout</title>
      <itunes:episode>1424</itunes:episode>
      <podcast:episode>1424</podcast:episode>
      <itunes:title>Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">30f47e52-c473-4911-b1e7-fbb3625723fa</guid>
      <link>https://share.transistor.fm/s/9188da16</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag</p>

            <p><strong>Title:</strong><br>
            Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20649v1">http://arxiv.org/abs/2511.20649v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce $\infty$-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish $\infty$-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that $\infty$-RoPE consistently surpasses previous autoregressive models in overall VBench scores.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag</p>

            <p><strong>Title:</strong><br>
            Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20649v1">http://arxiv.org/abs/2511.20649v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce $\infty$-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish $\infty$-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that $\infty$-RoPE consistently surpasses previous autoregressive models in overall VBench scores.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 02 Dec 2025 20:12:37 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9188da16/46b32123.mp3" length="18442989" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1149</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag</p>

            <p><strong>Title:</strong><br>
            Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20649v1">http://arxiv.org/abs/2511.20649v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce $\infty$-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish $\infty$-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that $\infty$-RoPE consistently surpasses previous autoregressive models in overall VBench scores.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment</title>
      <itunes:episode>1423</itunes:episode>
      <podcast:episode>1423</podcast:episode>
      <itunes:title>The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b553425f-b7a6-47fa-85af-2108cec20ebc</guid>
      <link>https://share.transistor.fm/s/12e92e85</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziheng Ouyang, Yiren Song, Yaoli Liu, Shihao Zhu, Qibin Hou, Ming-Ming Cheng, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20614v1">http://arxiv.org/abs/2511.20614v1</a></p>

            <p><strong>Abstract:</strong><br>
            Previous works have explored various customized generation tasks given a reference image, but they still face limitations in generating consistent fine-grained details. In this paper, our aim is to solve the inconsistency problem of generated images by applying a reference-guided post-editing approach and present our ImageCritic. We first construct a dataset of reference-degraded-target triplets obtained via VLM-based selection and explicit degradation, which effectively simulates the common inaccuracies or inconsistencies observed in existing generation models. Furthermore, building on a thorough examination of the model's attention mechanisms and intrinsic representations, we accordingly devise an attention alignment loss and a detail encoder to precisely rectify inconsistencies. ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing in complex scenarios. Extensive experiments demonstrate that ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziheng Ouyang, Yiren Song, Yaoli Liu, Shihao Zhu, Qibin Hou, Ming-Ming Cheng, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20614v1">http://arxiv.org/abs/2511.20614v1</a></p>

            <p><strong>Abstract:</strong><br>
            Previous works have explored various customized generation tasks given a reference image, but they still face limitations in generating consistent fine-grained details. In this paper, our aim is to solve the inconsistency problem of generated images by applying a reference-guided post-editing approach and present our ImageCritic. We first construct a dataset of reference-degraded-target triplets obtained via VLM-based selection and explicit degradation, which effectively simulates the common inaccuracies or inconsistencies observed in existing generation models. Furthermore, building on a thorough examination of the model's attention mechanisms and intrinsic representations, we accordingly devise an attention alignment loss and a detail encoder to precisely rectify inconsistencies. ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing in complex scenarios. Extensive experiments demonstrate that ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 02 Dec 2025 20:12:14 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/12e92e85/8ca8e010.mp3" length="23159668" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1444</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziheng Ouyang, Yiren Song, Yaoli Liu, Shihao Zhu, Qibin Hou, Ming-Ming Cheng, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20614v1">http://arxiv.org/abs/2511.20614v1</a></p>

            <p><strong>Abstract:</strong><br>
            Previous works have explored various customized generation tasks given a reference image, but they still face limitations in generating consistent fine-grained details. In this paper, our aim is to solve the inconsistency problem of generated images by applying a reference-guided post-editing approach and present our ImageCritic. We first construct a dataset of reference-degraded-target triplets obtained via VLM-based selection and explicit degradation, which effectively simulates the common inaccuracies or inconsistencies observed in existing generation models. Furthermore, building on a thorough examination of the model's attention mechanisms and intrinsic representations, we accordingly devise an attention alignment loss and a detail encoder to precisely rectify inconsistencies. ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing in complex scenarios. Extensive experiments demonstrate that ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models</title>
      <itunes:episode>1422</itunes:episode>
      <podcast:episode>1422</podcast:episode>
      <itunes:title>TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e0b541cf-1ffb-4caf-9570-18fc3d7dec34</guid>
      <link>https://share.transistor.fm/s/7e9595b8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, Viktar Atliha, Tony Ng, Xiao Han, Chuyan Zhu, Chenyang Zhang, Ding Liu, Juan-Manuel Perez-Rua, Sen He, Jürgen Schmidhuber, Wenhu Chen, Ping Luo, Wei Liu, Tao Xiang, Jonas Schult, Yuren Cong</p>

            <p><strong>Title:</strong><br>
            TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02014v1">http://arxiv.org/abs/2512.02014v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, Viktar Atliha, Tony Ng, Xiao Han, Chuyan Zhu, Chenyang Zhang, Ding Liu, Juan-Manuel Perez-Rua, Sen He, Jürgen Schmidhuber, Wenhu Chen, Ping Luo, Wei Liu, Tao Xiang, Jonas Schult, Yuren Cong</p>

            <p><strong>Title:</strong><br>
            TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02014v1">http://arxiv.org/abs/2512.02014v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 02 Dec 2025 20:11:51 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7e9595b8/6d6bc012.mp3" length="24020215" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1498</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, Viktar Atliha, Tony Ng, Xiao Han, Chuyan Zhu, Chenyang Zhang, Ding Liu, Juan-Manuel Perez-Rua, Sen He, Jürgen Schmidhuber, Wenhu Chen, Ping Luo, Wei Liu, Tao Xiang, Jonas Schult, Yuren Cong</p>

            <p><strong>Title:</strong><br>
            TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2512.02014v1">http://arxiv.org/abs/2512.02014v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LFM2 Technical Report</title>
      <itunes:episode>1421</itunes:episode>
      <podcast:episode>1421</podcast:episode>
      <itunes:title>LFM2 Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2fc047fe-b9d2-45a0-9382-25651780aaaa</guid>
      <link>https://share.transistor.fm/s/67c03d7a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Alexander Amini, Anna Banaszak, Harold Benoit, Arthur Böök, Tarek Dakhran, Song Duong, Alfred Eng, Fernando Fernandes, Marc Härkönen, Anne Harrington, Ramin Hasani, Saniya Karwa, Yuri Khrustalev, Maxime Labonne, Mathias Lechner, Valentine Lechner, Simon Lee, Zetian Li, Noel Loo, Jacob Marks, Edoardo Mosca, Samuel J. Paech, Paul Pak, Rom N. Parnichkun, Alex Quach, Ryan Rogers, Daniela Rus, Nayan Saxena, Bettina Schlager, Tim Seyde, Jimmy T. H. Smith, Aditya Tadimeti, Neehal Tumma</p>

            <p><strong>Title:</strong><br>
            LFM2 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.23404v1">http://arxiv.org/abs/2511.23404v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present LFM2, a family of Liquid Foundation Models designed for efficient on-device deployment and strong task capabilities. Using hardware-in-the-loop architecture search under edge latency and memory constraints, we obtain a compact hybrid backbone that combines gated short convolutions with a small number of grouped query attention blocks, delivering up to 2x faster prefill and decode on CPUs compared to similarly sized models. The LFM2 family covers 350M-8.3B parameters, including dense models (350M, 700M, 1.2B, 2.6B) and a mixture-of-experts variant (8.3B total, 1.5B active), all with 32K context length. LFM2's training pipeline includes a tempered, decoupled Top-K knowledge distillation objective that avoids support mismatch; curriculum learning with difficulty-ordered data; and a three-stage post-training recipe of supervised fine-tuning, length-normalized preference optimization, and model merging. Pre-trained on 10-12T tokens, LFM2 models achieve strong results across diverse benchmarks; for example, LFM2-2.6B reaches 79.56% on IFEval and 82.41% on GSM8K. We further build multimodal and retrieval variants: LFM2-VL for vision-language tasks, LFM2-Audio for speech, and LFM2-ColBERT for retrieval. LFM2-VL supports tunable accuracy-latency tradeoffs via token-efficient visual processing, while LFM2-Audio separates audio input and output pathways to enable real-time speech-to-speech interaction competitive with models 3x larger. LFM2-ColBERT provides a low-latency encoder for queries and documents, enabling high-performance retrieval across multiple languages. All models are released with open weights and deployment packages for ExecuTorch, llama.cpp, and vLLM, making LFM2 a practical base for edge applications that need fast, memory-efficient inference and strong task capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Alexander Amini, Anna Banaszak, Harold Benoit, Arthur Böök, Tarek Dakhran, Song Duong, Alfred Eng, Fernando Fernandes, Marc Härkönen, Anne Harrington, Ramin Hasani, Saniya Karwa, Yuri Khrustalev, Maxime Labonne, Mathias Lechner, Valentine Lechner, Simon Lee, Zetian Li, Noel Loo, Jacob Marks, Edoardo Mosca, Samuel J. Paech, Paul Pak, Rom N. Parnichkun, Alex Quach, Ryan Rogers, Daniela Rus, Nayan Saxena, Bettina Schlager, Tim Seyde, Jimmy T. H. Smith, Aditya Tadimeti, Neehal Tumma</p>

            <p><strong>Title:</strong><br>
            LFM2 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.23404v1">http://arxiv.org/abs/2511.23404v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present LFM2, a family of Liquid Foundation Models designed for efficient on-device deployment and strong task capabilities. Using hardware-in-the-loop architecture search under edge latency and memory constraints, we obtain a compact hybrid backbone that combines gated short convolutions with a small number of grouped query attention blocks, delivering up to 2x faster prefill and decode on CPUs compared to similarly sized models. The LFM2 family covers 350M-8.3B parameters, including dense models (350M, 700M, 1.2B, 2.6B) and a mixture-of-experts variant (8.3B total, 1.5B active), all with 32K context length. LFM2's training pipeline includes a tempered, decoupled Top-K knowledge distillation objective that avoids support mismatch; curriculum learning with difficulty-ordered data; and a three-stage post-training recipe of supervised fine-tuning, length-normalized preference optimization, and model merging. Pre-trained on 10-12T tokens, LFM2 models achieve strong results across diverse benchmarks; for example, LFM2-2.6B reaches 79.56% on IFEval and 82.41% on GSM8K. We further build multimodal and retrieval variants: LFM2-VL for vision-language tasks, LFM2-Audio for speech, and LFM2-ColBERT for retrieval. LFM2-VL supports tunable accuracy-latency tradeoffs via token-efficient visual processing, while LFM2-Audio separates audio input and output pathways to enable real-time speech-to-speech interaction competitive with models 3x larger. LFM2-ColBERT provides a low-latency encoder for queries and documents, enabling high-performance retrieval across multiple languages. All models are released with open weights and deployment packages for ExecuTorch, llama.cpp, and vLLM, making LFM2 a practical base for edge applications that need fast, memory-efficient inference and strong task capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 02 Dec 2025 20:11:28 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/67c03d7a/164534a4.mp3" length="21777805" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1357</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Alexander Amini, Anna Banaszak, Harold Benoit, Arthur Böök, Tarek Dakhran, Song Duong, Alfred Eng, Fernando Fernandes, Marc Härkönen, Anne Harrington, Ramin Hasani, Saniya Karwa, Yuri Khrustalev, Maxime Labonne, Mathias Lechner, Valentine Lechner, Simon Lee, Zetian Li, Noel Loo, Jacob Marks, Edoardo Mosca, Samuel J. Paech, Paul Pak, Rom N. Parnichkun, Alex Quach, Ryan Rogers, Daniela Rus, Nayan Saxena, Bettina Schlager, Tim Seyde, Jimmy T. H. Smith, Aditya Tadimeti, Neehal Tumma</p>

            <p><strong>Title:</strong><br>
            LFM2 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.23404v1">http://arxiv.org/abs/2511.23404v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present LFM2, a family of Liquid Foundation Models designed for efficient on-device deployment and strong task capabilities. Using hardware-in-the-loop architecture search under edge latency and memory constraints, we obtain a compact hybrid backbone that combines gated short convolutions with a small number of grouped query attention blocks, delivering up to 2x faster prefill and decode on CPUs compared to similarly sized models. The LFM2 family covers 350M-8.3B parameters, including dense models (350M, 700M, 1.2B, 2.6B) and a mixture-of-experts variant (8.3B total, 1.5B active), all with 32K context length. LFM2's training pipeline includes a tempered, decoupled Top-K knowledge distillation objective that avoids support mismatch; curriculum learning with difficulty-ordered data; and a three-stage post-training recipe of supervised fine-tuning, length-normalized preference optimization, and model merging. Pre-trained on 10-12T tokens, LFM2 models achieve strong results across diverse benchmarks; for example, LFM2-2.6B reaches 79.56% on IFEval and 82.41% on GSM8K. We further build multimodal and retrieval variants: LFM2-VL for vision-language tasks, LFM2-Audio for speech, and LFM2-ColBERT for retrieval. LFM2-VL supports tunable accuracy-latency tradeoffs via token-efficient visual processing, while LFM2-Audio separates audio input and output pathways to enable real-time speech-to-speech interaction competitive with models 3x larger. LFM2-ColBERT provides a low-latency encoder for queries and documents, enabling high-performance retrieval across multiple languages. All models are released with open weights and deployment packages for ExecuTorch, llama.cpp, and vLLM, making LFM2 a practical base for edge applications that need fast, memory-efficient inference and strong task capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer</title>
      <itunes:episode>1420</itunes:episode>
      <podcast:episode>1420</podcast:episode>
      <itunes:title>Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1a5f5a60-f7da-4f99-9280-2dd38bcf0573</guid>
      <link>https://share.transistor.fm/s/d9b71982</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 78 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Shijie Huang, Zhaohui Hou, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, Shilin Zhou</p>

            <p><strong>Title:</strong><br>
            Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.22699v1">http://arxiv.org/abs/2511.22699v1</a></p>

            <p><strong>Abstract:</strong><br>
            The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (&lt;16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 78 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Shijie Huang, Zhaohui Hou, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, Shilin Zhou</p>

            <p><strong>Title:</strong><br>
            Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.22699v1">http://arxiv.org/abs/2511.22699v1</a></p>

            <p><strong>Abstract:</strong><br>
            The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (&lt;16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 01 Dec 2025 19:30:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d9b71982/7d435c43.mp3" length="23785338" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1483</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 78 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Shijie Huang, Zhaohui Hou, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, Shilin Zhou</p>

            <p><strong>Title:</strong><br>
            Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.22699v1">http://arxiv.org/abs/2511.22699v1</a></p>

            <p><strong>Abstract:</strong><br>
            The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (&lt;16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>REASONEDIT: Towards Reasoning-Enhanced Image Editing Models</title>
      <itunes:episode>1419</itunes:episode>
      <podcast:episode>1419</podcast:episode>
      <itunes:title>REASONEDIT: Towards Reasoning-Enhanced Image Editing Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cc3e1a99-3289-41a9-8f96-3ea9e1ec0c97</guid>
      <link>https://share.transistor.fm/s/9d1caa62</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing, Rui Wang, Wei Cheng, Yingming Wang, Aojie Li, Zixin Yin, Pengtao Chen, Xiangyu Zhang, Daxin Jiang, Xianfang Zeng, Gang Yu</p>

            <p><strong>Title:</strong><br>
            REASONEDIT: Towards Reasoning-Enhanced Image Editing Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.22625v1">http://arxiv.org/abs/2511.22625v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training. In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy. Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round. Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%) when initializing our DiT from the Step1X-Edit (ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit (ReasonEdit-Q).</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing, Rui Wang, Wei Cheng, Yingming Wang, Aojie Li, Zixin Yin, Pengtao Chen, Xiangyu Zhang, Daxin Jiang, Xianfang Zeng, Gang Yu</p>

            <p><strong>Title:</strong><br>
            REASONEDIT: Towards Reasoning-Enhanced Image Editing Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.22625v1">http://arxiv.org/abs/2511.22625v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training. In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy. Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round. Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%) when initializing our DiT from the Step1X-Edit (ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit (ReasonEdit-Q).</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 01 Dec 2025 19:30:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9d1caa62/71dfd702.mp3" length="20862095" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1300</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing, Rui Wang, Wei Cheng, Yingming Wang, Aojie Li, Zixin Yin, Pengtao Chen, Xiangyu Zhang, Daxin Jiang, Xianfang Zeng, Gang Yu</p>

            <p><strong>Title:</strong><br>
            REASONEDIT: Towards Reasoning-Enhanced Image Editing Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.22625v1">http://arxiv.org/abs/2511.22625v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training. In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy. Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round. Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%) when initializing our DiT from the Step1X-Edit (ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit (ReasonEdit-Q).</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Vision Bridge Transformer at Scale</title>
      <itunes:episode>1418</itunes:episode>
      <podcast:episode>1418</podcast:episode>
      <itunes:title>Vision Bridge Transformer at Scale</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">849ca3c5-7bbf-4e3a-b2a1-4de40e60d8a1</guid>
      <link>https://share.transistor.fm/s/becf8071</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhenxiong Tan, Zeqing Wang, Xingyi Yang, Songhua Liu, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            Vision Bridge Transformer at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.23199v1">http://arxiv.org/abs/2511.23199v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhenxiong Tan, Zeqing Wang, Xingyi Yang, Songhua Liu, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            Vision Bridge Transformer at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.23199v1">http://arxiv.org/abs/2511.23199v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 01 Dec 2025 19:29:50 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/becf8071/7b804dd2.mp3" length="20990801" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1308</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhenxiong Tan, Zeqing Wang, Xingyi Yang, Songhua Liu, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            Vision Bridge Transformer at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.23199v1">http://arxiv.org/abs/2511.23199v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning</title>
      <itunes:episode>1417</itunes:episode>
      <podcast:episode>1417</podcast:episode>
      <itunes:title>DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a3419ee4-6855-4d58-85d9-0105d72a98f4</guid>
      <link>https://share.transistor.fm/s/d3250ac4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhihong Shao, Yuxiang Luo, Chengda Lu, Z. Z. Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, Xiaokang Zhang</p>

            <p><strong>Title:</strong><br>
            DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.22570v1">http://arxiv.org/abs/2511.22570v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models have made significant progress in mathematical reasoning, which serves as an important testbed for AI and could impact scientific research if further advanced. By scaling reasoning with reinforcement learning that rewards correct final answers, LLMs have improved from poor performance to saturating quantitative reasoning competitions like AIME and HMMT in one year. However, this approach faces fundamental limitations. Pursuing higher final answer accuracy doesn't address a key issue: correct answers don't guarantee correct reasoning. Moreover, many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. To push the limits of deep reasoning, we believe it is necessary to verify the comprehensiveness and rigor of mathematical reasoning. Self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions. Towards self-verifiable mathematical reasoning, we investigate how to train an accurate and faithful LLM-based verifier for theorem proving. We then train a proof generator using the verifier as the reward model, and incentivize the generator to identify and resolve as many issues as possible in their own proofs before finalizing them. To maintain the generation-verification gap as the generator becomes stronger, we propose to scale verification compute to automatically label new hard-to-verify proofs, creating training data to further improve the verifier. Our resulting model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhihong Shao, Yuxiang Luo, Chengda Lu, Z. Z. Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, Xiaokang Zhang</p>

            <p><strong>Title:</strong><br>
            DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.22570v1">http://arxiv.org/abs/2511.22570v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models have made significant progress in mathematical reasoning, which serves as an important testbed for AI and could impact scientific research if further advanced. By scaling reasoning with reinforcement learning that rewards correct final answers, LLMs have improved from poor performance to saturating quantitative reasoning competitions like AIME and HMMT in one year. However, this approach faces fundamental limitations. Pursuing higher final answer accuracy doesn't address a key issue: correct answers don't guarantee correct reasoning. Moreover, many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. To push the limits of deep reasoning, we believe it is necessary to verify the comprehensiveness and rigor of mathematical reasoning. Self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions. Towards self-verifiable mathematical reasoning, we investigate how to train an accurate and faithful LLM-based verifier for theorem proving. We then train a proof generator using the verifier as the reward model, and incentivize the generator to identify and resolve as many issues as possible in their own proofs before finalizing them. To maintain the generation-verification gap as the generator becomes stronger, we propose to scale verification compute to automatically label new hard-to-verify proofs, creating training data to further improve the verifier. Our resulting model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 01 Dec 2025 19:29:28 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d3250ac4/6310bea3.mp3" length="20375594" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1270</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhihong Shao, Yuxiang Luo, Chengda Lu, Z. Z. Ren, Jiewen Hu, Tian Ye, Zhibin Gou, Shirong Ma, Xiaokang Zhang</p>

            <p><strong>Title:</strong><br>
            DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.22570v1">http://arxiv.org/abs/2511.22570v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models have made significant progress in mathematical reasoning, which serves as an important testbed for AI and could impact scientific research if further advanced. By scaling reasoning with reinforcement learning that rewards correct final answers, LLMs have improved from poor performance to saturating quantitative reasoning competitions like AIME and HMMT in one year. However, this approach faces fundamental limitations. Pursuing higher final answer accuracy doesn't address a key issue: correct answers don't guarantee correct reasoning. Moreover, many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable. To push the limits of deep reasoning, we believe it is necessary to verify the comprehensiveness and rigor of mathematical reasoning. Self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions. Towards self-verifiable mathematical reasoning, we investigate how to train an accurate and faithful LLM-based verifier for theorem proving. We then train a proof generator using the verifier as the reward model, and incentivize the generator to identify and resolve as many issues as possible in their own proofs before finalizing them. To maintain the generation-verification gap as the generator becomes stronger, we propose to scale verification compute to automatically label new hard-to-verify proofs, creating training data to further improve the verifier. Our resulting model, DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled test-time compute.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Architecture Decoupling Is Not All You Need For Unified Multimodal Model</title>
      <itunes:episode>1416</itunes:episode>
      <podcast:episode>1416</podcast:episode>
      <itunes:title>Architecture Decoupling Is Not All You Need For Unified Multimodal Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">87201bf0-df8e-4658-9df7-ae7874e4b959</guid>
      <link>https://share.transistor.fm/s/39158c3d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, Peng Pei, Xunliang Cai, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            Architecture Decoupling Is Not All You Need For Unified Multimodal Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.22663v1">http://arxiv.org/abs/2511.22663v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of model decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling alleviates conflicts by studying the cross-modal attention behavior of models. We observe that model decoupling essentially drives models toward task-specific multimodal interaction patterns, as seen in Qwen-VL and HunyuanImage, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns Task-Specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, Peng Pei, Xunliang Cai, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            Architecture Decoupling Is Not All You Need For Unified Multimodal Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.22663v1">http://arxiv.org/abs/2511.22663v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of model decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling alleviates conflicts by studying the cross-modal attention behavior of models. We observe that model decoupling essentially drives models toward task-specific multimodal interaction patterns, as seen in Qwen-VL and HunyuanImage, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns Task-Specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 01 Dec 2025 19:29:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/39158c3d/cd376683.mp3" length="20141128" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1255</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, Peng Pei, Xunliang Cai, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            Architecture Decoupling Is Not All You Need For Unified Multimodal Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.22663v1">http://arxiv.org/abs/2511.22663v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of model decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling alleviates conflicts by studying the cross-modal attention behavior of models. We observe that model decoupling essentially drives models toward task-specific multimodal interaction patterns, as seen in Qwen-VL and HunyuanImage, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns Task-Specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Multimodal Evaluation of Russian-language Architectures</title>
      <itunes:episode>1415</itunes:episode>
      <podcast:episode>1415</podcast:episode>
      <itunes:title>Multimodal Evaluation of Russian-language Architectures</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2d96cdc6-a8d3-4e43-bbb9-197689499544</guid>
      <link>https://share.transistor.fm/s/3e702861</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Artem Chervyakov, Ulyana Isaeva, Anton Emelyanov, Artem Safin, Maria Tikhonova, Alexander Kharitonov, Yulia Lyakh, Petr Surovtsev, Denis Shevelev, Vildan Saburov, Vasily Konovalov, Elisei Rykov, Ivan Sviridov, Amina Miftakhova, Ilseyar Alimova, Alexander Panchenko, Alexander Kapitanov, Alena Fenogenova</p>

            <p><strong>Title:</strong><br>
            Multimodal Evaluation of Russian-language Architectures</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15552v2">http://arxiv.org/abs/2511.15552v2</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Artem Chervyakov, Ulyana Isaeva, Anton Emelyanov, Artem Safin, Maria Tikhonova, Alexander Kharitonov, Yulia Lyakh, Petr Surovtsev, Denis Shevelev, Vildan Saburov, Vasily Konovalov, Elisei Rykov, Ivan Sviridov, Amina Miftakhova, Ilseyar Alimova, Alexander Panchenko, Alexander Kapitanov, Alena Fenogenova</p>

            <p><strong>Title:</strong><br>
            Multimodal Evaluation of Russian-language Architectures</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15552v2">http://arxiv.org/abs/2511.15552v2</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 27 Nov 2025 18:53:33 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3e702861/2c5ec046.mp3" length="23871397" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1488</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Artem Chervyakov, Ulyana Isaeva, Anton Emelyanov, Artem Safin, Maria Tikhonova, Alexander Kharitonov, Yulia Lyakh, Petr Surovtsev, Denis Shevelev, Vildan Saburov, Vasily Konovalov, Elisei Rykov, Ivan Sviridov, Amina Miftakhova, Ilseyar Alimova, Alexander Panchenko, Alexander Kapitanov, Alena Fenogenova</p>

            <p><strong>Title:</strong><br>
            Multimodal Evaluation of Russian-language Architectures</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15552v2">http://arxiv.org/abs/2511.15552v2</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Latent Collaboration in Multi-Agent Systems</title>
      <itunes:episode>1414</itunes:episode>
      <podcast:episode>1414</podcast:episode>
      <itunes:title>Latent Collaboration in Multi-Agent Systems</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">34fe5d2d-466f-46f0-94e2-266b2b4cc75a</guid>
      <link>https://share.transistor.fm/s/8e10e8e2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang</p>

            <p><strong>Title:</strong><br>
            Latent Collaboration in Multi-Agent Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20639v1">http://arxiv.org/abs/2511.20639v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang</p>

            <p><strong>Title:</strong><br>
            Latent Collaboration in Multi-Agent Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20639v1">http://arxiv.org/abs/2511.20639v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 27 Nov 2025 18:53:11 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8e10e8e2/facdede4.mp3" length="25213034" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1572</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang</p>

            <p><strong>Title:</strong><br>
            Latent Collaboration in Multi-Agent Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20639v1">http://arxiv.org/abs/2511.20639v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation</title>
      <itunes:episode>1413</itunes:episode>
      <podcast:episode>1413</podcast:episode>
      <itunes:title>Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">188c2bfb-bfd4-45ed-8a5a-ae1768a67f30</guid>
      <link>https://share.transistor.fm/s/25da6430</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Inferix Team, Tianyu Feng, Yizeng Han, Jiahao He, Yuanyu He, Xi Lin, Teng Liu, Hanfeng Lu, Jiasheng Tang, Wei Wang, Zhiyuan Wang, Jichao Wu, Mingyang Yang, Yinghao Yu, Zeyu Zhang, Bohan Zhuang</p>

            <p><strong>Title:</strong><br>
            Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20714v1">http://arxiv.org/abs/2511.20714v1</a></p>

            <p><strong>Abstract:</strong><br>
            World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block-applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation.   Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV-Bench, a new fine-grained evaluation benchmark tailored for minute-long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Inferix Team, Tianyu Feng, Yizeng Han, Jiahao He, Yuanyu He, Xi Lin, Teng Liu, Hanfeng Lu, Jiasheng Tang, Wei Wang, Zhiyuan Wang, Jichao Wu, Mingyang Yang, Yinghao Yu, Zeyu Zhang, Bohan Zhuang</p>

            <p><strong>Title:</strong><br>
            Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20714v1">http://arxiv.org/abs/2511.20714v1</a></p>

            <p><strong>Abstract:</strong><br>
            World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block-applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation.   Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV-Bench, a new fine-grained evaluation benchmark tailored for minute-long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 27 Nov 2025 18:52:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/25da6430/fe913c8c.mp3" length="17762118" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1106</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Inferix Team, Tianyu Feng, Yizeng Han, Jiahao He, Yuanyu He, Xi Lin, Teng Liu, Hanfeng Lu, Jiasheng Tang, Wei Wang, Zhiyuan Wang, Jichao Wu, Mingyang Yang, Yinghao Yu, Zeyu Zhang, Bohan Zhuang</p>

            <p><strong>Title:</strong><br>
            Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20714v1">http://arxiv.org/abs/2511.20714v1</a></p>

            <p><strong>Abstract:</strong><br>
            World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block-applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation.   Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV-Bench, a new fine-grained evaluation benchmark tailored for minute-long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms</title>
      <itunes:episode>1412</itunes:episode>
      <podcast:episode>1412</podcast:episode>
      <itunes:title>GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">33428a85-f0ca-4601-9264-62a7ecbf3355</guid>
      <link>https://share.transistor.fm/s/fb00092f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 87 | cs.NE, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Valentin Khrulkov, Andrey Galichin, Denis Bashkirov, Dmitry Vinichenko, Oleg Travkin, Roman Alferov, Andrey Kuznetsov, Ivan Oseledets</p>

            <p><strong>Title:</strong><br>
            GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.17592v1">http://arxiv.org/abs/2511.17592v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in LLM-guided evolutionary computation, particularly AlphaEvolve (Novikov et al., 2025; Georgiev et al., 2025), have demonstrated remarkable success in discovering novel mathematical constructions and solving challenging optimization problems. However, the high-level descriptions in published work leave many implementation details unspecified, hindering reproducibility and further research. In this report we present GigaEvo, an extensible open-source framework that enables researchers to study and experiment with hybrid LLM-evolution approaches inspired by AlphaEvolve. Our system provides modular implementations of key components: MAP-Elites quality-diversity algorithms, asynchronous DAG-based evaluation pipelines, LLM-driven mutation operators with insight generation and bidirectional lineage tracking, and flexible multi-island evolutionary strategies. In order to assess reproducibility and validate our implementation we evaluate GigaEvo on challenging problems from the AlphaEvolve paper: Heilbronn triangle placement, circle packing in squares, and high-dimensional kissing numbers. The framework emphasizes modularity, concurrency, and ease of experimentation, enabling rapid prototyping through declarative configuration. We provide detailed descriptions of system architecture, implementation decisions, and experimental methodology to support further research in LLM driven evolutionary methods. The GigaEvo framework and all experimental code are available at https://github.com/AIRI-Institute/gigaevo-core.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 87 | cs.NE, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Valentin Khrulkov, Andrey Galichin, Denis Bashkirov, Dmitry Vinichenko, Oleg Travkin, Roman Alferov, Andrey Kuznetsov, Ivan Oseledets</p>

            <p><strong>Title:</strong><br>
            GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.17592v1">http://arxiv.org/abs/2511.17592v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in LLM-guided evolutionary computation, particularly AlphaEvolve (Novikov et al., 2025; Georgiev et al., 2025), have demonstrated remarkable success in discovering novel mathematical constructions and solving challenging optimization problems. However, the high-level descriptions in published work leave many implementation details unspecified, hindering reproducibility and further research. In this report we present GigaEvo, an extensible open-source framework that enables researchers to study and experiment with hybrid LLM-evolution approaches inspired by AlphaEvolve. Our system provides modular implementations of key components: MAP-Elites quality-diversity algorithms, asynchronous DAG-based evaluation pipelines, LLM-driven mutation operators with insight generation and bidirectional lineage tracking, and flexible multi-island evolutionary strategies. In order to assess reproducibility and validate our implementation we evaluate GigaEvo on challenging problems from the AlphaEvolve paper: Heilbronn triangle placement, circle packing in squares, and high-dimensional kissing numbers. The framework emphasizes modularity, concurrency, and ease of experimentation, enabling rapid prototyping through declarative configuration. We provide detailed descriptions of system architecture, implementation decisions, and experimental methodology to support further research in LLM driven evolutionary methods. The GigaEvo framework and all experimental code are available at https://github.com/AIRI-Institute/gigaevo-core.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 26 Nov 2025 19:43:00 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fb00092f/420fccce.mp3" length="22781391" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1420</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 87 | cs.NE, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Valentin Khrulkov, Andrey Galichin, Denis Bashkirov, Dmitry Vinichenko, Oleg Travkin, Roman Alferov, Andrey Kuznetsov, Ivan Oseledets</p>

            <p><strong>Title:</strong><br>
            GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.17592v1">http://arxiv.org/abs/2511.17592v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in LLM-guided evolutionary computation, particularly AlphaEvolve (Novikov et al., 2025; Georgiev et al., 2025), have demonstrated remarkable success in discovering novel mathematical constructions and solving challenging optimization problems. However, the high-level descriptions in published work leave many implementation details unspecified, hindering reproducibility and further research. In this report we present GigaEvo, an extensible open-source framework that enables researchers to study and experiment with hybrid LLM-evolution approaches inspired by AlphaEvolve. Our system provides modular implementations of key components: MAP-Elites quality-diversity algorithms, asynchronous DAG-based evaluation pipelines, LLM-driven mutation operators with insight generation and bidirectional lineage tracking, and flexible multi-island evolutionary strategies. In order to assess reproducibility and validate our implementation we evaluate GigaEvo on challenging problems from the AlphaEvolve paper: Heilbronn triangle placement, circle packing in squares, and high-dimensional kissing numbers. The framework emphasizes modularity, concurrency, and ease of experimentation, enabling rapid prototyping through declarative configuration. We provide detailed descriptions of system architecture, implementation decisions, and experimental methodology to support further research in LLM driven evolutionary methods. The GigaEvo framework and all experimental code are available at https://github.com/AIRI-Institute/gigaevo-core.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MedSAM3: Delving into Segment Anything with Medical Concepts</title>
      <itunes:episode>1411</itunes:episode>
      <podcast:episode>1411</podcast:episode>
      <itunes:title>MedSAM3: Delving into Segment Anything with Medical Concepts</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f50f1027-89f2-4c8e-bfdc-c5e56af284f2</guid>
      <link>https://share.transistor.fm/s/50ee192f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Anglin Liu, Rundong Xue, Xu R. Cao, Yifan Shen, Yi Lu, Xiang Li, Qianqian Chen, Jintai Chen</p>

            <p><strong>Title:</strong><br>
            MedSAM3: Delving into Segment Anything with Medical Concepts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19046v1">http://arxiv.org/abs/2511.19046v1</a></p>

            <p><strong>Abstract:</strong><br>
            Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmentation. By fine-tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM-3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open-vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM-3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent-in-the-loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X-ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models. We will release our code and model at https://github.com/Joey-S-Liu/MedSAM3.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Anglin Liu, Rundong Xue, Xu R. Cao, Yifan Shen, Yi Lu, Xiang Li, Qianqian Chen, Jintai Chen</p>

            <p><strong>Title:</strong><br>
            MedSAM3: Delving into Segment Anything with Medical Concepts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19046v1">http://arxiv.org/abs/2511.19046v1</a></p>

            <p><strong>Abstract:</strong><br>
            Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmentation. By fine-tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM-3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open-vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM-3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent-in-the-loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X-ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models. We will release our code and model at https://github.com/Joey-S-Liu/MedSAM3.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 26 Nov 2025 19:42:39 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/50ee192f/e3b0a3db.mp3" length="23441322" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1461</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Anglin Liu, Rundong Xue, Xu R. Cao, Yifan Shen, Yi Lu, Xiang Li, Qianqian Chen, Jintai Chen</p>

            <p><strong>Title:</strong><br>
            MedSAM3: Delving into Segment Anything with Medical Concepts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19046v1">http://arxiv.org/abs/2511.19046v1</a></p>

            <p><strong>Abstract:</strong><br>
            Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmentation. By fine-tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM-3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open-vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM-3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent-in-the-loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X-ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models. We will release our code and model at https://github.com/Joey-S-Liu/MedSAM3.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning</title>
      <itunes:episode>1410</itunes:episode>
      <podcast:episode>1410</podcast:episode>
      <itunes:title>Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b525de8a-c9fd-4e7f-84b7-ea7bd932b952</guid>
      <link>https://share.transistor.fm/s/0cbb755d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, Huaxiu Yao</p>

            <p><strong>Title:</strong><br>
            Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19900v1">http://arxiv.org/abs/2511.19900v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at \href{https://github.com/aiming-lab/Agent0/Agent0-VL}{this https URL}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, Huaxiu Yao</p>

            <p><strong>Title:</strong><br>
            Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19900v1">http://arxiv.org/abs/2511.19900v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at \href{https://github.com/aiming-lab/Agent0/Agent0-VL}{this https URL}.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 26 Nov 2025 19:42:18 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0cbb755d/768f696d.mp3" length="21848505" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1362</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, Huaxiu Yao</p>

            <p><strong>Title:</strong><br>
            Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19900v1">http://arxiv.org/abs/2511.19900v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at \href{https://github.com/aiming-lab/Agent0/Agent0-VL}{this https URL}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation</title>
      <itunes:episode>1409</itunes:episode>
      <podcast:episode>1409</podcast:episode>
      <itunes:title>SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d50c3e69-cf1f-4287-b30b-3ee67ebff3a9</guid>
      <link>https://share.transistor.fm/s/c0e86ce2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiaming Zhang, Shengming Cao, Rui Li, Xiaotong Zhao, Yutao Cui, Xinglin Hou, Gangshan Wu, Haolan Chen, Yu Xu, Limin Wang, Kai Ma</p>

            <p><strong>Title:</strong><br>
            SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19320v1">http://arxiv.org/abs/2511.19320v1</a></p>

            <p><strong>Abstract:</strong><br>
            Preserving first-frame identity while ensuring precise motion control is a fundamental challenge in human image animation. The Image-to-Motion Binding process of the dominant Reference-to-Video (R2V) paradigm overlooks critical spatio-temporal misalignments common in real-world applications, leading to failures such as identity drift and visual artifacts. We introduce SteadyDancer, an Image-to-Video (I2V) paradigm-based framework that achieves harmonized and coherent animation and is the first to ensure first-frame preservation robustly. Firstly, we propose a Condition-Reconciliation Mechanism to harmonize the two conflicting conditions, enabling precise control without sacrificing fidelity. Secondly, we design Synergistic Pose Modulation Modules to generate an adaptive and coherent pose representation that is highly compatible with the reference image. Finally, we employ a Staged Decoupled-Objective Training Pipeline that hierarchically optimizes the model for motion fidelity, visual quality, and temporal coherence. Experiments demonstrate that SteadyDancer achieves state-of-the-art performance in both appearance fidelity and motion control, while requiring significantly fewer training resources than comparable methods.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiaming Zhang, Shengming Cao, Rui Li, Xiaotong Zhao, Yutao Cui, Xinglin Hou, Gangshan Wu, Haolan Chen, Yu Xu, Limin Wang, Kai Ma</p>

            <p><strong>Title:</strong><br>
            SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19320v1">http://arxiv.org/abs/2511.19320v1</a></p>

            <p><strong>Abstract:</strong><br>
            Preserving first-frame identity while ensuring precise motion control is a fundamental challenge in human image animation. The Image-to-Motion Binding process of the dominant Reference-to-Video (R2V) paradigm overlooks critical spatio-temporal misalignments common in real-world applications, leading to failures such as identity drift and visual artifacts. We introduce SteadyDancer, an Image-to-Video (I2V) paradigm-based framework that achieves harmonized and coherent animation and is the first to ensure first-frame preservation robustly. Firstly, we propose a Condition-Reconciliation Mechanism to harmonize the two conflicting conditions, enabling precise control without sacrificing fidelity. Secondly, we design Synergistic Pose Modulation Modules to generate an adaptive and coherent pose representation that is highly compatible with the reference image. Finally, we employ a Staged Decoupled-Objective Training Pipeline that hierarchically optimizes the model for motion fidelity, visual quality, and temporal coherence. Experiments demonstrate that SteadyDancer achieves state-of-the-art performance in both appearance fidelity and motion control, while requiring significantly fewer training resources than comparable methods.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 26 Nov 2025 19:41:57 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c0e86ce2/59371edf.mp3" length="18470980" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1151</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiaming Zhang, Shengming Cao, Rui Li, Xiaotong Zhao, Yutao Cui, Xinglin Hou, Gangshan Wu, Haolan Chen, Yu Xu, Limin Wang, Kai Ma</p>

            <p><strong>Title:</strong><br>
            SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19320v1">http://arxiv.org/abs/2511.19320v1</a></p>

            <p><strong>Abstract:</strong><br>
            Preserving first-frame identity while ensuring precise motion control is a fundamental challenge in human image animation. The Image-to-Motion Binding process of the dominant Reference-to-Video (R2V) paradigm overlooks critical spatio-temporal misalignments common in real-world applications, leading to failures such as identity drift and visual artifacts. We introduce SteadyDancer, an Image-to-Video (I2V) paradigm-based framework that achieves harmonized and coherent animation and is the first to ensure first-frame preservation robustly. Firstly, we propose a Condition-Reconciliation Mechanism to harmonize the two conflicting conditions, enabling precise control without sacrificing fidelity. Secondly, we design Synergistic Pose Modulation Modules to generate an adaptive and coherent pose representation that is highly compatible with the reference image. Finally, we employ a Staged Decoupled-Objective Training Pipeline that hierarchically optimizes the model for motion fidelity, visual quality, and temporal coherence. Experiments demonstrate that SteadyDancer achieves state-of-the-art performance in both appearance fidelity and motion control, while requiring significantly fewer training resources than comparable methods.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation</title>
      <itunes:episode>1408</itunes:episode>
      <podcast:episode>1408</podcast:episode>
      <itunes:title>iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fbc5020e-09f1-4a81-9e7d-e5f0cf908aa0</guid>
      <link>https://share.transistor.fm/s/14a90c99</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhoujie Fu, Xianfang Zeng, Jinghong Lan, Xinyao Liao, Cheng Chen, Junyi Chen, Jiacheng Wei, Wei Cheng, Shiyu Liu, Yunuo Chen, Gang Yu, Guosheng Lin</p>

            <p><strong>Title:</strong><br>
            iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20635v1">http://arxiv.org/abs/2511.20635v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Find our homepage at: https://kr1sjfu.github.io/iMontage-web/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhoujie Fu, Xianfang Zeng, Jinghong Lan, Xinyao Liao, Cheng Chen, Junyi Chen, Jiacheng Wei, Wei Cheng, Shiyu Liu, Yunuo Chen, Gang Yu, Guosheng Lin</p>

            <p><strong>Title:</strong><br>
            iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20635v1">http://arxiv.org/abs/2511.20635v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Find our homepage at: https://kr1sjfu.github.io/iMontage-web/.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 26 Nov 2025 19:41:36 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/14a90c99/6a7d7bed.mp3" length="24140582" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1505</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhoujie Fu, Xianfang Zeng, Jinghong Lan, Xinyao Liao, Cheng Chen, Junyi Chen, Jiacheng Wei, Wei Cheng, Shiyu Liu, Yunuo Chen, Gang Yu, Guosheng Lin</p>

            <p><strong>Title:</strong><br>
            iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20635v1">http://arxiv.org/abs/2511.20635v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Find our homepage at: https://kr1sjfu.github.io/iMontage-web/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward</title>
      <itunes:episode>1407</itunes:episode>
      <podcast:episode>1407</podcast:episode>
      <itunes:title>Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">76d56db3-24ba-46da-bd14-e6e93112dd01</guid>
      <link>https://share.transistor.fm/s/45ebd5e2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin, Bin Lin, Zongjian Li, Bin Zhu, Weihao Yu, Li Yuan</p>

            <p><strong>Title:</strong><br>
            Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20561v1">http://arxiv.org/abs/2511.20561v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin, Bin Lin, Zongjian Li, Bin Zhu, Weihao Yu, Li Yuan</p>

            <p><strong>Title:</strong><br>
            Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20561v1">http://arxiv.org/abs/2511.20561v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 26 Nov 2025 19:41:15 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/45ebd5e2/74b1fd23.mp3" length="24213328" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1510</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin, Bin Lin, Zongjian Li, Bin Zhu, Weihao Yu, Li Yuan</p>

            <p><strong>Title:</strong><br>
            Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20561v1">http://arxiv.org/abs/2511.20561v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at https://github.com/PKU-YuanGroup/UniSandBox</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GigaWorld-0: World Models as Data Engine to Empower Embodied AI</title>
      <itunes:episode>1406</itunes:episode>
      <podcast:episode>1406</podcast:episode>
      <itunes:title>GigaWorld-0: World Models as Data Engine to Empower Embodied AI</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">af4ba8ca-dab0-45fa-8308-75d827f73883</guid>
      <link>https://share.transistor.fm/s/64baaf99</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, Zheng Zhu</p>

            <p><strong>Title:</strong><br>
            GigaWorld-0: World Models as Data Engine to Empower Embodied AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19861v1">http://arxiv.org/abs/2511.19861v1</a></p>

            <p><strong>Abstract:</strong><br>
            World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, Zheng Zhu</p>

            <p><strong>Title:</strong><br>
            GigaWorld-0: World Models as Data Engine to Empower Embodied AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19861v1">http://arxiv.org/abs/2511.19861v1</a></p>

            <p><strong>Abstract:</strong><br>
            World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 26 Nov 2025 19:40:53 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/64baaf99/84279d3f.mp3" length="21603558" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1347</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, Zheng Zhu</p>

            <p><strong>Title:</strong><br>
            GigaWorld-0: World Models as Data Engine to Empower Embodied AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19861v1">http://arxiv.org/abs/2511.19861v1</a></p>

            <p><strong>Abstract:</strong><br>
            World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space</title>
      <itunes:episode>1405</itunes:episode>
      <podcast:episode>1405</podcast:episode>
      <itunes:title>SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4a784993-27a3-4a30-9cd0-c7ef932ddbd1</guid>
      <link>https://share.transistor.fm/s/806e8de4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhenyi Shen, Junru Lu, Lin Gui, Jiazheng Li, Yulan He, Di Yin, Xing Sun</p>

            <p><strong>Title:</strong><br>
            SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20102v1">http://arxiv.org/abs/2511.20102v1</a></p>

            <p><strong>Abstract:</strong><br>
            The quadratic complexity of full attention limits efficient long-context processing in large language models (LLMs). Sparse attention mitigates this cost by restricting each query to attend to a subset of previous tokens; however, training-free approaches often lead to severe performance degradation. Native sparse-attention methods (e.g., NSA, MoBA) alleviate this issue, yet exhibit a critical paradox: they produce lower attention sparsity than full-attention models, despite aiming to approximate full attention, which may constrain their effectiveness. We attribute this paradox to gradient update deficiency: low-ranked key-value pairs excluded during sparse training receive neither forward contribution nor backward gradients, and thus never learn proper suppression. To overcome this limitation, we propose SSA (Sparse Sparse Attention), a unified training framework that considers both sparse and full attention and enforces bidirectional alignment at every layer. This design preserves gradient flow to all tokens while explicitly encouraging sparse-attention outputs to align with their full-attention counterparts, thereby promoting stronger sparsity. As a result, SSA achieves state-of-the-art performance under both sparse and full attention inference across multiple commonsense benchmarks. Furthermore, SSA enables models to adapt smoothly to varying sparsity budgets; performance improves consistently as more tokens are allowed to attend, supporting flexible compute-performance trade-offs at inference time. Finally, we show that native sparse-attention training surprisingly improves long-context extrapolation by mitigating the over-allocation of attention values in sink areas, with SSA demonstrating the strongest extrapolation capability.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhenyi Shen, Junru Lu, Lin Gui, Jiazheng Li, Yulan He, Di Yin, Xing Sun</p>

            <p><strong>Title:</strong><br>
            SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20102v1">http://arxiv.org/abs/2511.20102v1</a></p>

            <p><strong>Abstract:</strong><br>
            The quadratic complexity of full attention limits efficient long-context processing in large language models (LLMs). Sparse attention mitigates this cost by restricting each query to attend to a subset of previous tokens; however, training-free approaches often lead to severe performance degradation. Native sparse-attention methods (e.g., NSA, MoBA) alleviate this issue, yet exhibit a critical paradox: they produce lower attention sparsity than full-attention models, despite aiming to approximate full attention, which may constrain their effectiveness. We attribute this paradox to gradient update deficiency: low-ranked key-value pairs excluded during sparse training receive neither forward contribution nor backward gradients, and thus never learn proper suppression. To overcome this limitation, we propose SSA (Sparse Sparse Attention), a unified training framework that considers both sparse and full attention and enforces bidirectional alignment at every layer. This design preserves gradient flow to all tokens while explicitly encouraging sparse-attention outputs to align with their full-attention counterparts, thereby promoting stronger sparsity. As a result, SSA achieves state-of-the-art performance under both sparse and full attention inference across multiple commonsense benchmarks. Furthermore, SSA enables models to adapt smoothly to varying sparsity budgets; performance improves consistently as more tokens are allowed to attend, supporting flexible compute-performance trade-offs at inference time. Finally, we show that native sparse-attention training surprisingly improves long-context extrapolation by mitigating the over-allocation of attention values in sink areas, with SSA demonstrating the strongest extrapolation capability.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 26 Nov 2025 19:40:32 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/806e8de4/a9e8caa9.mp3" length="20731723" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1292</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhenyi Shen, Junru Lu, Lin Gui, Jiazheng Li, Yulan He, Di Yin, Xing Sun</p>

            <p><strong>Title:</strong><br>
            SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20102v1">http://arxiv.org/abs/2511.20102v1</a></p>

            <p><strong>Abstract:</strong><br>
            The quadratic complexity of full attention limits efficient long-context processing in large language models (LLMs). Sparse attention mitigates this cost by restricting each query to attend to a subset of previous tokens; however, training-free approaches often lead to severe performance degradation. Native sparse-attention methods (e.g., NSA, MoBA) alleviate this issue, yet exhibit a critical paradox: they produce lower attention sparsity than full-attention models, despite aiming to approximate full attention, which may constrain their effectiveness. We attribute this paradox to gradient update deficiency: low-ranked key-value pairs excluded during sparse training receive neither forward contribution nor backward gradients, and thus never learn proper suppression. To overcome this limitation, we propose SSA (Sparse Sparse Attention), a unified training framework that considers both sparse and full attention and enforces bidirectional alignment at every layer. This design preserves gradient flow to all tokens while explicitly encouraging sparse-attention outputs to align with their full-attention counterparts, thereby promoting stronger sparsity. As a result, SSA achieves state-of-the-art performance under both sparse and full attention inference across multiple commonsense benchmarks. Furthermore, SSA enables models to adapt smoothly to varying sparsity budgets; performance improves consistently as more tokens are allowed to attend, supporting flexible compute-performance trade-offs at inference time. Finally, we show that native sparse-attention training surprisingly improves long-context extrapolation by mitigating the over-allocation of attention values in sink areas, with SSA demonstrating the strongest extrapolation capability.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Soft Adaptive Policy Optimization</title>
      <itunes:episode>1404</itunes:episode>
      <podcast:episode>1404</podcast:episode>
      <itunes:title>Soft Adaptive Policy Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">58d00b13-4a5f-4587-bcfa-eecfa08028cb</guid>
      <link>https://share.transistor.fm/s/7ebed5fd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Soft Adaptive Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20347v1">http://arxiv.org/abs/2511.20347v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Soft Adaptive Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20347v1">http://arxiv.org/abs/2511.20347v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 26 Nov 2025 19:40:11 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7ebed5fd/aac75cf7.mp3" length="23228554" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1448</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Soft Adaptive Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.20347v1">http://arxiv.org/abs/2511.20347v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>General Agentic Memory Via Deep Research</title>
      <itunes:episode>1403</itunes:episode>
      <podcast:episode>1403</podcast:episode>
      <itunes:title>General Agentic Memory Via Deep Research</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f848b294-f358-44fe-86e2-44b85b6d9ce6</guid>
      <link>https://share.transistor.fm/s/59816d28</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 121 | cs.CL, cs.AI, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            B. Y. Yan, Chaofan Li, Hongjin Qian, Shuqi Lu, Zheng Liu</p>

            <p><strong>Title:</strong><br>
            General Agentic Memory Via Deep Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.18423v1">http://arxiv.org/abs/2511.18423v1</a></p>

            <p><strong>Abstract:</strong><br>
            Memory is critical for AI agents, yet the widely-adopted static memory, aiming to create readily available memory in advance, is inevitably subject to severe information loss. To address this limitation, we propose a novel framework called \textbf{general agentic memory (GAM)}. GAM follows the principle of "\textbf{just-in time (JIT) compilation}" where it focuses on creating optimized contexts for its client at runtime while keeping only simple but useful memory during the offline stage. To this end, GAM employs a duo-design with the following components. 1) \textbf{Memorizer}, which highlights key historical information using a lightweight memory, while maintaining complete historical information within a universal page-store. 2) \textbf{Researcher}, which retrieves and integrates useful information from the page-store for its online request guided by the pre-constructed memory. This design allows GAM to effectively leverage the agentic capabilities and test-time scalability of frontier large language models (LLMs), while also facilitating end-to-end performance optimization through reinforcement learning. In our experimental study, we demonstrate that GAM achieves substantial improvement on various memory-grounded task completion scenarios against existing memory systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 121 | cs.CL, cs.AI, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            B. Y. Yan, Chaofan Li, Hongjin Qian, Shuqi Lu, Zheng Liu</p>

            <p><strong>Title:</strong><br>
            General Agentic Memory Via Deep Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.18423v1">http://arxiv.org/abs/2511.18423v1</a></p>

            <p><strong>Abstract:</strong><br>
            Memory is critical for AI agents, yet the widely-adopted static memory, aiming to create readily available memory in advance, is inevitably subject to severe information loss. To address this limitation, we propose a novel framework called \textbf{general agentic memory (GAM)}. GAM follows the principle of "\textbf{just-in time (JIT) compilation}" where it focuses on creating optimized contexts for its client at runtime while keeping only simple but useful memory during the offline stage. To this end, GAM employs a duo-design with the following components. 1) \textbf{Memorizer}, which highlights key historical information using a lightweight memory, while maintaining complete historical information within a universal page-store. 2) \textbf{Researcher}, which retrieves and integrates useful information from the page-store for its online request guided by the pre-constructed memory. This design allows GAM to effectively leverage the agentic capabilities and test-time scalability of frontier large language models (LLMs), while also facilitating end-to-end performance optimization through reinforcement learning. In our experimental study, we demonstrate that GAM achieves substantial improvement on various memory-grounded task completion scenarios against existing memory systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 25 Nov 2025 19:29:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/59816d28/25cd7f1f.mp3" length="24642099" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1536</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 121 | cs.CL, cs.AI, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            B. Y. Yan, Chaofan Li, Hongjin Qian, Shuqi Lu, Zheng Liu</p>

            <p><strong>Title:</strong><br>
            General Agentic Memory Via Deep Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.18423v1">http://arxiv.org/abs/2511.18423v1</a></p>

            <p><strong>Abstract:</strong><br>
            Memory is critical for AI agents, yet the widely-adopted static memory, aiming to create readily available memory in advance, is inevitably subject to severe information loss. To address this limitation, we propose a novel framework called \textbf{general agentic memory (GAM)}. GAM follows the principle of "\textbf{just-in time (JIT) compilation}" where it focuses on creating optimized contexts for its client at runtime while keeping only simple but useful memory during the offline stage. To this end, GAM employs a duo-design with the following components. 1) \textbf{Memorizer}, which highlights key historical information using a lightweight memory, while maintaining complete historical information within a universal page-store. 2) \textbf{Researcher}, which retrieves and integrates useful information from the page-store for its online request guided by the pre-constructed memory. This design allows GAM to effectively leverage the agentic capabilities and test-time scalability of frontier large language models (LLMs), while also facilitating end-to-end performance optimization through reinforcement learning. In our experimental study, we demonstrate that GAM achieves substantial improvement on various memory-grounded task completion scenarios against existing memory systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning</title>
      <itunes:episode>1402</itunes:episode>
      <podcast:episode>1402</podcast:episode>
      <itunes:title>AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c06c350b-e94f-4435-a747-549745170c0a</guid>
      <link>https://share.transistor.fm/s/eaab89c5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 79 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiayi Zhang, Yiran Peng, Fanqi Kong, Yang Cheng, Yifan Wu, Zhaoyang Yu, Jinyu Xiang, Jianhao Ruan, Jinlin Wang, Maojia Song, HongZhang Liu, Xiangru Tang, Bang Liu, Chenglin Wu, Yuyu Luo</p>

            <p><strong>Title:</strong><br>
            AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19304v1">http://arxiv.org/abs/2511.19304v1</a></p>

            <p><strong>Abstract:</strong><br>
            Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 79 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiayi Zhang, Yiran Peng, Fanqi Kong, Yang Cheng, Yifan Wu, Zhaoyang Yu, Jinyu Xiang, Jianhao Ruan, Jinlin Wang, Maojia Song, HongZhang Liu, Xiangru Tang, Bang Liu, Chenglin Wu, Yuyu Luo</p>

            <p><strong>Title:</strong><br>
            AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19304v1">http://arxiv.org/abs/2511.19304v1</a></p>

            <p><strong>Abstract:</strong><br>
            Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 25 Nov 2025 19:29:25 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/eaab89c5/ed26fb2b.mp3" length="22139397" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1380</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 79 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiayi Zhang, Yiran Peng, Fanqi Kong, Yang Cheng, Yifan Wu, Zhaoyang Yu, Jinyu Xiang, Jianhao Ruan, Jinlin Wang, Maojia Song, HongZhang Liu, Xiangru Tang, Bang Liu, Chenglin Wu, Yuyu Luo</p>

            <p><strong>Title:</strong><br>
            AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19304v1">http://arxiv.org/abs/2511.19304v1</a></p>

            <p><strong>Abstract:</strong><br>
            Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Computer-Use Agents as Judges for Generative User Interface</title>
      <itunes:episode>1401</itunes:episode>
      <podcast:episode>1401</podcast:episode>
      <itunes:title>Computer-Use Agents as Judges for Generative User Interface</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6fbfa7d8-6b35-410e-aa1d-c3f3e62e32eb</guid>
      <link>https://share.transistor.fm/s/27529f29</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV, cs.CL, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            Computer-Use Agents as Judges for Generative User Interface</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15567v1">http://arxiv.org/abs/2511.15567v1</a></p>

            <p><strong>Abstract:</strong><br>
            Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV, cs.CL, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            Computer-Use Agents as Judges for Generative User Interface</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15567v1">http://arxiv.org/abs/2511.15567v1</a></p>

            <p><strong>Abstract:</strong><br>
            Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 25 Nov 2025 19:29:02 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/27529f29/0f54daad.mp3" length="24864472" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1550</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV, cs.CL, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            Computer-Use Agents as Judges for Generative User Interface</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15567v1">http://arxiv.org/abs/2511.15567v1</a></p>

            <p><strong>Abstract:</strong><br>
            Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation</title>
      <itunes:episode>1400</itunes:episode>
      <podcast:episode>1400</podcast:episode>
      <itunes:title>DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e51695e5-4a73-4e8c-a467-d7624f45def6</guid>
      <link>https://share.transistor.fm/s/08734c7a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, Qi Tian</p>

            <p><strong>Title:</strong><br>
            DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19365v1">http://arxiv.org/abs/2511.19365v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, Qi Tian</p>

            <p><strong>Title:</strong><br>
            DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19365v1">http://arxiv.org/abs/2511.19365v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 25 Nov 2025 19:28:39 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/08734c7a/0b200988.mp3" length="24213305" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1510</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, Qi Tian</p>

            <p><strong>Title:</strong><br>
            DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19365v1">http://arxiv.org/abs/2511.19365v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research</title>
      <itunes:episode>1399</itunes:episode>
      <podcast:episode>1399</podcast:episode>
      <itunes:title>DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">61a1a0d5-9420-4f27-90a3-ad4c44c0102d</guid>
      <link>https://share.transistor.fm/s/b0adafaf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, Pang Wei Koh</p>

            <p><strong>Title:</strong><br>
            DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19399v1">http://arxiv.org/abs/2511.19399v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, Pang Wei Koh</p>

            <p><strong>Title:</strong><br>
            DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19399v1">http://arxiv.org/abs/2511.19399v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 25 Nov 2025 19:28:16 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b0adafaf/e5ef0133.mp3" length="19751589" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1231</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, Pang Wei Koh</p>

            <p><strong>Title:</strong><br>
            DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19399v1">http://arxiv.org/abs/2511.19399v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios</title>
      <itunes:episode>1398</itunes:episode>
      <podcast:episode>1398</podcast:episode>
      <itunes:title>UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f1cbab9d-9639-4a7c-bd40-8931977f9f67</guid>
      <link>https://share.transistor.fm/s/1b1c0b51</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tian Ye, Song Fei, Lei Zhu</p>

            <p><strong>Title:</strong><br>
            UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.18050v1">http://arxiv.org/abs/2511.18050v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tian Ye, Song Fei, Lei Zhu</p>

            <p><strong>Title:</strong><br>
            UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.18050v1">http://arxiv.org/abs/2511.18050v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 25 Nov 2025 19:27:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1b1c0b51/cf52f57e.mp3" length="21497028" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1340</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tian Ye, Song Fei, Lei Zhu</p>

            <p><strong>Title:</strong><br>
            UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.18050v1">http://arxiv.org/abs/2511.18050v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>In-Video Instructions: Visual Signals as Generative Control</title>
      <itunes:episode>1397</itunes:episode>
      <podcast:episode>1397</podcast:episode>
      <itunes:title>In-Video Instructions: Visual Signals as Generative Control</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">410e6d6c-3ebe-4e60-841f-3466cc474bf9</guid>
      <link>https://share.transistor.fm/s/b29acd66</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Gongfan Fang, Xinyin Ma, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            In-Video Instructions: Visual Signals as Generative Control</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19401v1">http://arxiv.org/abs/2511.19401v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Gongfan Fang, Xinyin Ma, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            In-Video Instructions: Visual Signals as Generative Control</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19401v1">http://arxiv.org/abs/2511.19401v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 25 Nov 2025 19:27:31 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b29acd66/44ccca54.mp3" length="21596449" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1346</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Gongfan Fang, Xinyin Ma, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            In-Video Instructions: Visual Signals as Generative Control</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.19401v1">http://arxiv.org/abs/2511.19401v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe</title>
      <itunes:episode>1396</itunes:episode>
      <podcast:episode>1396</podcast:episode>
      <itunes:title>OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1be999e7-be7a-4c6a-98a1-7c8fa6099fca</guid>
      <link>https://share.transistor.fm/s/c6956fd6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 76 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kaichen Zhang, Keming Wu, Zuhao Yang, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.16334v1">http://arxiv.org/abs/2511.16334v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 76 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kaichen Zhang, Keming Wu, Zuhao Yang, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.16334v1">http://arxiv.org/abs/2511.16334v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 24 Nov 2025 19:01:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c6956fd6/86aef443.mp3" length="20608429" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1284</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 76 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kaichen Zhang, Keming Wu, Zuhao Yang, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.16334v1">http://arxiv.org/abs/2511.16334v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story</title>
      <itunes:episode>1395</itunes:episode>
      <podcast:episode>1395</podcast:episode>
      <itunes:title>Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b9884d89-3453-4f34-8093-0221e5aff870</guid>
      <link>https://share.transistor.fm/s/bf95774d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 72 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Vladislav Pedashenko, Laida Kushnareva, Yana Khassan Nibal, Eduard Tulchinskii, Kristian Kuznetsov, Vladislav Zharchinskii, Yury Maximov, Irina Piontkovskaya</p>

            <p><strong>Title:</strong><br>
            Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15210v1">http://arxiv.org/abs/2511.15210v1</a></p>

            <p><strong>Abstract:</strong><br>
            Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively "easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 72 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Vladislav Pedashenko, Laida Kushnareva, Yana Khassan Nibal, Eduard Tulchinskii, Kristian Kuznetsov, Vladislav Zharchinskii, Yury Maximov, Irina Piontkovskaya</p>

            <p><strong>Title:</strong><br>
            Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15210v1">http://arxiv.org/abs/2511.15210v1</a></p>

            <p><strong>Abstract:</strong><br>
            Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively "easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 24 Nov 2025 19:00:38 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bf95774d/28ada809.mp3" length="21394596" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1333</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 72 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Vladislav Pedashenko, Laida Kushnareva, Yana Khassan Nibal, Eduard Tulchinskii, Kristian Kuznetsov, Vladislav Zharchinskii, Yury Maximov, Irina Piontkovskaya</p>

            <p><strong>Title:</strong><br>
            Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15210v1">http://arxiv.org/abs/2511.15210v1</a></p>

            <p><strong>Abstract:</strong><br>
            Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively "easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization</title>
      <itunes:episode>1394</itunes:episode>
      <podcast:episode>1394</podcast:episode>
      <itunes:title>GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">65b7eadd-92d8-4819-b281-5b5979b6702c</guid>
      <link>https://share.transistor.fm/s/cdb52ecd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yikun Wang, Zuyan Liu, Ziyi Wang, Pengfei Liu, Han Hu, Yongming Rao</p>

            <p><strong>Title:</strong><br>
            GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15705v1">http://arxiv.org/abs/2511.15705v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yikun Wang, Zuyan Liu, Ziyi Wang, Pengfei Liu, Han Hu, Yongming Rao</p>

            <p><strong>Title:</strong><br>
            GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15705v1">http://arxiv.org/abs/2511.15705v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 24 Nov 2025 19:00:15 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cdb52ecd/a36c1be4.mp3" length="20863776" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1300</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yikun Wang, Zuyan Liu, Ziyi Wang, Pengfei Liu, Han Hu, Yongming Rao</p>

            <p><strong>Title:</strong><br>
            GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15705v1">http://arxiv.org/abs/2511.15705v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SAM 3: Segment Anything with Concepts</title>
      <itunes:episode>1393</itunes:episode>
      <podcast:episode>1393</podcast:episode>
      <itunes:title>SAM 3: Segment Anything with Concepts</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8ca71102-4b4f-4671-a0c5-99d19e48eaa5</guid>
      <link>https://share.transistor.fm/s/1971e106</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane Momeni, Rishi Hazra, Shuangrui Ding, Sagar Vaze, Francois Porcher, Feng Li, Siyuan Li, Aishwarya Kamath, Ho Kei Cheng, Piotr Dollár, Nikhila Ravi, Kate Saenko, Pengchuan Zhang, Christoph Feichtenhofer</p>

            <p><strong>Title:</strong><br>
            SAM 3: Segment Anything with Concepts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.16719v1">http://arxiv.org/abs/2511.16719v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane Momeni, Rishi Hazra, Shuangrui Ding, Sagar Vaze, Francois Porcher, Feng Li, Siyuan Li, Aishwarya Kamath, Ho Kei Cheng, Piotr Dollár, Nikhila Ravi, Kate Saenko, Pengchuan Zhang, Christoph Feichtenhofer</p>

            <p><strong>Title:</strong><br>
            SAM 3: Segment Anything with Concepts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.16719v1">http://arxiv.org/abs/2511.16719v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 24 Nov 2025 18:59:52 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1971e106/aefbb880.mp3" length="22989067" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1433</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane Momeni, Rishi Hazra, Shuangrui Ding, Sagar Vaze, Francois Porcher, Feng Li, Siyuan Li, Aishwarya Kamath, Ho Kei Cheng, Piotr Dollár, Nikhila Ravi, Kate Saenko, Pengchuan Zhang, Christoph Feichtenhofer</p>

            <p><strong>Title:</strong><br>
            SAM 3: Segment Anything with Concepts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.16719v1">http://arxiv.org/abs/2511.16719v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks</title>
      <itunes:episode>1392</itunes:episode>
      <podcast:episode>1392</podcast:episode>
      <itunes:title>Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">97d540a4-e01f-4561-b7e7-f25ed91ce996</guid>
      <link>https://share.transistor.fm/s/f9fe9de4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Cheng Yang, Haiyuan Wan, Yiran Peng, Xin Cheng, Zhaoyang Yu, Jiayi Zhang, Junchi Yu, Xinlei Yu, Xiawu Zheng, Dongzhan Zhou, Chenglin Wu</p>

            <p><strong>Title:</strong><br>
            Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15065v1">http://arxiv.org/abs/2511.15065v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench -- a comprehensive benchmark designed to systematically evaluate video models' reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10--20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Cheng Yang, Haiyuan Wan, Yiran Peng, Xin Cheng, Zhaoyang Yu, Jiayi Zhang, Junchi Yu, Xinlei Yu, Xiawu Zheng, Dongzhan Zhou, Chenglin Wu</p>

            <p><strong>Title:</strong><br>
            Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15065v1">http://arxiv.org/abs/2511.15065v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench -- a comprehensive benchmark designed to systematically evaluate video models' reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10--20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 20 Nov 2025 19:11:10 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f9fe9de4/8d631f9a.mp3" length="25006624" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1559</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Cheng Yang, Haiyuan Wan, Yiran Peng, Xin Cheng, Zhaoyang Yu, Jiayi Zhang, Junchi Yu, Xinlei Yu, Xiawu Zheng, Dongzhan Zhou, Chenglin Wu</p>

            <p><strong>Title:</strong><br>
            Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15065v1">http://arxiv.org/abs/2511.15065v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench -- a comprehensive benchmark designed to systematically evaluate video models' reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis demonstrates that SFT can efficiently elicit the reasoning ability of video model. Video models exhibit stronger spatial perception during reasoning, outperforming leading VLMs and generalizing well across diverse scenarios, tasks, and levels of complexity. We further discover a test-time scaling effect, where diverse sampling during inference improves reasoning reliability by 10--20%. These findings highlight the unique potential and scalability of reasoning via video for spatial reasoning tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation</title>
      <itunes:episode>1391</itunes:episode>
      <podcast:episode>1391</podcast:episode>
      <itunes:title>Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">afcd3159-f9cc-4d9b-a22c-3076a317e8c0</guid>
      <link>https://share.transistor.fm/s/9b1d2a74</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 128 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev, Alexey Letunovskiy, Maria Kovaleva, Nikolai Vaulin, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Nikita Kiselev, Alexander Varlamov, Dmitrii Mikhailov, Vladimir Polovnikov, Andrey Shutkin, Ilya Vasiliev, Julia Agafonova, Anastasiia Kargapoltseva, Anna Dmitrienko, Anastasia Maltseva, Anna Averchenkova, Olga Kim, Tatiana Nikulina, Denis Dimitrov</p>

            <p><strong>Title:</strong><br>
            Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.14993v1">http://arxiv.org/abs/2511.14993v1</a></p>

            <p><strong>Abstract:</strong><br>
            This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 128 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev, Alexey Letunovskiy, Maria Kovaleva, Nikolai Vaulin, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Nikita Kiselev, Alexander Varlamov, Dmitrii Mikhailov, Vladimir Polovnikov, Andrey Shutkin, Ilya Vasiliev, Julia Agafonova, Anastasiia Kargapoltseva, Anna Dmitrienko, Anastasia Maltseva, Anna Averchenkova, Olga Kim, Tatiana Nikulina, Denis Dimitrov</p>

            <p><strong>Title:</strong><br>
            Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.14993v1">http://arxiv.org/abs/2511.14993v1</a></p>

            <p><strong>Abstract:</strong><br>
            This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 20 Nov 2025 19:09:09 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9b1d2a74/5e8122fb.mp3" length="24004328" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1497</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 128 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev, Alexey Letunovskiy, Maria Kovaleva, Nikolai Vaulin, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Nikita Kiselev, Alexander Varlamov, Dmitrii Mikhailov, Vladimir Polovnikov, Andrey Shutkin, Ilya Vasiliev, Julia Agafonova, Anastasiia Kargapoltseva, Anna Dmitrienko, Anastasia Maltseva, Anna Averchenkova, Olga Kim, Tatiana Nikulina, Denis Dimitrov</p>

            <p><strong>Title:</strong><br>
            Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.14993v1">http://arxiv.org/abs/2511.14993v1</a></p>

            <p><strong>Abstract:</strong><br>
            This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity</title>
      <itunes:episode>1390</itunes:episode>
      <podcast:episode>1390</podcast:episode>
      <itunes:title>What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d451ba73-7bcc-4c0e-a915-3a2aa2ab5c1c</guid>
      <link>https://share.transistor.fm/s/7b6ca2d4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Alexis Audran-Reiss, Jordi Armengol Estapé, Karen Hambardzumyan, Amar Budhiraja, Martin Josifoski, Edan Toledo, Rishi Hazra, Despoina Magka, Michael Shvartsman, Parth Pathak, Justine T Kao, Lucia Cipolina-Kun, Bhavul Gauri, Jean-Christophe Gagnon-Audet, Emanuel Tewolde, Jenny Zhang, Taco Cohen, Yossi Adi, Tatiana Shavrina, Yoram Bachrach</p>

            <p><strong>Title:</strong><br>
            What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15593v1">http://arxiv.org/abs/2511.15593v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI research agents offer the promise to accelerate scientific progress by automating the design, implementation, and training of machine learning models. However, the field is still in its infancy, and the key factors driving the success or failure of agent trajectories are not fully understood. We examine the role that ideation diversity plays in agent performance. First, we analyse agent trajectories on MLE-bench, a well-known benchmark to evaluate AI research agents, across different models and agent scaffolds. Our analysis reveals that different models and agent scaffolds yield varying degrees of ideation diversity, and that higher-performing agents tend to have increased ideation diversity. Further, we run a controlled experiment where we modify the degree of ideation diversity, demonstrating that higher ideation diversity results in stronger performance. Finally, we strengthen our results by examining additional evaluation metrics beyond the standard medal-based scoring of MLE-bench, showing that our findings still hold across other agent performance metrics.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Alexis Audran-Reiss, Jordi Armengol Estapé, Karen Hambardzumyan, Amar Budhiraja, Martin Josifoski, Edan Toledo, Rishi Hazra, Despoina Magka, Michael Shvartsman, Parth Pathak, Justine T Kao, Lucia Cipolina-Kun, Bhavul Gauri, Jean-Christophe Gagnon-Audet, Emanuel Tewolde, Jenny Zhang, Taco Cohen, Yossi Adi, Tatiana Shavrina, Yoram Bachrach</p>

            <p><strong>Title:</strong><br>
            What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15593v1">http://arxiv.org/abs/2511.15593v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI research agents offer the promise to accelerate scientific progress by automating the design, implementation, and training of machine learning models. However, the field is still in its infancy, and the key factors driving the success or failure of agent trajectories are not fully understood. We examine the role that ideation diversity plays in agent performance. First, we analyse agent trajectories on MLE-bench, a well-known benchmark to evaluate AI research agents, across different models and agent scaffolds. Our analysis reveals that different models and agent scaffolds yield varying degrees of ideation diversity, and that higher-performing agents tend to have increased ideation diversity. Further, we run a controlled experiment where we modify the degree of ideation diversity, demonstrating that higher ideation diversity results in stronger performance. Finally, we strengthen our results by examining additional evaluation metrics beyond the standard medal-based scoring of MLE-bench, showing that our findings still hold across other agent performance metrics.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 20 Nov 2025 19:08:19 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7b6ca2d4/4b856550.mp3" length="21819669" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1360</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Alexis Audran-Reiss, Jordi Armengol Estapé, Karen Hambardzumyan, Amar Budhiraja, Martin Josifoski, Edan Toledo, Rishi Hazra, Despoina Magka, Michael Shvartsman, Parth Pathak, Justine T Kao, Lucia Cipolina-Kun, Bhavul Gauri, Jean-Christophe Gagnon-Audet, Emanuel Tewolde, Jenny Zhang, Taco Cohen, Yossi Adi, Tatiana Shavrina, Yoram Bachrach</p>

            <p><strong>Title:</strong><br>
            What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15593v1">http://arxiv.org/abs/2511.15593v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI research agents offer the promise to accelerate scientific progress by automating the design, implementation, and training of machine learning models. However, the field is still in its infancy, and the key factors driving the success or failure of agent trajectories are not fully understood. We examine the role that ideation diversity plays in agent performance. First, we analyse agent trajectories on MLE-bench, a well-known benchmark to evaluate AI research agents, across different models and agent scaffolds. Our analysis reveals that different models and agent scaffolds yield varying degrees of ideation diversity, and that higher-performing agents tend to have increased ideation diversity. Further, we run a controlled experiment where we modify the degree of ideation diversity, demonstrating that higher ideation diversity results in stronger performance. Finally, we strengthen our results by examining additional evaluation metrics beyond the standard medal-based scoring of MLE-bench, showing that our findings still hold across other agent performance metrics.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VisPlay: Self-Evolving Vision-Language Models from Images</title>
      <itunes:episode>1389</itunes:episode>
      <podcast:episode>1389</podcast:episode>
      <itunes:title>VisPlay: Self-Evolving Vision-Language Models from Images</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a5b3571e-8ac5-435f-9411-c8325c36c4fa</guid>
      <link>https://share.transistor.fm/s/902c4097</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang</p>

            <p><strong>Title:</strong><br>
            VisPlay: Self-Evolving Vision-Language Models from Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15661v2">http://arxiv.org/abs/2511.15661v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang</p>

            <p><strong>Title:</strong><br>
            VisPlay: Self-Evolving Vision-Language Models from Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15661v2">http://arxiv.org/abs/2511.15661v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 20 Nov 2025 19:07:56 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/902c4097/1c047d6d.mp3" length="21633227" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1348</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang</p>

            <p><strong>Title:</strong><br>
            VisPlay: Self-Evolving Vision-Language Models from Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15661v2">http://arxiv.org/abs/2511.15661v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset</title>
      <itunes:episode>1388</itunes:episode>
      <podcast:episode>1388</podcast:episode>
      <itunes:title>Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cfe65128-f2ad-4696-8547-e3b6926b7207</guid>
      <link>https://share.transistor.fm/s/60b88dfc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Geon Choi, Hangyul Yoon, Hyunju Shin, Hyunki Park, Sang Hoon Seo, Eunho Yang, Edward Choi</p>

            <p><strong>Title:</strong><br>
            Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15186v1">http://arxiv.org/abs/2511.15186v1</a></p>

            <p><strong>Abstract:</strong><br>
            The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on long, detailed expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce a new paradigm: instruction-guided lesion segmentation (ILS), which is designed to segment diverse lesion types based on simple, user-friendly instructions. Under this paradigm, we construct MIMIC-ILS, the first large-scale instruction-answer dataset for CXR lesion segmentation, using our fully automated multimodal pipeline that generates annotations from chest X-ray images and their corresponding reports. MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we introduce ROSALIA, a vision-language model fine-tuned on MIMIC-ILS. ROSALIA can segment diverse lesions and provide textual explanations in response to user instructions. The model achieves high segmentation and textual accuracy in our newly proposed task, highlighting the effectiveness of our pipeline and the value of MIMIC-ILS as a foundational resource for pixel-level CXR lesion grounding.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Geon Choi, Hangyul Yoon, Hyunju Shin, Hyunki Park, Sang Hoon Seo, Eunho Yang, Edward Choi</p>

            <p><strong>Title:</strong><br>
            Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15186v1">http://arxiv.org/abs/2511.15186v1</a></p>

            <p><strong>Abstract:</strong><br>
            The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on long, detailed expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce a new paradigm: instruction-guided lesion segmentation (ILS), which is designed to segment diverse lesion types based on simple, user-friendly instructions. Under this paradigm, we construct MIMIC-ILS, the first large-scale instruction-answer dataset for CXR lesion segmentation, using our fully automated multimodal pipeline that generates annotations from chest X-ray images and their corresponding reports. MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we introduce ROSALIA, a vision-language model fine-tuned on MIMIC-ILS. ROSALIA can segment diverse lesions and provide textual explanations in response to user instructions. The model achieves high segmentation and textual accuracy in our newly proposed task, highlighting the effectiveness of our pipeline and the value of MIMIC-ILS as a foundational resource for pixel-level CXR lesion grounding.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 20 Nov 2025 19:07:33 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/60b88dfc/b531adeb.mp3" length="18644448" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1162</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Geon Choi, Hangyul Yoon, Hyunju Shin, Hyunki Park, Sang Hoon Seo, Eunho Yang, Edward Choi</p>

            <p><strong>Title:</strong><br>
            Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.15186v1">http://arxiv.org/abs/2511.15186v1</a></p>

            <p><strong>Abstract:</strong><br>
            The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on long, detailed expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce a new paradigm: instruction-guided lesion segmentation (ILS), which is designed to segment diverse lesion types based on simple, user-friendly instructions. Under this paradigm, we construct MIMIC-ILS, the first large-scale instruction-answer dataset for CXR lesion segmentation, using our fully automated multimodal pipeline that generates annotations from chest X-ray images and their corresponding reports. MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we introduce ROSALIA, a vision-language model fine-tuned on MIMIC-ILS. ROSALIA can segment diverse lesions and provide textual explanations in response to user instructions. The model achieves high segmentation and textual accuracy in our newly proposed task, highlighting the effectiveness of our pipeline and the value of MIMIC-ILS as a foundational resource for pixel-level CXR lesion grounding.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VIDEOP2R: Video Understanding from Perception to Reasoning</title>
      <itunes:episode>1387</itunes:episode>
      <podcast:episode>1387</podcast:episode>
      <itunes:title>VIDEOP2R: Video Understanding from Perception to Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">63ef6439-4fa5-4a22-8e64-24632a94cb09</guid>
      <link>https://share.transistor.fm/s/62f92c1a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yifan Jiang, Yueying Wang, Rui Zhao, Toufiq Parag, Zhimin Chen, Zhenyu Liao, Jayakrishnan Unnikrishnan</p>

            <p><strong>Title:</strong><br>
            VIDEOP2R: Video Understanding from Perception to Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11113v1">http://arxiv.org/abs/2511.11113v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yifan Jiang, Yueying Wang, Rui Zhao, Toufiq Parag, Zhimin Chen, Zhenyu Liao, Jayakrishnan Unnikrishnan</p>

            <p><strong>Title:</strong><br>
            VIDEOP2R: Video Understanding from Perception to Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11113v1">http://arxiv.org/abs/2511.11113v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 19 Nov 2025 19:29:09 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/62f92c1a/b338cd1d.mp3" length="24184451" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1508</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yifan Jiang, Yueying Wang, Rui Zhao, Toufiq Parag, Zhimin Chen, Zhenyu Liao, Jayakrishnan Unnikrishnan</p>

            <p><strong>Title:</strong><br>
            VIDEOP2R: Video Understanding from Perception to Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11113v1">http://arxiv.org/abs/2511.11113v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models</title>
      <itunes:episode>1386</itunes:episode>
      <podcast:episode>1386</podcast:episode>
      <itunes:title>Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2dd05188-35ce-4f53-bc7d-6cef6d5bd292</guid>
      <link>https://share.transistor.fm/s/b91859a4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CL, cs.AI, cs.LG, cs.PF</p>

            <p><strong>Authors:</strong><br>
            Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, Yu Wang</p>

            <p><strong>Title:</strong><br>
            Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.08577v1">http://arxiv.org/abs/2511.08577v1</a></p>

            <p><strong>Abstract:</strong><br>
            Improving reasoning capabilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Prior work proposes recurrent transformers, which allocate a fixed number of extra iterations per token to improve generation quality. After the first, standard forward pass, instead of verbalization, last-layer hidden states are fed back as inputs for additional iterations to refine token predictions. Yet we identify a latent overthinking phenomenon: easy token predictions that are already correct after the first pass are sometimes revised into errors in additional iterations. To address this, we propose Think-at-Hard (TaH), a dynamic latent thinking method that iterates deeper only at hard tokens. It employs a lightweight neural decider to trigger latent iterations only at tokens that are likely incorrect after the standard forward pass. During latent iterations, Low-Rank Adaptation (LoRA) modules shift the LLM objective from general next-token prediction to focused hard-token refinement. We further introduce a duo-causal attention mechanism that extends attention from the token sequence dimension to an additional iteration depth dimension. This enables cross-iteration information flow while maintaining full sequential parallelism. Experiments show that TaH boosts LLM reasoning performance across five challenging benchmarks while maintaining the same parameter count. Compared with baselines that iterate twice for all output tokens, TaH delivers 8.1-11.3% accuracy gains while exempting 94% of tokens from the second iteration. Against strong single-iteration Qwen3 models finetuned with the same data, it also delivers 4.0-5.0% accuracy gains. When allowing less than 3% additional parameters from LoRA and the iteration decider, the gains increase to 8.5-12.6% and 5.3-5.4%, respectively. Our code is available at https://github.com/thu-nics/TaH.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CL, cs.AI, cs.LG, cs.PF</p>

            <p><strong>Authors:</strong><br>
            Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, Yu Wang</p>

            <p><strong>Title:</strong><br>
            Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.08577v1">http://arxiv.org/abs/2511.08577v1</a></p>

            <p><strong>Abstract:</strong><br>
            Improving reasoning capabilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Prior work proposes recurrent transformers, which allocate a fixed number of extra iterations per token to improve generation quality. After the first, standard forward pass, instead of verbalization, last-layer hidden states are fed back as inputs for additional iterations to refine token predictions. Yet we identify a latent overthinking phenomenon: easy token predictions that are already correct after the first pass are sometimes revised into errors in additional iterations. To address this, we propose Think-at-Hard (TaH), a dynamic latent thinking method that iterates deeper only at hard tokens. It employs a lightweight neural decider to trigger latent iterations only at tokens that are likely incorrect after the standard forward pass. During latent iterations, Low-Rank Adaptation (LoRA) modules shift the LLM objective from general next-token prediction to focused hard-token refinement. We further introduce a duo-causal attention mechanism that extends attention from the token sequence dimension to an additional iteration depth dimension. This enables cross-iteration information flow while maintaining full sequential parallelism. Experiments show that TaH boosts LLM reasoning performance across five challenging benchmarks while maintaining the same parameter count. Compared with baselines that iterate twice for all output tokens, TaH delivers 8.1-11.3% accuracy gains while exempting 94% of tokens from the second iteration. Against strong single-iteration Qwen3 models finetuned with the same data, it also delivers 4.0-5.0% accuracy gains. When allowing less than 3% additional parameters from LoRA and the iteration decider, the gains increase to 8.5-12.6% and 5.3-5.4%, respectively. Our code is available at https://github.com/thu-nics/TaH.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 19 Nov 2025 19:28:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b91859a4/d74e1b5c.mp3" length="24019796" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1498</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CL, cs.AI, cs.LG, cs.PF</p>

            <p><strong>Authors:</strong><br>
            Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, Yu Wang</p>

            <p><strong>Title:</strong><br>
            Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.08577v1">http://arxiv.org/abs/2511.08577v1</a></p>

            <p><strong>Abstract:</strong><br>
            Improving reasoning capabilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Prior work proposes recurrent transformers, which allocate a fixed number of extra iterations per token to improve generation quality. After the first, standard forward pass, instead of verbalization, last-layer hidden states are fed back as inputs for additional iterations to refine token predictions. Yet we identify a latent overthinking phenomenon: easy token predictions that are already correct after the first pass are sometimes revised into errors in additional iterations. To address this, we propose Think-at-Hard (TaH), a dynamic latent thinking method that iterates deeper only at hard tokens. It employs a lightweight neural decider to trigger latent iterations only at tokens that are likely incorrect after the standard forward pass. During latent iterations, Low-Rank Adaptation (LoRA) modules shift the LLM objective from general next-token prediction to focused hard-token refinement. We further introduce a duo-causal attention mechanism that extends attention from the token sequence dimension to an additional iteration depth dimension. This enables cross-iteration information flow while maintaining full sequential parallelism. Experiments show that TaH boosts LLM reasoning performance across five challenging benchmarks while maintaining the same parameter count. Compared with baselines that iterate twice for all output tokens, TaH delivers 8.1-11.3% accuracy gains while exempting 94% of tokens from the second iteration. Against strong single-iteration Qwen3 models finetuned with the same data, it also delivers 4.0-5.0% accuracy gains. When allowing less than 3% additional parameters from LoRA and the iteration decider, the gains increase to 8.5-12.6% and 5.3-5.4%, respectively. Our code is available at https://github.com/thu-nics/TaH.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models</title>
      <itunes:episode>1385</itunes:episode>
      <podcast:episode>1385</podcast:episode>
      <itunes:title>AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9bf98def-f4c5-42f8-976c-40be021973b3</guid>
      <link>https://share.transistor.fm/s/f8e92b27</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Mohammad Zbib, Hasan Abed Al Kader Hammoud, Sina Mukalled, Nadine Rizk, Fatima Karnib, Issam Lakkis, Ammar Mohanna, Bernard Ghanem</p>

            <p><strong>Title:</strong><br>
            AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.14295v1">http://arxiv.org/abs/2511.14295v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Mohammad Zbib, Hasan Abed Al Kader Hammoud, Sina Mukalled, Nadine Rizk, Fatima Karnib, Issam Lakkis, Ammar Mohanna, Bernard Ghanem</p>

            <p><strong>Title:</strong><br>
            AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.14295v1">http://arxiv.org/abs/2511.14295v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 19 Nov 2025 19:28:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f8e92b27/ebf5008f.mp3" length="22910983" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1428</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Mohammad Zbib, Hasan Abed Al Kader Hammoud, Sina Mukalled, Nadine Rizk, Fatima Karnib, Issam Lakkis, Ammar Mohanna, Bernard Ghanem</p>

            <p><strong>Title:</strong><br>
            AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.14295v1">http://arxiv.org/abs/2511.14295v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space</title>
      <itunes:episode>1384</itunes:episode>
      <podcast:episode>1384</podcast:episode>
      <itunes:title>A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">83528d54-05e5-42f9-a1e6-84fae43c570d</guid>
      <link>https://share.transistor.fm/s/52d14a21</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Huijie Liu, Shuhao Cui, Haoxiang Cao, Shuai Ma, Kai Wu, Guoliang Kang</p>

            <p><strong>Title:</strong><br>
            A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.10555v4">http://arxiv.org/abs/2511.10555v4</a></p>

            <p><strong>Abstract:</strong><br>
            Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Huijie Liu, Shuhao Cui, Haoxiang Cao, Shuai Ma, Kai Wu, Guoliang Kang</p>

            <p><strong>Title:</strong><br>
            A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.10555v4">http://arxiv.org/abs/2511.10555v4</a></p>

            <p><strong>Abstract:</strong><br>
            Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 19 Nov 2025 19:28:02 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/52d14a21/9b2e8ab6.mp3" length="22899680" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1428</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Huijie Liu, Shuhao Cui, Haoxiang Cao, Shuai Ma, Kai Wu, Guoliang Kang</p>

            <p><strong>Title:</strong><br>
            A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.10555v4">http://arxiv.org/abs/2511.10555v4</a></p>

            <p><strong>Abstract:</strong><br>
            Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark</title>
      <itunes:episode>1383</itunes:episode>
      <podcast:episode>1383</podcast:episode>
      <itunes:title>Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a6f68000-1187-4ab0-a68c-6feef970f16c</guid>
      <link>https://share.transistor.fm/s/a165ff3c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xinxin Liu, Zhaopan Xu, Kai Wang, Yong Jae Lee, Yuzhang Shang</p>

            <p><strong>Title:</strong><br>
            Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13853v1">http://arxiv.org/abs/2511.13853v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Chain-of-Thought (CoT) prompting enables sophisticated symbolic reasoning in LLMs, it remains confined to discrete text and cannot simulate the continuous, physics-governed dynamics of the real world. Recent video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning -- materializing thought as frame-by-frame visual sequences, with each frame representing a physically-grounded reasoning step. Despite compelling demonstrations, a challenge persists: existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning and thus cannot measure core cognitive abilities in multi-step planning, algorithmic logic, or abstract pattern extrapolation. This evaluation void prevents systematic understanding of model capabilities and principled guidance for improvement. We introduce Gen-ViRe (Generative Visual Reasoning Benchmark), a framework grounded in cognitive science and real-world AI applications, which decomposes CoF reasoning into six cognitive dimensions -- from perceptual logic to abstract planning -- and 24 subtasks. Through multi-source data curation, minimal prompting protocols, and hybrid VLM-assisted evaluation with detailed criteria, Gen-ViRe delivers the first quantitative assessment of video models as reasoners. Our experiments on SOTA systems reveal substantial discrepancies between impressive visual quality and actual reasoning depth, establishing baselines and diagnostic tools to advance genuine world simulators.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xinxin Liu, Zhaopan Xu, Kai Wang, Yong Jae Lee, Yuzhang Shang</p>

            <p><strong>Title:</strong><br>
            Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13853v1">http://arxiv.org/abs/2511.13853v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Chain-of-Thought (CoT) prompting enables sophisticated symbolic reasoning in LLMs, it remains confined to discrete text and cannot simulate the continuous, physics-governed dynamics of the real world. Recent video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning -- materializing thought as frame-by-frame visual sequences, with each frame representing a physically-grounded reasoning step. Despite compelling demonstrations, a challenge persists: existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning and thus cannot measure core cognitive abilities in multi-step planning, algorithmic logic, or abstract pattern extrapolation. This evaluation void prevents systematic understanding of model capabilities and principled guidance for improvement. We introduce Gen-ViRe (Generative Visual Reasoning Benchmark), a framework grounded in cognitive science and real-world AI applications, which decomposes CoF reasoning into six cognitive dimensions -- from perceptual logic to abstract planning -- and 24 subtasks. Through multi-source data curation, minimal prompting protocols, and hybrid VLM-assisted evaluation with detailed criteria, Gen-ViRe delivers the first quantitative assessment of video models as reasoners. Our experiments on SOTA systems reveal substantial discrepancies between impressive visual quality and actual reasoning depth, establishing baselines and diagnostic tools to advance genuine world simulators.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 19 Nov 2025 19:27:40 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a165ff3c/9dde7537.mp3" length="21805030" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1359</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xinxin Liu, Zhaopan Xu, Kai Wang, Yong Jae Lee, Yuzhang Shang</p>

            <p><strong>Title:</strong><br>
            Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13853v1">http://arxiv.org/abs/2511.13853v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Chain-of-Thought (CoT) prompting enables sophisticated symbolic reasoning in LLMs, it remains confined to discrete text and cannot simulate the continuous, physics-governed dynamics of the real world. Recent video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning -- materializing thought as frame-by-frame visual sequences, with each frame representing a physically-grounded reasoning step. Despite compelling demonstrations, a challenge persists: existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning and thus cannot measure core cognitive abilities in multi-step planning, algorithmic logic, or abstract pattern extrapolation. This evaluation void prevents systematic understanding of model capabilities and principled guidance for improvement. We introduce Gen-ViRe (Generative Visual Reasoning Benchmark), a framework grounded in cognitive science and real-world AI applications, which decomposes CoF reasoning into six cognitive dimensions -- from perceptual logic to abstract planning -- and 24 subtasks. Through multi-source data curation, minimal prompting protocols, and hybrid VLM-assisted evaluation with detailed criteria, Gen-ViRe delivers the first quantitative assessment of video models as reasoners. Our experiments on SOTA systems reveal substantial discrepancies between impressive visual quality and actual reasoning depth, establishing baselines and diagnostic tools to advance genuine world simulators.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs</title>
      <itunes:episode>1382</itunes:episode>
      <podcast:episode>1382</podcast:episode>
      <itunes:title>MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b1267a28-f34f-4f5e-b93c-a470b6ca6b68</guid>
      <link>https://share.transistor.fm/s/211ea37a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Huiyi Chen, Jiawei Peng, Dehai Min, Changchang Sun, Kaijie Chen, Yan Yan, Xu Yang, Lu Cheng</p>

            <p><strong>Title:</strong><br>
            MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.14159v1">http://arxiv.org/abs/2511.14159v1</a></p>

            <p><strong>Abstract:</strong><br>
            Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Huiyi Chen, Jiawei Peng, Dehai Min, Changchang Sun, Kaijie Chen, Yan Yan, Xu Yang, Lu Cheng</p>

            <p><strong>Title:</strong><br>
            MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.14159v1">http://arxiv.org/abs/2511.14159v1</a></p>

            <p><strong>Abstract:</strong><br>
            Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 19 Nov 2025 19:27:18 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/211ea37a/932c245d.mp3" length="23529132" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1467</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Huiyi Chen, Jiawei Peng, Dehai Min, Changchang Sun, Kaijie Chen, Yan Yan, Xu Yang, Lu Cheng</p>

            <p><strong>Title:</strong><br>
            MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.14159v1">http://arxiv.org/abs/2511.14159v1</a></p>

            <p><strong>Abstract:</strong><br>
            Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding</title>
      <itunes:episode>1381</itunes:episode>
      <podcast:episode>1381</podcast:episode>
      <itunes:title>REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7c766ef9-fd53-4acb-b347-67189f8a007b</guid>
      <link>https://share.transistor.fm/s/554d3b2c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiaze Li, Hao Yin, Wenhui Tan, Jingyang Chen, Boshen Xu, Yuxun Qu, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Jian Luan</p>

            <p><strong>Title:</strong><br>
            REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13026v1">http://arxiv.org/abs/2511.13026v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model's reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiaze Li, Hao Yin, Wenhui Tan, Jingyang Chen, Boshen Xu, Yuxun Qu, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Jian Luan</p>

            <p><strong>Title:</strong><br>
            REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13026v1">http://arxiv.org/abs/2511.13026v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model's reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 19 Nov 2025 19:26:56 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/554d3b2c/672f80dc.mp3" length="25765226" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1607</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiaze Li, Hao Yin, Wenhui Tan, Jingyang Chen, Boshen Xu, Yuxun Qu, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Jian Luan</p>

            <p><strong>Title:</strong><br>
            REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13026v1">http://arxiv.org/abs/2511.13026v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model's reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data</title>
      <itunes:episode>1380</itunes:episode>
      <podcast:episode>1380</podcast:episode>
      <itunes:title>Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">59020a8b-9318-4d67-9d84-8468b68a8ec0</guid>
      <link>https://share.transistor.fm/s/669e4e2b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 87 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunxin Li, Xinyu Chen, Shenyuan Jiang, Haoyuan Shi, Zhenyu Liu, Xuanyu Zhang, Nanhao Deng, Zhenran Xu, Yicheng Ma, Meishan Zhang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.12609v1">http://arxiv.org/abs/2511.12609v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 87 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunxin Li, Xinyu Chen, Shenyuan Jiang, Haoyuan Shi, Zhenyu Liu, Xuanyu Zhang, Nanhao Deng, Zhenran Xu, Yicheng Ma, Meishan Zhang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.12609v1">http://arxiv.org/abs/2511.12609v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 18 Nov 2025 19:54:00 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/669e4e2b/9614be62.mp3" length="23476471" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1464</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 87 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunxin Li, Xinyu Chen, Shenyuan Jiang, Haoyuan Shi, Zhenyu Liu, Xuanyu Zhang, Nanhao Deng, Zhenran Xu, Yicheng Ma, Meishan Zhang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.12609v1">http://arxiv.org/abs/2511.12609v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>P1: Mastering Physics Olympiads with Reinforcement Learning</title>
      <itunes:episode>1379</itunes:episode>
      <podcast:episode>1379</podcast:episode>
      <itunes:title>P1: Mastering Physics Olympiads with Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">510736fb-27ec-443c-8f75-ac5cefdd04ed</guid>
      <link>https://share.transistor.fm/s/5356aaf9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 107 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiacheng Chen, Qianjia Cheng, Fangchen Yu, Haiyuan Wan, Yuchen Zhang, Shenghe Zheng, Junchi Yao, Qingyang Zhang, Haonan He, Yun Luo, Yufeng Zhao, Futing Wang, Li Sheng, Chengxing Xie, Yuxin Zuo, Yizhuo Li, Wenxauan Zeng, Yulun Wu, Rui Huang, Dongzhan Zhou, Kai Chen, Yu Qiao, Lei Bai, Yu Cheng, Ning Ding, Bowen Zhou, Peng Ye, Ganqu Cui</p>

            <p><strong>Title:</strong><br>
            P1: Mastering Physics Olympiads with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13612v1">http://arxiv.org/abs/2511.13612v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in large language models (LLMs) has moved the frontier from puzzle-solving to science-grade reasoning-the kind needed to tackle problems whose answers must stand against nature, not merely fit a rubric. Physics is the sharpest test of this shift, which binds symbols to reality in a fundamental way, serving as the cornerstone of most modern technologies. In this work, we manage to advance physics research by developing large language models with exceptional physics reasoning capabilities, especially excel at solving Olympiad-level physics problems. We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL). Among them, P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025. P1-30B-A3B also surpasses almost all other open-source models on IPhO 2025, getting a silver medal. Further equipped with an agentic framework PhysicsMinions, P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions. Besides physics, P1 models also present great performance on other reasoning tasks like math and coding, showing the great generalibility of P1 series.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 107 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiacheng Chen, Qianjia Cheng, Fangchen Yu, Haiyuan Wan, Yuchen Zhang, Shenghe Zheng, Junchi Yao, Qingyang Zhang, Haonan He, Yun Luo, Yufeng Zhao, Futing Wang, Li Sheng, Chengxing Xie, Yuxin Zuo, Yizhuo Li, Wenxauan Zeng, Yulun Wu, Rui Huang, Dongzhan Zhou, Kai Chen, Yu Qiao, Lei Bai, Yu Cheng, Ning Ding, Bowen Zhou, Peng Ye, Ganqu Cui</p>

            <p><strong>Title:</strong><br>
            P1: Mastering Physics Olympiads with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13612v1">http://arxiv.org/abs/2511.13612v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in large language models (LLMs) has moved the frontier from puzzle-solving to science-grade reasoning-the kind needed to tackle problems whose answers must stand against nature, not merely fit a rubric. Physics is the sharpest test of this shift, which binds symbols to reality in a fundamental way, serving as the cornerstone of most modern technologies. In this work, we manage to advance physics research by developing large language models with exceptional physics reasoning capabilities, especially excel at solving Olympiad-level physics problems. We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL). Among them, P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025. P1-30B-A3B also surpasses almost all other open-source models on IPhO 2025, getting a silver medal. Further equipped with an agentic framework PhysicsMinions, P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions. Besides physics, P1 models also present great performance on other reasoning tasks like math and coding, showing the great generalibility of P1 series.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 18 Nov 2025 19:47:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5356aaf9/beed7ebc.mp3" length="21429265" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1336</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 107 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiacheng Chen, Qianjia Cheng, Fangchen Yu, Haiyuan Wan, Yuchen Zhang, Shenghe Zheng, Junchi Yao, Qingyang Zhang, Haonan He, Yun Luo, Yufeng Zhao, Futing Wang, Li Sheng, Chengxing Xie, Yuxin Zuo, Yizhuo Li, Wenxauan Zeng, Yulun Wu, Rui Huang, Dongzhan Zhou, Kai Chen, Yu Qiao, Lei Bai, Yu Cheng, Ning Ding, Bowen Zhou, Peng Ye, Ganqu Cui</p>

            <p><strong>Title:</strong><br>
            P1: Mastering Physics Olympiads with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13612v1">http://arxiv.org/abs/2511.13612v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in large language models (LLMs) has moved the frontier from puzzle-solving to science-grade reasoning-the kind needed to tackle problems whose answers must stand against nature, not merely fit a rubric. Physics is the sharpest test of this shift, which binds symbols to reality in a fundamental way, serving as the cornerstone of most modern technologies. In this work, we manage to advance physics research by developing large language models with exceptional physics reasoning capabilities, especially excel at solving Olympiad-level physics problems. We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL). Among them, P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025. P1-30B-A3B also surpasses almost all other open-source models on IPhO 2025, getting a silver medal. Further equipped with an agentic framework PhysicsMinions, P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions. Besides physics, P1 models also present great performance on other reasoning tasks like math and coding, showing the great generalibility of P1 series.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling</title>
      <itunes:episode>1378</itunes:episode>
      <podcast:episode>1378</podcast:episode>
      <itunes:title>MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">75438d8b-b592-4227-9308-881aaef94965</guid>
      <link>https://share.transistor.fm/s/cc7dfbc7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 104 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, Wenhan Dou, Yue Deng, Yunjie Fu, Junqi Ge, Chenxia Han, Tammy Huang, Zhenhang Huang, Jerry Jiao, Shilei Jiang, Tianyu Jiao, Xiaoqi Jian, Lei Lei, Ruilin Li, Ryan Luo, Tiantong Li, Xiang Lin, Ziyuan Liu, Zhiqi Li, Jie Ni, Qiang Ren, Pax Sun, Shiqian Su, Chenxin Tao, Bin Wang, Hellen Wang, Haonan Wang, James Wang, Jin Wang, Jojo Wang, Letian Wang, Shizun Wang, Weizhi Wang, Zixuan Wang, Jinfan Xu, Sen Xing, Chenyu Yang, Hai Ye, Jiaheng Yu, Yue Yu, Muyan Zhong, Tianchen Zhao, Xizhou Zhu, Yanpeng Zhou, Yifan Zhang, Zhi Zhu</p>

            <p><strong>Title:</strong><br>
            MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11793v2">http://arxiv.org/abs/2511.11793v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 104 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, Wenhan Dou, Yue Deng, Yunjie Fu, Junqi Ge, Chenxia Han, Tammy Huang, Zhenhang Huang, Jerry Jiao, Shilei Jiang, Tianyu Jiao, Xiaoqi Jian, Lei Lei, Ruilin Li, Ryan Luo, Tiantong Li, Xiang Lin, Ziyuan Liu, Zhiqi Li, Jie Ni, Qiang Ren, Pax Sun, Shiqian Su, Chenxin Tao, Bin Wang, Hellen Wang, Haonan Wang, James Wang, Jin Wang, Jojo Wang, Letian Wang, Shizun Wang, Weizhi Wang, Zixuan Wang, Jinfan Xu, Sen Xing, Chenyu Yang, Hai Ye, Jiaheng Yu, Yue Yu, Muyan Zhong, Tianchen Zhao, Xizhou Zhu, Yanpeng Zhou, Yifan Zhang, Zhi Zhu</p>

            <p><strong>Title:</strong><br>
            MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11793v2">http://arxiv.org/abs/2511.11793v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 18 Nov 2025 19:47:06 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cc7dfbc7/c5759d9b.mp3" length="26689345" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1664</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 104 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, Wenhan Dou, Yue Deng, Yunjie Fu, Junqi Ge, Chenxia Han, Tammy Huang, Zhenhang Huang, Jerry Jiao, Shilei Jiang, Tianyu Jiao, Xiaoqi Jian, Lei Lei, Ruilin Li, Ryan Luo, Tiantong Li, Xiang Lin, Ziyuan Liu, Zhiqi Li, Jie Ni, Qiang Ren, Pax Sun, Shiqian Su, Chenxin Tao, Bin Wang, Hellen Wang, Haonan Wang, James Wang, Jin Wang, Jojo Wang, Letian Wang, Shizun Wang, Weizhi Wang, Zixuan Wang, Jinfan Xu, Sen Xing, Chenyu Yang, Hai Ye, Jiaheng Yu, Yue Yu, Muyan Zhong, Tianchen Zhao, Xizhou Zhu, Yanpeng Zhou, Yifan Zhang, Zhi Zhu</p>

            <p><strong>Title:</strong><br>
            MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11793v2">http://arxiv.org/abs/2511.11793v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance</title>
      <itunes:episode>1377</itunes:episode>
      <podcast:episode>1377</podcast:episode>
      <itunes:title>Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">00802f65-08e8-40fe-8f30-2cd5f50e7985</guid>
      <link>https://share.transistor.fm/s/99d12528</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 75 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shalini Maiti, Amar Budhiraja, Bhavul Gauri, Gaurav Chaurasia, Anton Protopopov, Alexis Audran-Reiss, Michael Slater, Despoina Magka, Tatiana Shavrina, Roberta Raileanu, Yoram Bachrach</p>

            <p><strong>Title:</strong><br>
            Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13254v1">http://arxiv.org/abs/2511.13254v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their training remains resource- and time-intensive, requiring massive compute power and careful orchestration of training procedures. Model souping-the practice of averaging weights from multiple models of the same architecture-has emerged as a promising pre- and post-training technique that can enhance performance without expensive retraining. In this paper, we introduce Soup Of Category Experts (SoCE), a principled approach for model souping that utilizes benchmark composition to identify optimal model candidates and applies non-uniform weighted averaging to maximize performance. Contrary to previous uniform-averaging approaches, our method leverages the observation that benchmark categories often exhibit low inter-correlations in model performance. SoCE identifies "expert" models for each weakly-correlated category cluster and combines them using optimized weighted averaging rather than uniform weights. We demonstrate that the proposed method improves performance and robustness across multiple domains, including multilingual capabilities, tool calling, and math and achieves state-of-the-art results on the Berkeley Function Calling Leaderboard.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 75 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shalini Maiti, Amar Budhiraja, Bhavul Gauri, Gaurav Chaurasia, Anton Protopopov, Alexis Audran-Reiss, Michael Slater, Despoina Magka, Tatiana Shavrina, Roberta Raileanu, Yoram Bachrach</p>

            <p><strong>Title:</strong><br>
            Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13254v1">http://arxiv.org/abs/2511.13254v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their training remains resource- and time-intensive, requiring massive compute power and careful orchestration of training procedures. Model souping-the practice of averaging weights from multiple models of the same architecture-has emerged as a promising pre- and post-training technique that can enhance performance without expensive retraining. In this paper, we introduce Soup Of Category Experts (SoCE), a principled approach for model souping that utilizes benchmark composition to identify optimal model candidates and applies non-uniform weighted averaging to maximize performance. Contrary to previous uniform-averaging approaches, our method leverages the observation that benchmark categories often exhibit low inter-correlations in model performance. SoCE identifies "expert" models for each weakly-correlated category cluster and combines them using optimized weighted averaging rather than uniform weights. We demonstrate that the proposed method improves performance and robustness across multiple domains, including multilingual capabilities, tool calling, and math and achieves state-of-the-art results on the Berkeley Function Calling Leaderboard.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 18 Nov 2025 19:46:34 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/99d12528/923cd4b1.mp3" length="23044277" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1437</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 75 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shalini Maiti, Amar Budhiraja, Bhavul Gauri, Gaurav Chaurasia, Anton Protopopov, Alexis Audran-Reiss, Michael Slater, Despoina Magka, Tatiana Shavrina, Roberta Raileanu, Yoram Bachrach</p>

            <p><strong>Title:</strong><br>
            Souper-Model: How Simple Arithmetic Unlocks State-of-the-Art LLM Performance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13254v1">http://arxiv.org/abs/2511.13254v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their training remains resource- and time-intensive, requiring massive compute power and careful orchestration of training procedures. Model souping-the practice of averaging weights from multiple models of the same architecture-has emerged as a promising pre- and post-training technique that can enhance performance without expensive retraining. In this paper, we introduce Soup Of Category Experts (SoCE), a principled approach for model souping that utilizes benchmark composition to identify optimal model candidates and applies non-uniform weighted averaging to maximize performance. Contrary to previous uniform-averaging approaches, our method leverages the observation that benchmark categories often exhibit low inter-correlations in model performance. SoCE identifies "expert" models for each weakly-correlated category cluster and combines them using optimized weighted averaging rather than uniform weights. We demonstrate that the proposed method improves performance and robustness across multiple domains, including multilingual capabilities, tool calling, and math and achieves state-of-the-art results on the Berkeley Function Calling Leaderboard.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Part-X-MLLM: Part-aware 3D Multimodal Large Language Model</title>
      <itunes:episode>1376</itunes:episode>
      <podcast:episode>1376</podcast:episode>
      <itunes:title>Part-X-MLLM: Part-aware 3D Multimodal Large Language Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d5f0dc87-3e90-4c5a-a51d-83458279a501</guid>
      <link>https://share.transistor.fm/s/b57320d6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chunshi Wang, Junliang Ye, Yunhan Yang, Yang Li, Zizhuo Lin, Jun Zhu, Zhuo Chen, Yawei Luo, Chunchao Guo</p>

            <p><strong>Title:</strong><br>
            Part-X-MLLM: Part-aware 3D Multimodal Large Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13647v1">http://arxiv.org/abs/2511.13647v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q\&amp;A, compositional generation, and localized editing through one unified interface. Project page: https://chunshi.wang/Part-X-MLLM/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chunshi Wang, Junliang Ye, Yunhan Yang, Yang Li, Zizhuo Lin, Jun Zhu, Zhuo Chen, Yawei Luo, Chunchao Guo</p>

            <p><strong>Title:</strong><br>
            Part-X-MLLM: Part-aware 3D Multimodal Large Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13647v1">http://arxiv.org/abs/2511.13647v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q\&amp;A, compositional generation, and localized editing through one unified interface. Project page: https://chunshi.wang/Part-X-MLLM/</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 18 Nov 2025 19:46:13 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b57320d6/31141ae5.mp3" length="24968543" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1557</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chunshi Wang, Junliang Ye, Yunhan Yang, Yang Li, Zizhuo Lin, Jun Zhu, Zhuo Chen, Yawei Luo, Chunchao Guo</p>

            <p><strong>Title:</strong><br>
            Part-X-MLLM: Part-aware 3D Multimodal Large Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13647v1">http://arxiv.org/abs/2511.13647v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q\&amp;A, compositional generation, and localized editing through one unified interface. Project page: https://chunshi.wang/Part-X-MLLM/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation</title>
      <itunes:episode>1375</itunes:episode>
      <podcast:episode>1375</podcast:episode>
      <itunes:title>MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">462087f8-ae22-4e53-abcf-425c1048a93b</guid>
      <link>https://share.transistor.fm/s/94fe7f32</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, Xiangtai Li</p>

            <p><strong>Title:</strong><br>
            MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.09611v3">http://arxiv.org/abs/2511.09611v3</a></p>

            <p><strong>Abstract:</strong><br>
            While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9\% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at https://github.com/tyfeld/MMaDA-Parallel</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, Xiangtai Li</p>

            <p><strong>Title:</strong><br>
            MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.09611v3">http://arxiv.org/abs/2511.09611v3</a></p>

            <p><strong>Abstract:</strong><br>
            While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9\% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at https://github.com/tyfeld/MMaDA-Parallel</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 18 Nov 2025 19:45:52 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/94fe7f32/e04abc96.mp3" length="19953910" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1243</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, Xiangtai Li</p>

            <p><strong>Title:</strong><br>
            MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.09611v3">http://arxiv.org/abs/2511.09611v3</a></p>

            <p><strong>Abstract:</strong><br>
            While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9\% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at https://github.com/tyfeld/MMaDA-Parallel</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GroupRank: A Groupwise Reranking Paradigm Driven by Reinforcement Learning</title>
      <itunes:episode>1374</itunes:episode>
      <podcast:episode>1374</podcast:episode>
      <itunes:title>GroupRank: A Groupwise Reranking Paradigm Driven by Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cd70eb77-a1f7-4dea-bb7d-e2705ee7b649</guid>
      <link>https://share.transistor.fm/s/ae46d2b4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.IR, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Duolin Sun, Meixiu Long, Dan Yang, Yihan Jiao, Zhehao Tan, Jie Feng, Junjie Wang, Yue Shen, Peng Wei, Jian Wang, Jinjie Gu</p>

            <p><strong>Title:</strong><br>
            GroupRank: A Groupwise Reranking Paradigm Driven by Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11653v1">http://arxiv.org/abs/2511.11653v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models have shown strong potential as rerankers to enhance the overall performance of RAG systems. However, existing reranking paradigms are constrained by a core theoretical and practical dilemma: Pointwise methods, while simple and highly flexible, evaluate documents independently, making them prone to the Ranking Myopia Trap, overlooking the relative importance between documents. In contrast, Listwise methods can perceive the global ranking context, but suffer from inherent List Rigidity, leading to severe scalability and flexibility issues when handling large candidate sets. To address these challenges, we propose Groupwise, a novel reranking paradigm. In this approach, the query and a group of candidate documents are jointly fed into the model, which performs within-group comparisons to assign individual relevance scores to each document. This design retains the flexibility of Pointwise methods while enabling the comparative capability of Listwise methods. We further adopt GRPO for model training, equipped with a heterogeneous reward function that integrates ranking metrics with a distributional reward aimed at aligning score distributions across groups. To overcome the bottleneck caused by the scarcity of high quality labeled data, we further propose an innovative pipeline for synthesizing high quality retrieval and ranking data. The resulting data can be leveraged not only for training the reranker but also for training the retriever. Extensive experiments validate the effectiveness of our approach. On two reasoning intensive retrieval benchmarks, BRIGHT and R2MED.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.IR, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Duolin Sun, Meixiu Long, Dan Yang, Yihan Jiao, Zhehao Tan, Jie Feng, Junjie Wang, Yue Shen, Peng Wei, Jian Wang, Jinjie Gu</p>

            <p><strong>Title:</strong><br>
            GroupRank: A Groupwise Reranking Paradigm Driven by Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11653v1">http://arxiv.org/abs/2511.11653v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models have shown strong potential as rerankers to enhance the overall performance of RAG systems. However, existing reranking paradigms are constrained by a core theoretical and practical dilemma: Pointwise methods, while simple and highly flexible, evaluate documents independently, making them prone to the Ranking Myopia Trap, overlooking the relative importance between documents. In contrast, Listwise methods can perceive the global ranking context, but suffer from inherent List Rigidity, leading to severe scalability and flexibility issues when handling large candidate sets. To address these challenges, we propose Groupwise, a novel reranking paradigm. In this approach, the query and a group of candidate documents are jointly fed into the model, which performs within-group comparisons to assign individual relevance scores to each document. This design retains the flexibility of Pointwise methods while enabling the comparative capability of Listwise methods. We further adopt GRPO for model training, equipped with a heterogeneous reward function that integrates ranking metrics with a distributional reward aimed at aligning score distributions across groups. To overcome the bottleneck caused by the scarcity of high quality labeled data, we further propose an innovative pipeline for synthesizing high quality retrieval and ranking data. The resulting data can be leveraged not only for training the reranker but also for training the retriever. Extensive experiments validate the effectiveness of our approach. On two reasoning intensive retrieval benchmarks, BRIGHT and R2MED.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 18 Nov 2025 19:45:31 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ae46d2b4/cf22ecca.mp3" length="22917215" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1429</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.IR, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Duolin Sun, Meixiu Long, Dan Yang, Yihan Jiao, Zhehao Tan, Jie Feng, Junjie Wang, Yue Shen, Peng Wei, Jian Wang, Jinjie Gu</p>

            <p><strong>Title:</strong><br>
            GroupRank: A Groupwise Reranking Paradigm Driven by Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11653v1">http://arxiv.org/abs/2511.11653v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models have shown strong potential as rerankers to enhance the overall performance of RAG systems. However, existing reranking paradigms are constrained by a core theoretical and practical dilemma: Pointwise methods, while simple and highly flexible, evaluate documents independently, making them prone to the Ranking Myopia Trap, overlooking the relative importance between documents. In contrast, Listwise methods can perceive the global ranking context, but suffer from inherent List Rigidity, leading to severe scalability and flexibility issues when handling large candidate sets. To address these challenges, we propose Groupwise, a novel reranking paradigm. In this approach, the query and a group of candidate documents are jointly fed into the model, which performs within-group comparisons to assign individual relevance scores to each document. This design retains the flexibility of Pointwise methods while enabling the comparative capability of Listwise methods. We further adopt GRPO for model training, equipped with a heterogeneous reward function that integrates ranking metrics with a distributional reward aimed at aligning score distributions across groups. To overcome the bottleneck caused by the scarcity of high quality labeled data, we further propose an innovative pipeline for synthesizing high quality retrieval and ranking data. The resulting data can be leveraged not only for training the reranker but also for training the retriever. Extensive experiments validate the effectiveness of our approach. On two reasoning intensive retrieval benchmarks, BRIGHT and R2MED.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models</title>
      <itunes:episode>1373</itunes:episode>
      <podcast:episode>1373</podcast:episode>
      <itunes:title>TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a9610bf8-4431-41ca-90cc-6666c8619fe3</guid>
      <link>https://share.transistor.fm/s/78053750</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Harold Haodong Chen, Disen Lan, Wen-Jie Shu, Qingyang Liu, Zihan Wang, Sirui Chen, Wenkai Cheng, Kanghao Chen, Hongfei Zhang, Zixin Zhang, Rongjin Guo, Yu Cheng, Ying-Cong Chen</p>

            <p><strong>Title:</strong><br>
            TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13704v1">http://arxiv.org/abs/2511.13704v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning &amp; Search, ii) Spatial &amp; Visual Pattern Reasoning, iii) Symbolic &amp; Logical Reasoning, and iv) Action Planning &amp; Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Harold Haodong Chen, Disen Lan, Wen-Jie Shu, Qingyang Liu, Zihan Wang, Sirui Chen, Wenkai Cheng, Kanghao Chen, Hongfei Zhang, Zixin Zhang, Rongjin Guo, Yu Cheng, Ying-Cong Chen</p>

            <p><strong>Title:</strong><br>
            TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13704v1">http://arxiv.org/abs/2511.13704v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning &amp; Search, ii) Spatial &amp; Visual Pattern Reasoning, iii) Symbolic &amp; Logical Reasoning, and iv) Action Planning &amp; Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 18 Nov 2025 19:45:09 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/78053750/87cd13a7.mp3" length="22309086" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1391</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Harold Haodong Chen, Disen Lan, Wen-Jie Shu, Qingyang Liu, Zihan Wang, Sirui Chen, Wenkai Cheng, Kanghao Chen, Hongfei Zhang, Zixin Zhang, Rongjin Guo, Yu Cheng, Ying-Cong Chen</p>

            <p><strong>Title:</strong><br>
            TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13704v1">http://arxiv.org/abs/2511.13704v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning &amp; Search, ii) Spatial &amp; Visual Pattern Reasoning, iii) Symbolic &amp; Logical Reasoning, and iv) Action Planning &amp; Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image</title>
      <itunes:episode>1372</itunes:episode>
      <podcast:episode>1372</podcast:episode>
      <itunes:title>PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">17bfcb27-6abe-4910-9c60-a7b4d6ec8697</guid>
      <link>https://share.transistor.fm/s/cdc795d4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13648v1">http://arxiv.org/abs/2511.13648v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13648v1">http://arxiv.org/abs/2511.13648v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 18 Nov 2025 19:44:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cdc795d4/6ce031d9.mp3" length="24460315" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1525</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.13648v1">http://arxiv.org/abs/2511.13648v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models</title>
      <itunes:episode>1371</itunes:episode>
      <podcast:episode>1371</podcast:episode>
      <itunes:title>GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ec2efdfb-7844-4ef7-9ac2-805399712c8f</guid>
      <link>https://share.transistor.fm/s/fa236daf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jingxuan Wei, Caijun Jia, Xi Bai, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Lijun Wu, Cheng Tan</p>

            <p><strong>Title:</strong><br>
            GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11134v1">http://arxiv.org/abs/2511.11134v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advent of Unified Multimodal Models (UMMs) signals a paradigm shift in artificial intelligence, moving from passive perception to active, cross-modal generation. Despite their unprecedented ability to synthesize information, a critical gap persists in evaluation: existing benchmarks primarily assess discriminative understanding or unconstrained image generation separately, failing to measure the integrated cognitive process of generative reasoning. To bridge this gap, we propose that geometric construction provides an ideal testbed as it inherently demands a fusion of language comprehension and precise visual generation. We introduce GGBench, a benchmark designed specifically to evaluate geometric generative reasoning. It provides a comprehensive framework for systematically diagnosing a model's ability to not only understand and reason but to actively construct a solution, thereby setting a more rigorous standard for the next generation of intelligent systems. Project website: https://opendatalab-raiser.github.io/GGBench/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jingxuan Wei, Caijun Jia, Xi Bai, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Lijun Wu, Cheng Tan</p>

            <p><strong>Title:</strong><br>
            GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11134v1">http://arxiv.org/abs/2511.11134v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advent of Unified Multimodal Models (UMMs) signals a paradigm shift in artificial intelligence, moving from passive perception to active, cross-modal generation. Despite their unprecedented ability to synthesize information, a critical gap persists in evaluation: existing benchmarks primarily assess discriminative understanding or unconstrained image generation separately, failing to measure the integrated cognitive process of generative reasoning. To bridge this gap, we propose that geometric construction provides an ideal testbed as it inherently demands a fusion of language comprehension and precise visual generation. We introduce GGBench, a benchmark designed specifically to evaluate geometric generative reasoning. It provides a comprehensive framework for systematically diagnosing a model's ability to not only understand and reason but to actively construct a solution, thereby setting a more rigorous standard for the next generation of intelligent systems. Project website: https://opendatalab-raiser.github.io/GGBench/.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 17 Nov 2025 19:42:33 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fa236daf/66ba5556.mp3" length="21141313" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1318</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jingxuan Wei, Caijun Jia, Xi Bai, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Lijun Wu, Cheng Tan</p>

            <p><strong>Title:</strong><br>
            GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11134v1">http://arxiv.org/abs/2511.11134v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advent of Unified Multimodal Models (UMMs) signals a paradigm shift in artificial intelligence, moving from passive perception to active, cross-modal generation. Despite their unprecedented ability to synthesize information, a critical gap persists in evaluation: existing benchmarks primarily assess discriminative understanding or unconstrained image generation separately, failing to measure the integrated cognitive process of generative reasoning. To bridge this gap, we propose that geometric construction provides an ideal testbed as it inherently demands a fusion of language comprehension and precise visual generation. We introduce GGBench, a benchmark designed specifically to evaluate geometric generative reasoning. It provides a comprehensive framework for systematically diagnosing a model's ability to not only understand and reason but to actively construct a solution, thereby setting a more rigorous standard for the next generation of intelligent systems. Project website: https://opendatalab-raiser.github.io/GGBench/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DoPE: Denoising Rotary Position Embedding</title>
      <itunes:episode>1370</itunes:episode>
      <podcast:episode>1370</podcast:episode>
      <itunes:title>DoPE: Denoising Rotary Position Embedding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">53d0bd62-4d8c-49b2-b6ed-b7863d38fa9c</guid>
      <link>https://share.transistor.fm/s/57c50aa9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jing Xiong, Liyang Fan, Hui Shen, Zunhai Su, Min Yang, Lingpeng Kong, Ngai Wong</p>

            <p><strong>Title:</strong><br>
            DoPE: Denoising Rotary Position Embedding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.09146v1">http://arxiv.org/abs/2511.09146v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length extrapolation. We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional Encoding (DoPE), a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map. Leveraging the noise characteristics of the feature map, we further reparameterize it with a parameter-free Gaussian distribution to achieve robust extrapolation. Our method theoretically reveals the underlying cause of the attention sink phenomenon and its connection to truncated matrix entropy. Experiments on needle-in-a-haystack and many-shot in-context learning tasks demonstrate that DoPE significantly improves retrieval accuracy and reasoning stability across extended contexts (up to 64K tokens). The results show that the denoising strategy for positional embeddings effectively mitigates attention sinks and restores balanced attention patterns, providing a simple yet powerful solution for improving length generalization. Our project page is Project: https://The-physical-picture-of-LLMs.github.io</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jing Xiong, Liyang Fan, Hui Shen, Zunhai Su, Min Yang, Lingpeng Kong, Ngai Wong</p>

            <p><strong>Title:</strong><br>
            DoPE: Denoising Rotary Position Embedding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.09146v1">http://arxiv.org/abs/2511.09146v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length extrapolation. We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional Encoding (DoPE), a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map. Leveraging the noise characteristics of the feature map, we further reparameterize it with a parameter-free Gaussian distribution to achieve robust extrapolation. Our method theoretically reveals the underlying cause of the attention sink phenomenon and its connection to truncated matrix entropy. Experiments on needle-in-a-haystack and many-shot in-context learning tasks demonstrate that DoPE significantly improves retrieval accuracy and reasoning stability across extended contexts (up to 64K tokens). The results show that the denoising strategy for positional embeddings effectively mitigates attention sinks and restores balanced attention patterns, providing a simple yet powerful solution for improving length generalization. Our project page is Project: https://The-physical-picture-of-LLMs.github.io</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 17 Nov 2025 19:36:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/57c50aa9/cb3886fd.mp3" length="18679494" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1164</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jing Xiong, Liyang Fan, Hui Shen, Zunhai Su, Min Yang, Lingpeng Kong, Ngai Wong</p>

            <p><strong>Title:</strong><br>
            DoPE: Denoising Rotary Position Embedding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.09146v1">http://arxiv.org/abs/2511.09146v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length extrapolation. We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional Encoding (DoPE), a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map. Leveraging the noise characteristics of the feature map, we further reparameterize it with a parameter-free Gaussian distribution to achieve robust extrapolation. Our method theoretically reveals the underlying cause of the attention sink phenomenon and its connection to truncated matrix entropy. Experiments on needle-in-a-haystack and many-shot in-context learning tasks demonstrate that DoPE significantly improves retrieval accuracy and reasoning stability across extended contexts (up to 64K tokens). The results show that the denoising strategy for positional embeddings effectively mitigates attention sinks and restores balanced attention patterns, providing a simple yet powerful solution for improving length generalization. Our project page is Project: https://The-physical-picture-of-LLMs.github.io</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation</title>
      <itunes:episode>1369</itunes:episode>
      <podcast:episode>1369</podcast:episode>
      <itunes:title>WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">012ad9eb-f967-44a6-853f-25f68c0ff5e1</guid>
      <link>https://share.transistor.fm/s/7f7af660</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wei Chow, Jiachun Pan, Yongyuan Liang, Mingze Zhou, Xue Song, Liyu Jia, Saining Zhang, Siliang Tang, Juncheng Li, Fengda Zhang, Weijia Wu, Hanwang Zhang, Tat-Seng Chua</p>

            <p><strong>Title:</strong><br>
            WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11434v1">http://arxiv.org/abs/2511.11434v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation. Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models' abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains. Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing. We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wei Chow, Jiachun Pan, Yongyuan Liang, Mingze Zhou, Xue Song, Liyu Jia, Saining Zhang, Siliang Tang, Juncheng Li, Fengda Zhang, Weijia Wu, Hanwang Zhang, Tat-Seng Chua</p>

            <p><strong>Title:</strong><br>
            WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11434v1">http://arxiv.org/abs/2511.11434v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation. Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models' abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains. Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing. We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 17 Nov 2025 19:36:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7f7af660/0e42cb68.mp3" length="24455739" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1525</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wei Chow, Jiachun Pan, Yongyuan Liang, Mingze Zhou, Xue Song, Liyu Jia, Saining Zhang, Siliang Tang, Juncheng Li, Fengda Zhang, Weijia Wu, Hanwang Zhang, Tat-Seng Chua</p>

            <p><strong>Title:</strong><br>
            WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11434v1">http://arxiv.org/abs/2511.11434v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation. Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models' abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains. Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing. We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation</title>
      <itunes:episode>1368</itunes:episode>
      <podcast:episode>1368</podcast:episode>
      <itunes:title>UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">23ecfeda-0177-4d56-b78e-259f03a56d0c</guid>
      <link>https://share.transistor.fm/s/9da05faf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhen Yang, Wenyi Hong, Mingde Xu, Xinyue Fan, Weihan Wang, Jiele Cheng, Xiaotao Gu, Jie Tang</p>

            <p><strong>Title:</strong><br>
            UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.08195v2">http://arxiv.org/abs/2511.08195v2</a></p>

            <p><strong>Abstract:</strong><br>
            User interface (UI) programming is a core yet highly complex part of modern software development. Recent advances in visual language models (VLMs) highlight the potential of automatic UI coding, but current approaches face two key limitations: multimodal coding capabilities remain underdeveloped, and single-turn paradigms make little use of iterative visual feedback. We address these challenges with an interactive UI-to-code paradigm that better reflects real-world workflows and raises the upper bound of achievable performance. Under this paradigm, we present UI2Code$^\text{N}$, a visual language model trained through staged pretraining, fine-tuning, and reinforcement learning to achieve foundational improvements in multimodal coding. The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing. We further explore test-time scaling for interactive generation, enabling systematic use of multi-turn feedback. Experiments on UI-to-code and UI polishing benchmarks show that UI2Code$^\text{N}$ establishes a new state of the art among open-source models and achieves performance comparable to leading closed-source models such as Claude-4-Sonnet and GPT-5. Our code and models are available at https://github.com/zai-org/UI2Code_N.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhen Yang, Wenyi Hong, Mingde Xu, Xinyue Fan, Weihan Wang, Jiele Cheng, Xiaotao Gu, Jie Tang</p>

            <p><strong>Title:</strong><br>
            UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.08195v2">http://arxiv.org/abs/2511.08195v2</a></p>

            <p><strong>Abstract:</strong><br>
            User interface (UI) programming is a core yet highly complex part of modern software development. Recent advances in visual language models (VLMs) highlight the potential of automatic UI coding, but current approaches face two key limitations: multimodal coding capabilities remain underdeveloped, and single-turn paradigms make little use of iterative visual feedback. We address these challenges with an interactive UI-to-code paradigm that better reflects real-world workflows and raises the upper bound of achievable performance. Under this paradigm, we present UI2Code$^\text{N}$, a visual language model trained through staged pretraining, fine-tuning, and reinforcement learning to achieve foundational improvements in multimodal coding. The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing. We further explore test-time scaling for interactive generation, enabling systematic use of multi-turn feedback. Experiments on UI-to-code and UI polishing benchmarks show that UI2Code$^\text{N}$ establishes a new state of the art among open-source models and achieves performance comparable to leading closed-source models such as Claude-4-Sonnet and GPT-5. Our code and models are available at https://github.com/zai-org/UI2Code_N.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 17 Nov 2025 19:35:31 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9da05faf/bc984564.mp3" length="24794705" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1546</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhen Yang, Wenyi Hong, Mingde Xu, Xinyue Fan, Weihan Wang, Jiele Cheng, Xiaotao Gu, Jie Tang</p>

            <p><strong>Title:</strong><br>
            UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.08195v2">http://arxiv.org/abs/2511.08195v2</a></p>

            <p><strong>Abstract:</strong><br>
            User interface (UI) programming is a core yet highly complex part of modern software development. Recent advances in visual language models (VLMs) highlight the potential of automatic UI coding, but current approaches face two key limitations: multimodal coding capabilities remain underdeveloped, and single-turn paradigms make little use of iterative visual feedback. We address these challenges with an interactive UI-to-code paradigm that better reflects real-world workflows and raises the upper bound of achievable performance. Under this paradigm, we present UI2Code$^\text{N}$, a visual language model trained through staged pretraining, fine-tuning, and reinforcement learning to achieve foundational improvements in multimodal coding. The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing. We further explore test-time scaling for interactive generation, enabling systematic use of multi-turn feedback. Experiments on UI-to-code and UI polishing benchmarks show that UI2Code$^\text{N}$ establishes a new state of the art among open-source models and achieves performance comparable to leading closed-source models such as Claude-4-Sonnet and GPT-5. Our code and models are available at https://github.com/zai-org/UI2Code_N.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery</title>
      <itunes:episode>1367</itunes:episode>
      <podcast:episode>1367</podcast:episode>
      <itunes:title>AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c7e8984c-35c3-40e3-b84f-2f1d4c8b6d20</guid>
      <link>https://share.transistor.fm/s/dea82cc9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.CE, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yuqi Yin, Yibo Fu, Siyuan Wang, Peng Sun, Hongyu Wang, Xiaohui Wang, Lei Zheng, Zhiyong Li, Zhirong Liu, Jianji Wang, Zhaoxi Sun</p>

            <p><strong>Title:</strong><br>
            AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11257v1">http://arxiv.org/abs/2511.11257v1</a></p>

            <p><strong>Abstract:</strong><br>
            The discovery of novel Ionic Liquids (ILs) is hindered by critical challenges in property prediction, including limited data, poor model accuracy, and fragmented workflows. Leveraging the power of Large Language Models (LLMs), we introduce AIonopedia, to the best of our knowledge, the first LLM agent for IL discovery. Powered by an LLM-augmented multimodal domain foundation model for ILs, AIonopedia enables accurate property predictions and incorporates a hierarchical search architecture for molecular screening and design. Trained and evaluated on a newly curated and comprehensive IL dataset, our model delivers superior performance. Complementing these results, evaluations on literature-reported systems indicate that the agent can perform effective IL modification. Moving beyond offline tests, the practical efficacy was further confirmed through real-world wet-lab validation, in which the agent demonstrated exceptional generalization capabilities on challenging out-of-distribution tasks, underscoring its ability to accelerate real-world IL discovery.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.CE, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yuqi Yin, Yibo Fu, Siyuan Wang, Peng Sun, Hongyu Wang, Xiaohui Wang, Lei Zheng, Zhiyong Li, Zhirong Liu, Jianji Wang, Zhaoxi Sun</p>

            <p><strong>Title:</strong><br>
            AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11257v1">http://arxiv.org/abs/2511.11257v1</a></p>

            <p><strong>Abstract:</strong><br>
            The discovery of novel Ionic Liquids (ILs) is hindered by critical challenges in property prediction, including limited data, poor model accuracy, and fragmented workflows. Leveraging the power of Large Language Models (LLMs), we introduce AIonopedia, to the best of our knowledge, the first LLM agent for IL discovery. Powered by an LLM-augmented multimodal domain foundation model for ILs, AIonopedia enables accurate property predictions and incorporates a hierarchical search architecture for molecular screening and design. Trained and evaluated on a newly curated and comprehensive IL dataset, our model delivers superior performance. Complementing these results, evaluations on literature-reported systems indicate that the agent can perform effective IL modification. Moving beyond offline tests, the practical efficacy was further confirmed through real-world wet-lab validation, in which the agent demonstrated exceptional generalization capabilities on challenging out-of-distribution tasks, underscoring its ability to accelerate real-world IL discovery.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 17 Nov 2025 19:35:10 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/dea82cc9/4cf58b01.mp3" length="27766806" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1732</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.CE, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yuqi Yin, Yibo Fu, Siyuan Wang, Peng Sun, Hongyu Wang, Xiaohui Wang, Lei Zheng, Zhiyong Li, Zhirong Liu, Jianji Wang, Zhaoxi Sun</p>

            <p><strong>Title:</strong><br>
            AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11257v1">http://arxiv.org/abs/2511.11257v1</a></p>

            <p><strong>Abstract:</strong><br>
            The discovery of novel Ionic Liquids (ILs) is hindered by critical challenges in property prediction, including limited data, poor model accuracy, and fragmented workflows. Leveraging the power of Large Language Models (LLMs), we introduce AIonopedia, to the best of our knowledge, the first LLM agent for IL discovery. Powered by an LLM-augmented multimodal domain foundation model for ILs, AIonopedia enables accurate property predictions and incorporates a hierarchical search architecture for molecular screening and design. Trained and evaluated on a newly curated and comprehensive IL dataset, our model delivers superior performance. Complementing these results, evaluations on literature-reported systems indicate that the agent can perform effective IL modification. Moving beyond offline tests, the practical efficacy was further confirmed through real-world wet-lab validation, in which the agent demonstrated exceptional generalization capabilities on challenging out-of-distribution tasks, underscoring its ability to accelerate real-world IL discovery.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LiteAttention: A Temporal Sparse Attention for Diffusion Transformers</title>
      <itunes:episode>1366</itunes:episode>
      <podcast:episode>1366</podcast:episode>
      <itunes:title>LiteAttention: A Temporal Sparse Attention for Diffusion Transformers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">05602ff1-dfe8-42fe-bebc-3211d2a62f32</guid>
      <link>https://share.transistor.fm/s/cda34136</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Dor Shmilovich, Tony Wu, Aviad Dahan, Yuval Domb</p>

            <p><strong>Title:</strong><br>
            LiteAttention: A Temporal Sparse Attention for Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11062v1">http://arxiv.org/abs/2511.11062v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformers, particularly for video generation, achieve remarkable quality but suffer from quadratic attention complexity, leading to prohibitive latency. Existing acceleration methods face a fundamental trade-off: dynamically estimating sparse attention patterns at each denoising step incurs high computational overhead and estimation errors, while static sparsity patterns remain fixed and often suboptimal throughout denoising. We identify a key structural property of diffusion attention, namely, its sparsity patterns exhibit strong temporal coherence across denoising steps. Tiles deemed non-essential at step $t$ typically remain so at step $t+δ$. Leveraging this observation, we introduce LiteAttention, a method that exploits temporal coherence to enable evolutionary computation skips across the denoising sequence. By marking non-essential tiles early and propagating skip decisions forward, LiteAttention eliminates redundant attention computations without repeated profiling overheads, combining the adaptivity of dynamic methods with the efficiency of static ones. We implement a highly optimized LiteAttention kernel on top of FlashAttention and demonstrate substantial speedups on production video diffusion models, with no degradation in quality. The code and implementation details will be publicly released.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Dor Shmilovich, Tony Wu, Aviad Dahan, Yuval Domb</p>

            <p><strong>Title:</strong><br>
            LiteAttention: A Temporal Sparse Attention for Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11062v1">http://arxiv.org/abs/2511.11062v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformers, particularly for video generation, achieve remarkable quality but suffer from quadratic attention complexity, leading to prohibitive latency. Existing acceleration methods face a fundamental trade-off: dynamically estimating sparse attention patterns at each denoising step incurs high computational overhead and estimation errors, while static sparsity patterns remain fixed and often suboptimal throughout denoising. We identify a key structural property of diffusion attention, namely, its sparsity patterns exhibit strong temporal coherence across denoising steps. Tiles deemed non-essential at step $t$ typically remain so at step $t+δ$. Leveraging this observation, we introduce LiteAttention, a method that exploits temporal coherence to enable evolutionary computation skips across the denoising sequence. By marking non-essential tiles early and propagating skip decisions forward, LiteAttention eliminates redundant attention computations without repeated profiling overheads, combining the adaptivity of dynamic methods with the efficiency of static ones. We implement a highly optimized LiteAttention kernel on top of FlashAttention and demonstrate substantial speedups on production video diffusion models, with no degradation in quality. The code and implementation details will be publicly released.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 17 Nov 2025 19:34:49 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cda34136/889e4fee.mp3" length="20476328" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1276</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Dor Shmilovich, Tony Wu, Aviad Dahan, Yuval Domb</p>

            <p><strong>Title:</strong><br>
            LiteAttention: A Temporal Sparse Attention for Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11062v1">http://arxiv.org/abs/2511.11062v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformers, particularly for video generation, achieve remarkable quality but suffer from quadratic attention complexity, leading to prohibitive latency. Existing acceleration methods face a fundamental trade-off: dynamically estimating sparse attention patterns at each denoising step incurs high computational overhead and estimation errors, while static sparsity patterns remain fixed and often suboptimal throughout denoising. We identify a key structural property of diffusion attention, namely, its sparsity patterns exhibit strong temporal coherence across denoising steps. Tiles deemed non-essential at step $t$ typically remain so at step $t+δ$. Leveraging this observation, we introduce LiteAttention, a method that exploits temporal coherence to enable evolutionary computation skips across the denoising sequence. By marking non-essential tiles early and propagating skip decisions forward, LiteAttention eliminates redundant attention computations without repeated profiling overheads, combining the adaptivity of dynamic methods with the efficiency of static ones. We implement a highly optimized LiteAttention kernel on top of FlashAttention and demonstrate substantial speedups on production video diffusion models, with no degradation in quality. The code and implementation details will be publicly released.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Virtual Width Networks</title>
      <itunes:episode>1365</itunes:episode>
      <podcast:episode>1365</podcast:episode>
      <itunes:title>Virtual Width Networks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">aa442dff-d51a-458e-a0a9-91bb874f7171</guid>
      <link>https://share.transistor.fm/s/42f0a631</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Seed, Baisheng Li, Banggu Wu, Bole Ma, Bowen Xiao, Chaoyi Zhang, Cheng Li, Chengyi Wang, Chenyin Xu, Chi Zhang, Chong Hu, Daoguang Zan, Defa Zhu, Dongyu Xu, Du Li, Faming Wu, Fan Xia, Ge Zhang, Guang Shi, Haobin Chen, Hongyu Zhu, Hongzhi Huang, Huan Zhou, Huanzhang Dou, Jianhui Duan, Jianqiao Lu, Jianyu Jiang, Jiayi Xu, Jiecao Chen, Jin Chen, Jin Ma, Jing Su, Jingji Chen, Jun Wang, Jun Yuan, Juncai Liu, Jundong Zhou, Kai Hua, Kai Shen, Kai Xiang, Kaiyuan Chen, Kang Liu, Ke Shen, Liang Xiang, Lin Yan, Lishu Luo, Mengyao Zhang, Ming Ding, Mofan Zhang, Nianning Liang, Peng Li, Penghao Huang, Pengpeng Mu, Qi Huang, Qianli Ma, Qiyang Min, Qiying Yu, Renming Pang, Ru Zhang, Shen Yan, Shen Yan, Shixiong Zhao, Shuaishuai Cao, Shuang Wu, Siyan Chen, Siyu Li, Siyuan Qiao, Tao Sun, Tian Xin, Tiantian Fan, Ting Huang, Ting-Han Fan, Wei Jia, Wenqiang Zhang, Wenxuan Liu, Xiangzhong Wu, Xiaochen Zuo, Xiaoying Jia, Ximing Yang, Xin Liu, Xin Yu, Xingyan Bin, Xintong Hao, Xiongcai Luo, Xujing Li, Xun Zhou, Yanghua Peng, Yangrui Chen, Yi Lin, Yichong Leng, Yinghao Li, Yingshuan Song, Yiyuan Ma, Yong Shan, Yongan Xiang, Yonghui Wu, Yongtao Zhang, Yongzhen Yao, Yu Bao, Yuehang Yang, Yufeng Yuan, Yunshui Li, Yuqiao Xian, Yutao Zeng, Yuxuan Wang, Zehua Hong, Zehua Wang, Zengzhi Wang, Zeyu Yang, Zhengqiang Yin, Zhenyi Lu, Zhexi Zhang, Zhi Chen, Zhi Zhang, Zhiqi Lin, Zihao Huang, Zilin Xu, Ziyun Wei, Zuo Wang</p>

            <p><strong>Title:</strong><br>
            Virtual Width Networks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11238v1">http://arxiv.org/abs/2511.11238v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Seed, Baisheng Li, Banggu Wu, Bole Ma, Bowen Xiao, Chaoyi Zhang, Cheng Li, Chengyi Wang, Chenyin Xu, Chi Zhang, Chong Hu, Daoguang Zan, Defa Zhu, Dongyu Xu, Du Li, Faming Wu, Fan Xia, Ge Zhang, Guang Shi, Haobin Chen, Hongyu Zhu, Hongzhi Huang, Huan Zhou, Huanzhang Dou, Jianhui Duan, Jianqiao Lu, Jianyu Jiang, Jiayi Xu, Jiecao Chen, Jin Chen, Jin Ma, Jing Su, Jingji Chen, Jun Wang, Jun Yuan, Juncai Liu, Jundong Zhou, Kai Hua, Kai Shen, Kai Xiang, Kaiyuan Chen, Kang Liu, Ke Shen, Liang Xiang, Lin Yan, Lishu Luo, Mengyao Zhang, Ming Ding, Mofan Zhang, Nianning Liang, Peng Li, Penghao Huang, Pengpeng Mu, Qi Huang, Qianli Ma, Qiyang Min, Qiying Yu, Renming Pang, Ru Zhang, Shen Yan, Shen Yan, Shixiong Zhao, Shuaishuai Cao, Shuang Wu, Siyan Chen, Siyu Li, Siyuan Qiao, Tao Sun, Tian Xin, Tiantian Fan, Ting Huang, Ting-Han Fan, Wei Jia, Wenqiang Zhang, Wenxuan Liu, Xiangzhong Wu, Xiaochen Zuo, Xiaoying Jia, Ximing Yang, Xin Liu, Xin Yu, Xingyan Bin, Xintong Hao, Xiongcai Luo, Xujing Li, Xun Zhou, Yanghua Peng, Yangrui Chen, Yi Lin, Yichong Leng, Yinghao Li, Yingshuan Song, Yiyuan Ma, Yong Shan, Yongan Xiang, Yonghui Wu, Yongtao Zhang, Yongzhen Yao, Yu Bao, Yuehang Yang, Yufeng Yuan, Yunshui Li, Yuqiao Xian, Yutao Zeng, Yuxuan Wang, Zehua Hong, Zehua Wang, Zengzhi Wang, Zeyu Yang, Zhengqiang Yin, Zhenyi Lu, Zhexi Zhang, Zhi Chen, Zhi Zhang, Zhiqi Lin, Zihao Huang, Zilin Xu, Ziyun Wei, Zuo Wang</p>

            <p><strong>Title:</strong><br>
            Virtual Width Networks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11238v1">http://arxiv.org/abs/2511.11238v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 17 Nov 2025 19:34:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/42f0a631/8699dffb.mp3" length="21753983" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1356</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Seed, Baisheng Li, Banggu Wu, Bole Ma, Bowen Xiao, Chaoyi Zhang, Cheng Li, Chengyi Wang, Chenyin Xu, Chi Zhang, Chong Hu, Daoguang Zan, Defa Zhu, Dongyu Xu, Du Li, Faming Wu, Fan Xia, Ge Zhang, Guang Shi, Haobin Chen, Hongyu Zhu, Hongzhi Huang, Huan Zhou, Huanzhang Dou, Jianhui Duan, Jianqiao Lu, Jianyu Jiang, Jiayi Xu, Jiecao Chen, Jin Chen, Jin Ma, Jing Su, Jingji Chen, Jun Wang, Jun Yuan, Juncai Liu, Jundong Zhou, Kai Hua, Kai Shen, Kai Xiang, Kaiyuan Chen, Kang Liu, Ke Shen, Liang Xiang, Lin Yan, Lishu Luo, Mengyao Zhang, Ming Ding, Mofan Zhang, Nianning Liang, Peng Li, Penghao Huang, Pengpeng Mu, Qi Huang, Qianli Ma, Qiyang Min, Qiying Yu, Renming Pang, Ru Zhang, Shen Yan, Shen Yan, Shixiong Zhao, Shuaishuai Cao, Shuang Wu, Siyan Chen, Siyu Li, Siyuan Qiao, Tao Sun, Tian Xin, Tiantian Fan, Ting Huang, Ting-Han Fan, Wei Jia, Wenqiang Zhang, Wenxuan Liu, Xiangzhong Wu, Xiaochen Zuo, Xiaoying Jia, Ximing Yang, Xin Liu, Xin Yu, Xingyan Bin, Xintong Hao, Xiongcai Luo, Xujing Li, Xun Zhou, Yanghua Peng, Yangrui Chen, Yi Lin, Yichong Leng, Yinghao Li, Yingshuan Song, Yiyuan Ma, Yong Shan, Yongan Xiang, Yonghui Wu, Yongtao Zhang, Yongzhen Yao, Yu Bao, Yuehang Yang, Yufeng Yuan, Yunshui Li, Yuqiao Xian, Yutao Zeng, Yuxuan Wang, Zehua Hong, Zehua Wang, Zengzhi Wang, Zeyu Yang, Zhengqiang Yin, Zhenyi Lu, Zhexi Zhang, Zhi Chen, Zhi Zhang, Zhiqi Lin, Zihao Huang, Zilin Xu, Ziyun Wei, Zuo Wang</p>

            <p><strong>Title:</strong><br>
            Virtual Width Networks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.11238v1">http://arxiv.org/abs/2511.11238v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models</title>
      <itunes:episode>1364</itunes:episode>
      <podcast:episode>1364</podcast:episode>
      <itunes:title>One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">489bf329-ba40-460b-8a4b-a998ef60b334</guid>
      <link>https://share.transistor.fm/s/eba2b6f7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Aleksandr Razin, Danil Kazantsev, Ilya Makarov</p>

            <p><strong>Title:</strong><br>
            One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.10629v1">http://arxiv.org/abs/2511.10629v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Aleksandr Razin, Danil Kazantsev, Ilya Makarov</p>

            <p><strong>Title:</strong><br>
            One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.10629v1">http://arxiv.org/abs/2511.10629v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 14 Nov 2025 19:04:16 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/eba2b6f7/0c709873.mp3" length="21175611" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1320</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Aleksandr Razin, Danil Kazantsev, Ilya Makarov</p>

            <p><strong>Title:</strong><br>
            One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.10629v1">http://arxiv.org/abs/2511.10629v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PAN: A World Model for General, Interactable, and Long-Horizon World Simulation</title>
      <itunes:episode>1363</itunes:episode>
      <podcast:episode>1363</podcast:episode>
      <itunes:title>PAN: A World Model for General, Interactable, and Long-Horizon World Simulation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ebfff914-dc9a-4528-b2bd-65ec9d22428e</guid>
      <link>https://share.transistor.fm/s/cd78a5d3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            PAN Team, Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, Davit Abrahamyan, Arif Ahmad, Ganesh Bannur, Junrong Chen, Kimi Chen, Mingkai Deng, Ruobing Han, Xinqi Huang, Haoqiang Kang, Zheqi Li, Enze Ma, Hector Ren, Yashowardhan Shinde, Rohan Shingre, Ramsundar Tanikella, Kaiming Tao, Dequan Yang, Xinle Yu, Cong Zeng, Binglin Zhou, Zhengzhong Liu, Zhiting Hu, Eric P. Xing</p>

            <p><strong>Title:</strong><br>
            PAN: A World Model for General, Interactable, and Long-Horizon World Simulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.09057v2">http://arxiv.org/abs/2511.09057v2</a></p>

            <p><strong>Abstract:</strong><br>
            A world model enables an intelligent agent to imagine, predict, and reason about how the world evolves in response to its actions, and accordingly to plan and strategize. While recent video generation models produce realistic visual sequences, they typically operate in the prompt-to-full-video manner without causal control, interactivity, or long-horizon consistency required for purposeful reasoning. Existing world modeling efforts, on the other hand, often focus on restricted domains (e.g., physical, game, or 3D-scene dynamics) with limited depth and controllability, and struggle to generalize across diverse environments and interaction formats. In this work, we introduce PAN, a general, interactable, and long-horizon world model that predicts future world states through high-quality video simulation conditioned on history and natural language actions. PAN employs the Generative Latent Prediction (GLP) architecture that combines an autoregressive latent dynamics backbone based on a large language model (LLM), which grounds simulation in extensive text-based knowledge and enables conditioning on language-specified actions, with a video diffusion decoder that reconstructs perceptually detailed and temporally coherent visual observations, to achieve a unification between latent space reasoning (imagination) and realizable world dynamics (reality). Trained on large-scale video-action pairs spanning diverse domains, PAN supports open-domain, action-conditioned simulation with coherent, long-term dynamics. Extensive experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning compared to other video generators and world models, taking a step towards general world models that enable predictive simulation of future world states for reasoning and acting.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            PAN Team, Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, Davit Abrahamyan, Arif Ahmad, Ganesh Bannur, Junrong Chen, Kimi Chen, Mingkai Deng, Ruobing Han, Xinqi Huang, Haoqiang Kang, Zheqi Li, Enze Ma, Hector Ren, Yashowardhan Shinde, Rohan Shingre, Ramsundar Tanikella, Kaiming Tao, Dequan Yang, Xinle Yu, Cong Zeng, Binglin Zhou, Zhengzhong Liu, Zhiting Hu, Eric P. Xing</p>

            <p><strong>Title:</strong><br>
            PAN: A World Model for General, Interactable, and Long-Horizon World Simulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.09057v2">http://arxiv.org/abs/2511.09057v2</a></p>

            <p><strong>Abstract:</strong><br>
            A world model enables an intelligent agent to imagine, predict, and reason about how the world evolves in response to its actions, and accordingly to plan and strategize. While recent video generation models produce realistic visual sequences, they typically operate in the prompt-to-full-video manner without causal control, interactivity, or long-horizon consistency required for purposeful reasoning. Existing world modeling efforts, on the other hand, often focus on restricted domains (e.g., physical, game, or 3D-scene dynamics) with limited depth and controllability, and struggle to generalize across diverse environments and interaction formats. In this work, we introduce PAN, a general, interactable, and long-horizon world model that predicts future world states through high-quality video simulation conditioned on history and natural language actions. PAN employs the Generative Latent Prediction (GLP) architecture that combines an autoregressive latent dynamics backbone based on a large language model (LLM), which grounds simulation in extensive text-based knowledge and enables conditioning on language-specified actions, with a video diffusion decoder that reconstructs perceptually detailed and temporally coherent visual observations, to achieve a unification between latent space reasoning (imagination) and realizable world dynamics (reality). Trained on large-scale video-action pairs spanning diverse domains, PAN supports open-domain, action-conditioned simulation with coherent, long-term dynamics. Extensive experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning compared to other video generators and world models, taking a step towards general world models that enable predictive simulation of future world states for reasoning and acting.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 14 Nov 2025 19:03:53 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cd78a5d3/84b6f60b.mp3" length="24961876" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1556</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            PAN Team, Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, Davit Abrahamyan, Arif Ahmad, Ganesh Bannur, Junrong Chen, Kimi Chen, Mingkai Deng, Ruobing Han, Xinqi Huang, Haoqiang Kang, Zheqi Li, Enze Ma, Hector Ren, Yashowardhan Shinde, Rohan Shingre, Ramsundar Tanikella, Kaiming Tao, Dequan Yang, Xinle Yu, Cong Zeng, Binglin Zhou, Zhengzhong Liu, Zhiting Hu, Eric P. Xing</p>

            <p><strong>Title:</strong><br>
            PAN: A World Model for General, Interactable, and Long-Horizon World Simulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.09057v2">http://arxiv.org/abs/2511.09057v2</a></p>

            <p><strong>Abstract:</strong><br>
            A world model enables an intelligent agent to imagine, predict, and reason about how the world evolves in response to its actions, and accordingly to plan and strategize. While recent video generation models produce realistic visual sequences, they typically operate in the prompt-to-full-video manner without causal control, interactivity, or long-horizon consistency required for purposeful reasoning. Existing world modeling efforts, on the other hand, often focus on restricted domains (e.g., physical, game, or 3D-scene dynamics) with limited depth and controllability, and struggle to generalize across diverse environments and interaction formats. In this work, we introduce PAN, a general, interactable, and long-horizon world model that predicts future world states through high-quality video simulation conditioned on history and natural language actions. PAN employs the Generative Latent Prediction (GLP) architecture that combines an autoregressive latent dynamics backbone based on a large language model (LLM), which grounds simulation in extensive text-based knowledge and enables conditioning on language-specified actions, with a video diffusion decoder that reconstructs perceptually detailed and temporally coherent visual observations, to achieve a unification between latent space reasoning (imagination) and realizable world dynamics (reality). Trained on large-scale video-action pairs spanning diverse domains, PAN supports open-domain, action-conditioned simulation with coherent, long-term dynamics. Extensive experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning compared to other video generators and world models, taking a step towards general world models that enable predictive simulation of future world states for reasoning and acting.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist</title>
      <itunes:episode>1362</itunes:episode>
      <podcast:episode>1362</podcast:episode>
      <itunes:title>UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fa341bb0-5337-4250-a0b3-0cee9f439959</guid>
      <link>https://share.transistor.fm/s/3e1dd8f1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Yuechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, Hao Fei</p>

            <p><strong>Title:</strong><br>
            UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.08521v1">http://arxiv.org/abs/2511.08521v1</a></p>

            <p><strong>Abstract:</strong><br>
            While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation $\rightarrow$ multi-round editing $\rightarrow$ object segmentation $\rightarrow$ compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Yuechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, Hao Fei</p>

            <p><strong>Title:</strong><br>
            UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.08521v1">http://arxiv.org/abs/2511.08521v1</a></p>

            <p><strong>Abstract:</strong><br>
            While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation $\rightarrow$ multi-round editing $\rightarrow$ object segmentation $\rightarrow$ compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 14 Nov 2025 19:03:30 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3e1dd8f1/55e71f78.mp3" length="26541346" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1655</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Yuechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, Hao Fei</p>

            <p><strong>Title:</strong><br>
            UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.08521v1">http://arxiv.org/abs/2511.08521v1</a></p>

            <p><strong>Abstract:</strong><br>
            While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation $\rightarrow$ multi-round editing $\rightarrow$ object segmentation $\rightarrow$ compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Too Good to be Bad: On the Failure of LLMs to Role-Play Villains</title>
      <itunes:episode>1361</itunes:episode>
      <podcast:episode>1361</podcast:episode>
      <itunes:title>Too Good to be Bad: On the Failure of LLMs to Role-Play Villains</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9c57c7dd-3e0c-4976-a121-1de9bf28ebe0</guid>
      <link>https://share.transistor.fm/s/de67fe09</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zihao Yi, Qingxuan Jiang, Ruotian Ma, Xingyu Chen, Qu Yang, Mengru Wang, Fanghua Ye, Ying Shen, Zhaopeng Tu, Xiaolong Li, Linus</p>

            <p><strong>Title:</strong><br>
            Too Good to be Bad: On the Failure of LLMs to Role-Play Villains</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.04962v1">http://arxiv.org/abs/2511.04962v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as ``Deceitful'' and ``Manipulative'', often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zihao Yi, Qingxuan Jiang, Ruotian Ma, Xingyu Chen, Qu Yang, Mengru Wang, Fanghua Ye, Ying Shen, Zhaopeng Tu, Xiaolong Li, Linus</p>

            <p><strong>Title:</strong><br>
            Too Good to be Bad: On the Failure of LLMs to Role-Play Villains</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.04962v1">http://arxiv.org/abs/2511.04962v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as ``Deceitful'' and ``Manipulative'', often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 10 Nov 2025 19:24:22 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/de67fe09/246e1c39.mp3" length="24418096" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1522</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zihao Yi, Qingxuan Jiang, Ruotian Ma, Xingyu Chen, Qu Yang, Mengru Wang, Fanghua Ye, Ying Shen, Zhaopeng Tu, Xiaolong Li, Linus</p>

            <p><strong>Title:</strong><br>
            Too Good to be Bad: On the Failure of LLMs to Role-Play Villains</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.04962v1">http://arxiv.org/abs/2511.04962v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as ``Deceitful'' and ``Manipulative'', often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DeepEyesV2: Toward Agentic Multimodal Model</title>
      <itunes:episode>1360</itunes:episode>
      <podcast:episode>1360</podcast:episode>
      <itunes:title>DeepEyesV2: Toward Agentic Multimodal Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e5c60fdf-1cc2-44ea-a619-bcc1c8f781d2</guid>
      <link>https://share.transistor.fm/s/a399fe74</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu</p>

            <p><strong>Title:</strong><br>
            DeepEyesV2: Toward Agentic Multimodal Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.05271v1">http://arxiv.org/abs/2511.05271v1</a></p>

            <p><strong>Abstract:</strong><br>
            Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu</p>

            <p><strong>Title:</strong><br>
            DeepEyesV2: Toward Agentic Multimodal Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.05271v1">http://arxiv.org/abs/2511.05271v1</a></p>

            <p><strong>Abstract:</strong><br>
            Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 10 Nov 2025 19:23:58 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a399fe74/f484882f.mp3" length="25352214" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1581</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu</p>

            <p><strong>Title:</strong><br>
            DeepEyesV2: Toward Agentic Multimodal Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.05271v1">http://arxiv.org/abs/2511.05271v1</a></p>

            <p><strong>Abstract:</strong><br>
            Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Visual Spatial Tuning</title>
      <itunes:episode>1359</itunes:episode>
      <podcast:episode>1359</podcast:episode>
      <itunes:title>Visual Spatial Tuning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">db61a297-7621-4292-a65f-4d85520a3783</guid>
      <link>https://share.transistor.fm/s/b6d5d68b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            Visual Spatial Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.05491v1">http://arxiv.org/abs/2511.05491v1</a></p>

            <p><strong>Abstract:</strong><br>
            Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including $34.8\%$ on MMSI-Bench and $61.2\%$ on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            Visual Spatial Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.05491v1">http://arxiv.org/abs/2511.05491v1</a></p>

            <p><strong>Abstract:</strong><br>
            Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including $34.8\%$ on MMSI-Bench and $61.2\%$ on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 10 Nov 2025 19:23:35 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b6d5d68b/57dcceb3.mp3" length="23324254" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1454</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            Visual Spatial Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.05491v1">http://arxiv.org/abs/2511.05491v1</a></p>

            <p><strong>Abstract:</strong><br>
            Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including $34.8\%$ on MMSI-Bench and $61.2\%$ on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks</title>
      <itunes:episode>1358</itunes:episode>
      <podcast:episode>1358</podcast:episode>
      <itunes:title>VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e0d808ac-5e35-402f-a2c1-c596e24053b5</guid>
      <link>https://share.transistor.fm/s/c82b617e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yu Feng, Nathaniel Weir, Kaj Bostrom, Sam Bayless, Darion Cassel, Sapana Chaudhary, Benjamin Kiesl-Reiter, Huzefa Rangwala</p>

            <p><strong>Title:</strong><br>
            VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.04662v1">http://arxiv.org/abs/2511.04662v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLMs can perform multi-step reasoning through Chain-of-Thought (CoT), but they cannot reliably verify their own logic. Even when they reach correct answers, the underlying reasoning may be flawed, undermining trust in high-stakes scenarios. To mitigate this issue, we introduce VeriCoT, a neuro-symbolic method that extracts and verifies formal logical arguments from CoT reasoning. VeriCoT formalizes each CoT reasoning step into first-order logic and identifies premises that ground the argument in source context, commonsense knowledge, or prior reasoning steps. The symbolic representation enables automated solvers to verify logical validity while the NL premises allow humans and systems to identify ungrounded or fallacious reasoning steps. Experiments on the ProofWriter, LegalBench, and BioASQ datasets show VeriCoT effectively identifies flawed reasoning, and serves as a strong predictor of final answer correctness. We also leverage VeriCoT's verification signal for (1) inference-time self-reflection, (2) supervised fine-tuning (SFT) on VeriCoT-distilled datasets and (3) preference fine-tuning (PFT) with direct preference optimization (DPO) using verification-based pairwise rewards, further improving reasoning validity and accuracy.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yu Feng, Nathaniel Weir, Kaj Bostrom, Sam Bayless, Darion Cassel, Sapana Chaudhary, Benjamin Kiesl-Reiter, Huzefa Rangwala</p>

            <p><strong>Title:</strong><br>
            VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.04662v1">http://arxiv.org/abs/2511.04662v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLMs can perform multi-step reasoning through Chain-of-Thought (CoT), but they cannot reliably verify their own logic. Even when they reach correct answers, the underlying reasoning may be flawed, undermining trust in high-stakes scenarios. To mitigate this issue, we introduce VeriCoT, a neuro-symbolic method that extracts and verifies formal logical arguments from CoT reasoning. VeriCoT formalizes each CoT reasoning step into first-order logic and identifies premises that ground the argument in source context, commonsense knowledge, or prior reasoning steps. The symbolic representation enables automated solvers to verify logical validity while the NL premises allow humans and systems to identify ungrounded or fallacious reasoning steps. Experiments on the ProofWriter, LegalBench, and BioASQ datasets show VeriCoT effectively identifies flawed reasoning, and serves as a strong predictor of final answer correctness. We also leverage VeriCoT's verification signal for (1) inference-time self-reflection, (2) supervised fine-tuning (SFT) on VeriCoT-distilled datasets and (3) preference fine-tuning (PFT) with direct preference optimization (DPO) using verification-based pairwise rewards, further improving reasoning validity and accuracy.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 10 Nov 2025 19:23:12 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c82b617e/04b5d7ba.mp3" length="22062079" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1375</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yu Feng, Nathaniel Weir, Kaj Bostrom, Sam Bayless, Darion Cassel, Sapana Chaudhary, Benjamin Kiesl-Reiter, Huzefa Rangwala</p>

            <p><strong>Title:</strong><br>
            VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.04662v1">http://arxiv.org/abs/2511.04662v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLMs can perform multi-step reasoning through Chain-of-Thought (CoT), but they cannot reliably verify their own logic. Even when they reach correct answers, the underlying reasoning may be flawed, undermining trust in high-stakes scenarios. To mitigate this issue, we introduce VeriCoT, a neuro-symbolic method that extracts and verifies formal logical arguments from CoT reasoning. VeriCoT formalizes each CoT reasoning step into first-order logic and identifies premises that ground the argument in source context, commonsense knowledge, or prior reasoning steps. The symbolic representation enables automated solvers to verify logical validity while the NL premises allow humans and systems to identify ungrounded or fallacious reasoning steps. Experiments on the ProofWriter, LegalBench, and BioASQ datasets show VeriCoT effectively identifies flawed reasoning, and serves as a strong predictor of final answer correctness. We also leverage VeriCoT's verification signal for (1) inference-time self-reflection, (2) supervised fine-tuning (SFT) on VeriCoT-distilled datasets and (3) preference fine-tuning (PFT) with direct preference optimization (DPO) using verification-based pairwise rewards, further improving reasoning validity and accuracy.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm</title>
      <itunes:episode>1357</itunes:episode>
      <podcast:episode>1357</podcast:episode>
      <itunes:title>Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">46b3e4a4-4846-48e9-93e8-e57af9b0ae4c</guid>
      <link>https://share.transistor.fm/s/224fb52b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 127 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu</p>

            <p><strong>Title:</strong><br>
            Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.04570v1">http://arxiv.org/abs/2511.04570v1</a></p>

            <p><strong>Abstract:</strong><br>
            "Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 127 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu</p>

            <p><strong>Title:</strong><br>
            Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.04570v1">http://arxiv.org/abs/2511.04570v1</a></p>

            <p><strong>Abstract:</strong><br>
            "Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 07 Nov 2025 19:00:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/224fb52b/ff3f7ec8.mp3" length="23454719" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1462</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 127 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu</p>

            <p><strong>Title:</strong><br>
            Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.04570v1">http://arxiv.org/abs/2511.04570v1</a></p>

            <p><strong>Abstract:</strong><br>
            "Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>V-Thinker: Interactive Thinking with Images</title>
      <itunes:episode>1356</itunes:episode>
      <podcast:episode>1356</podcast:episode>
      <itunes:title>V-Thinker: Interactive Thinking with Images</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">12992d95-cab8-48cd-9a35-5116113d166e</guid>
      <link>https://share.transistor.fm/s/cec76665</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, Chong Sun, Chen Li, Honggang Zhang</p>

            <p><strong>Title:</strong><br>
            V-Thinker: Interactive Thinking with Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.04460v1">http://arxiv.org/abs/2511.04460v1</a></p>

            <p><strong>Abstract:</strong><br>
            Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising "Thinking with Images" paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, Chong Sun, Chen Li, Honggang Zhang</p>

            <p><strong>Title:</strong><br>
            V-Thinker: Interactive Thinking with Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.04460v1">http://arxiv.org/abs/2511.04460v1</a></p>

            <p><strong>Abstract:</strong><br>
            Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising "Thinking with Images" paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 07 Nov 2025 19:00:33 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cec76665/2292ac33.mp3" length="20231378" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1261</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, Chong Sun, Chen Li, Honggang Zhang</p>

            <p><strong>Title:</strong><br>
            V-Thinker: Interactive Thinking with Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.04460v1">http://arxiv.org/abs/2511.04460v1</a></p>

            <p><strong>Abstract:</strong><br>
            Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising "Thinking with Images" paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Scaling Agent Learning via Experience Synthesis</title>
      <itunes:episode>1355</itunes:episode>
      <podcast:episode>1355</podcast:episode>
      <itunes:title>Scaling Agent Learning via Experience Synthesis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b13706de-5b51-44e7-95b4-7ad8f54bab14</guid>
      <link>https://share.transistor.fm/s/92fa074b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, Dat Huynh</p>

            <p><strong>Title:</strong><br>
            Scaling Agent Learning via Experience Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.03773v1">http://arxiv.org/abs/2511.03773v1</a></p>

            <p><strong>Abstract:</strong><br>
            While reinforcement learning (RL) can empower large language model (LLM) agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an experience replay buffer initialized with offline real-world data and continuously enriched with fresh interactions to actively support agent training. To improve knowledge acquisition, DreamGym adaptively generates new tasks that challenge the current agent policy, enabling more effective online curriculum learning. Experiments across diverse environments and agent backbones demonstrate that DreamGym substantially improves RL training, both in fully synthetic settings and in sim-to-real transfer scenarios. On non-RL-ready tasks like WebArena, DreamGym outperforms all baselines by over 30%. And in RL-ready but costly settings, it matches GRPO and PPO performance using only synthetic interactions. When transferring a policy trained purely on synthetic experiences to real-environment RL, DreamGym yields significant additional performance gains while requiring far fewer real-world interactions, providing a scalable warm-start strategy for general-purpose RL.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, Dat Huynh</p>

            <p><strong>Title:</strong><br>
            Scaling Agent Learning via Experience Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.03773v1">http://arxiv.org/abs/2511.03773v1</a></p>

            <p><strong>Abstract:</strong><br>
            While reinforcement learning (RL) can empower large language model (LLM) agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an experience replay buffer initialized with offline real-world data and continuously enriched with fresh interactions to actively support agent training. To improve knowledge acquisition, DreamGym adaptively generates new tasks that challenge the current agent policy, enabling more effective online curriculum learning. Experiments across diverse environments and agent backbones demonstrate that DreamGym substantially improves RL training, both in fully synthetic settings and in sim-to-real transfer scenarios. On non-RL-ready tasks like WebArena, DreamGym outperforms all baselines by over 30%. And in RL-ready but costly settings, it matches GRPO and PPO performance using only synthetic interactions. When transferring a policy trained purely on synthetic experiences to real-environment RL, DreamGym yields significant additional performance gains while requiring far fewer real-world interactions, providing a scalable warm-start strategy for general-purpose RL.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 07 Nov 2025 19:00:12 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/92fa074b/ffd4dee2.mp3" length="22141038" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1380</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, Dat Huynh</p>

            <p><strong>Title:</strong><br>
            Scaling Agent Learning via Experience Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.03773v1">http://arxiv.org/abs/2511.03773v1</a></p>

            <p><strong>Abstract:</strong><br>
            While reinforcement learning (RL) can empower large language model (LLM) agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an experience replay buffer initialized with offline real-world data and continuously enriched with fresh interactions to actively support agent training. To improve knowledge acquisition, DreamGym adaptively generates new tasks that challenge the current agent policy, enabling more effective online curriculum learning. Experiments across diverse environments and agent backbones demonstrate that DreamGym substantially improves RL training, both in fully synthetic settings and in sim-to-real transfer scenarios. On non-RL-ready tasks like WebArena, DreamGym outperforms all baselines by over 30%. And in RL-ready but costly settings, it matches GRPO and PPO performance using only synthetic interactions. When transferring a policy trained purely on synthetic experiences to real-environment RL, DreamGym yields significant additional performance gains while requiring far fewer real-world interactions, providing a scalable warm-start strategy for general-purpose RL.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Diffusion Language Models are Super Data Learners</title>
      <itunes:episode>1354</itunes:episode>
      <podcast:episode>1354</podcast:episode>
      <itunes:title>Diffusion Language Models are Super Data Learners</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">67943b95-92e4-4aa1-886e-791ad6949f5d</guid>
      <link>https://share.transistor.fm/s/97736993</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, Michael Qizhe Shieh</p>

            <p><strong>Title:</strong><br>
            Diffusion Language Models are Super Data Learners</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.03276v1">http://arxiv.org/abs/2511.03276v1</a></p>

            <p><strong>Abstract:</strong><br>
            Under strictly controlled pre-training settings, we observe a Crossover: when unique data is limited, diffusion language models (DLMs) consistently surpass autoregressive (AR) models by training for more epochs. The crossover shifts later with more or higher-quality data, earlier with larger models, and persists across dense and sparse architectures. We attribute the gains to three compounding factors: (1) any-order modeling, (2) super-dense compute from iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation; input or parameter noise improves AR under data constraint but cannot close the gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B unique Python tokens overtakes an AR coder trained with strictly matched settings. In addition, a 1B-parameter DLM achieves &gt; 56% accuracy on HellaSwag and &gt; 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. We also show that rising validation cross-entropy does not imply degraded downstream performance in this regime.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, Michael Qizhe Shieh</p>

            <p><strong>Title:</strong><br>
            Diffusion Language Models are Super Data Learners</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.03276v1">http://arxiv.org/abs/2511.03276v1</a></p>

            <p><strong>Abstract:</strong><br>
            Under strictly controlled pre-training settings, we observe a Crossover: when unique data is limited, diffusion language models (DLMs) consistently surpass autoregressive (AR) models by training for more epochs. The crossover shifts later with more or higher-quality data, earlier with larger models, and persists across dense and sparse architectures. We attribute the gains to three compounding factors: (1) any-order modeling, (2) super-dense compute from iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation; input or parameter noise improves AR under data constraint but cannot close the gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B unique Python tokens overtakes an AR coder trained with strictly matched settings. In addition, a 1B-parameter DLM achieves &gt; 56% accuracy on HellaSwag and &gt; 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. We also show that rising validation cross-entropy does not imply degraded downstream performance in this regime.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 06 Nov 2025 19:05:34 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/97736993/dd288f69.mp3" length="21612321" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1347</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jinjie Ni, Qian Liu, Longxu Dou, Chao Du, Zili Wang, Hang Yan, Tianyu Pang, Michael Qizhe Shieh</p>

            <p><strong>Title:</strong><br>
            Diffusion Language Models are Super Data Learners</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.03276v1">http://arxiv.org/abs/2511.03276v1</a></p>

            <p><strong>Abstract:</strong><br>
            Under strictly controlled pre-training settings, we observe a Crossover: when unique data is limited, diffusion language models (DLMs) consistently surpass autoregressive (AR) models by training for more epochs. The crossover shifts later with more or higher-quality data, earlier with larger models, and persists across dense and sparse architectures. We attribute the gains to three compounding factors: (1) any-order modeling, (2) super-dense compute from iterative bidirectional denoising, and (3) built-in Monte Carlo augmentation; input or parameter noise improves AR under data constraint but cannot close the gap. At scale, a 1.7B DLM trained with a ~1.5T-token compute budget on 10B unique Python tokens overtakes an AR coder trained with strictly matched settings. In addition, a 1B-parameter DLM achieves &gt; 56% accuracy on HellaSwag and &gt; 33% on MMLU using only 1B tokens, without any special tricks, just by repeating standard pre-training data. We also show that rising validation cross-entropy does not imply degraded downstream performance in this regime.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation</title>
      <itunes:episode>1353</itunes:episode>
      <podcast:episode>1353</podcast:episode>
      <itunes:title>LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7d022225-b720-49ed-b201-5b99c25be3f9</guid>
      <link>https://share.transistor.fm/s/6dfec66b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Gyeom Hwangbo, Hyungjoo Chae, Minseok Kang, Hyeonjong Ju, Soohyun Oh, Jinyoung Yeo</p>

            <p><strong>Title:</strong><br>
            LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.03001v1">http://arxiv.org/abs/2511.03001v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Gyeom Hwangbo, Hyungjoo Chae, Minseok Kang, Hyeonjong Ju, Soohyun Oh, Jinyoung Yeo</p>

            <p><strong>Title:</strong><br>
            LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.03001v1">http://arxiv.org/abs/2511.03001v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 06 Nov 2025 19:05:12 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6dfec66b/274d812b.mp3" length="24874968" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1551</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Gyeom Hwangbo, Hyungjoo Chae, Minseok Kang, Hyeonjong Ju, Soohyun Oh, Jinyoung Yeo</p>

            <p><strong>Title:</strong><br>
            LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.03001v1">http://arxiv.org/abs/2511.03001v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions</title>
      <itunes:episode>1352</itunes:episode>
      <podcast:episode>1352</podcast:episode>
      <itunes:title>UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ef95f783-7497-44c5-b895-ba50fbfd30ea</guid>
      <link>https://share.transistor.fm/s/2721c1e3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, Limin Wang</p>

            <p><strong>Title:</strong><br>
            UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.03334v1">http://arxiv.org/abs/2511.03334v1</a></p>

            <p><strong>Abstract:</strong><br>
            Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint synthesis design enables seamless unification of pivotal audio-video tasks within a single model, such as joint audio-video generation and continuation, video-to-audio dubbing, and audio-driven video synthesis. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, Limin Wang</p>

            <p><strong>Title:</strong><br>
            UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.03334v1">http://arxiv.org/abs/2511.03334v1</a></p>

            <p><strong>Abstract:</strong><br>
            Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint synthesis design enables seamless unification of pivotal audio-video tasks within a single model, such as joint audio-video generation and continuation, video-to-audio dubbing, and audio-driven video synthesis. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 06 Nov 2025 19:04:50 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2721c1e3/f53d9dd6.mp3" length="22442842" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1399</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, Limin Wang</p>

            <p><strong>Title:</strong><br>
            UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.03334v1">http://arxiv.org/abs/2511.03334v1</a></p>

            <p><strong>Abstract:</strong><br>
            Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint synthesis design enables seamless unification of pivotal audio-video tasks within a single model, such as joint audio-video generation and continuation, video-to-audio dubbing, and audio-driven video synthesis. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization</title>
      <itunes:episode>1351</itunes:episode>
      <podcast:episode>1351</podcast:episode>
      <itunes:title>Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e4554bea-94d0-49a3-8311-4a87268a6252</guid>
      <link>https://share.transistor.fm/s/872b2245</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.LG, cs.AI, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov</p>

            <p><strong>Title:</strong><br>
            Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.25616v1">http://arxiv.org/abs/2510.25616v1</a></p>

            <p><strong>Abstract:</strong><br>
            The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.LG, cs.AI, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov</p>

            <p><strong>Title:</strong><br>
            Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.25616v1">http://arxiv.org/abs/2510.25616v1</a></p>

            <p><strong>Abstract:</strong><br>
            The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 05 Nov 2025 19:08:00 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/872b2245/2da72f0f.mp3" length="27255633" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1700</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.LG, cs.AI, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov</p>

            <p><strong>Title:</strong><br>
            Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.25616v1">http://arxiv.org/abs/2510.25616v1</a></p>

            <p><strong>Abstract:</strong><br>
            The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation</title>
      <itunes:episode>1350</itunes:episode>
      <podcast:episode>1350</podcast:episode>
      <itunes:title>VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bb60cd58-6530-4824-9f6e-1a3cf286994b</guid>
      <link>https://share.transistor.fm/s/8db9ae70</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, Alex Jinpeng Wang</p>

            <p><strong>Title:</strong><br>
            VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.02778v1">http://arxiv.org/abs/2511.02778v1</a></p>

            <p><strong>Abstract:</strong><br>
            Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, Alex Jinpeng Wang</p>

            <p><strong>Title:</strong><br>
            VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.02778v1">http://arxiv.org/abs/2511.02778v1</a></p>

            <p><strong>Abstract:</strong><br>
            Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 05 Nov 2025 19:07:39 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8db9ae70/641c9117.mp3" length="21086977" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1314</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, Alex Jinpeng Wang</p>

            <p><strong>Title:</strong><br>
            VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.02778v1">http://arxiv.org/abs/2511.02778v1</a></p>

            <p><strong>Abstract:</strong><br>
            Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought</title>
      <itunes:episode>1349</itunes:episode>
      <podcast:episode>1349</podcast:episode>
      <itunes:title>When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2b2b65f5-9a3f-432e-a8c9-bd9e207286be</guid>
      <link>https://share.transistor.fm/s/e74803aa</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, Haoqi Fan, Cihang Xie, Huaxiu Yao, Qinghao Ye</p>

            <p><strong>Title:</strong><br>
            When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.02779v1">http://arxiv.org/abs/2511.02779v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, Haoqi Fan, Cihang Xie, Huaxiu Yao, Qinghao Ye</p>

            <p><strong>Title:</strong><br>
            When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.02779v1">http://arxiv.org/abs/2511.02779v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 05 Nov 2025 19:07:17 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e74803aa/dc2a1132.mp3" length="23306773" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1453</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, Haoqi Fan, Cihang Xie, Huaxiu Yao, Qinghao Ye</p>

            <p><strong>Title:</strong><br>
            When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.02779v1">http://arxiv.org/abs/2511.02779v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation</title>
      <itunes:episode>1348</itunes:episode>
      <podcast:episode>1348</podcast:episode>
      <itunes:title>Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6e0744ce-81a5-449f-978d-2d71e7528ce3</guid>
      <link>https://share.transistor.fm/s/7099b34f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 61 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ling-Team, Ang Li, Ben Liu, Binbin Hu, Bing Li, Bingwei Zeng, Borui Ye, Caizhi Tang, Changxin Tian, Chao Huang, Chao Zhang, Chen Qian, Chenchen Ju, Chenchen Li, Chengfu Tang, Chili Fu, Chunshao Ren, Chunwei Wu, Cong Zhang, Cunyin Peng, Dafeng Xu, Daixin Wang, Dalong Zhang, Dingnan Jin, Dingyuan Zhu, Dongke Hu, Fangzheng Zhao, Feifan Wu, Feng Zhu, Gangshan Wang, Haitao Zhang, Hailin Zhao, Hanxiao Zhang, Hanzi Wang, Hao Qian, Haoyi Yu, Heng Zhang, Hongliang Zhang, Hongzhi Luan, Huirong Dong, Huizhong Li, Jia Li, Jia Liu, Jialong Zhu, Jian Sha, Jianping Wei, Jiaolong Yang, Jieyue Ma, Jiewei Wu, Jinjing Huang, Jingyun Tian, Jingyuan Zhang, Jinquan Sun, Juanhui Tu, Jun Liu, Jun Xu, Jun Zhou, Junjie Ou, Junpeng Fang, Kaihong Zhang, Kaiqin Hu, Ke Shi, Kun Tang, Kunlong Chen, Lanyin Mei, Lei Liang, Lei Xu, Libo Zhang, Lin Ju, Lin Yuan, Ling Zhong, Lintao Ma, Lu Liu, Lu Yu, Lun Cai, Meiqi Zhu, Mengying Li, Min Chen, Minghao Xue, Minghong Cai, Mingming Yin, Peijie Jiang, Peilong Zhao, Pingping Liu, Qian Zhao, Qing Cui, Qingxiang Huang, Qingyuan Yang, Quankun Yu, Shaowei Wei, Shijie Lian, Shoujian Zheng, Shun Song, Shungen Zhang, Shuo Zhang, Siyuan Li, Song Liu, Ting Guo, Tong Zhao, Wanli Gu, Weichang Wu, Weiguang Han, Wenjing Fang, Wubin Wang, Xiang Shu, Xiao Shi, Xiaoshun Lan, Xiaolu Zhang, Xiaqing Sun, Xin Zhao, Xingyu Lu, Xiong Xu, Xudong Wang, Xudong Wang, Xuemin Yang, Yajie Yang, Yang Xiang, Yanzhe Li, Yi Zhang, Yilong Wang, Yingxue Li, Yongzhen Guo, Yuzhuo Fu, Yuanyuan Wang, Yue Yang, Yue Yu, Yufeng Deng, Yun Zhang, Yunfei Xu, Yuqi Zhang, Yuxiao He, Zengke Gui, Zhaoxin Huan, Zhaoyang Wang, Zhibo Zhu, Zhihao Wang, Zhiqiang Zhang, Zhoufei Wang, Zihang Zeng, Ziqi Liu, Zitao Xuan, Zuoli Tang</p>

            <p><strong>Title:</strong><br>
            Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.22115v1">http://arxiv.org/abs/2510.22115v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Ling 2.0, a series reasoning-oriented language foundation built upon the principle that every activation boosts reasoning capability. Designed to scale from tens of billions to one trillion parameters under a unified Mixture-of-Experts (MoE) paradigm, Ling 2.0 emphasizes high sparsity, cross-scale consistency, and efficiency guided by empirical scaling laws. The series includes three non-thinking (instruct) models - Ling-mini-2.0, Ling-flash-2.0, and Ling-1T - ranging from 16B to 1T total parameters and achieving up to 7-fold active-compute efficiency compared with dense counterparts. Ling 2.0 integrates coordinated innovations across model architecture, pre-training, post-training, and infrastructure: a high-sparsity MoE with MTP for efficient reasoning, reasoning-oriented data and mid-training CoT activation, reinforcement-based fine-tuning (DFT, Evo-CoT), and full-scale FP8 training with fine-grained heterogeneous pipelines. At the trillion scale, Ling-1T establishes a new Pareto frontier of reasoning accuracy versus computational efficiency, demonstrating that sparse activation, when properly aligned with reasoning objectives, enables scalable and efficient intelligence. Collectively, Ling 2.0 provides a coherent, open, and efficient foundation for advancing future reasoning and thinking models, including the Ring series built upon the same base.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 61 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ling-Team, Ang Li, Ben Liu, Binbin Hu, Bing Li, Bingwei Zeng, Borui Ye, Caizhi Tang, Changxin Tian, Chao Huang, Chao Zhang, Chen Qian, Chenchen Ju, Chenchen Li, Chengfu Tang, Chili Fu, Chunshao Ren, Chunwei Wu, Cong Zhang, Cunyin Peng, Dafeng Xu, Daixin Wang, Dalong Zhang, Dingnan Jin, Dingyuan Zhu, Dongke Hu, Fangzheng Zhao, Feifan Wu, Feng Zhu, Gangshan Wang, Haitao Zhang, Hailin Zhao, Hanxiao Zhang, Hanzi Wang, Hao Qian, Haoyi Yu, Heng Zhang, Hongliang Zhang, Hongzhi Luan, Huirong Dong, Huizhong Li, Jia Li, Jia Liu, Jialong Zhu, Jian Sha, Jianping Wei, Jiaolong Yang, Jieyue Ma, Jiewei Wu, Jinjing Huang, Jingyun Tian, Jingyuan Zhang, Jinquan Sun, Juanhui Tu, Jun Liu, Jun Xu, Jun Zhou, Junjie Ou, Junpeng Fang, Kaihong Zhang, Kaiqin Hu, Ke Shi, Kun Tang, Kunlong Chen, Lanyin Mei, Lei Liang, Lei Xu, Libo Zhang, Lin Ju, Lin Yuan, Ling Zhong, Lintao Ma, Lu Liu, Lu Yu, Lun Cai, Meiqi Zhu, Mengying Li, Min Chen, Minghao Xue, Minghong Cai, Mingming Yin, Peijie Jiang, Peilong Zhao, Pingping Liu, Qian Zhao, Qing Cui, Qingxiang Huang, Qingyuan Yang, Quankun Yu, Shaowei Wei, Shijie Lian, Shoujian Zheng, Shun Song, Shungen Zhang, Shuo Zhang, Siyuan Li, Song Liu, Ting Guo, Tong Zhao, Wanli Gu, Weichang Wu, Weiguang Han, Wenjing Fang, Wubin Wang, Xiang Shu, Xiao Shi, Xiaoshun Lan, Xiaolu Zhang, Xiaqing Sun, Xin Zhao, Xingyu Lu, Xiong Xu, Xudong Wang, Xudong Wang, Xuemin Yang, Yajie Yang, Yang Xiang, Yanzhe Li, Yi Zhang, Yilong Wang, Yingxue Li, Yongzhen Guo, Yuzhuo Fu, Yuanyuan Wang, Yue Yang, Yue Yu, Yufeng Deng, Yun Zhang, Yunfei Xu, Yuqi Zhang, Yuxiao He, Zengke Gui, Zhaoxin Huan, Zhaoyang Wang, Zhibo Zhu, Zhihao Wang, Zhiqiang Zhang, Zhoufei Wang, Zihang Zeng, Ziqi Liu, Zitao Xuan, Zuoli Tang</p>

            <p><strong>Title:</strong><br>
            Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.22115v1">http://arxiv.org/abs/2510.22115v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Ling 2.0, a series reasoning-oriented language foundation built upon the principle that every activation boosts reasoning capability. Designed to scale from tens of billions to one trillion parameters under a unified Mixture-of-Experts (MoE) paradigm, Ling 2.0 emphasizes high sparsity, cross-scale consistency, and efficiency guided by empirical scaling laws. The series includes three non-thinking (instruct) models - Ling-mini-2.0, Ling-flash-2.0, and Ling-1T - ranging from 16B to 1T total parameters and achieving up to 7-fold active-compute efficiency compared with dense counterparts. Ling 2.0 integrates coordinated innovations across model architecture, pre-training, post-training, and infrastructure: a high-sparsity MoE with MTP for efficient reasoning, reasoning-oriented data and mid-training CoT activation, reinforcement-based fine-tuning (DFT, Evo-CoT), and full-scale FP8 training with fine-grained heterogeneous pipelines. At the trillion scale, Ling-1T establishes a new Pareto frontier of reasoning accuracy versus computational efficiency, demonstrating that sparse activation, when properly aligned with reasoning objectives, enables scalable and efficient intelligence. Collectively, Ling 2.0 provides a coherent, open, and efficient foundation for advancing future reasoning and thinking models, including the Ring series built upon the same base.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Nov 2025 20:00:11 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7099b34f/e8ee85fd.mp3" length="23233207" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1448</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 61 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ling-Team, Ang Li, Ben Liu, Binbin Hu, Bing Li, Bingwei Zeng, Borui Ye, Caizhi Tang, Changxin Tian, Chao Huang, Chao Zhang, Chen Qian, Chenchen Ju, Chenchen Li, Chengfu Tang, Chili Fu, Chunshao Ren, Chunwei Wu, Cong Zhang, Cunyin Peng, Dafeng Xu, Daixin Wang, Dalong Zhang, Dingnan Jin, Dingyuan Zhu, Dongke Hu, Fangzheng Zhao, Feifan Wu, Feng Zhu, Gangshan Wang, Haitao Zhang, Hailin Zhao, Hanxiao Zhang, Hanzi Wang, Hao Qian, Haoyi Yu, Heng Zhang, Hongliang Zhang, Hongzhi Luan, Huirong Dong, Huizhong Li, Jia Li, Jia Liu, Jialong Zhu, Jian Sha, Jianping Wei, Jiaolong Yang, Jieyue Ma, Jiewei Wu, Jinjing Huang, Jingyun Tian, Jingyuan Zhang, Jinquan Sun, Juanhui Tu, Jun Liu, Jun Xu, Jun Zhou, Junjie Ou, Junpeng Fang, Kaihong Zhang, Kaiqin Hu, Ke Shi, Kun Tang, Kunlong Chen, Lanyin Mei, Lei Liang, Lei Xu, Libo Zhang, Lin Ju, Lin Yuan, Ling Zhong, Lintao Ma, Lu Liu, Lu Yu, Lun Cai, Meiqi Zhu, Mengying Li, Min Chen, Minghao Xue, Minghong Cai, Mingming Yin, Peijie Jiang, Peilong Zhao, Pingping Liu, Qian Zhao, Qing Cui, Qingxiang Huang, Qingyuan Yang, Quankun Yu, Shaowei Wei, Shijie Lian, Shoujian Zheng, Shun Song, Shungen Zhang, Shuo Zhang, Siyuan Li, Song Liu, Ting Guo, Tong Zhao, Wanli Gu, Weichang Wu, Weiguang Han, Wenjing Fang, Wubin Wang, Xiang Shu, Xiao Shi, Xiaoshun Lan, Xiaolu Zhang, Xiaqing Sun, Xin Zhao, Xingyu Lu, Xiong Xu, Xudong Wang, Xudong Wang, Xuemin Yang, Yajie Yang, Yang Xiang, Yanzhe Li, Yi Zhang, Yilong Wang, Yingxue Li, Yongzhen Guo, Yuzhuo Fu, Yuanyuan Wang, Yue Yang, Yue Yu, Yufeng Deng, Yun Zhang, Yunfei Xu, Yuqi Zhang, Yuxiao He, Zengke Gui, Zhaoxin Huan, Zhaoyang Wang, Zhibo Zhu, Zhihao Wang, Zhiqiang Zhang, Zhoufei Wang, Zihang Zeng, Ziqi Liu, Zitao Xuan, Zuoli Tang</p>

            <p><strong>Title:</strong><br>
            Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.22115v1">http://arxiv.org/abs/2510.22115v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Ling 2.0, a series reasoning-oriented language foundation built upon the principle that every activation boosts reasoning capability. Designed to scale from tens of billions to one trillion parameters under a unified Mixture-of-Experts (MoE) paradigm, Ling 2.0 emphasizes high sparsity, cross-scale consistency, and efficiency guided by empirical scaling laws. The series includes three non-thinking (instruct) models - Ling-mini-2.0, Ling-flash-2.0, and Ling-1T - ranging from 16B to 1T total parameters and achieving up to 7-fold active-compute efficiency compared with dense counterparts. Ling 2.0 integrates coordinated innovations across model architecture, pre-training, post-training, and infrastructure: a high-sparsity MoE with MTP for efficient reasoning, reasoning-oriented data and mid-training CoT activation, reinforcement-based fine-tuning (DFT, Evo-CoT), and full-scale FP8 training with fine-grained heterogeneous pipelines. At the trillion scale, Ling-1T establishes a new Pareto frontier of reasoning accuracy versus computational efficiency, demonstrating that sparse activation, when properly aligned with reasoning objectives, enables scalable and efficient intelligence. Collectively, Ling 2.0 provides a coherent, open, and efficient foundation for advancing future reasoning and thinking models, including the Ring series built upon the same base.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph</title>
      <itunes:episode>1347</itunes:episode>
      <podcast:episode>1347</podcast:episode>
      <itunes:title>Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">08ffa346-2269-4f0e-ac4c-c7079af2674d</guid>
      <link>https://share.transistor.fm/s/b9742896</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.LG, cs.AI, cs.CL, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Fali Wang, Jihai Chen, Shuhua Yang, Runxue Bao, Tianxiang Zhao, Zhiwei Zhang, Xianfeng Tang, Hui Liu, Qi He, Suhang Wang</p>

            <p><strong>Title:</strong><br>
            Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.00086v1">http://arxiv.org/abs/2511.00086v1</a></p>

            <p><strong>Abstract:</strong><br>
            Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration architectures (e.g., topologies) and single-model usage, overlooking that optimal architectures and model combinations can vary across tasks. Therefore, we study the novel problem of searching for compute-optimal model combinations and architectures in TTS under a fixed budget. We formalize it as a multi-LLM collaboration graph, where nodes encode roles and LLM model assignments, and edges capture information flow. This problem is challenging because (i) the combinatorial search space is prohibitively large, and (ii) task-specific requirements demand tailored designs. To address these, we reformulate the problem as probabilistic graph optimization and, through pilot experiments, derive three empirical insights into TTS collaboration graphs. Guided by these insights, we propose Agent-REINFORCE, an LLM-agent-augmented framework that mirrors the REINFORCE pipeline by mapping sampling-gradient-update to sampling-feedback-update, where feedback serves as a textual gradient to update the probabilistic graph and efficiently search for optimal multi-LLM collaboration graphs. Experiments show that Agent-REINFORCE outperforms both traditional and LLM-based baselines in sample efficiency and search performance, and effectively identifies optimal graphs under joint objectives of accuracy and inference latency.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.LG, cs.AI, cs.CL, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Fali Wang, Jihai Chen, Shuhua Yang, Runxue Bao, Tianxiang Zhao, Zhiwei Zhang, Xianfeng Tang, Hui Liu, Qi He, Suhang Wang</p>

            <p><strong>Title:</strong><br>
            Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.00086v1">http://arxiv.org/abs/2511.00086v1</a></p>

            <p><strong>Abstract:</strong><br>
            Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration architectures (e.g., topologies) and single-model usage, overlooking that optimal architectures and model combinations can vary across tasks. Therefore, we study the novel problem of searching for compute-optimal model combinations and architectures in TTS under a fixed budget. We formalize it as a multi-LLM collaboration graph, where nodes encode roles and LLM model assignments, and edges capture information flow. This problem is challenging because (i) the combinatorial search space is prohibitively large, and (ii) task-specific requirements demand tailored designs. To address these, we reformulate the problem as probabilistic graph optimization and, through pilot experiments, derive three empirical insights into TTS collaboration graphs. Guided by these insights, we propose Agent-REINFORCE, an LLM-agent-augmented framework that mirrors the REINFORCE pipeline by mapping sampling-gradient-update to sampling-feedback-update, where feedback serves as a textual gradient to update the probabilistic graph and efficiently search for optimal multi-LLM collaboration graphs. Experiments show that Agent-REINFORCE outperforms both traditional and LLM-based baselines in sample efficiency and search performance, and effectively identifies optimal graphs under joint objectives of accuracy and inference latency.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Nov 2025 19:59:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b9742896/6e53079a.mp3" length="21985162" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1370</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.LG, cs.AI, cs.CL, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Fali Wang, Jihai Chen, Shuhua Yang, Runxue Bao, Tianxiang Zhao, Zhiwei Zhang, Xianfeng Tang, Hui Liu, Qi He, Suhang Wang</p>

            <p><strong>Title:</strong><br>
            Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.00086v1">http://arxiv.org/abs/2511.00086v1</a></p>

            <p><strong>Abstract:</strong><br>
            Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration architectures (e.g., topologies) and single-model usage, overlooking that optimal architectures and model combinations can vary across tasks. Therefore, we study the novel problem of searching for compute-optimal model combinations and architectures in TTS under a fixed budget. We formalize it as a multi-LLM collaboration graph, where nodes encode roles and LLM model assignments, and edges capture information flow. This problem is challenging because (i) the combinatorial search space is prohibitively large, and (ii) task-specific requirements demand tailored designs. To address these, we reformulate the problem as probabilistic graph optimization and, through pilot experiments, derive three empirical insights into TTS collaboration graphs. Guided by these insights, we propose Agent-REINFORCE, an LLM-agent-augmented framework that mirrors the REINFORCE pipeline by mapping sampling-gradient-update to sampling-feedback-update, where feedback serves as a textual gradient to update the probabilistic graph and efficiently search for optimal multi-LLM collaboration graphs. Experiments show that Agent-REINFORCE outperforms both traditional and LLM-based baselines in sample efficiency and search performance, and effectively identifies optimal graphs under joint objectives of accuracy and inference latency.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Underappreciated Power of Vision Models for Graph Structural Understanding</title>
      <itunes:episode>1346</itunes:episode>
      <podcast:episode>1346</podcast:episode>
      <itunes:title>The Underappreciated Power of Vision Models for Graph Structural Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a2f286ae-eb99-4232-aa15-5f8bd8597e48</guid>
      <link>https://share.transistor.fm/s/40b883ea</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xinjian Zhao, Wei Pang, Zhongkai Xue, Xiangru Jian, Lei Zhang, Yaoyao Xu, Xiaozhuang Song, Shu Wu, Tianshu Yu</p>

            <p><strong>Title:</strong><br>
            The Underappreciated Power of Vision Models for Graph Structural Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.24788v1">http://arxiv.org/abs/2510.24788v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graph Neural Networks operate through bottom-up message-passing, fundamentally differing from human visual perception, which intuitively captures global structures first. We investigate the underappreciated potential of vision models for graph understanding, finding they achieve performance comparable to GNNs on established benchmarks while exhibiting distinctly different learning patterns. These divergent behaviors, combined with limitations of existing benchmarks that conflate domain features with topological understanding, motivate our introduction of GraphAbstract. This benchmark evaluates models' ability to perceive global graph properties as humans do: recognizing organizational archetypes, detecting symmetry, sensing connectivity strength, and identifying critical elements. Our results reveal that vision models significantly outperform GNNs on tasks requiring holistic structural understanding and maintain generalizability across varying graph scales, while GNNs struggle with global pattern abstraction and degrade with increasing graph size. This work demonstrates that vision models possess remarkable yet underutilized capabilities for graph structural understanding, particularly for problems requiring global topological awareness and scale-invariant reasoning. These findings open new avenues to leverage this underappreciated potential for developing more effective graph foundation models for tasks dominated by holistic pattern recognition.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xinjian Zhao, Wei Pang, Zhongkai Xue, Xiangru Jian, Lei Zhang, Yaoyao Xu, Xiaozhuang Song, Shu Wu, Tianshu Yu</p>

            <p><strong>Title:</strong><br>
            The Underappreciated Power of Vision Models for Graph Structural Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.24788v1">http://arxiv.org/abs/2510.24788v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graph Neural Networks operate through bottom-up message-passing, fundamentally differing from human visual perception, which intuitively captures global structures first. We investigate the underappreciated potential of vision models for graph understanding, finding they achieve performance comparable to GNNs on established benchmarks while exhibiting distinctly different learning patterns. These divergent behaviors, combined with limitations of existing benchmarks that conflate domain features with topological understanding, motivate our introduction of GraphAbstract. This benchmark evaluates models' ability to perceive global graph properties as humans do: recognizing organizational archetypes, detecting symmetry, sensing connectivity strength, and identifying critical elements. Our results reveal that vision models significantly outperform GNNs on tasks requiring holistic structural understanding and maintain generalizability across varying graph scales, while GNNs struggle with global pattern abstraction and degrade with increasing graph size. This work demonstrates that vision models possess remarkable yet underutilized capabilities for graph structural understanding, particularly for problems requiring global topological awareness and scale-invariant reasoning. These findings open new avenues to leverage this underappreciated potential for developing more effective graph foundation models for tasks dominated by holistic pattern recognition.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Nov 2025 19:59:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/40b883ea/8defcf61.mp3" length="24951426" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1556</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xinjian Zhao, Wei Pang, Zhongkai Xue, Xiangru Jian, Lei Zhang, Yaoyao Xu, Xiaozhuang Song, Shu Wu, Tianshu Yu</p>

            <p><strong>Title:</strong><br>
            The Underappreciated Power of Vision Models for Graph Structural Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.24788v1">http://arxiv.org/abs/2510.24788v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graph Neural Networks operate through bottom-up message-passing, fundamentally differing from human visual perception, which intuitively captures global structures first. We investigate the underappreciated potential of vision models for graph understanding, finding they achieve performance comparable to GNNs on established benchmarks while exhibiting distinctly different learning patterns. These divergent behaviors, combined with limitations of existing benchmarks that conflate domain features with topological understanding, motivate our introduction of GraphAbstract. This benchmark evaluates models' ability to perceive global graph properties as humans do: recognizing organizational archetypes, detecting symmetry, sensing connectivity strength, and identifying critical elements. Our results reveal that vision models significantly outperform GNNs on tasks requiring holistic structural understanding and maintain generalizability across varying graph scales, while GNNs struggle with global pattern abstraction and degrade with increasing graph size. This work demonstrates that vision models possess remarkable yet underutilized capabilities for graph structural understanding, particularly for problems requiring global topological awareness and scale-invariant reasoning. These findings open new avenues to leverage this underappreciated potential for developing more effective graph foundation models for tasks dominated by holistic pattern recognition.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback</title>
      <itunes:episode>1345</itunes:episode>
      <podcast:episode>1345</podcast:episode>
      <itunes:title>UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">888cefda-191b-4827-aef8-74a951bf13a5</guid>
      <link>https://share.transistor.fm/s/96a2b3f3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ropeway Liu, Hangjie Yuan, Bo Dong, Jiazheng Xing, Jinwang Wang, Rui Zhao, Yan Xing, Weihua Chen, Fan Wang</p>

            <p><strong>Title:</strong><br>
            UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.01678v1">http://arxiv.org/abs/2511.01678v1</a></p>

            <p><strong>Abstract:</strong><br>
            Relighting is a crucial task with both practical demand and artistic value, and recent diffusion models have shown strong potential by enabling rich and controllable lighting effects. However, as they are typically optimized in semantic latent space, where proximity does not guarantee physical correctness in visual space, they often produce unrealistic results, such as overexposed highlights, misaligned shadows, and incorrect occlusions. We address this with UniLumos, a unified relighting framework for both images and videos that brings RGB-space geometry feedback into a flow matching backbone. By supervising the model with depth and normal maps extracted from its outputs, we explicitly align lighting effects with the scene structure, enhancing physical plausibility. Nevertheless, this feedback requires high-quality outputs for supervision in visual space, making standard multi-step denoising computationally expensive. To mitigate this, we employ path consistency learning, allowing supervision to remain effective even under few-step training regimes. To enable fine-grained relighting control and supervision, we design a structured six-dimensional annotation protocol capturing core illumination attributes. Building upon this, we propose LumosBench, a disentangled attribute-level benchmark that evaluates lighting controllability via large vision-language models, enabling automatic and interpretable assessment of relighting precision across individual dimensions. Extensive experiments demonstrate that UniLumos achieves state-of-the-art relighting quality with significantly improved physical consistency, while delivering a 20x speedup for both image and video relighting. Code is available at https://github.com/alibaba-damo-academy/Lumos-Custom.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ropeway Liu, Hangjie Yuan, Bo Dong, Jiazheng Xing, Jinwang Wang, Rui Zhao, Yan Xing, Weihua Chen, Fan Wang</p>

            <p><strong>Title:</strong><br>
            UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.01678v1">http://arxiv.org/abs/2511.01678v1</a></p>

            <p><strong>Abstract:</strong><br>
            Relighting is a crucial task with both practical demand and artistic value, and recent diffusion models have shown strong potential by enabling rich and controllable lighting effects. However, as they are typically optimized in semantic latent space, where proximity does not guarantee physical correctness in visual space, they often produce unrealistic results, such as overexposed highlights, misaligned shadows, and incorrect occlusions. We address this with UniLumos, a unified relighting framework for both images and videos that brings RGB-space geometry feedback into a flow matching backbone. By supervising the model with depth and normal maps extracted from its outputs, we explicitly align lighting effects with the scene structure, enhancing physical plausibility. Nevertheless, this feedback requires high-quality outputs for supervision in visual space, making standard multi-step denoising computationally expensive. To mitigate this, we employ path consistency learning, allowing supervision to remain effective even under few-step training regimes. To enable fine-grained relighting control and supervision, we design a structured six-dimensional annotation protocol capturing core illumination attributes. Building upon this, we propose LumosBench, a disentangled attribute-level benchmark that evaluates lighting controllability via large vision-language models, enabling automatic and interpretable assessment of relighting precision across individual dimensions. Extensive experiments demonstrate that UniLumos achieves state-of-the-art relighting quality with significantly improved physical consistency, while delivering a 20x speedup for both image and video relighting. Code is available at https://github.com/alibaba-damo-academy/Lumos-Custom.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Nov 2025 19:59:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/96a2b3f3/b10d3881.mp3" length="23077304" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1439</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ropeway Liu, Hangjie Yuan, Bo Dong, Jiazheng Xing, Jinwang Wang, Rui Zhao, Yan Xing, Weihua Chen, Fan Wang</p>

            <p><strong>Title:</strong><br>
            UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.01678v1">http://arxiv.org/abs/2511.01678v1</a></p>

            <p><strong>Abstract:</strong><br>
            Relighting is a crucial task with both practical demand and artistic value, and recent diffusion models have shown strong potential by enabling rich and controllable lighting effects. However, as they are typically optimized in semantic latent space, where proximity does not guarantee physical correctness in visual space, they often produce unrealistic results, such as overexposed highlights, misaligned shadows, and incorrect occlusions. We address this with UniLumos, a unified relighting framework for both images and videos that brings RGB-space geometry feedback into a flow matching backbone. By supervising the model with depth and normal maps extracted from its outputs, we explicitly align lighting effects with the scene structure, enhancing physical plausibility. Nevertheless, this feedback requires high-quality outputs for supervision in visual space, making standard multi-step denoising computationally expensive. To mitigate this, we employ path consistency learning, allowing supervision to remain effective even under few-step training regimes. To enable fine-grained relighting control and supervision, we design a structured six-dimensional annotation protocol capturing core illumination attributes. Building upon this, we propose LumosBench, a disentangled attribute-level benchmark that evaluates lighting controllability via large vision-language models, enabling automatic and interpretable assessment of relighting precision across individual dimensions. Extensive experiments demonstrate that UniLumos achieves state-of-the-art relighting quality with significantly improved physical consistency, while delivering a 20x speedup for both image and video relighting. Code is available at https://github.com/alibaba-damo-academy/Lumos-Custom.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation</title>
      <itunes:episode>1344</itunes:episode>
      <podcast:episode>1344</podcast:episode>
      <itunes:title>ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">db9de597-076c-4925-80f6-7d8ae5ea6b4f</guid>
      <link>https://share.transistor.fm/s/89a829f8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yongyuan Liang, Wei Chow, Feng Li, Ziqiao Ma, Xiyao Wang, Jiageng Mao, Jiuhai Chen, Jiatao Gu, Yue Wang, Furong Huang</p>

            <p><strong>Title:</strong><br>
            ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.01163v1">http://arxiv.org/abs/2511.01163v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models (UMMs) have emerged as a powerful paradigm for seamlessly unifying text and image understanding and generation. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning, i.e., textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. We introduce ROVER to address this pressing need to test reciprocal cross-modal reasoning, the use of one modality to guide, verify, or refine outputs in the other, an ability central to the vision of unified multimodal intelligence. ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1312 tasks grounded in 1876 images, spanning two complementary settings. Verbally-augmented reasoning for visual generation evaluates whether models can use verbal prompts and reasoning chains to guide faithful image synthesis. Visually-augmented reasoning for verbal generation evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes for question answering. Experiments on 17 unified models reveal two key findings: (i) Cross-modal reasoning determines visual generation quality, with interleaved models significantly outperforming non-interleaved ones; notably, combining strong unimodal models fails to achieve comparable reasoning. (ii) Models show dissociation between physical and symbolic reasoning: they succeed at interpreting perceptual concepts literally but fail to construct visual abstractions for symbolic tasks, where faulty reasoning harms performance. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yongyuan Liang, Wei Chow, Feng Li, Ziqiao Ma, Xiyao Wang, Jiageng Mao, Jiuhai Chen, Jiatao Gu, Yue Wang, Furong Huang</p>

            <p><strong>Title:</strong><br>
            ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.01163v1">http://arxiv.org/abs/2511.01163v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models (UMMs) have emerged as a powerful paradigm for seamlessly unifying text and image understanding and generation. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning, i.e., textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. We introduce ROVER to address this pressing need to test reciprocal cross-modal reasoning, the use of one modality to guide, verify, or refine outputs in the other, an ability central to the vision of unified multimodal intelligence. ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1312 tasks grounded in 1876 images, spanning two complementary settings. Verbally-augmented reasoning for visual generation evaluates whether models can use verbal prompts and reasoning chains to guide faithful image synthesis. Visually-augmented reasoning for verbal generation evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes for question answering. Experiments on 17 unified models reveal two key findings: (i) Cross-modal reasoning determines visual generation quality, with interleaved models significantly outperforming non-interleaved ones; notably, combining strong unimodal models fails to achieve comparable reasoning. (ii) Models show dissociation between physical and symbolic reasoning: they succeed at interpreting perceptual concepts literally but fail to construct visual abstractions for symbolic tasks, where faulty reasoning harms performance. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Nov 2025 19:58:37 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/89a829f8/8f7be2c0.mp3" length="24602847" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1534</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yongyuan Liang, Wei Chow, Feng Li, Ziqiao Ma, Xiyao Wang, Jiageng Mao, Jiuhai Chen, Jiatao Gu, Yue Wang, Furong Huang</p>

            <p><strong>Title:</strong><br>
            ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.01163v1">http://arxiv.org/abs/2511.01163v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models (UMMs) have emerged as a powerful paradigm for seamlessly unifying text and image understanding and generation. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning, i.e., textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. We introduce ROVER to address this pressing need to test reciprocal cross-modal reasoning, the use of one modality to guide, verify, or refine outputs in the other, an ability central to the vision of unified multimodal intelligence. ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1312 tasks grounded in 1876 images, spanning two complementary settings. Verbally-augmented reasoning for visual generation evaluates whether models can use verbal prompts and reasoning chains to guide faithful image synthesis. Visually-augmented reasoning for verbal generation evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes for question answering. Experiments on 17 unified models reveal two key findings: (i) Cross-modal reasoning determines visual generation quality, with interleaved models significantly outperforming non-interleaved ones; notably, combining strong unimodal models fails to achieve comparable reasoning. (ii) Models show dissociation between physical and symbolic reasoning: they succeed at interpreting perceptual concepts literally but fail to construct visual abstractions for symbolic tasks, where faulty reasoning harms performance. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PHUMA: Physically-Grounded Humanoid Locomotion Dataset</title>
      <itunes:episode>1343</itunes:episode>
      <podcast:episode>1343</podcast:episode>
      <itunes:title>PHUMA: Physically-Grounded Humanoid Locomotion Dataset</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">57a6b1ea-a2de-4bfe-aa8c-ccf81dad848c</guid>
      <link>https://share.transistor.fm/s/5409a9e2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Kyungmin Lee, Sibeen Kim, Minho Park, Hyunseung Kim, Dongyoon Hwang, Hojoon Lee, Jaegul Choo</p>

            <p><strong>Title:</strong><br>
            PHUMA: Physically-Grounded Humanoid Locomotion Dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.26236v1">http://arxiv.org/abs/2510.26236v1</a></p>

            <p><strong>Abstract:</strong><br>
            Motion imitation is a promising approach for humanoid locomotion, enabling agents to acquire humanlike behaviors. Existing methods typically rely on high-quality motion capture datasets such as AMASS, but these are scarce and expensive, limiting scalability and diversity. Recent studies attempt to scale data collection by converting large-scale internet videos, exemplified by Humanoid-X. However, they often introduce physical artifacts such as floating, penetration, and foot skating, which hinder stable imitation. In response, we introduce PHUMA, a Physically-grounded HUMAnoid locomotion dataset that leverages human video at scale, while addressing physical artifacts through careful data curation and physics-constrained retargeting. PHUMA enforces joint limits, ensures ground contact, and eliminates foot skating, producing motions that are both large-scale and physically reliable. We evaluated PHUMA in two sets of conditions: (i) imitation of unseen motion from self-recorded test videos and (ii) path following with pelvis-only guidance. In both cases, PHUMA-trained policies outperform Humanoid-X and AMASS, achieving significant gains in imitating diverse motions. The code is available at https://davian-robotics.github.io/PHUMA.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Kyungmin Lee, Sibeen Kim, Minho Park, Hyunseung Kim, Dongyoon Hwang, Hojoon Lee, Jaegul Choo</p>

            <p><strong>Title:</strong><br>
            PHUMA: Physically-Grounded Humanoid Locomotion Dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.26236v1">http://arxiv.org/abs/2510.26236v1</a></p>

            <p><strong>Abstract:</strong><br>
            Motion imitation is a promising approach for humanoid locomotion, enabling agents to acquire humanlike behaviors. Existing methods typically rely on high-quality motion capture datasets such as AMASS, but these are scarce and expensive, limiting scalability and diversity. Recent studies attempt to scale data collection by converting large-scale internet videos, exemplified by Humanoid-X. However, they often introduce physical artifacts such as floating, penetration, and foot skating, which hinder stable imitation. In response, we introduce PHUMA, a Physically-grounded HUMAnoid locomotion dataset that leverages human video at scale, while addressing physical artifacts through careful data curation and physics-constrained retargeting. PHUMA enforces joint limits, ensures ground contact, and eliminates foot skating, producing motions that are both large-scale and physically reliable. We evaluated PHUMA in two sets of conditions: (i) imitation of unseen motion from self-recorded test videos and (ii) path following with pelvis-only guidance. In both cases, PHUMA-trained policies outperform Humanoid-X and AMASS, achieving significant gains in imitating diverse motions. The code is available at https://davian-robotics.github.io/PHUMA.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Nov 2025 19:58:13 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5409a9e2/273bab04.mp3" length="20628451" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1286</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Kyungmin Lee, Sibeen Kim, Minho Park, Hyunseung Kim, Dongyoon Hwang, Hojoon Lee, Jaegul Choo</p>

            <p><strong>Title:</strong><br>
            PHUMA: Physically-Grounded Humanoid Locomotion Dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.26236v1">http://arxiv.org/abs/2510.26236v1</a></p>

            <p><strong>Abstract:</strong><br>
            Motion imitation is a promising approach for humanoid locomotion, enabling agents to acquire humanlike behaviors. Existing methods typically rely on high-quality motion capture datasets such as AMASS, but these are scarce and expensive, limiting scalability and diversity. Recent studies attempt to scale data collection by converting large-scale internet videos, exemplified by Humanoid-X. However, they often introduce physical artifacts such as floating, penetration, and foot skating, which hinder stable imitation. In response, we introduce PHUMA, a Physically-grounded HUMAnoid locomotion dataset that leverages human video at scale, while addressing physical artifacts through careful data curation and physics-constrained retargeting. PHUMA enforces joint limits, ensures ground contact, and eliminates foot skating, producing motions that are both large-scale and physically reliable. We evaluated PHUMA in two sets of conditions: (i) imitation of unseen motion from self-recorded test videos and (ii) path following with pelvis-only guidance. In both cases, PHUMA-trained policies outperform Humanoid-X and AMASS, achieving significant gains in imitating diverse motions. The code is available at https://davian-robotics.github.io/PHUMA.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UniREditBench: A Unified Reasoning-based Image Editing Benchmark</title>
      <itunes:episode>1342</itunes:episode>
      <podcast:episode>1342</podcast:episode>
      <itunes:title>UniREditBench: A Unified Reasoning-based Image Editing Benchmark</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">063c7e3c-a3fc-4211-a01c-449f8a1a25e4</guid>
      <link>https://share.transistor.fm/s/436b4717</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            UniREditBench: A Unified Reasoning-based Image Editing Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.01295v1">http://arxiv.org/abs/2511.01295v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multi-modal generative models have driven substantial improvements in image editing. However, current generative models still struggle with handling diverse and complex image editing tasks that require implicit reasoning, underscoring the need for a comprehensive benchmark to systematically assess their performance across various reasoning scenarios. Existing benchmarks primarily focus on single-object attribute transformation in realistic scenarios, which, while effective, encounter two key challenges: (1) they largely overlook multi-object interactions as well as game-world scenarios that involve human-defined rules, which are common in real-life applications; (2) they only rely on textual references to evaluate the generated images, potentially leading to systematic misjudgments, especially in complex reasoning scenarios. To this end, this work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation. It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions. To improve evaluation reliability, we introduce multimodal dual-reference evaluation, providing both textual and ground-truth image references for each sample assessment. Furthermore, we design an automated multi-scenario data synthesis pipeline and construct UniREdit-Data-100K, a large-scale synthetic dataset with high-quality chain-of-thought (CoT) reasoning annotations. We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings. Through thorough benchmarking of both open-source and closed-source image editing models, we reveal their strengths and weaknesses across various aspects.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            UniREditBench: A Unified Reasoning-based Image Editing Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.01295v1">http://arxiv.org/abs/2511.01295v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multi-modal generative models have driven substantial improvements in image editing. However, current generative models still struggle with handling diverse and complex image editing tasks that require implicit reasoning, underscoring the need for a comprehensive benchmark to systematically assess their performance across various reasoning scenarios. Existing benchmarks primarily focus on single-object attribute transformation in realistic scenarios, which, while effective, encounter two key challenges: (1) they largely overlook multi-object interactions as well as game-world scenarios that involve human-defined rules, which are common in real-life applications; (2) they only rely on textual references to evaluate the generated images, potentially leading to systematic misjudgments, especially in complex reasoning scenarios. To this end, this work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation. It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions. To improve evaluation reliability, we introduce multimodal dual-reference evaluation, providing both textual and ground-truth image references for each sample assessment. Furthermore, we design an automated multi-scenario data synthesis pipeline and construct UniREdit-Data-100K, a large-scale synthetic dataset with high-quality chain-of-thought (CoT) reasoning annotations. We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings. Through thorough benchmarking of both open-source and closed-source image editing models, we reveal their strengths and weaknesses across various aspects.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Nov 2025 19:57:50 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/436b4717/7c82526b.mp3" length="22155683" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1381</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            UniREditBench: A Unified Reasoning-based Image Editing Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.01295v1">http://arxiv.org/abs/2511.01295v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multi-modal generative models have driven substantial improvements in image editing. However, current generative models still struggle with handling diverse and complex image editing tasks that require implicit reasoning, underscoring the need for a comprehensive benchmark to systematically assess their performance across various reasoning scenarios. Existing benchmarks primarily focus on single-object attribute transformation in realistic scenarios, which, while effective, encounter two key challenges: (1) they largely overlook multi-object interactions as well as game-world scenarios that involve human-defined rules, which are common in real-life applications; (2) they only rely on textual references to evaluate the generated images, potentially leading to systematic misjudgments, especially in complex reasoning scenarios. To this end, this work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation. It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions. To improve evaluation reliability, we introduce multimodal dual-reference evaluation, providing both textual and ground-truth image references for each sample assessment. Furthermore, we design an automated multi-scenario data synthesis pipeline and construct UniREdit-Data-100K, a large-scale synthetic dataset with high-quality chain-of-thought (CoT) reasoning annotations. We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings. Through thorough benchmarking of both open-source and closed-source image editing models, we reveal their strengths and weaknesses across various aspects.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>World Simulation with Video Foundation Models for Physical AI</title>
      <itunes:episode>1341</itunes:episode>
      <podcast:episode>1341</podcast:episode>
      <itunes:title>World Simulation with Video Foundation Models for Physical AI</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b8581426-3d23-4e53-b49f-e71b919f066e</guid>
      <link>https://share.transistor.fm/s/922fbdb7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV, cs.AI, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jinwei Gu, Aryaman Gupta, Siddharth Gururani, Imad El Hanafi, Ali Hassani, Zekun Hao, Jacob Huffman, Joel Jang, Pooya Jannaty, Jan Kautz, Grace Lam, Xuan Li, Zhaoshuo Li, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Yen-Chen Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Seungjun Nah, Yashraj Narang, Abhijeet Panaskar, Lindsey Pavao, Trung Pham, Morteza Ramezanali, Fitsum Reda, Scott Reed, Xuanchi Ren, Haonan Shao, Yue Shen, Stella Shi, Shuran Song, Bartosz Stefaniak, Shangkun Sun, Shitao Tang, Sameena Tasmeen, Lyne Tchapmi, Wei-Cheng Tseng, Jibin Varghese, Andrew Z. Wang, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Jiashu Xu, Dinghao Yang, Xiaodong Yang, Haotian Ye, Seonghyeon Ye, Xiaohui Zeng, Jing Zhang, Qinsheng Zhang, Kaiwen Zheng, Andrew Zhu, Yuke Zhu</p>

            <p><strong>Title:</strong><br>
            World Simulation with Video Foundation Models for Physical AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.00062v1">http://arxiv.org/abs/2511.00062v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5$\times$ smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and [Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV, cs.AI, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jinwei Gu, Aryaman Gupta, Siddharth Gururani, Imad El Hanafi, Ali Hassani, Zekun Hao, Jacob Huffman, Joel Jang, Pooya Jannaty, Jan Kautz, Grace Lam, Xuan Li, Zhaoshuo Li, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Yen-Chen Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Seungjun Nah, Yashraj Narang, Abhijeet Panaskar, Lindsey Pavao, Trung Pham, Morteza Ramezanali, Fitsum Reda, Scott Reed, Xuanchi Ren, Haonan Shao, Yue Shen, Stella Shi, Shuran Song, Bartosz Stefaniak, Shangkun Sun, Shitao Tang, Sameena Tasmeen, Lyne Tchapmi, Wei-Cheng Tseng, Jibin Varghese, Andrew Z. Wang, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Jiashu Xu, Dinghao Yang, Xiaodong Yang, Haotian Ye, Seonghyeon Ye, Xiaohui Zeng, Jing Zhang, Qinsheng Zhang, Kaiwen Zheng, Andrew Zhu, Yuke Zhu</p>

            <p><strong>Title:</strong><br>
            World Simulation with Video Foundation Models for Physical AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.00062v1">http://arxiv.org/abs/2511.00062v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5$\times$ smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and [Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Nov 2025 19:57:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/922fbdb7/988f6884.mp3" length="27914740" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1741</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV, cs.AI, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jinwei Gu, Aryaman Gupta, Siddharth Gururani, Imad El Hanafi, Ali Hassani, Zekun Hao, Jacob Huffman, Joel Jang, Pooya Jannaty, Jan Kautz, Grace Lam, Xuan Li, Zhaoshuo Li, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Yen-Chen Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Seungjun Nah, Yashraj Narang, Abhijeet Panaskar, Lindsey Pavao, Trung Pham, Morteza Ramezanali, Fitsum Reda, Scott Reed, Xuanchi Ren, Haonan Shao, Yue Shen, Stella Shi, Shuran Song, Bartosz Stefaniak, Shangkun Sun, Shitao Tang, Sameena Tasmeen, Lyne Tchapmi, Wei-Cheng Tseng, Jibin Varghese, Andrew Z. Wang, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Jiashu Xu, Dinghao Yang, Xiaodong Yang, Haotian Ye, Seonghyeon Ye, Xiaohui Zeng, Jing Zhang, Qinsheng Zhang, Kaiwen Zheng, Andrew Zhu, Yuke Zhu</p>

            <p><strong>Title:</strong><br>
            World Simulation with Video Foundation Models for Physical AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2511.00062v1">http://arxiv.org/abs/2511.00062v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5$\times$ smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and [Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning</title>
      <itunes:episode>1340</itunes:episode>
      <podcast:episode>1340</podcast:episode>
      <itunes:title>ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b2a0a0ab-1b11-431e-8887-eb1662ee22db</guid>
      <link>https://share.transistor.fm/s/1ad4cfab</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.27492v1">http://arxiv.org/abs/2510.27492v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.27492v1">http://arxiv.org/abs/2510.27492v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 03 Nov 2025 19:10:57 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1ad4cfab/01035037.mp3" length="21949232" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1368</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.27492v1">http://arxiv.org/abs/2510.27492v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts.These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats</title>
      <itunes:episode>1339</itunes:episode>
      <podcast:episode>1339</podcast:episode>
      <itunes:title>INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6d93add7-6264-43ef-aaa3-d4c8c740460f</guid>
      <link>https://share.transistor.fm/s/fe702533</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mengzhao Chen, Meng Wu, Hui Jin, Zhihang Yuan, Jing Liu, Chaoyi Zhang, Yunshui Li, Jie Huang, Jin Ma, Zeyue Xue, Zhiheng Liu, Xingyan Bin, Ping Luo</p>

            <p><strong>Title:</strong><br>
            INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.25602v1">http://arxiv.org/abs/2510.25602v1</a></p>

            <p><strong>Abstract:</strong><br>
            Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mengzhao Chen, Meng Wu, Hui Jin, Zhihang Yuan, Jing Liu, Chaoyi Zhang, Yunshui Li, Jie Huang, Jin Ma, Zeyue Xue, Zhiheng Liu, Xingyan Bin, Ping Luo</p>

            <p><strong>Title:</strong><br>
            INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.25602v1">http://arxiv.org/abs/2510.25602v1</a></p>

            <p><strong>Abstract:</strong><br>
            Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 03 Nov 2025 19:10:33 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fe702533/aaabee2a.mp3" length="20124417" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1254</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mengzhao Chen, Meng Wu, Hui Jin, Zhihang Yuan, Jing Liu, Chaoyi Zhang, Yunshui Li, Jie Huang, Jin Ma, Zeyue Xue, Zhiheng Liu, Xingyan Bin, Ping Luo</p>

            <p><strong>Title:</strong><br>
            INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.25602v1">http://arxiv.org/abs/2510.25602v1</a></p>

            <p><strong>Abstract:</strong><br>
            Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language Models (LLMs). Despite this industry trend, a unified comparison of FP and integer (INT) quantization across varying granularities has been missing, leaving algorithm and hardware co-design without clear guidance. This paper fills that gap by systematically investigating the trade-offs between FP and INT formats. We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced. Our comprehensive comparison demonstrates that for popular 8-bit fine-grained formats (e.g., MX with block size 32), MXINT8 is superior to its FP counterpart in both algorithmic accuracy and hardware efficiency. However, for 4-bit formats, FP (e.g., MXFP4, NVFP4) often holds an accuracy advantage , though we show that NVINT4 can surpass NVFP4 when outlier-mitigation techniques like Hadamard rotation are applied. We also introduce a symmetric clipping method that resolves gradient bias in fine-grained low-bit INT training, enabling nearly lossless performance for MXINT8 training. These findings challenge the current hardware trajectory, demonstrating that a one-size-fits-all FP approach is suboptimal and advocating that fine-grained INT formats, particularly MXINT8, offer a better balance of accuracy, power, and efficiency for future AI accelerators.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning</title>
      <itunes:episode>1338</itunes:episode>
      <podcast:episode>1338</podcast:episode>
      <itunes:title>Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">06448783-51e6-4875-885c-83cfd50d2f3a</guid>
      <link>https://share.transistor.fm/s/d6651908</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.27606v1">http://arxiv.org/abs/2510.27606v1</a></p>

            <p><strong>Abstract:</strong><br>
            Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.27606v1">http://arxiv.org/abs/2510.27606v1</a></p>

            <p><strong>Abstract:</strong><br>
            Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 03 Nov 2025 19:09:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d6651908/7f01ca89.mp3" length="24485830" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1527</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.27606v1">http://arxiv.org/abs/2510.27606v1</a></p>

            <p><strong>Abstract:</strong><br>
            Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The End of Manual Decoding: Towards Truly End-to-End Language Models</title>
      <itunes:episode>1337</itunes:episode>
      <podcast:episode>1337</podcast:episode>
      <itunes:title>The End of Manual Decoding: Towards Truly End-to-End Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">992a57dc-c42b-44a1-b5f1-b3c3db567117</guid>
      <link>https://share.transistor.fm/s/6ccf887b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhichao Wang, Dongyang Ma, Xinting Huang, Deng Cai, Tian Lan, Jiahao Xu, Haitao Mi, Xiaoying Tang, Yan Wang</p>

            <p><strong>Title:</strong><br>
            The End of Manual Decoding: Towards Truly End-to-End Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.26697v1">http://arxiv.org/abs/2510.26697v1</a></p>

            <p><strong>Abstract:</strong><br>
            The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass.   Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhichao Wang, Dongyang Ma, Xinting Huang, Deng Cai, Tian Lan, Jiahao Xu, Haitao Mi, Xiaoying Tang, Yan Wang</p>

            <p><strong>Title:</strong><br>
            The End of Manual Decoding: Towards Truly End-to-End Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.26697v1">http://arxiv.org/abs/2510.26697v1</a></p>

            <p><strong>Abstract:</strong><br>
            The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass.   Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 31 Oct 2025 20:35:58 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6ccf887b/434d0634.mp3" length="21726025" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1354</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhichao Wang, Dongyang Ma, Xinting Huang, Deng Cai, Tian Lan, Jiahao Xu, Haitao Mi, Xiaoying Tang, Yan Wang</p>

            <p><strong>Title:</strong><br>
            The End of Manual Decoding: Towards Truly End-to-End Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.26697v1">http://arxiv.org/abs/2510.26697v1</a></p>

            <p><strong>Abstract:</strong><br>
            The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass.   Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Kimi Linear: An Expressive, Efficient Attention Architecture</title>
      <itunes:episode>1336</itunes:episode>
      <podcast:episode>1336</podcast:episode>
      <itunes:title>Kimi Linear: An Expressive, Efficient Attention Architecture</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1fd54435-33c6-443a-8a13-737304ff3dbe</guid>
      <link>https://share.transistor.fm/s/a9f8272e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo Pang, Junjie Yan, Zhejun Jiang, Weixiao Huang, Bohong Yin, Jiacheng You, Chu Wei, Zhengtao Wang, Chao Hong, Yutian Chen, Guanduo Chen, Yucheng Wang, Huabin Zheng, Feng Wang, Yibo Liu, Mengnan Dong, Zheng Zhang, Siyuan Pan, Wenhao Wu, Yuhao Wu, Longyu Guan, Jiawen Tao, Guohong Fu, Xinran Xu, Yuzhi Wang, Guokun Lai, Yuxin Wu, Xinyu Zhou, Zhilin Yang, Yulun Du</p>

            <p><strong>Title:</strong><br>
            Kimi Linear: An Expressive, Efficient Attention Architecture</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.26692v1">http://arxiv.org/abs/2510.26692v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule.   We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths.   To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo Pang, Junjie Yan, Zhejun Jiang, Weixiao Huang, Bohong Yin, Jiacheng You, Chu Wei, Zhengtao Wang, Chao Hong, Yutian Chen, Guanduo Chen, Yucheng Wang, Huabin Zheng, Feng Wang, Yibo Liu, Mengnan Dong, Zheng Zhang, Siyuan Pan, Wenhao Wu, Yuhao Wu, Longyu Guan, Jiawen Tao, Guohong Fu, Xinran Xu, Yuzhi Wang, Guokun Lai, Yuxin Wu, Xinyu Zhou, Zhilin Yang, Yulun Du</p>

            <p><strong>Title:</strong><br>
            Kimi Linear: An Expressive, Efficient Attention Architecture</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.26692v1">http://arxiv.org/abs/2510.26692v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule.   We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths.   To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 31 Oct 2025 20:35:27 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a9f8272e/5ff1b244.mp3" length="22315758" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1391</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo Pang, Junjie Yan, Zhejun Jiang, Weixiao Huang, Bohong Yin, Jiacheng You, Chu Wei, Zhengtao Wang, Chao Hong, Yutian Chen, Guanduo Chen, Yucheng Wang, Huabin Zheng, Feng Wang, Yibo Liu, Mengnan Dong, Zheng Zhang, Siyuan Pan, Wenhao Wu, Yuhao Wu, Longyu Guan, Jiawen Tao, Guohong Fu, Xinran Xu, Yuzhi Wang, Guokun Lai, Yuxin Wu, Xinyu Zhou, Zhilin Yang, Yulun Du</p>

            <p><strong>Title:</strong><br>
            Kimi Linear: An Expressive, Efficient Attention Architecture</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.26692v1">http://arxiv.org/abs/2510.26692v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule.   We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths.   To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Surfer 2: The Next Generation of Cross-Platform Computer Use Agents</title>
      <itunes:episode>1335</itunes:episode>
      <podcast:episode>1335</podcast:episode>
      <itunes:title>Surfer 2: The Next Generation of Cross-Platform Computer Use Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6c420c54-4825-472a-a46e-9957f2b45df1</guid>
      <link>https://share.transistor.fm/s/742bf9d8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mathieu Andreux, Märt Bakler, Yanael Barbier, Hamza Benchekroun, Emilien Biré, Antoine Bonnet, Riaz Bordie, Nathan Bout, Matthias Brunel, Aleix Cambray, Pierre-Louis Cedoz, Antoine Chassang, Gautier Cloix, Ethan Connelly, Alexandra Constantinou, Ramzi De Coster, Hubert de la Jonquiere, Aurélien Delfosse, Maxime Delpit, Alexis Deprez, Augustin Derupti, Mathieu Diaz, Shannon D'Souza, Julie Dujardin, Abai Edmund, Michael Eickenberg, Armand Fatalot, Wissem Felissi, Isaac Herring, Xavier Koegler, Erwan Le Jumeau de Kergaradec, Aurélien Lac, Maxime Langevin, Corentin Lauverjat, Antonio Loison, Avshalom Manevich, Axel Moyal, Axel Nguyen Kerbel, Marinela Parovic, Julien Revelle, Guillaume Richard, Mats Richter, Ronan Riochet, María Santos, Romain Savidan, Laurent Sifre, Maxime Theillard, Marc Thibault, Ivan Valentini, Tony Wu, Laura Yie, Kai Yuan, Jevgenij Zubovskij</p>

            <p><strong>Title:</strong><br>
            Surfer 2: The Next Generation of Cross-Platform Computer Use Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.19949v2">http://arxiv.org/abs/2510.19949v2</a></p>

            <p><strong>Abstract:</strong><br>
            Building agents that generalize across web, desktop, and mobile environments remains an open challenge, as prior systems rely on environment-specific interfaces that limit cross-platform deployment. We introduce Surfer 2, a unified architecture operating purely from visual observations that achieves state-of-the-art performance across all three environments. Surfer 2 integrates hierarchical context management, decoupled planning and execution, and self-verification with adaptive recovery, enabling reliable operation over long task horizons. Our system achieves 97.1% accuracy on WebVoyager, 69.6% on WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, outperforming all prior systems without task-specific fine-tuning. With multiple attempts, Surfer 2 exceeds human performance on all benchmarks. These results demonstrate that systematic orchestration amplifies foundation model capabilities and enables general-purpose computer control through visual interaction alone, while calling for a next-generation vision language model to achieve Pareto-optimal cost-efficiency.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mathieu Andreux, Märt Bakler, Yanael Barbier, Hamza Benchekroun, Emilien Biré, Antoine Bonnet, Riaz Bordie, Nathan Bout, Matthias Brunel, Aleix Cambray, Pierre-Louis Cedoz, Antoine Chassang, Gautier Cloix, Ethan Connelly, Alexandra Constantinou, Ramzi De Coster, Hubert de la Jonquiere, Aurélien Delfosse, Maxime Delpit, Alexis Deprez, Augustin Derupti, Mathieu Diaz, Shannon D'Souza, Julie Dujardin, Abai Edmund, Michael Eickenberg, Armand Fatalot, Wissem Felissi, Isaac Herring, Xavier Koegler, Erwan Le Jumeau de Kergaradec, Aurélien Lac, Maxime Langevin, Corentin Lauverjat, Antonio Loison, Avshalom Manevich, Axel Moyal, Axel Nguyen Kerbel, Marinela Parovic, Julien Revelle, Guillaume Richard, Mats Richter, Ronan Riochet, María Santos, Romain Savidan, Laurent Sifre, Maxime Theillard, Marc Thibault, Ivan Valentini, Tony Wu, Laura Yie, Kai Yuan, Jevgenij Zubovskij</p>

            <p><strong>Title:</strong><br>
            Surfer 2: The Next Generation of Cross-Platform Computer Use Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.19949v2">http://arxiv.org/abs/2510.19949v2</a></p>

            <p><strong>Abstract:</strong><br>
            Building agents that generalize across web, desktop, and mobile environments remains an open challenge, as prior systems rely on environment-specific interfaces that limit cross-platform deployment. We introduce Surfer 2, a unified architecture operating purely from visual observations that achieves state-of-the-art performance across all three environments. Surfer 2 integrates hierarchical context management, decoupled planning and execution, and self-verification with adaptive recovery, enabling reliable operation over long task horizons. Our system achieves 97.1% accuracy on WebVoyager, 69.6% on WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, outperforming all prior systems without task-specific fine-tuning. With multiple attempts, Surfer 2 exceeds human performance on all benchmarks. These results demonstrate that systematic orchestration amplifies foundation model capabilities and enables general-purpose computer control through visual interaction alone, while calling for a next-generation vision language model to achieve Pareto-optimal cost-efficiency.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 31 Oct 2025 20:34:35 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/742bf9d8/3f11c1fe.mp3" length="22857858" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1425</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mathieu Andreux, Märt Bakler, Yanael Barbier, Hamza Benchekroun, Emilien Biré, Antoine Bonnet, Riaz Bordie, Nathan Bout, Matthias Brunel, Aleix Cambray, Pierre-Louis Cedoz, Antoine Chassang, Gautier Cloix, Ethan Connelly, Alexandra Constantinou, Ramzi De Coster, Hubert de la Jonquiere, Aurélien Delfosse, Maxime Delpit, Alexis Deprez, Augustin Derupti, Mathieu Diaz, Shannon D'Souza, Julie Dujardin, Abai Edmund, Michael Eickenberg, Armand Fatalot, Wissem Felissi, Isaac Herring, Xavier Koegler, Erwan Le Jumeau de Kergaradec, Aurélien Lac, Maxime Langevin, Corentin Lauverjat, Antonio Loison, Avshalom Manevich, Axel Moyal, Axel Nguyen Kerbel, Marinela Parovic, Julien Revelle, Guillaume Richard, Mats Richter, Ronan Riochet, María Santos, Romain Savidan, Laurent Sifre, Maxime Theillard, Marc Thibault, Ivan Valentini, Tony Wu, Laura Yie, Kai Yuan, Jevgenij Zubovskij</p>

            <p><strong>Title:</strong><br>
            Surfer 2: The Next Generation of Cross-Platform Computer Use Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.19949v2">http://arxiv.org/abs/2510.19949v2</a></p>

            <p><strong>Abstract:</strong><br>
            Building agents that generalize across web, desktop, and mobile environments remains an open challenge, as prior systems rely on environment-specific interfaces that limit cross-platform deployment. We introduce Surfer 2, a unified architecture operating purely from visual observations that achieves state-of-the-art performance across all three environments. Surfer 2 integrates hierarchical context management, decoupled planning and execution, and self-verification with adaptive recovery, enabling reliable operation over long task horizons. Our system achieves 97.1% accuracy on WebVoyager, 69.6% on WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, outperforming all prior systems without task-specific fine-tuning. With multiple attempts, Surfer 2 exceeds human performance on all benchmarks. These results demonstrate that systematic orchestration amplifies foundation model capabilities and enables general-purpose computer control through visual interaction alone, while calling for a next-generation vision language model to achieve Pareto-optimal cost-efficiency.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark</title>
      <itunes:episode>1334</itunes:episode>
      <podcast:episode>1334</podcast:episode>
      <itunes:title>Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">96d56fac-1337-4c14-ba27-dd735bb21bcc</guid>
      <link>https://share.transistor.fm/s/aa062693</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, Pheng-Ann Heng</p>

            <p><strong>Title:</strong><br>
            Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.26802v1">http://arxiv.org/abs/2510.26802v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, Pheng-Ann Heng</p>

            <p><strong>Title:</strong><br>
            Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.26802v1">http://arxiv.org/abs/2510.26802v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 31 Oct 2025 20:34:14 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/aa062693/7f41e9eb.mp3" length="24134748" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1505</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, Pheng-Ann Heng</p>

            <p><strong>Title:</strong><br>
            Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.26802v1">http://arxiv.org/abs/2510.26802v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Quest for Generalizable Motion Generation: Data, Model, and Evaluation</title>
      <itunes:episode>1333</itunes:episode>
      <podcast:episode>1333</podcast:episode>
      <itunes:title>The Quest for Generalizable Motion Generation: Data, Model, and Evaluation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f950a60b-966c-4444-9611-a276fb062fc3</guid>
      <link>https://share.transistor.fm/s/a7eac950</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, Zhongang Cai, Lei Yang, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            The Quest for Generalizable Motion Generation: Data, Model, and Evaluation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.26794v1">http://arxiv.org/abs/2510.26794v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, Zhongang Cai, Lei Yang, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            The Quest for Generalizable Motion Generation: Data, Model, and Evaluation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.26794v1">http://arxiv.org/abs/2510.26794v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 31 Oct 2025 20:33:53 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a7eac950/f8bae440.mp3" length="21954655" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1368</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, Zhongang Cai, Lei Yang, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            The Quest for Generalizable Motion Generation: Data, Model, and Evaluation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.26794v1">http://arxiv.org/abs/2510.26794v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations</title>
      <itunes:episode>1332</itunes:episode>
      <podcast:episode>1332</podcast:episode>
      <itunes:title>Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">29f6cf75-76c4-4a24-9026-fd0c014ce562</guid>
      <link>https://share.transistor.fm/s/c63249a5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 147 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.23607v1">http://arxiv.org/abs/2510.23607v1</a></p>

            <p><strong>Abstract:</strong><br>
            Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 147 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.23607v1">http://arxiv.org/abs/2510.23607v1</a></p>

            <p><strong>Abstract:</strong><br>
            Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 28 Oct 2025 20:12:50 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c63249a5/b4fe81ee.mp3" length="22286937" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1389</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 147 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.23607v1">http://arxiv.org/abs/2510.23607v1</a></p>

            <p><strong>Abstract:</strong><br>
            Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning</title>
      <itunes:episode>1331</itunes:episode>
      <podcast:episode>1331</podcast:episode>
      <itunes:title>Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6cca5f52-f959-4301-b33c-a014c586d52d</guid>
      <link>https://share.transistor.fm/s/1ed72552</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 79 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, Meng Li, Mingyang Zhang, Peijie Jiang, Peng Jiao, Qian Zhao, Qingyuan Yang, Wenbo Shen, Xinxing Yang, Yalin Zhang, Yankun Ren, Yao Zhao, Yibo Cao, Yixuan Sun, Yue Zhang, Yuchen Fang, Zibin Lin, Zixuan Cheng, Jun Zhou</p>

            <p><strong>Title:</strong><br>
            Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.19338v2">http://arxiv.org/abs/2510.19338v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 79 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, Meng Li, Mingyang Zhang, Peijie Jiang, Peng Jiao, Qian Zhao, Qingyuan Yang, Wenbo Shen, Xinxing Yang, Yalin Zhang, Yankun Ren, Yao Zhao, Yibo Cao, Yixuan Sun, Yue Zhang, Yuchen Fang, Zibin Lin, Zixuan Cheng, Jun Zhou</p>

            <p><strong>Title:</strong><br>
            Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.19338v2">http://arxiv.org/abs/2510.19338v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 23 Oct 2025 20:35:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1ed72552/3d20533a.mp3" length="22238877" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1386</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 79 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, Meng Li, Mingyang Zhang, Peijie Jiang, Peng Jiao, Qian Zhao, Qingyuan Yang, Wenbo Shen, Xinxing Yang, Yalin Zhang, Yankun Ren, Yao Zhao, Yibo Cao, Yixuan Sun, Yue Zhang, Yuchen Fang, Zibin Lin, Zixuan Cheng, Jun Zhou</p>

            <p><strong>Title:</strong><br>
            Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.19338v2">http://arxiv.org/abs/2510.19338v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping</title>
      <itunes:episode>1330</itunes:episode>
      <podcast:episode>1330</podcast:episode>
      <itunes:title>BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b373788e-88a2-4f08-8891-e51227c56953</guid>
      <link>https://share.transistor.fm/s/141e752f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Zhihao Zhang, Honglin Guo, Xun Deng, Zhikai Lei, Miao Zheng, Guoteng Wang, Shuo Zhang, Peng Sun, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang</p>

            <p><strong>Title:</strong><br>
            BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18927v1">http://arxiv.org/abs/2510.18927v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings--where stale data from past policies are used for training--improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective method that dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization. Across diverse off-policy scenarios--including sample replay and partial rollout--BAPO achieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-art results among models of the same scale but also outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Zhihao Zhang, Honglin Guo, Xun Deng, Zhikai Lei, Miao Zheng, Guoteng Wang, Shuo Zhang, Peng Sun, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang</p>

            <p><strong>Title:</strong><br>
            BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18927v1">http://arxiv.org/abs/2510.18927v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings--where stale data from past policies are used for training--improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective method that dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization. Across diverse off-policy scenarios--including sample replay and partial rollout--BAPO achieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-art results among models of the same scale but also outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 23 Oct 2025 20:35:05 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/141e752f/3d35bdd6.mp3" length="20861316" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1300</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Zhihao Zhang, Honglin Guo, Xun Deng, Zhikai Lei, Miao Zheng, Guoteng Wang, Shuo Zhang, Peng Sun, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang</p>

            <p><strong>Title:</strong><br>
            BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18927v1">http://arxiv.org/abs/2510.18927v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings--where stale data from past policies are used for training--improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective method that dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization. Across diverse off-policy scenarios--including sample replay and partial rollout--BAPO achieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-art results among models of the same scale but also outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts</title>
      <itunes:episode>1329</itunes:episode>
      <podcast:episode>1329</podcast:episode>
      <itunes:title>LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">42b136f0-8036-4b0b-95ff-960ff38606db</guid>
      <link>https://share.transistor.fm/s/dbb66ed5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Siyuan Wang, Gaokai Zhang, Li Lyna Zhang, Ning Shang, Fan Yang, Dongyao Chen, Mao Yang</p>

            <p><strong>Title:</strong><br>
            LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.19363v1">http://arxiv.org/abs/2510.19363v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing "Aha" moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Siyuan Wang, Gaokai Zhang, Li Lyna Zhang, Ning Shang, Fan Yang, Dongyao Chen, Mao Yang</p>

            <p><strong>Title:</strong><br>
            LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.19363v1">http://arxiv.org/abs/2510.19363v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing "Aha" moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 23 Oct 2025 20:34:44 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/dbb66ed5/242099f2.mp3" length="20588344" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1283</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Siyuan Wang, Gaokai Zhang, Li Lyna Zhang, Ning Shang, Fan Yang, Dongyao Chen, Mao Yang</p>

            <p><strong>Title:</strong><br>
            LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.19363v1">http://arxiv.org/abs/2510.19363v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing "Aha" moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Language Models are Injective and Hence Invertible</title>
      <itunes:episode>1328</itunes:episode>
      <podcast:episode>1328</podcast:episode>
      <itunes:title>Language Models are Injective and Hence Invertible</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e74755e1-d900-4279-a8f9-b4ae684948b6</guid>
      <link>https://share.transistor.fm/s/4dabe3b5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, Emanuele Rodolà</p>

            <p><strong>Title:</strong><br>
            Language Models are Injective and Hence Invertible</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15511v3">http://arxiv.org/abs/2510.15511v3</a></p>

            <p><strong>Abstract:</strong><br>
            Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, Emanuele Rodolà</p>

            <p><strong>Title:</strong><br>
            Language Models are Injective and Hence Invertible</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15511v3">http://arxiv.org/abs/2510.15511v3</a></p>

            <p><strong>Abstract:</strong><br>
            Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 23 Oct 2025 20:34:23 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4dabe3b5/5cec1050.mp3" length="23154174" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1443</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, Emanuele Rodolà</p>

            <p><strong>Title:</strong><br>
            Language Models are Injective and Hence Invertible</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15511v3">http://arxiv.org/abs/2510.15511v3</a></p>

            <p><strong>Abstract:</strong><br>
            Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GigaBrain-0: A World Model-Powered Vision-Language-Action Model</title>
      <itunes:episode>1327</itunes:episode>
      <podcast:episode>1327</podcast:episode>
      <itunes:title>GigaBrain-0: A World Model-Powered Vision-Language-Action Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">56638d9c-840d-41a3-8242-502cb61381e5</guid>
      <link>https://share.transistor.fm/s/e36164a6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, Peng Li, Qiuping Deng, Runqi Ouyang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yilong Li, Yiran Ding, Yuan Xu, Yun Ye, Yukun Zhou, Zhehao Dong, Zhenan Wang, Zhichao Liu, Zheng Zhu</p>

            <p><strong>Title:</strong><br>
            GigaBrain-0: A World Model-Powered Vision-Language-Action Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.19430v1">http://arxiv.org/abs/2510.19430v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, Peng Li, Qiuping Deng, Runqi Ouyang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yilong Li, Yiran Ding, Yuan Xu, Yun Ye, Yukun Zhou, Zhehao Dong, Zhenan Wang, Zhichao Liu, Zheng Zhu</p>

            <p><strong>Title:</strong><br>
            GigaBrain-0: A World Model-Powered Vision-Language-Action Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.19430v1">http://arxiv.org/abs/2510.19430v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 23 Oct 2025 20:33:51 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e36164a6/782a2568.mp3" length="27960300" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1744</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, Peng Li, Qiuping Deng, Runqi Ouyang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yilong Li, Yiran Ding, Yuan Xu, Yun Ye, Yukun Zhou, Zhehao Dong, Zhenan Wang, Zhichao Liu, Zheng Zhu</p>

            <p><strong>Title:</strong><br>
            GigaBrain-0: A World Model-Powered Vision-Language-Action Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.19430v1">http://arxiv.org/abs/2510.19430v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LightMem: Lightweight and Efficient Memory-Augmented Generation</title>
      <itunes:episode>1326</itunes:episode>
      <podcast:episode>1326</podcast:episode>
      <itunes:title>LightMem: Lightweight and Efficient Memory-Augmented Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4b8f0d43-2073-4270-8802-55618d512cd4</guid>
      <link>https://share.transistor.fm/s/72d6c308</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 86 | cs.CL, cs.AI, cs.CV, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, Ningyu Zhang</p>

            <p><strong>Title:</strong><br>
            LightMem: Lightweight and Efficient Memory-Augmented Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18866v1">http://arxiv.org/abs/2510.18866v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117x, API calls by up to 159x, and runtime by over 12x. The code is available at https://github.com/zjunlp/LightMem.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 86 | cs.CL, cs.AI, cs.CV, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, Ningyu Zhang</p>

            <p><strong>Title:</strong><br>
            LightMem: Lightweight and Efficient Memory-Augmented Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18866v1">http://arxiv.org/abs/2510.18866v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117x, API calls by up to 159x, and runtime by over 12x. The code is available at https://github.com/zjunlp/LightMem.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Oct 2025 21:01:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/72d6c308/9647bde8.mp3" length="25050886" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1562</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 86 | cs.CL, cs.AI, cs.CV, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, Ningyu Zhang</p>

            <p><strong>Title:</strong><br>
            LightMem: Lightweight and Efficient Memory-Augmented Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18866v1">http://arxiv.org/abs/2510.18866v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. Experiments on LongMemEval with GPT and Qwen backbones show that LightMem outperforms strong baselines in accuracy (up to 10.9% gains) while reducing token usage by up to 117x, API calls by up to 159x, and runtime by over 12x. The code is available at https://github.com/zjunlp/LightMem.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Efficient Long-context Language Model Training by Core Attention Disaggregation</title>
      <itunes:episode>1325</itunes:episode>
      <podcast:episode>1325</podcast:episode>
      <itunes:title>Efficient Long-context Language Model Training by Core Attention Disaggregation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">656417b2-0873-40cf-8917-989fc502d7ff</guid>
      <link>https://share.transistor.fm/s/fe3b7dc9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.LG, cs.DC</p>

            <p><strong>Authors:</strong><br>
            Yonghao Zhuang, Junda Chen, Bo Pang, Yi Gu, Yibo Zhu, Yimin Jiang, Ion Stoica, Eric Xing, Hao Zhang</p>

            <p><strong>Title:</strong><br>
            Efficient Long-context Language Model Training by Core Attention Disaggregation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18121v1">http://arxiv.org/abs/2510.18121v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present core attention disaggregation (CAD), a technique that improves long-context large language model training by decoupling the core attention computation, softmax(QK^T)V, from the rest of the model and executing it on a separate pool of devices. In existing systems, core attention is colocated with other layers; at long context lengths, its quadratic compute growth compared to the near-linear growth of other components causes load imbalance and stragglers across data and pipeline parallel groups. CAD is enabled by two observations. First, core attention is stateless: it has no trainable parameters and only minimal transient data, so balancing reduces to scheduling compute-bound tasks. Second, it is composable: modern attention kernels retain high efficiency when processing fused batches of token-level shards with arbitrary lengths. CAD partitions core attention into token-level tasks and dispatches them to dedicated attention servers, which dynamically rebatch tasks to equalize compute without sacrificing kernel efficiency. We implement CAD in a system called DistCA, which uses a ping-pong execution scheme to fully overlap communication with computation and in-place execution on attention servers to reduce memory use. On 512 H200 GPUs and context lengths up to 512k tokens, DistCA improves end-to-end training throughput by up to 1.35x, eliminates data and pipeline parallel stragglers, and achieves near-perfect compute and memory balance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.LG, cs.DC</p>

            <p><strong>Authors:</strong><br>
            Yonghao Zhuang, Junda Chen, Bo Pang, Yi Gu, Yibo Zhu, Yimin Jiang, Ion Stoica, Eric Xing, Hao Zhang</p>

            <p><strong>Title:</strong><br>
            Efficient Long-context Language Model Training by Core Attention Disaggregation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18121v1">http://arxiv.org/abs/2510.18121v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present core attention disaggregation (CAD), a technique that improves long-context large language model training by decoupling the core attention computation, softmax(QK^T)V, from the rest of the model and executing it on a separate pool of devices. In existing systems, core attention is colocated with other layers; at long context lengths, its quadratic compute growth compared to the near-linear growth of other components causes load imbalance and stragglers across data and pipeline parallel groups. CAD is enabled by two observations. First, core attention is stateless: it has no trainable parameters and only minimal transient data, so balancing reduces to scheduling compute-bound tasks. Second, it is composable: modern attention kernels retain high efficiency when processing fused batches of token-level shards with arbitrary lengths. CAD partitions core attention into token-level tasks and dispatches them to dedicated attention servers, which dynamically rebatch tasks to equalize compute without sacrificing kernel efficiency. We implement CAD in a system called DistCA, which uses a ping-pong execution scheme to fully overlap communication with computation and in-place execution on attention servers to reduce memory use. On 512 H200 GPUs and context lengths up to 512k tokens, DistCA improves end-to-end training throughput by up to 1.35x, eliminates data and pipeline parallel stragglers, and achieves near-perfect compute and memory balance.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Oct 2025 21:00:40 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fe3b7dc9/a786c272.mp3" length="22790996" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1421</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.LG, cs.DC</p>

            <p><strong>Authors:</strong><br>
            Yonghao Zhuang, Junda Chen, Bo Pang, Yi Gu, Yibo Zhu, Yimin Jiang, Ion Stoica, Eric Xing, Hao Zhang</p>

            <p><strong>Title:</strong><br>
            Efficient Long-context Language Model Training by Core Attention Disaggregation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18121v1">http://arxiv.org/abs/2510.18121v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present core attention disaggregation (CAD), a technique that improves long-context large language model training by decoupling the core attention computation, softmax(QK^T)V, from the rest of the model and executing it on a separate pool of devices. In existing systems, core attention is colocated with other layers; at long context lengths, its quadratic compute growth compared to the near-linear growth of other components causes load imbalance and stragglers across data and pipeline parallel groups. CAD is enabled by two observations. First, core attention is stateless: it has no trainable parameters and only minimal transient data, so balancing reduces to scheduling compute-bound tasks. Second, it is composable: modern attention kernels retain high efficiency when processing fused batches of token-level shards with arbitrary lengths. CAD partitions core attention into token-level tasks and dispatches them to dedicated attention servers, which dynamically rebatch tasks to equalize compute without sacrificing kernel efficiency. We implement CAD in a system called DistCA, which uses a ping-pong execution scheme to fully overlap communication with computation and in-place execution on attention servers to reduce memory use. On 512 H200 GPUs and context lengths up to 512k tokens, DistCA improves end-to-end training throughput by up to 1.35x, eliminates data and pipeline parallel stragglers, and achieves near-perfect compute and memory balance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>World-in-World: World Models in a Closed-Loop World</title>
      <itunes:episode>1324</itunes:episode>
      <podcast:episode>1324</podcast:episode>
      <itunes:title>World-in-World: World Models in a Closed-Loop World</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d0d52292-11b3-48f8-a7b0-c687261a3bb2</guid>
      <link>https://share.transistor.fm/s/b101e8bd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M. Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan Yuille, Yilun Du, Jieneng Chen</p>

            <p><strong>Title:</strong><br>
            World-in-World: World Models in a Closed-Loop World</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18135v1">http://arxiv.org/abs/2510.18135v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-in-World, the first open platform that benchmarks WMs in a closed-loop world that mirrors real agent-environment interactions. World-in-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M. Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan Yuille, Yilun Du, Jieneng Chen</p>

            <p><strong>Title:</strong><br>
            World-in-World: World Models in a Closed-Loop World</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18135v1">http://arxiv.org/abs/2510.18135v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-in-World, the first open platform that benchmarks WMs in a closed-loop world that mirrors real agent-environment interactions. World-in-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Oct 2025 21:00:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b101e8bd/e3cac03a.mp3" length="23549982" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1468</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M. Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan Yuille, Yilun Du, Jieneng Chen</p>

            <p><strong>Title:</strong><br>
            World-in-World: World Models in a Closed-Loop World</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18135v1">http://arxiv.org/abs/2510.18135v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-in-World, the first open platform that benchmarks WMs in a closed-loop world that mirrors real agent-environment interactions. World-in-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation</title>
      <itunes:episode>1323</itunes:episode>
      <podcast:episode>1323</podcast:episode>
      <itunes:title>UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9a043ee7-add7-40fd-b04e-f32fcd8f54d9</guid>
      <link>https://share.transistor.fm/s/e26bc761</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18701v1">http://arxiv.org/abs/2510.18701v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in text-to-image (T2I) generation underscores the importance of reliable benchmarks in evaluating how accurately generated images reflect the semantics of their textual prompt. However, (1) existing benchmarks lack the diversity of prompt scenarios and multilingual support, both essential for real-world applicability; (2) they offer only coarse evaluations across primary dimensions, covering a narrow range of sub-dimensions, and fall short in fine-grained sub-dimension assessment. To address these limitations, we introduce UniGenBench++, a unified semantic assessment benchmark for T2I generation. Specifically, it comprises 600 prompts organized hierarchically to ensure both coverage and efficiency: (1) spans across diverse real-world scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively probes T2I models' semantic consistency over 10 primary and 27 sub evaluation criteria, with each prompt assessing multiple testpoints. To rigorously assess model robustness to variations in language and prompt length, we provide both English and Chinese versions of each prompt in short and long forms. Leveraging the general world knowledge and fine-grained image understanding capabilities of a closed-source Multi-modal Large Language Model (MLLM), i.e., Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark construction and streamlined model assessment. Moreover, to further facilitate community use, we train a robust evaluation model that enables offline assessment of T2I model outputs. Through comprehensive benchmarking of both open- and closed-sourced T2I models, we systematically reveal their strengths and weaknesses across various aspects.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18701v1">http://arxiv.org/abs/2510.18701v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in text-to-image (T2I) generation underscores the importance of reliable benchmarks in evaluating how accurately generated images reflect the semantics of their textual prompt. However, (1) existing benchmarks lack the diversity of prompt scenarios and multilingual support, both essential for real-world applicability; (2) they offer only coarse evaluations across primary dimensions, covering a narrow range of sub-dimensions, and fall short in fine-grained sub-dimension assessment. To address these limitations, we introduce UniGenBench++, a unified semantic assessment benchmark for T2I generation. Specifically, it comprises 600 prompts organized hierarchically to ensure both coverage and efficiency: (1) spans across diverse real-world scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively probes T2I models' semantic consistency over 10 primary and 27 sub evaluation criteria, with each prompt assessing multiple testpoints. To rigorously assess model robustness to variations in language and prompt length, we provide both English and Chinese versions of each prompt in short and long forms. Leveraging the general world knowledge and fine-grained image understanding capabilities of a closed-source Multi-modal Large Language Model (MLLM), i.e., Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark construction and streamlined model assessment. Moreover, to further facilitate community use, we train a robust evaluation model that enables offline assessment of T2I model outputs. Through comprehensive benchmarking of both open- and closed-sourced T2I models, we systematically reveal their strengths and weaknesses across various aspects.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Oct 2025 20:59:55 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e26bc761/0458db41.mp3" length="23254099" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1450</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18701v1">http://arxiv.org/abs/2510.18701v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in text-to-image (T2I) generation underscores the importance of reliable benchmarks in evaluating how accurately generated images reflect the semantics of their textual prompt. However, (1) existing benchmarks lack the diversity of prompt scenarios and multilingual support, both essential for real-world applicability; (2) they offer only coarse evaluations across primary dimensions, covering a narrow range of sub-dimensions, and fall short in fine-grained sub-dimension assessment. To address these limitations, we introduce UniGenBench++, a unified semantic assessment benchmark for T2I generation. Specifically, it comprises 600 prompts organized hierarchically to ensure both coverage and efficiency: (1) spans across diverse real-world scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively probes T2I models' semantic consistency over 10 primary and 27 sub evaluation criteria, with each prompt assessing multiple testpoints. To rigorously assess model robustness to variations in language and prompt length, we provide both English and Chinese versions of each prompt in short and long forms. Leveraging the general world knowledge and fine-grained image understanding capabilities of a closed-source Multi-modal Large Language Model (MLLM), i.e., Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark construction and streamlined model assessment. Moreover, to further facilitate community use, we train a robust evaluation model that enables offline assessment of T2I model outputs. Through comprehensive benchmarking of both open- and closed-sourced T2I models, we systematically reveal their strengths and weaknesses across various aspects.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Chem-R: Learning to Reason as a Chemist</title>
      <itunes:episode>1322</itunes:episode>
      <podcast:episode>1322</podcast:episode>
      <itunes:title>Chem-R: Learning to Reason as a Chemist</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">89e405b4-ee9e-4d27-b793-a37bf7494808</guid>
      <link>https://share.transistor.fm/s/a24323be</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CE</p>

            <p><strong>Authors:</strong><br>
            Weida Wang, Benteng Chen, Di Zhang, Wanhao Liu, Shuchen Pu, Ben Gao, Jin Zeng, Xiaoyong Wei, Tianshu Yu, Shuzhou Sun, Tianfan Fu, Wanli Ouyang, Lei Bai, Jiatong Li, Zifu Wang, Yuqiang Li, Shufei Zhang</p>

            <p><strong>Title:</strong><br>
            Chem-R: Learning to Reason as a Chemist</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.16880v2">http://arxiv.org/abs/2510.16880v2</a></p>

            <p><strong>Abstract:</strong><br>
            Although large language models (LLMs) have significant potential to advance chemical discovery, current LLMs lack core chemical knowledge, produce unreliable reasoning trajectories, and exhibit suboptimal performance across diverse chemical tasks. To address these challenges, we propose Chem-R, a generalizable Chemical Reasoning model designed to emulate the deliberative processes of chemists. Chem-R is trained through a three-phase framework that progressively builds advanced reasoning capabilities, including: 1) Chemical Foundation Training, which establishes core chemical knowledge. 2) Chemical Reasoning Protocol Distillation, incorporating structured, expert-like reasoning traces to guide systematic and reliable problem solving. 3) Multi-task Group Relative Policy Optimization that optimizes the model for balanced performance across diverse molecular- and reaction-level tasks. This structured pipeline enables Chem-R to achieve state-of-the-art performance on comprehensive benchmarks, surpassing leading large language models, including Gemini-2.5-Pro and DeepSeek-R1, by up to 32% on molecular tasks and 48% on reaction tasks. Meanwhile, Chem-R also consistently outperforms the existing chemical foundation models across both molecular and reaction level tasks. These results highlight Chem-R's robust generalization, interpretability, and potential as a foundation for next-generation AI-driven chemical discovery. The code and model are available at https://github.com/davidweidawang/Chem-R.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CE</p>

            <p><strong>Authors:</strong><br>
            Weida Wang, Benteng Chen, Di Zhang, Wanhao Liu, Shuchen Pu, Ben Gao, Jin Zeng, Xiaoyong Wei, Tianshu Yu, Shuzhou Sun, Tianfan Fu, Wanli Ouyang, Lei Bai, Jiatong Li, Zifu Wang, Yuqiang Li, Shufei Zhang</p>

            <p><strong>Title:</strong><br>
            Chem-R: Learning to Reason as a Chemist</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.16880v2">http://arxiv.org/abs/2510.16880v2</a></p>

            <p><strong>Abstract:</strong><br>
            Although large language models (LLMs) have significant potential to advance chemical discovery, current LLMs lack core chemical knowledge, produce unreliable reasoning trajectories, and exhibit suboptimal performance across diverse chemical tasks. To address these challenges, we propose Chem-R, a generalizable Chemical Reasoning model designed to emulate the deliberative processes of chemists. Chem-R is trained through a three-phase framework that progressively builds advanced reasoning capabilities, including: 1) Chemical Foundation Training, which establishes core chemical knowledge. 2) Chemical Reasoning Protocol Distillation, incorporating structured, expert-like reasoning traces to guide systematic and reliable problem solving. 3) Multi-task Group Relative Policy Optimization that optimizes the model for balanced performance across diverse molecular- and reaction-level tasks. This structured pipeline enables Chem-R to achieve state-of-the-art performance on comprehensive benchmarks, surpassing leading large language models, including Gemini-2.5-Pro and DeepSeek-R1, by up to 32% on molecular tasks and 48% on reaction tasks. Meanwhile, Chem-R also consistently outperforms the existing chemical foundation models across both molecular and reaction level tasks. These results highlight Chem-R's robust generalization, interpretability, and potential as a foundation for next-generation AI-driven chemical discovery. The code and model are available at https://github.com/davidweidawang/Chem-R.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Oct 2025 20:59:32 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a24323be/715cc02f.mp3" length="19987704" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1246</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CE</p>

            <p><strong>Authors:</strong><br>
            Weida Wang, Benteng Chen, Di Zhang, Wanhao Liu, Shuchen Pu, Ben Gao, Jin Zeng, Xiaoyong Wei, Tianshu Yu, Shuzhou Sun, Tianfan Fu, Wanli Ouyang, Lei Bai, Jiatong Li, Zifu Wang, Yuqiang Li, Shufei Zhang</p>

            <p><strong>Title:</strong><br>
            Chem-R: Learning to Reason as a Chemist</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.16880v2">http://arxiv.org/abs/2510.16880v2</a></p>

            <p><strong>Abstract:</strong><br>
            Although large language models (LLMs) have significant potential to advance chemical discovery, current LLMs lack core chemical knowledge, produce unreliable reasoning trajectories, and exhibit suboptimal performance across diverse chemical tasks. To address these challenges, we propose Chem-R, a generalizable Chemical Reasoning model designed to emulate the deliberative processes of chemists. Chem-R is trained through a three-phase framework that progressively builds advanced reasoning capabilities, including: 1) Chemical Foundation Training, which establishes core chemical knowledge. 2) Chemical Reasoning Protocol Distillation, incorporating structured, expert-like reasoning traces to guide systematic and reliable problem solving. 3) Multi-task Group Relative Policy Optimization that optimizes the model for balanced performance across diverse molecular- and reaction-level tasks. This structured pipeline enables Chem-R to achieve state-of-the-art performance on comprehensive benchmarks, surpassing leading large language models, including Gemini-2.5-Pro and DeepSeek-R1, by up to 32% on molecular tasks and 48% on reaction tasks. Meanwhile, Chem-R also consistently outperforms the existing chemical foundation models across both molecular and reaction level tasks. These results highlight Chem-R's robust generalization, interpretability, and potential as a foundation for next-generation AI-driven chemical discovery. The code and model are available at https://github.com/davidweidawang/Chem-R.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation</title>
      <itunes:episode>1321</itunes:episode>
      <podcast:episode>1321</podcast:episode>
      <itunes:title>MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">25535bab-0824-4523-8c27-a0b1be80a030</guid>
      <link>https://share.transistor.fm/s/1591fe8a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weinan Jia, Yuning Lu, Mengqi Huang, Hualiang Wang, Binyuan Huang, Nan Chen, Mu Liu, Jidong Jiang, Zhendong Mao</p>

            <p><strong>Title:</strong><br>
            MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18692v1">http://arxiv.org/abs/2510.18692v1</a></p>

            <p><strong>Abstract:</strong><br>
            Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weinan Jia, Yuning Lu, Mengqi Huang, Hualiang Wang, Binyuan Huang, Nan Chen, Mu Liu, Jidong Jiang, Zhendong Mao</p>

            <p><strong>Title:</strong><br>
            MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18692v1">http://arxiv.org/abs/2510.18692v1</a></p>

            <p><strong>Abstract:</strong><br>
            Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Oct 2025 20:59:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1591fe8a/818f3970.mp3" length="21880254" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1364</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weinan Jia, Yuning Lu, Mengqi Huang, Hualiang Wang, Binyuan Huang, Nan Chen, Mu Liu, Jidong Jiang, Zhendong Mao</p>

            <p><strong>Title:</strong><br>
            MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18692v1">http://arxiv.org/abs/2510.18692v1</a></p>

            <p><strong>Abstract:</strong><br>
            Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query-key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy-efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs</title>
      <itunes:episode>1320</itunes:episode>
      <podcast:episode>1320</podcast:episode>
      <itunes:title>Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5ad18c08-a817-4bc3-9427-d0cbd811b644</guid>
      <link>https://share.transistor.fm/s/6c4f1f5a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang</p>

            <p><strong>Title:</strong><br>
            Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18876v2">http://arxiv.org/abs/2510.18876v2</a></p>

            <p><strong>Abstract:</strong><br>
            While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang</p>

            <p><strong>Title:</strong><br>
            Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18876v2">http://arxiv.org/abs/2510.18876v2</a></p>

            <p><strong>Abstract:</strong><br>
            While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Oct 2025 20:58:45 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6c4f1f5a/3d4e1981.mp3" length="22674810" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1413</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang</p>

            <p><strong>Title:</strong><br>
            Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18876v2">http://arxiv.org/abs/2510.18876v2</a></p>

            <p><strong>Abstract:</strong><br>
            While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model</title>
      <itunes:episode>1319</itunes:episode>
      <podcast:episode>1319</podcast:episode>
      <itunes:title>Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5fd2ec43-8287-4181-b74d-088ce811db5e</guid>
      <link>https://share.transistor.fm/s/64bb4b50</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, Chengyao Wen, Congqi Li, Deng Zhao, Dingbo Yuan, Donghai You, Fagui Mao, Fanzhuang Meng, Feng Xu, Guojie Li, Guowei Wang, Hao Dai, Haonan Zheng, Hong Liu, Jia Guo, Jiaming Liu, Jian Liu, Jianhao Fu, Jiannan Shi, Jianwen Wang, Jianxin Lai, Jin Yang, Jun Mei, Jun Zhou, Junbo Zhao, Junping Zhao, Kuan Xu, Le Su, Lei Chen, Li Tang, Liang Jiang, Liangcheng Fu, Lianhao Xu, Linfeng Shi, Lisha Liao, Longfei Zheng, Meng Li, Mingchun Chen, Qi Zuo, Qiang Cheng, Qianggang Cao, Qitao Shi, Quanrui Guo, Senlin Zhu, Shaofei Wang, Shaomian Zheng, Shuaicheng Li, Shuwei Gu, Siba Chen, Tao Wu, Tao Zhang, Tianyu Zhang, Tianyu Zhou, Tiwei Bie, Tongkai Yang, Wang Hong, Wang Ren, Weihua Chen, Wenbo Yu, Wengang Zheng, Xiangchun Wang, Xiaodong Yan, Xiaopei Wan, Xin Zhao, Xinyu Kong, Xinyu Tang, Xudong Han, Xudong Wang, Xuemin Yang, Xueyu Hu, Yalin Zhang, Yan Sun, Yicheng Shan, Yilong Wang, Yingying Xu, Yongkang Liu, Yongzhen Guo, Yuanyuan Wang, Yuchen Yan, Yuefan Wang, Yuhong Guo, Zehuan Li, Zhankai Xu, Zhe Li, Zhenduo Zhang, Zhengke Gui, Zhenxuan Pan, Zhenyu Huang, Zhenzhong Lan, Zhiqiang Ding, Zhiqiang Zhang, Zhixun Li, Zhizhen Liu, Zihao Wang, Zujie Wen</p>

            <p><strong>Title:</strong><br>
            Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18855v1">http://arxiv.org/abs/2510.18855v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Ring-1T, the first open-source, state-of-the-art thinking model with a trillion-scale parameter. It features 1 trillion total parameters and activates approximately 50 billion per token. Training such models at a trillion-parameter scale introduces unprecedented challenges, including train-inference misalignment, inefficiencies in rollout processing, and bottlenecks in the RL system. To address these, we pioneer three interconnected innovations: (1) IcePop stabilizes RL training via token-level discrepancy masking and clipping, resolving instability from training-inference mismatches; (2) C3PO++ improves resource utilization for long rollouts under a token budget by dynamically partitioning them, thereby obtaining high time efficiency; and (3) ASystem, a high-performance RL framework designed to overcome the systemic bottlenecks that impede trillion-parameter model training. Ring-1T delivers breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-v1. Notably, it attains a silver medal-level result on the IMO-2025, underscoring its exceptional reasoning capabilities. By releasing the complete 1T parameter MoE model to the community, we provide the research community with direct access to cutting-edge reasoning capabilities. This contribution marks a significant milestone in democratizing large-scale reasoning intelligence and establishes a new baseline for open-source model performance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, Chengyao Wen, Congqi Li, Deng Zhao, Dingbo Yuan, Donghai You, Fagui Mao, Fanzhuang Meng, Feng Xu, Guojie Li, Guowei Wang, Hao Dai, Haonan Zheng, Hong Liu, Jia Guo, Jiaming Liu, Jian Liu, Jianhao Fu, Jiannan Shi, Jianwen Wang, Jianxin Lai, Jin Yang, Jun Mei, Jun Zhou, Junbo Zhao, Junping Zhao, Kuan Xu, Le Su, Lei Chen, Li Tang, Liang Jiang, Liangcheng Fu, Lianhao Xu, Linfeng Shi, Lisha Liao, Longfei Zheng, Meng Li, Mingchun Chen, Qi Zuo, Qiang Cheng, Qianggang Cao, Qitao Shi, Quanrui Guo, Senlin Zhu, Shaofei Wang, Shaomian Zheng, Shuaicheng Li, Shuwei Gu, Siba Chen, Tao Wu, Tao Zhang, Tianyu Zhang, Tianyu Zhou, Tiwei Bie, Tongkai Yang, Wang Hong, Wang Ren, Weihua Chen, Wenbo Yu, Wengang Zheng, Xiangchun Wang, Xiaodong Yan, Xiaopei Wan, Xin Zhao, Xinyu Kong, Xinyu Tang, Xudong Han, Xudong Wang, Xuemin Yang, Xueyu Hu, Yalin Zhang, Yan Sun, Yicheng Shan, Yilong Wang, Yingying Xu, Yongkang Liu, Yongzhen Guo, Yuanyuan Wang, Yuchen Yan, Yuefan Wang, Yuhong Guo, Zehuan Li, Zhankai Xu, Zhe Li, Zhenduo Zhang, Zhengke Gui, Zhenxuan Pan, Zhenyu Huang, Zhenzhong Lan, Zhiqiang Ding, Zhiqiang Zhang, Zhixun Li, Zhizhen Liu, Zihao Wang, Zujie Wen</p>

            <p><strong>Title:</strong><br>
            Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18855v1">http://arxiv.org/abs/2510.18855v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Ring-1T, the first open-source, state-of-the-art thinking model with a trillion-scale parameter. It features 1 trillion total parameters and activates approximately 50 billion per token. Training such models at a trillion-parameter scale introduces unprecedented challenges, including train-inference misalignment, inefficiencies in rollout processing, and bottlenecks in the RL system. To address these, we pioneer three interconnected innovations: (1) IcePop stabilizes RL training via token-level discrepancy masking and clipping, resolving instability from training-inference mismatches; (2) C3PO++ improves resource utilization for long rollouts under a token budget by dynamically partitioning them, thereby obtaining high time efficiency; and (3) ASystem, a high-performance RL framework designed to overcome the systemic bottlenecks that impede trillion-parameter model training. Ring-1T delivers breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-v1. Notably, it attains a silver medal-level result on the IMO-2025, underscoring its exceptional reasoning capabilities. By releasing the complete 1T parameter MoE model to the community, we provide the research community with direct access to cutting-edge reasoning capabilities. This contribution marks a significant milestone in democratizing large-scale reasoning intelligence and establishes a new baseline for open-source model performance.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Oct 2025 20:58:23 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/64bb4b50/99a79408.mp3" length="21652481" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1350</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, Chengyao Wen, Congqi Li, Deng Zhao, Dingbo Yuan, Donghai You, Fagui Mao, Fanzhuang Meng, Feng Xu, Guojie Li, Guowei Wang, Hao Dai, Haonan Zheng, Hong Liu, Jia Guo, Jiaming Liu, Jian Liu, Jianhao Fu, Jiannan Shi, Jianwen Wang, Jianxin Lai, Jin Yang, Jun Mei, Jun Zhou, Junbo Zhao, Junping Zhao, Kuan Xu, Le Su, Lei Chen, Li Tang, Liang Jiang, Liangcheng Fu, Lianhao Xu, Linfeng Shi, Lisha Liao, Longfei Zheng, Meng Li, Mingchun Chen, Qi Zuo, Qiang Cheng, Qianggang Cao, Qitao Shi, Quanrui Guo, Senlin Zhu, Shaofei Wang, Shaomian Zheng, Shuaicheng Li, Shuwei Gu, Siba Chen, Tao Wu, Tao Zhang, Tianyu Zhang, Tianyu Zhou, Tiwei Bie, Tongkai Yang, Wang Hong, Wang Ren, Weihua Chen, Wenbo Yu, Wengang Zheng, Xiangchun Wang, Xiaodong Yan, Xiaopei Wan, Xin Zhao, Xinyu Kong, Xinyu Tang, Xudong Han, Xudong Wang, Xuemin Yang, Xueyu Hu, Yalin Zhang, Yan Sun, Yicheng Shan, Yilong Wang, Yingying Xu, Yongkang Liu, Yongzhen Guo, Yuanyuan Wang, Yuchen Yan, Yuefan Wang, Yuhong Guo, Zehuan Li, Zhankai Xu, Zhe Li, Zhenduo Zhang, Zhengke Gui, Zhenxuan Pan, Zhenyu Huang, Zhenzhong Lan, Zhiqiang Ding, Zhiqiang Zhang, Zhixun Li, Zhizhen Liu, Zihao Wang, Zujie Wen</p>

            <p><strong>Title:</strong><br>
            Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18855v1">http://arxiv.org/abs/2510.18855v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Ring-1T, the first open-source, state-of-the-art thinking model with a trillion-scale parameter. It features 1 trillion total parameters and activates approximately 50 billion per token. Training such models at a trillion-parameter scale introduces unprecedented challenges, including train-inference misalignment, inefficiencies in rollout processing, and bottlenecks in the RL system. To address these, we pioneer three interconnected innovations: (1) IcePop stabilizes RL training via token-level discrepancy masking and clipping, resolving instability from training-inference mismatches; (2) C3PO++ improves resource utilization for long rollouts under a token budget by dynamically partitioning them, thereby obtaining high time efficiency; and (3) ASystem, a high-performance RL framework designed to overcome the systemic bottlenecks that impede trillion-parameter model training. Ring-1T delivers breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-v1. Notably, it attains a silver medal-level result on the IMO-2025, underscoring its exceptional reasoning capabilities. By releasing the complete 1T parameter MoE model to the community, we provide the research community with direct access to cutting-edge reasoning capabilities. This contribution marks a significant milestone in democratizing large-scale reasoning intelligence and establishes a new baseline for open-source model performance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>IF-VidCap: Can Video Caption Models Follow Instructions?</title>
      <itunes:episode>1318</itunes:episode>
      <podcast:episode>1318</podcast:episode>
      <itunes:title>IF-VidCap: Can Video Caption Models Follow Instructions?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c96c5016-0e9a-4d57-be5e-91b537e42a85</guid>
      <link>https://share.transistor.fm/s/7d61221a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shihao Li, Yuanxing Zhang, Jiangtao Wu, Zhide Lei, Yiwen He, Runzhe Wen, Chenxi Liao, Chengkang Jiang, An Ping, Shuo Gao, Suhan Wang, Zhaozhou Bian, Zijun Zhou, Jingyi Xie, Jiayi Zhou, Jing Wang, Yifan Yao, Weihao Xie, Yingshui Tan, Yanghai Wang, Qianqian Xie, Zhaoxiang Zhang, Jiaheng Liu</p>

            <p><strong>Title:</strong><br>
            IF-VidCap: Can Video Caption Models Follow Instructions?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18726v1">http://arxiv.org/abs/2510.18726v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shihao Li, Yuanxing Zhang, Jiangtao Wu, Zhide Lei, Yiwen He, Runzhe Wen, Chenxi Liao, Chengkang Jiang, An Ping, Shuo Gao, Suhan Wang, Zhaozhou Bian, Zijun Zhou, Jingyi Xie, Jiayi Zhou, Jing Wang, Yifan Yao, Weihao Xie, Yingshui Tan, Yanghai Wang, Qianqian Xie, Zhaoxiang Zhang, Jiaheng Liu</p>

            <p><strong>Title:</strong><br>
            IF-VidCap: Can Video Caption Models Follow Instructions?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18726v1">http://arxiv.org/abs/2510.18726v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Oct 2025 20:58:00 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7d61221a/a24b65c5.mp3" length="23516551" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1466</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shihao Li, Yuanxing Zhang, Jiangtao Wu, Zhide Lei, Yiwen He, Runzhe Wen, Chenxi Liao, Chengkang Jiang, An Ping, Shuo Gao, Suhan Wang, Zhaozhou Bian, Zijun Zhou, Jingyi Xie, Jiayi Zhou, Jing Wang, Yifan Yao, Weihao Xie, Yingshui Tan, Yanghai Wang, Qianqian Xie, Zhaoxiang Zhang, Jiaheng Liu</p>

            <p><strong>Title:</strong><br>
            IF-VidCap: Can Video Caption Models Follow Instructions?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.18726v1">http://arxiv.org/abs/2510.18726v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DeepAnalyze: Agentic Large Language Models for Autonomous Data Science</title>
      <itunes:episode>1317</itunes:episode>
      <podcast:episode>1317</podcast:episode>
      <itunes:title>DeepAnalyze: Agentic Large Language Models for Autonomous Data Science</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">378ac3dd-db02-4905-a0da-a94d7f5e6221</guid>
      <link>https://share.transistor.fm/s/549c8f9f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.AI, cs.CL, cs.DB</p>

            <p><strong>Authors:</strong><br>
            Shaolei Zhang, Ju Fan, Meihao Fan, Guoliang Li, Xiaoyong Du</p>

            <p><strong>Title:</strong><br>
            DeepAnalyze: Agentic Large Language Models for Autonomous Data Science</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.16872v1">http://arxiv.org/abs/2510.16872v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autonomous data science, from raw data sources to analyst-grade deep research reports, has been a long-standing challenge, and is now becoming feasible with the emergence of powerful large language models (LLMs). Recent workflow-based data agents have shown promising results on specific data tasks but remain fundamentally limited in achieving fully autonomous data science due to their reliance on predefined workflows. In this paper, we introduce DeepAnalyze-8B, the first agentic LLM designed for autonomous data science, capable of automatically completing the end-toend pipeline from data sources to analyst-grade deep research reports. To tackle high-complexity data science tasks, we propose a curriculum-based agentic training paradigm that emulates the learning trajectory of human data scientists, enabling LLMs to progressively acquire and integrate multiple capabilities in real-world environments. We also introduce a data-grounded trajectory synthesis framework that constructs high-quality training data. Through agentic training, DeepAnalyze learns to perform a broad spectrum of data tasks, ranging from data question answering and specialized analytical tasks to open-ended data research. Experiments demonstrate that, with only 8B parameters, DeepAnalyze outperforms previous workflow-based agents built on most advanced proprietary LLMs. The model, code, and training data of DeepAnalyze are open-sourced, paving the way toward autonomous data science.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.AI, cs.CL, cs.DB</p>

            <p><strong>Authors:</strong><br>
            Shaolei Zhang, Ju Fan, Meihao Fan, Guoliang Li, Xiaoyong Du</p>

            <p><strong>Title:</strong><br>
            DeepAnalyze: Agentic Large Language Models for Autonomous Data Science</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.16872v1">http://arxiv.org/abs/2510.16872v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autonomous data science, from raw data sources to analyst-grade deep research reports, has been a long-standing challenge, and is now becoming feasible with the emergence of powerful large language models (LLMs). Recent workflow-based data agents have shown promising results on specific data tasks but remain fundamentally limited in achieving fully autonomous data science due to their reliance on predefined workflows. In this paper, we introduce DeepAnalyze-8B, the first agentic LLM designed for autonomous data science, capable of automatically completing the end-toend pipeline from data sources to analyst-grade deep research reports. To tackle high-complexity data science tasks, we propose a curriculum-based agentic training paradigm that emulates the learning trajectory of human data scientists, enabling LLMs to progressively acquire and integrate multiple capabilities in real-world environments. We also introduce a data-grounded trajectory synthesis framework that constructs high-quality training data. Through agentic training, DeepAnalyze learns to perform a broad spectrum of data tasks, ranging from data question answering and specialized analytical tasks to open-ended data research. Experiments demonstrate that, with only 8B parameters, DeepAnalyze outperforms previous workflow-based agents built on most advanced proprietary LLMs. The model, code, and training data of DeepAnalyze are open-sourced, paving the way toward autonomous data science.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 21 Oct 2025 20:49:27 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/549c8f9f/a1d33de7.mp3" length="19439791" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1211</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.AI, cs.CL, cs.DB</p>

            <p><strong>Authors:</strong><br>
            Shaolei Zhang, Ju Fan, Meihao Fan, Guoliang Li, Xiaoyong Du</p>

            <p><strong>Title:</strong><br>
            DeepAnalyze: Agentic Large Language Models for Autonomous Data Science</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.16872v1">http://arxiv.org/abs/2510.16872v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autonomous data science, from raw data sources to analyst-grade deep research reports, has been a long-standing challenge, and is now becoming feasible with the emergence of powerful large language models (LLMs). Recent workflow-based data agents have shown promising results on specific data tasks but remain fundamentally limited in achieving fully autonomous data science due to their reliance on predefined workflows. In this paper, we introduce DeepAnalyze-8B, the first agentic LLM designed for autonomous data science, capable of automatically completing the end-toend pipeline from data sources to analyst-grade deep research reports. To tackle high-complexity data science tasks, we propose a curriculum-based agentic training paradigm that emulates the learning trajectory of human data scientists, enabling LLMs to progressively acquire and integrate multiple capabilities in real-world environments. We also introduce a data-grounded trajectory synthesis framework that constructs high-quality training data. Through agentic training, DeepAnalyze learns to perform a broad spectrum of data tasks, ranging from data question answering and specialized analytical tasks to open-ended data research. Experiments demonstrate that, with only 8B parameters, DeepAnalyze outperforms previous workflow-based agents built on most advanced proprietary LLMs. The model, code, and training data of DeepAnalyze are open-sourced, paving the way toward autonomous data science.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PICABench: How Far Are We from Physically Realistic Image Editing?</title>
      <itunes:episode>1316</itunes:episode>
      <podcast:episode>1316</podcast:episode>
      <itunes:title>PICABench: How Far Are We from Physically Realistic Image Editing?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">359c189c-c5f6-4c4c-9979-d65a454e2ae3</guid>
      <link>https://share.transistor.fm/s/c188fdce</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuandong Pu, Le Zhuo, Songhao Han, Jinbo Xing, Kaiwen Zhu, Shuo Cao, Bin Fu, Si Liu, Hongsheng Li, Yu Qiao, Wenlong Zhang, Xi Chen, Yihao Liu</p>

            <p><strong>Title:</strong><br>
            PICABench: How Far Are We from Physically Realistic Image Editing?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.17681v2">http://arxiv.org/abs/2510.17681v2</a></p>

            <p><strong>Abstract:</strong><br>
            Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension (spanning optics, mechanics, and state transitions) for most of the common editing operations (add, remove, attribute change, etc.). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K. After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuandong Pu, Le Zhuo, Songhao Han, Jinbo Xing, Kaiwen Zhu, Shuo Cao, Bin Fu, Si Liu, Hongsheng Li, Yu Qiao, Wenlong Zhang, Xi Chen, Yihao Liu</p>

            <p><strong>Title:</strong><br>
            PICABench: How Far Are We from Physically Realistic Image Editing?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.17681v2">http://arxiv.org/abs/2510.17681v2</a></p>

            <p><strong>Abstract:</strong><br>
            Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension (spanning optics, mechanics, and state transitions) for most of the common editing operations (add, remove, attribute change, etc.). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K. After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 21 Oct 2025 20:49:06 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c188fdce/70606d7d.mp3" length="21485279" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1339</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuandong Pu, Le Zhuo, Songhao Han, Jinbo Xing, Kaiwen Zhu, Shuo Cao, Bin Fu, Si Liu, Hongsheng Li, Yu Qiao, Wenlong Zhang, Xi Chen, Yihao Liu</p>

            <p><strong>Title:</strong><br>
            PICABench: How Far Are We from Physically Realistic Image Editing?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.17681v2">http://arxiv.org/abs/2510.17681v2</a></p>

            <p><strong>Abstract:</strong><br>
            Image editing has achieved remarkable progress recently. Modern editing models could already follow complex instructions to manipulate the original content. However, beyond completing the editing instructions, the accompanying physical effects are the key to the generation realism. For example, removing an object should also remove its shadow, reflections, and interactions with nearby objects. Unfortunately, existing models and benchmarks mainly focus on instruction completion but overlook these physical effects. So, at this moment, how far are we from physically realistic image editing? To answer this, we introduce PICABench, which systematically evaluates physical realism across eight sub-dimension (spanning optics, mechanics, and state transitions) for most of the common editing operations (add, remove, attribute change, etc.). We further propose the PICAEval, a reliable evaluation protocol that uses VLM-as-a-judge with per-case, region-level human annotations and questions. Beyond benchmarking, we also explore effective solutions by learning physics from videos and construct a training dataset PICA-100K. After evaluating most of the mainstream models, we observe that physical realism remains a challenging problem with large rooms to explore. We hope that our benchmark and proposed solutions can serve as a foundation for future work moving from naive content editing toward physically consistent realism.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Glyph: Scaling Context Windows via Visual-Text Compression</title>
      <itunes:episode>1315</itunes:episode>
      <podcast:episode>1315</podcast:episode>
      <itunes:title>Glyph: Scaling Context Windows via Visual-Text Compression</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">55b320d3-829b-40f8-bf8b-6f6977eb1ff2</guid>
      <link>https://share.transistor.fm/s/a9feec8e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang</p>

            <p><strong>Title:</strong><br>
            Glyph: Scaling Context Windows via Visual-Text Compression</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.17800v2">http://arxiv.org/abs/2510.17800v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang</p>

            <p><strong>Title:</strong><br>
            Glyph: Scaling Context Windows via Visual-Text Compression</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.17800v2">http://arxiv.org/abs/2510.17800v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 21 Oct 2025 20:48:45 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a9feec8e/ac88c205.mp3" length="24409313" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1522</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang</p>

            <p><strong>Title:</strong><br>
            Glyph: Scaling Context Windows via Visual-Text Compression</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.17800v2">http://arxiv.org/abs/2510.17800v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FineVision: Open Data Is All You Need</title>
      <itunes:episode>1314</itunes:episode>
      <podcast:episode>1314</podcast:episode>
      <itunes:title>FineVision: Open Data Is All You Need</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">257d617b-9ef8-4dba-99f0-eee960fa7f4c</guid>
      <link>https://share.transistor.fm/s/6b17ae51</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andrés Marafioti</p>

            <p><strong>Title:</strong><br>
            FineVision: Open Data Is All You Need</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.17269v1">http://arxiv.org/abs/2510.17269v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andrés Marafioti</p>

            <p><strong>Title:</strong><br>
            FineVision: Open Data Is All You Need</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.17269v1">http://arxiv.org/abs/2510.17269v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 21 Oct 2025 20:48:24 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6b17ae51/32d08e3c.mp3" length="26007568" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1622</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andrés Marafioti</p>

            <p><strong>Title:</strong><br>
            FineVision: Open Data Is All You Need</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.17269v1">http://arxiv.org/abs/2510.17269v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model</title>
      <itunes:episode>1313</itunes:episode>
      <podcast:episode>1313</podcast:episode>
      <itunes:title>TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">edf5d9b8-c5ca-46e4-916f-f136a7b39e0e</guid>
      <link>https://share.transistor.fm/s/2ec36e15</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Bin Yu, Xinming Wang, Shijie Lian, Haotian Li, Changti Wu, Ruina Hu, Bailing Wang, Yuliang Wei, Kai Chen</p>

            <p><strong>Title:</strong><br>
            TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.16449v1">http://arxiv.org/abs/2510.16449v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have shown remarkable progress in complex reasoning tasks, largely enabled by test-time scaling (TTS) paradigms that allocate additional compute during inference. Among these, external TTS (particularly the Best-of-N selection paradigm) yields scalable performance improvements by selecting from multiple independently generated reasoning trajectories. However, this approach faces key limitations: (i) the high computational overhead of deploying process reward models, (ii) the underutilization of the LLM's intrinsic latent representations. We introduce TrajSelector, an efficient and effective Best-of-N framework that exploit the hidden states in the sampler LLM for process-level scoring. A lightweight verifier (with only 0.6B parameters) evaluates the quality of step-wise trajectory, and then aggregates these scores to identify the optimal reasoning trajectory. Our framework employs a fully data-driven, end-to-end training recipe that eliminates reliance on massive step-level annotations. Experiential results across five benchmarks demonstrate that TrajSelector delivers consistent performance gains. In Best-of-32 settings, it surpasses majority voting by 4.61% accuracy and outperforms existing process reward models by 4.31% to 12.21%, all while maintaining lower inference costs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Bin Yu, Xinming Wang, Shijie Lian, Haotian Li, Changti Wu, Ruina Hu, Bailing Wang, Yuliang Wei, Kai Chen</p>

            <p><strong>Title:</strong><br>
            TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.16449v1">http://arxiv.org/abs/2510.16449v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have shown remarkable progress in complex reasoning tasks, largely enabled by test-time scaling (TTS) paradigms that allocate additional compute during inference. Among these, external TTS (particularly the Best-of-N selection paradigm) yields scalable performance improvements by selecting from multiple independently generated reasoning trajectories. However, this approach faces key limitations: (i) the high computational overhead of deploying process reward models, (ii) the underutilization of the LLM's intrinsic latent representations. We introduce TrajSelector, an efficient and effective Best-of-N framework that exploit the hidden states in the sampler LLM for process-level scoring. A lightweight verifier (with only 0.6B parameters) evaluates the quality of step-wise trajectory, and then aggregates these scores to identify the optimal reasoning trajectory. Our framework employs a fully data-driven, end-to-end training recipe that eliminates reliance on massive step-level annotations. Experiential results across five benchmarks demonstrate that TrajSelector delivers consistent performance gains. In Best-of-32 settings, it surpasses majority voting by 4.61% accuracy and outperforms existing process reward models by 4.31% to 12.21%, all while maintaining lower inference costs.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 21 Oct 2025 20:48:03 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2ec36e15/44261bb6.mp3" length="22637218" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1411</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Bin Yu, Xinming Wang, Shijie Lian, Haotian Li, Changti Wu, Ruina Hu, Bailing Wang, Yuliang Wei, Kai Chen</p>

            <p><strong>Title:</strong><br>
            TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.16449v1">http://arxiv.org/abs/2510.16449v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have shown remarkable progress in complex reasoning tasks, largely enabled by test-time scaling (TTS) paradigms that allocate additional compute during inference. Among these, external TTS (particularly the Best-of-N selection paradigm) yields scalable performance improvements by selecting from multiple independently generated reasoning trajectories. However, this approach faces key limitations: (i) the high computational overhead of deploying process reward models, (ii) the underutilization of the LLM's intrinsic latent representations. We introduce TrajSelector, an efficient and effective Best-of-N framework that exploit the hidden states in the sampler LLM for process-level scoring. A lightweight verifier (with only 0.6B parameters) evaluates the quality of step-wise trajectory, and then aggregates these scores to identify the optimal reasoning trajectory. Our framework employs a fully data-driven, end-to-end training recipe that eliminates reliance on massive step-level annotations. Experiential results across five benchmarks demonstrate that TrajSelector delivers consistent performance gains. In Best-of-32 settings, it surpasses majority voting by 4.61% accuracy and outperforms existing process reward models by 4.31% to 12.21%, all while maintaining lower inference costs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation</title>
      <itunes:episode>1312</itunes:episode>
      <podcast:episode>1312</podcast:episode>
      <itunes:title>Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">87e1e43c-7c63-4d89-8584-7fd16cf6716b</guid>
      <link>https://share.transistor.fm/s/6af68e29</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.AI, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chenghao Zhang, Guanting Dong, Xinyu Yang, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.17354v1">http://arxiv.org/abs/2510.17354v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.AI, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chenghao Zhang, Guanting Dong, Xinyu Yang, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.17354v1">http://arxiv.org/abs/2510.17354v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 21 Oct 2025 20:47:42 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6af68e29/a51fdbfa.mp3" length="23313022" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1453</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.AI, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chenghao Zhang, Guanting Dong, Xinyu Yang, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.17354v1">http://arxiv.org/abs/2510.17354v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling</title>
      <itunes:episode>1311</itunes:episode>
      <podcast:episode>1311</podcast:episode>
      <itunes:title>When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">58f5775a-0f0a-41f1-9a0b-264ef0e675f0</guid>
      <link>https://share.transistor.fm/s/7216bda0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Heecheol Yun, Kwangmin Ki, Junghyun Lee, Eunho Yang</p>

            <p><strong>Title:</strong><br>
            When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15346v1">http://arxiv.org/abs/2510.15346v1</a></p>

            <p><strong>Abstract:</strong><br>
            Ensembling Large Language Models (LLMs) has gained attention as a promising approach to surpass the performance of individual models by leveraging their complementary strengths. In particular, aggregating models' next-token probability distributions to select the next token has been shown to be effective in various tasks. However, while successful for short-form answers, its application to long-form generation remains underexplored. In this paper, we show that using existing ensemble methods in long-form generation requires a careful choice of ensembling positions, since the standard practice of ensembling at every token often degrades performance. We identify two key factors for determining these positions: tokenization mismatch across models and consensus in their next-token probability distributions. Based on this, we propose SAFE, (Stable And Fast LLM Ensembling), a framework that selectively ensembles by jointly considering these factors. To further improve stability, we introduce a probability sharpening strategy that consolidates probabilities spread across multiple sub-word tokens representing the same word into a single representative token. Our experiments on diverse benchmarks, including MATH500 and BBH, demonstrate that SAFE outperforms existing methods in both accuracy and efficiency, with gains achieved even when ensembling fewer than 1% of tokens.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Heecheol Yun, Kwangmin Ki, Junghyun Lee, Eunho Yang</p>

            <p><strong>Title:</strong><br>
            When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15346v1">http://arxiv.org/abs/2510.15346v1</a></p>

            <p><strong>Abstract:</strong><br>
            Ensembling Large Language Models (LLMs) has gained attention as a promising approach to surpass the performance of individual models by leveraging their complementary strengths. In particular, aggregating models' next-token probability distributions to select the next token has been shown to be effective in various tasks. However, while successful for short-form answers, its application to long-form generation remains underexplored. In this paper, we show that using existing ensemble methods in long-form generation requires a careful choice of ensembling positions, since the standard practice of ensembling at every token often degrades performance. We identify two key factors for determining these positions: tokenization mismatch across models and consensus in their next-token probability distributions. Based on this, we propose SAFE, (Stable And Fast LLM Ensembling), a framework that selectively ensembles by jointly considering these factors. To further improve stability, we introduce a probability sharpening strategy that consolidates probabilities spread across multiple sub-word tokens representing the same word into a single representative token. Our experiments on diverse benchmarks, including MATH500 and BBH, demonstrate that SAFE outperforms existing methods in both accuracy and efficiency, with gains achieved even when ensembling fewer than 1% of tokens.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 21 Oct 2025 20:47:21 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7216bda0/e533303c.mp3" length="21748610" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1356</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Heecheol Yun, Kwangmin Ki, Junghyun Lee, Eunho Yang</p>

            <p><strong>Title:</strong><br>
            When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15346v1">http://arxiv.org/abs/2510.15346v1</a></p>

            <p><strong>Abstract:</strong><br>
            Ensembling Large Language Models (LLMs) has gained attention as a promising approach to surpass the performance of individual models by leveraging their complementary strengths. In particular, aggregating models' next-token probability distributions to select the next token has been shown to be effective in various tasks. However, while successful for short-form answers, its application to long-form generation remains underexplored. In this paper, we show that using existing ensemble methods in long-form generation requires a careful choice of ensembling positions, since the standard practice of ensembling at every token often degrades performance. We identify two key factors for determining these positions: tokenization mismatch across models and consensus in their next-token probability distributions. Based on this, we propose SAFE, (Stable And Fast LLM Ensembling), a framework that selectively ensembles by jointly considering these factors. To further improve stability, we introduce a probability sharpening strategy that consolidates probabilities spread across multiple sub-word tokens representing the same word into a single representative token. Our experiments on diverse benchmarks, including MATH500 and BBH, demonstrate that SAFE outperforms existing methods in both accuracy and efficiency, with gains achieved even when ensembling fewer than 1% of tokens.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning</title>
      <itunes:episode>1310</itunes:episode>
      <podcast:episode>1310</podcast:episode>
      <itunes:title>A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0909f6f9-8a88-4759-91a2-3036d18498bf</guid>
      <link>https://share.transistor.fm/s/d3777c1e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 112 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhi Zhou, Yuhao Tan, Zenan Li, Yuan Yao, Lan-Zhe Guo, Yu-Feng Li, Xiaoxing Ma</p>

            <p><strong>Title:</strong><br>
            A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15444v1">http://arxiv.org/abs/2510.15444v1</a></p>

            <p><strong>Abstract:</strong><br>
            Test-time scaling seeks to improve the reasoning performance of large language models (LLMs) by adding computational resources. A prevalent approach within the field is sampling-based test-time scaling methods, which enhance reasoning by generating multiple reasoning paths for a given input during inference. However, despite its practical success, the theoretical foundations remain underexplored. In this paper, we provide the first theoretical framework for analyzing sampling-based test-time scaling methods, grounded in the perspective of confidence estimation. Based on the framework, we analyze two dominant paradigms: self-consistency and perplexity, and reveal key limitations: self-consistency suffers from high estimation error while perplexity exhibits substantial modeling error and possible degradation of the estimation error convergence. To address these limitations, we introduce RPC, a hybrid method that leverages our theoretical insights through two key components: Perplexity Consistency and Reasoning Pruning. Perplexity Consistency combines the strengths of self-consistency and perplexity, boosting the convergence rate of estimation error from linear to exponential while preserving model error. Reasoning Pruning prevents degradation by eliminating low-probability reasoning paths. Both theoretical analysis and empirical results across seven benchmark datasets demonstrate that RPC has a strong potential for reducing reasoning error. Notably, RPC achieves reasoning performance comparable to self-consistency while not only enhancing confidence reliability but also reducing sampling costs by 50%. The code and resources are available at https://wnjxyk.github.io/RPC.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 112 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhi Zhou, Yuhao Tan, Zenan Li, Yuan Yao, Lan-Zhe Guo, Yu-Feng Li, Xiaoxing Ma</p>

            <p><strong>Title:</strong><br>
            A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15444v1">http://arxiv.org/abs/2510.15444v1</a></p>

            <p><strong>Abstract:</strong><br>
            Test-time scaling seeks to improve the reasoning performance of large language models (LLMs) by adding computational resources. A prevalent approach within the field is sampling-based test-time scaling methods, which enhance reasoning by generating multiple reasoning paths for a given input during inference. However, despite its practical success, the theoretical foundations remain underexplored. In this paper, we provide the first theoretical framework for analyzing sampling-based test-time scaling methods, grounded in the perspective of confidence estimation. Based on the framework, we analyze two dominant paradigms: self-consistency and perplexity, and reveal key limitations: self-consistency suffers from high estimation error while perplexity exhibits substantial modeling error and possible degradation of the estimation error convergence. To address these limitations, we introduce RPC, a hybrid method that leverages our theoretical insights through two key components: Perplexity Consistency and Reasoning Pruning. Perplexity Consistency combines the strengths of self-consistency and perplexity, boosting the convergence rate of estimation error from linear to exponential while preserving model error. Reasoning Pruning prevents degradation by eliminating low-probability reasoning paths. Both theoretical analysis and empirical results across seven benchmark datasets demonstrate that RPC has a strong potential for reducing reasoning error. Notably, RPC achieves reasoning performance comparable to self-consistency while not only enhancing confidence reliability but also reducing sampling costs by 50%. The code and resources are available at https://wnjxyk.github.io/RPC.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 20 Oct 2025 20:43:07 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d3777c1e/fb8e573c.mp3" length="19365415" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1207</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 112 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhi Zhou, Yuhao Tan, Zenan Li, Yuan Yao, Lan-Zhe Guo, Yu-Feng Li, Xiaoxing Ma</p>

            <p><strong>Title:</strong><br>
            A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15444v1">http://arxiv.org/abs/2510.15444v1</a></p>

            <p><strong>Abstract:</strong><br>
            Test-time scaling seeks to improve the reasoning performance of large language models (LLMs) by adding computational resources. A prevalent approach within the field is sampling-based test-time scaling methods, which enhance reasoning by generating multiple reasoning paths for a given input during inference. However, despite its practical success, the theoretical foundations remain underexplored. In this paper, we provide the first theoretical framework for analyzing sampling-based test-time scaling methods, grounded in the perspective of confidence estimation. Based on the framework, we analyze two dominant paradigms: self-consistency and perplexity, and reveal key limitations: self-consistency suffers from high estimation error while perplexity exhibits substantial modeling error and possible degradation of the estimation error convergence. To address these limitations, we introduce RPC, a hybrid method that leverages our theoretical insights through two key components: Perplexity Consistency and Reasoning Pruning. Perplexity Consistency combines the strengths of self-consistency and perplexity, boosting the convergence rate of estimation error from linear to exponential while preserving model error. Reasoning Pruning prevents degradation by eliminating low-probability reasoning paths. Both theoretical analysis and empirical results across seven benchmark datasets demonstrate that RPC has a strong potential for reducing reasoning error. Notably, RPC achieves reasoning performance comparable to self-consistency while not only enhancing confidence reliability but also reducing sampling costs by 50%. The code and resources are available at https://wnjxyk.github.io/RPC.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM</title>
      <itunes:episode>1309</itunes:episode>
      <podcast:episode>1309</podcast:episode>
      <itunes:title>OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">aa9bd32a-38f0-4367-8df4-2cfd7fac32fa</guid>
      <link>https://share.transistor.fm/s/2fcf03dd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Jason Lu, Oluwatobi Olabiyi, Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, Pavlo Molchanov</p>

            <p><strong>Title:</strong><br>
            OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15870v1">http://arxiv.org/abs/2510.15870v1</a></p>

            <p><strong>Abstract:</strong><br>
            Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Jason Lu, Oluwatobi Olabiyi, Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, Pavlo Molchanov</p>

            <p><strong>Title:</strong><br>
            OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15870v1">http://arxiv.org/abs/2510.15870v1</a></p>

            <p><strong>Abstract:</strong><br>
            Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 20 Oct 2025 20:42:45 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2fcf03dd/6b717dc6.mp3" length="24143508" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1505</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Jason Lu, Oluwatobi Olabiyi, Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, Pavlo Molchanov</p>

            <p><strong>Title:</strong><br>
            OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15870v1">http://arxiv.org/abs/2510.15870v1</a></p>

            <p><strong>Abstract:</strong><br>
            Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks</title>
      <itunes:episode>1308</itunes:episode>
      <podcast:episode>1308</podcast:episode>
      <itunes:title>NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">955d2183-ce62-43d7-81cd-348bda1b3241</guid>
      <link>https://share.transistor.fm/s/a7eb9c05</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junliang Ye, Shenghao Xie, Ruowen Zhao, Zhengyi Wang, Hongyu Yan, Wenqiang Zu, Lei Ma, Jun Zhu</p>

            <p><strong>Title:</strong><br>
            NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15019v1">http://arxiv.org/abs/2510.15019v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D object editing is essential for interactive content creation in gaming, animation, and robotics, yet current approaches remain inefficient, inconsistent, and often fail to preserve unedited regions. Most methods rely on editing multi-view renderings followed by reconstruction, which introduces artifacts and limits practicality. To address these challenges, we propose Nano3D, a training-free framework for precise and coherent 3D object editing without masks. Nano3D integrates FlowEdit into TRELLIS to perform localized edits guided by front-view renderings, and further introduces region-aware merging strategies, Voxel/Slat-Merge, which adaptively preserve structural fidelity by ensuring consistency between edited and unedited areas. Experiments demonstrate that Nano3D achieves superior 3D consistency and visual quality compared with existing methods. Based on this framework, we construct the first large-scale 3D editing datasets Nano3D-Edit-100k, which contains over 100,000 high-quality 3D editing pairs. This work addresses long-standing challenges in both algorithm design and data availability, significantly improving the generality and reliability of 3D editing, and laying the groundwork for the development of feed-forward 3D editing models. Project Page:https://jamesyjl.github.io/Nano3D</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junliang Ye, Shenghao Xie, Ruowen Zhao, Zhengyi Wang, Hongyu Yan, Wenqiang Zu, Lei Ma, Jun Zhu</p>

            <p><strong>Title:</strong><br>
            NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15019v1">http://arxiv.org/abs/2510.15019v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D object editing is essential for interactive content creation in gaming, animation, and robotics, yet current approaches remain inefficient, inconsistent, and often fail to preserve unedited regions. Most methods rely on editing multi-view renderings followed by reconstruction, which introduces artifacts and limits practicality. To address these challenges, we propose Nano3D, a training-free framework for precise and coherent 3D object editing without masks. Nano3D integrates FlowEdit into TRELLIS to perform localized edits guided by front-view renderings, and further introduces region-aware merging strategies, Voxel/Slat-Merge, which adaptively preserve structural fidelity by ensuring consistency between edited and unedited areas. Experiments demonstrate that Nano3D achieves superior 3D consistency and visual quality compared with existing methods. Based on this framework, we construct the first large-scale 3D editing datasets Nano3D-Edit-100k, which contains over 100,000 high-quality 3D editing pairs. This work addresses long-standing challenges in both algorithm design and data availability, significantly improving the generality and reliability of 3D editing, and laying the groundwork for the development of feed-forward 3D editing models. Project Page:https://jamesyjl.github.io/Nano3D</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 20 Oct 2025 20:42:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a7eb9c05/ef571051.mp3" length="21898646" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1365</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junliang Ye, Shenghao Xie, Ruowen Zhao, Zhengyi Wang, Hongyu Yan, Wenqiang Zu, Lei Ma, Jun Zhu</p>

            <p><strong>Title:</strong><br>
            NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15019v1">http://arxiv.org/abs/2510.15019v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D object editing is essential for interactive content creation in gaming, animation, and robotics, yet current approaches remain inefficient, inconsistent, and often fail to preserve unedited regions. Most methods rely on editing multi-view renderings followed by reconstruction, which introduces artifacts and limits practicality. To address these challenges, we propose Nano3D, a training-free framework for precise and coherent 3D object editing without masks. Nano3D integrates FlowEdit into TRELLIS to perform localized edits guided by front-view renderings, and further introduces region-aware merging strategies, Voxel/Slat-Merge, which adaptively preserve structural fidelity by ensuring consistency between edited and unedited areas. Experiments demonstrate that Nano3D achieves superior 3D consistency and visual quality compared with existing methods. Based on this framework, we construct the first large-scale 3D editing datasets Nano3D-Edit-100k, which contains over 100,000 high-quality 3D editing pairs. This work addresses long-standing challenges in both algorithm design and data availability, significantly improving the generality and reliability of 3D editing, and laying the groundwork for the development of feed-forward 3D editing models. Project Page:https://jamesyjl.github.io/Nano3D</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs</title>
      <itunes:episode>1307</itunes:episode>
      <podcast:episode>1307</podcast:episode>
      <itunes:title>Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b9f1a0a6-6b20-4959-8678-055b868e99d6</guid>
      <link>https://share.transistor.fm/s/18546555</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Nikita Afonin, Nikita Andriyanov, Nikhil Bageshpura, Kyle Liu, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Alexander Panchenko, Oleg Rogov, Elena Tutubalina, Mikhail Seleznyov</p>

            <p><strong>Title:</strong><br>
            Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11288v1">http://arxiv.org/abs/2510.11288v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across three datasets, three frontier models produce broadly misaligned responses at rates between 2% and 17% given 64 narrow in-context examples, and up to 58% with 256 examples. We also examine mechanisms of EM by eliciting step-by-step reasoning (while leaving in-context examples unchanged). Manual analysis of the resulting chain-of-thought shows that 67.5% of misaligned traces explicitly rationalize harmful outputs by adopting a reckless or dangerous ''persona'', echoing prior results on finetuning-induced EM.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Nikita Afonin, Nikita Andriyanov, Nikhil Bageshpura, Kyle Liu, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Alexander Panchenko, Oleg Rogov, Elena Tutubalina, Mikhail Seleznyov</p>

            <p><strong>Title:</strong><br>
            Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11288v1">http://arxiv.org/abs/2510.11288v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across three datasets, three frontier models produce broadly misaligned responses at rates between 2% and 17% given 64 narrow in-context examples, and up to 58% with 256 examples. We also examine mechanisms of EM by eliciting step-by-step reasoning (while leaving in-context examples unchanged). Manual analysis of the resulting chain-of-thought shows that 67.5% of misaligned traces explicitly rationalize harmful outputs by adopting a reckless or dangerous ''persona'', echoing prior results on finetuning-induced EM.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 20 Oct 2025 20:42:00 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/18546555/ddd41b05.mp3" length="19704398" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1228</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Nikita Afonin, Nikita Andriyanov, Nikhil Bageshpura, Kyle Liu, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Alexander Panchenko, Oleg Rogov, Elena Tutubalina, Mikhail Seleznyov</p>

            <p><strong>Title:</strong><br>
            Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11288v1">http://arxiv.org/abs/2510.11288v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across three datasets, three frontier models produce broadly misaligned responses at rates between 2% and 17% given 64 narrow in-context examples, and up to 58% with 256 examples. We also examine mechanisms of EM by eliciting step-by-step reasoning (while leaving in-context examples unchanged). Manual analysis of the resulting chain-of-thought shows that 67.5% of misaligned traces explicitly rationalize harmful outputs by adopting a reckless or dangerous ''persona'', echoing prior results on finetuning-induced EM.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset</title>
      <itunes:episode>1306</itunes:episode>
      <podcast:episode>1306</podcast:episode>
      <itunes:title>Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7c28dc00-9639-4eb9-8e99-105b849696ae</guid>
      <link>https://share.transistor.fm/s/508ee054</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen</p>

            <p><strong>Title:</strong><br>
            Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15742v1">http://arxiv.org/abs/2510.15742v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen</p>

            <p><strong>Title:</strong><br>
            Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15742v1">http://arxiv.org/abs/2510.15742v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 20 Oct 2025 20:41:38 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/508ee054/21146570.mp3" length="19378776" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1207</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen</p>

            <p><strong>Title:</strong><br>
            Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15742v1">http://arxiv.org/abs/2510.15742v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery</title>
      <itunes:episode>1305</itunes:episode>
      <podcast:episode>1305</podcast:episode>
      <itunes:title>Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fc8a9695-b94c-4472-a47e-2227483554d3</guid>
      <link>https://share.transistor.fm/s/4c6b0007</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jie-Ying Lee, Yi-Ruei Liu, Shr-Ruei Tsai, Wei-Cheng Chang, Chung-Ho Wu, Jiewen Chan, Zhenjun Zhao, Chieh Hubert Lin, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15869v1">http://arxiv.org/abs/2510.15869v1</a></p>

            <p><strong>Abstract:</strong><br>
            Synthesizing large-scale, explorable, and geometrically accurate 3D urban scenes is a challenging yet valuable task in providing immersive and embodied applications. The challenges lie in the lack of large-scale and high-quality real-world 3D scans for training generalizable generative models. In this paper, we take an alternative route to create large-scale 3D scenes by synergizing the readily available satellite imagery that supplies realistic coarse geometry and the open-domain diffusion model for creating high-quality close-up appearances. We propose \textbf{Skyfall-GS}, the first city-block scale 3D scene creation framework without costly 3D annotations, also featuring real-time, immersive 3D exploration. We tailor a curriculum-driven iterative refinement strategy to progressively enhance geometric completeness and photorealistic textures. Extensive experiments demonstrate that Skyfall-GS provides improved cross-view consistent geometry and more realistic textures compared to state-of-the-art approaches. Project page: https://skyfall-gs.jayinnn.dev/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jie-Ying Lee, Yi-Ruei Liu, Shr-Ruei Tsai, Wei-Cheng Chang, Chung-Ho Wu, Jiewen Chan, Zhenjun Zhao, Chieh Hubert Lin, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15869v1">http://arxiv.org/abs/2510.15869v1</a></p>

            <p><strong>Abstract:</strong><br>
            Synthesizing large-scale, explorable, and geometrically accurate 3D urban scenes is a challenging yet valuable task in providing immersive and embodied applications. The challenges lie in the lack of large-scale and high-quality real-world 3D scans for training generalizable generative models. In this paper, we take an alternative route to create large-scale 3D scenes by synergizing the readily available satellite imagery that supplies realistic coarse geometry and the open-domain diffusion model for creating high-quality close-up appearances. We propose \textbf{Skyfall-GS}, the first city-block scale 3D scene creation framework without costly 3D annotations, also featuring real-time, immersive 3D exploration. We tailor a curriculum-driven iterative refinement strategy to progressively enhance geometric completeness and photorealistic textures. Extensive experiments demonstrate that Skyfall-GS provides improved cross-view consistent geometry and more realistic textures compared to state-of-the-art approaches. Project page: https://skyfall-gs.jayinnn.dev/</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 20 Oct 2025 20:41:15 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4c6b0007/366d37c2.mp3" length="24502115" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1528</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jie-Ying Lee, Yi-Ruei Liu, Shr-Ruei Tsai, Wei-Cheng Chang, Chung-Ho Wu, Jiewen Chan, Zhenjun Zhao, Chieh Hubert Lin, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15869v1">http://arxiv.org/abs/2510.15869v1</a></p>

            <p><strong>Abstract:</strong><br>
            Synthesizing large-scale, explorable, and geometrically accurate 3D urban scenes is a challenging yet valuable task in providing immersive and embodied applications. The challenges lie in the lack of large-scale and high-quality real-world 3D scans for training generalizable generative models. In this paper, we take an alternative route to create large-scale 3D scenes by synergizing the readily available satellite imagery that supplies realistic coarse geometry and the open-domain diffusion model for creating high-quality close-up appearances. We propose \textbf{Skyfall-GS}, the first city-block scale 3D scene creation framework without costly 3D annotations, also featuring real-time, immersive 3D exploration. We tailor a curriculum-driven iterative refinement strategy to progressively enhance geometric completeness and photorealistic textures. Extensive experiments demonstrate that Skyfall-GS provides improved cross-view consistent geometry and more realistic textures compared to state-of-the-art approaches. Project page: https://skyfall-gs.jayinnn.dev/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Latent Diffusion Model without Variational Autoencoder</title>
      <itunes:episode>1304</itunes:episode>
      <podcast:episode>1304</podcast:episode>
      <itunes:title>Latent Diffusion Model without Variational Autoencoder</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d5faa159-67e4-477a-8701-9776e0a85cc9</guid>
      <link>https://share.transistor.fm/s/9bce9bc4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, Jiwen Lu</p>

            <p><strong>Title:</strong><br>
            Latent Diffusion Model without Variational Autoencoder</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15301v2">http://arxiv.org/abs/2510.15301v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations. Code and interpretations are available at https://howlin-wang.github.io/svg/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, Jiwen Lu</p>

            <p><strong>Title:</strong><br>
            Latent Diffusion Model without Variational Autoencoder</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15301v2">http://arxiv.org/abs/2510.15301v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations. Code and interpretations are available at https://howlin-wang.github.io/svg/.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 20 Oct 2025 20:40:53 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9bce9bc4/3e06d08f.mp3" length="24168565" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1507</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, Jiwen Lu</p>

            <p><strong>Title:</strong><br>
            Latent Diffusion Model without Variational Autoencoder</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.15301v2">http://arxiv.org/abs/2510.15301v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations. Code and interpretations are available at https://howlin-wang.github.io/svg/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA</title>
      <itunes:episode>1303</itunes:episode>
      <podcast:episode>1303</podcast:episode>
      <itunes:title>When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8da92459-1561-4ac2-bb3b-602dca367e22</guid>
      <link>https://share.transistor.fm/s/1b3720c0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 81 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Elisei Rykov, Kseniia Petrushina, Maksim Savkin, Valerii Olisov, Artem Vazhentsev, Kseniia Titova, Alexander Panchenko, Vasily Konovalov, Julia Belikova</p>

            <p><strong>Title:</strong><br>
            When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.04849v1">http://arxiv.org/abs/2510.04849v1</a></p>

            <p><strong>Abstract:</strong><br>
            Hallucination detection remains a fundamental challenge for the safe and reliable deployment of large language models (LLMs), especially in applications requiring factual accuracy. Existing hallucination benchmarks often operate at the sequence level and are limited to English, lacking the fine-grained, multilingual supervision needed for a comprehensive evaluation. In this work, we introduce PsiloQA, a large-scale, multilingual dataset annotated with span-level hallucinations across 14 languages. PsiloQA is constructed through an automated three-stage pipeline: generating question-answer pairs from Wikipedia using GPT-4o, eliciting potentially hallucinated answers from diverse LLMs in a no-context setting, and automatically annotating hallucinated spans using GPT-4o by comparing against golden answers and retrieved context. We evaluate a wide range of hallucination detection methods -- including uncertainty quantification, LLM-based tagging, and fine-tuned encoder models -- and show that encoder-based models achieve the strongest performance across languages. Furthermore, PsiloQA demonstrates effective cross-lingual generalization and supports robust knowledge transfer to other benchmarks, all while being significantly more cost-efficient than human-annotated datasets. Our dataset and results advance the development of scalable, fine-grained hallucination detection in multilingual settings.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 81 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Elisei Rykov, Kseniia Petrushina, Maksim Savkin, Valerii Olisov, Artem Vazhentsev, Kseniia Titova, Alexander Panchenko, Vasily Konovalov, Julia Belikova</p>

            <p><strong>Title:</strong><br>
            When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.04849v1">http://arxiv.org/abs/2510.04849v1</a></p>

            <p><strong>Abstract:</strong><br>
            Hallucination detection remains a fundamental challenge for the safe and reliable deployment of large language models (LLMs), especially in applications requiring factual accuracy. Existing hallucination benchmarks often operate at the sequence level and are limited to English, lacking the fine-grained, multilingual supervision needed for a comprehensive evaluation. In this work, we introduce PsiloQA, a large-scale, multilingual dataset annotated with span-level hallucinations across 14 languages. PsiloQA is constructed through an automated three-stage pipeline: generating question-answer pairs from Wikipedia using GPT-4o, eliciting potentially hallucinated answers from diverse LLMs in a no-context setting, and automatically annotating hallucinated spans using GPT-4o by comparing against golden answers and retrieved context. We evaluate a wide range of hallucination detection methods -- including uncertainty quantification, LLM-based tagging, and fine-tuned encoder models -- and show that encoder-based models achieve the strongest performance across languages. Furthermore, PsiloQA demonstrates effective cross-lingual generalization and supports robust knowledge transfer to other benchmarks, all while being significantly more cost-efficient than human-annotated datasets. Our dataset and results advance the development of scalable, fine-grained hallucination detection in multilingual settings.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 17 Oct 2025 21:13:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1b3720c0/91199bb1.mp3" length="23000402" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1434</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 81 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Elisei Rykov, Kseniia Petrushina, Maksim Savkin, Valerii Olisov, Artem Vazhentsev, Kseniia Titova, Alexander Panchenko, Vasily Konovalov, Julia Belikova</p>

            <p><strong>Title:</strong><br>
            When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.04849v1">http://arxiv.org/abs/2510.04849v1</a></p>

            <p><strong>Abstract:</strong><br>
            Hallucination detection remains a fundamental challenge for the safe and reliable deployment of large language models (LLMs), especially in applications requiring factual accuracy. Existing hallucination benchmarks often operate at the sequence level and are limited to English, lacking the fine-grained, multilingual supervision needed for a comprehensive evaluation. In this work, we introduce PsiloQA, a large-scale, multilingual dataset annotated with span-level hallucinations across 14 languages. PsiloQA is constructed through an automated three-stage pipeline: generating question-answer pairs from Wikipedia using GPT-4o, eliciting potentially hallucinated answers from diverse LLMs in a no-context setting, and automatically annotating hallucinated spans using GPT-4o by comparing against golden answers and retrieved context. We evaluate a wide range of hallucination detection methods -- including uncertainty quantification, LLM-based tagging, and fine-tuned encoder models -- and show that encoder-based models achieve the strongest performance across languages. Furthermore, PsiloQA demonstrates effective cross-lingual generalization and supports robust knowledge transfer to other benchmarks, all while being significantly more cost-efficient than human-annotated datasets. Our dataset and results advance the development of scalable, fine-grained hallucination detection in multilingual settings.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Agentic Entropy-Balanced Policy Optimization</title>
      <itunes:episode>1302</itunes:episode>
      <podcast:episode>1302</podcast:episode>
      <itunes:title>Agentic Entropy-Balanced Policy Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ed2aa3da-728d-40f6-bc48-3138157fc234</guid>
      <link>https://share.transistor.fm/s/9464459a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 78 | cs.LG, cs.AI, cs.CL, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            Agentic Entropy-Balanced Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14545v1">http://arxiv.org/abs/2510.14545v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity's Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity's Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 78 | cs.LG, cs.AI, cs.CL, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            Agentic Entropy-Balanced Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14545v1">http://arxiv.org/abs/2510.14545v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity's Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity's Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 17 Oct 2025 21:12:55 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9464459a/8c1b513c.mp3" length="22971520" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1432</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 78 | cs.LG, cs.AI, cs.CL, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            Agentic Entropy-Balanced Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14545v1">http://arxiv.org/abs/2510.14545v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity's Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity's Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WithAnyone: Towards Controllable and ID Consistent Image Generation</title>
      <itunes:episode>1301</itunes:episode>
      <podcast:episode>1301</podcast:episode>
      <itunes:title>WithAnyone: Towards Controllable and ID Consistent Image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">51a3ed6b-d831-4c72-ad77-f0382a401bbf</guid>
      <link>https://share.transistor.fm/s/9e6e0365</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hengyuan Xu, Wei Cheng, Peng Xing, Yixiao Fang, Shuhan Wu, Rui Wang, Xianfang Zeng, Daxin Jiang, Gang Yu, Xingjun Ma, Yu-Gang Jiang</p>

            <p><strong>Title:</strong><br>
            WithAnyone: Towards Controllable and ID Consistent Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14975v1">http://arxiv.org/abs/2510.14975v1</a></p>

            <p><strong>Abstract:</strong><br>
            Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hengyuan Xu, Wei Cheng, Peng Xing, Yixiao Fang, Shuhan Wu, Rui Wang, Xianfang Zeng, Daxin Jiang, Gang Yu, Xingjun Ma, Yu-Gang Jiang</p>

            <p><strong>Title:</strong><br>
            WithAnyone: Towards Controllable and ID Consistent Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14975v1">http://arxiv.org/abs/2510.14975v1</a></p>

            <p><strong>Abstract:</strong><br>
            Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 17 Oct 2025 21:12:32 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9e6e0365/9ffb54ec.mp3" length="22381802" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1395</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hengyuan Xu, Wei Cheng, Peng Xing, Yixiao Fang, Shuhan Wu, Rui Wang, Xianfang Zeng, Daxin Jiang, Gang Yu, Xingjun Ma, Yu-Gang Jiang</p>

            <p><strong>Title:</strong><br>
            WithAnyone: Towards Controllable and ID Consistent Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14975v1">http://arxiv.org/abs/2510.14975v1</a></p>

            <p><strong>Abstract:</strong><br>
            Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AI for Service: Proactive Assistance with AI Glasses</title>
      <itunes:episode>1300</itunes:episode>
      <podcast:episode>1300</podcast:episode>
      <itunes:title>AI for Service: Proactive Assistance with AI Glasses</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ac0cb1b3-ceb8-4b86-9b70-935b418a3829</guid>
      <link>https://share.transistor.fm/s/51907bf0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zichen Wen, Yiyu Wang, Chenfei Liao, Boxue Yang, Junxian Li, Weifeng Liu, Haocong He, Bolong Feng, Xuyang Liu, Yuanhuiyi Lyu, Xu Zheng, Xuming Hu, Linfeng Zhang</p>

            <p><strong>Title:</strong><br>
            AI for Service: Proactive Assistance with AI Glasses</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14359v1">http://arxiv.org/abs/2510.14359v1</a></p>

            <p><strong>Abstract:</strong><br>
            In an era where AI is evolving from a passive tool into an active and adaptive companion, we introduce AI for Service (AI4Service), a new paradigm that enables proactive and real-time assistance in daily life. Existing AI services remain largely reactive, responding only to explicit user commands. We argue that a truly intelligent and helpful assistant should be capable of anticipating user needs and taking actions proactively when appropriate. To realize this vision, we propose Alpha-Service, a unified framework that addresses two fundamental challenges: Know When to intervene by detecting service opportunities from egocentric video streams, and Know How to provide both generalized and personalized services. Inspired by the von Neumann computer architecture and based on AI glasses, Alpha-Service consists of five key components: an Input Unit for perception, a Central Processing Unit for task scheduling, an Arithmetic Logic Unit for tool utilization, a Memory Unit for long-term personalization, and an Output Unit for natural human interaction. As an initial exploration, we implement Alpha-Service through a multi-agent system deployed on AI glasses. Case studies, including a real-time Blackjack advisor, a museum tour guide, and a shopping fit assistant, demonstrate its ability to seamlessly perceive the environment, infer user intent, and provide timely and useful assistance without explicit prompts.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zichen Wen, Yiyu Wang, Chenfei Liao, Boxue Yang, Junxian Li, Weifeng Liu, Haocong He, Bolong Feng, Xuyang Liu, Yuanhuiyi Lyu, Xu Zheng, Xuming Hu, Linfeng Zhang</p>

            <p><strong>Title:</strong><br>
            AI for Service: Proactive Assistance with AI Glasses</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14359v1">http://arxiv.org/abs/2510.14359v1</a></p>

            <p><strong>Abstract:</strong><br>
            In an era where AI is evolving from a passive tool into an active and adaptive companion, we introduce AI for Service (AI4Service), a new paradigm that enables proactive and real-time assistance in daily life. Existing AI services remain largely reactive, responding only to explicit user commands. We argue that a truly intelligent and helpful assistant should be capable of anticipating user needs and taking actions proactively when appropriate. To realize this vision, we propose Alpha-Service, a unified framework that addresses two fundamental challenges: Know When to intervene by detecting service opportunities from egocentric video streams, and Know How to provide both generalized and personalized services. Inspired by the von Neumann computer architecture and based on AI glasses, Alpha-Service consists of five key components: an Input Unit for perception, a Central Processing Unit for task scheduling, an Arithmetic Logic Unit for tool utilization, a Memory Unit for long-term personalization, and an Output Unit for natural human interaction. As an initial exploration, we implement Alpha-Service through a multi-agent system deployed on AI glasses. Case studies, including a real-time Blackjack advisor, a museum tour guide, and a shopping fit assistant, demonstrate its ability to seamlessly perceive the environment, infer user intent, and provide timely and useful assistance without explicit prompts.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 17 Oct 2025 21:12:08 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/51907bf0/b9b782b6.mp3" length="23157938" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1444</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zichen Wen, Yiyu Wang, Chenfei Liao, Boxue Yang, Junxian Li, Weifeng Liu, Haocong He, Bolong Feng, Xuyang Liu, Yuanhuiyi Lyu, Xu Zheng, Xuming Hu, Linfeng Zhang</p>

            <p><strong>Title:</strong><br>
            AI for Service: Proactive Assistance with AI Glasses</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14359v1">http://arxiv.org/abs/2510.14359v1</a></p>

            <p><strong>Abstract:</strong><br>
            In an era where AI is evolving from a passive tool into an active and adaptive companion, we introduce AI for Service (AI4Service), a new paradigm that enables proactive and real-time assistance in daily life. Existing AI services remain largely reactive, responding only to explicit user commands. We argue that a truly intelligent and helpful assistant should be capable of anticipating user needs and taking actions proactively when appropriate. To realize this vision, we propose Alpha-Service, a unified framework that addresses two fundamental challenges: Know When to intervene by detecting service opportunities from egocentric video streams, and Know How to provide both generalized and personalized services. Inspired by the von Neumann computer architecture and based on AI glasses, Alpha-Service consists of five key components: an Input Unit for perception, a Central Processing Unit for task scheduling, an Arithmetic Logic Unit for tool utilization, a Memory Unit for long-term personalization, and an Output Unit for natural human interaction. As an initial exploration, we implement Alpha-Service through a multi-agent system deployed on AI glasses. Case studies, including a real-time Blackjack advisor, a museum tour guide, and a shopping fit assistant, demonstrate its ability to seamlessly perceive the environment, infer user intent, and provide timely and useful assistance without explicit prompts.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>From Pixels to Words -- Towards Native Vision-Language Primitives at Scale</title>
      <itunes:episode>1299</itunes:episode>
      <podcast:episode>1299</podcast:episode>
      <itunes:title>From Pixels to Words -- Towards Native Vision-Language Primitives at Scale</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">80bb7f2c-435c-4cd0-8c6c-a30bb11b66f7</guid>
      <link>https://share.transistor.fm/s/5f4b269d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            From Pixels to Words -- Towards Native Vision-Language Primitives at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14979v1">http://arxiv.org/abs/2510.14979v1</a></p>

            <p><strong>Abstract:</strong><br>
            The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            From Pixels to Words -- Towards Native Vision-Language Primitives at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14979v1">http://arxiv.org/abs/2510.14979v1</a></p>

            <p><strong>Abstract:</strong><br>
            The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 17 Oct 2025 21:11:45 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5f4b269d/19cc1c30.mp3" length="19909163" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1241</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            From Pixels to Words -- Towards Native Vision-Language Primitives at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14979v1">http://arxiv.org/abs/2510.14979v1</a></p>

            <p><strong>Abstract:</strong><br>
            The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints</title>
      <itunes:episode>1298</itunes:episode>
      <podcast:episode>1298</podcast:episode>
      <itunes:title>ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">05d91d8b-e9db-440e-ab95-8684af44cd13</guid>
      <link>https://share.transistor.fm/s/53fdc73a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Meiqi Wu, Jiashu Zhu, Xiaokun Feng, Chubin Chen, Chen Zhu, Bingze Song, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, Kaiqi Huang</p>

            <p><strong>Title:</strong><br>
            ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14847v1">http://arxiv.org/abs/2510.14847v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a prompt-guided adaptive test-time search strategy that dynamically adjusts both the inference search space and reward function according to semantic relationships in the prompt. This enables more coherent and visually plausible videos in challenging imaginative settings. To evaluate progress in this direction, we introduce LDT-Bench, the first dedicated benchmark for long-distance semantic prompts, consisting of 2,839 diverse concept pairs and an automated protocol for assessing creative generation capabilities. Extensive experiments show that ImagerySearch consistently outperforms strong video generation baselines and existing test-time scaling approaches on LDT-Bench, and achieves competitive improvements on VBench, demonstrating its effectiveness across diverse prompt types. We will release LDT-Bench and code to facilitate future research on imaginative video generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Meiqi Wu, Jiashu Zhu, Xiaokun Feng, Chubin Chen, Chen Zhu, Bingze Song, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, Kaiqi Huang</p>

            <p><strong>Title:</strong><br>
            ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14847v1">http://arxiv.org/abs/2510.14847v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a prompt-guided adaptive test-time search strategy that dynamically adjusts both the inference search space and reward function according to semantic relationships in the prompt. This enables more coherent and visually plausible videos in challenging imaginative settings. To evaluate progress in this direction, we introduce LDT-Bench, the first dedicated benchmark for long-distance semantic prompts, consisting of 2,839 diverse concept pairs and an automated protocol for assessing creative generation capabilities. Extensive experiments show that ImagerySearch consistently outperforms strong video generation baselines and existing test-time scaling approaches on LDT-Bench, and achieves competitive improvements on VBench, demonstrating its effectiveness across diverse prompt types. We will release LDT-Bench and code to facilitate future research on imaginative video generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 17 Oct 2025 21:11:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/53fdc73a/faa336ce.mp3" length="20414919" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1272</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Meiqi Wu, Jiashu Zhu, Xiaokun Feng, Chubin Chen, Chen Zhu, Bingze Song, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, Kaiqi Huang</p>

            <p><strong>Title:</strong><br>
            ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14847v1">http://arxiv.org/abs/2510.14847v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a prompt-guided adaptive test-time search strategy that dynamically adjusts both the inference search space and reward function according to semantic relationships in the prompt. This enables more coherent and visually plausible videos in challenging imaginative settings. To evaluate progress in this direction, we introduce LDT-Bench, the first dedicated benchmark for long-distance semantic prompts, consisting of 2,839 diverse concept pairs and an automated protocol for assessing creative generation capabilities. Extensive experiments show that ImagerySearch consistently outperforms strong video generation baselines and existing test-time scaling approaches on LDT-Bench, and achieves competitive improvements on VBench, demonstrating its effectiveness across diverse prompt types. We will release LDT-Bench and code to facilitate future research on imaginative video generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents</title>
      <itunes:episode>1297</itunes:episode>
      <podcast:episode>1297</podcast:episode>
      <itunes:title>Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">528e5ed0-0d81-4811-aeef-ca8e6cf55ac2</guid>
      <link>https://share.transistor.fm/s/09133f47</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, Zhenzhe Ying</p>

            <p><strong>Title:</strong><br>
            Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14967v1">http://arxiv.org/abs/2510.14967v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy's probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model's own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, Zhenzhe Ying</p>

            <p><strong>Title:</strong><br>
            Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14967v1">http://arxiv.org/abs/2510.14967v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy's probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model's own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 17 Oct 2025 21:10:59 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/09133f47/02a196f9.mp3" length="24127652" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1504</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, Zhenzhe Ying</p>

            <p><strong>Title:</strong><br>
            Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14967v1">http://arxiv.org/abs/2510.14967v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy's probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model's own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LaSeR: Reinforcement Learning with Last-Token Self-Rewarding</title>
      <itunes:episode>1296</itunes:episode>
      <podcast:episode>1296</podcast:episode>
      <itunes:title>LaSeR: Reinforcement Learning with Last-Token Self-Rewarding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">402fa405-109c-4c09-9ba7-5acceee556d5</guid>
      <link>https://share.transistor.fm/s/0788b068</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wenkai Yang, Weijie Liu, Ruobing Xie, Yiju Guo, Lulu Wu, Saiyong Yang, Yankai Lin</p>

            <p><strong>Title:</strong><br>
            LaSeR: Reinforcement Learning with Last-Token Self-Rewarding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14943v1">http://arxiv.org/abs/2510.14943v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). To address the lack of verification signals at test time, prior studies incorporate the training of model's self-verification capability into the standard RLVR process, thereby unifying reasoning and verification capabilities within a single LLM. However, previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency. In this work, we theoretically reveal that the closed-form solution to the RL objective of self-verification can be reduced to a remarkably simple form: the true reasoning reward of a solution is equal to its last-token self-rewarding score, which is computed as the difference between the policy model's next-token log-probability assigned to any pre-specified token at the solution's last token and a pre-calculated constant, scaled by the KL coefficient. Based on this insight, we propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss that aligns the last-token self-rewarding scores with verifier-based reasoning rewards, jointly optimizing the reasoning and self-rewarding capabilities of LLMs. The optimized self-rewarding scores can be utilized in both training and testing to enhance model performance. Notably, our algorithm derives these scores from the predicted next-token probability distribution of the last token immediately after generation, incurring only the minimal extra cost of one additional token inference. Experiments show that our method not only improves the model's reasoning performance but also equips it with remarkable self-rewarding capability, thereby boosting its inference-time scaling performance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wenkai Yang, Weijie Liu, Ruobing Xie, Yiju Guo, Lulu Wu, Saiyong Yang, Yankai Lin</p>

            <p><strong>Title:</strong><br>
            LaSeR: Reinforcement Learning with Last-Token Self-Rewarding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14943v1">http://arxiv.org/abs/2510.14943v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). To address the lack of verification signals at test time, prior studies incorporate the training of model's self-verification capability into the standard RLVR process, thereby unifying reasoning and verification capabilities within a single LLM. However, previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency. In this work, we theoretically reveal that the closed-form solution to the RL objective of self-verification can be reduced to a remarkably simple form: the true reasoning reward of a solution is equal to its last-token self-rewarding score, which is computed as the difference between the policy model's next-token log-probability assigned to any pre-specified token at the solution's last token and a pre-calculated constant, scaled by the KL coefficient. Based on this insight, we propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss that aligns the last-token self-rewarding scores with verifier-based reasoning rewards, jointly optimizing the reasoning and self-rewarding capabilities of LLMs. The optimized self-rewarding scores can be utilized in both training and testing to enhance model performance. Notably, our algorithm derives these scores from the predicted next-token probability distribution of the last token immediately after generation, incurring only the minimal extra cost of one additional token inference. Experiments show that our method not only improves the model's reasoning performance but also equips it with remarkable self-rewarding capability, thereby boosting its inference-time scaling performance.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 17 Oct 2025 21:10:36 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0788b068/4bb25c36.mp3" length="19005939" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1184</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wenkai Yang, Weijie Liu, Ruobing Xie, Yiju Guo, Lulu Wu, Saiyong Yang, Yankai Lin</p>

            <p><strong>Title:</strong><br>
            LaSeR: Reinforcement Learning with Last-Token Self-Rewarding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14943v1">http://arxiv.org/abs/2510.14943v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). To address the lack of verification signals at test time, prior studies incorporate the training of model's self-verification capability into the standard RLVR process, thereby unifying reasoning and verification capabilities within a single LLM. However, previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency. In this work, we theoretically reveal that the closed-form solution to the RL objective of self-verification can be reduced to a remarkably simple form: the true reasoning reward of a solution is equal to its last-token self-rewarding score, which is computed as the difference between the policy model's next-token log-probability assigned to any pre-specified token at the solution's last token and a pre-calculated constant, scaled by the KL coefficient. Based on this insight, we propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss that aligns the last-token self-rewarding scores with verifier-based reasoning rewards, jointly optimizing the reasoning and self-rewarding capabilities of LLMs. The optimized self-rewarding scores can be utilized in both training and testing to enhance model performance. Notably, our algorithm derives these scores from the predicted next-token probability distribution of the last token immediately after generation, incurring only the minimal extra cost of one additional token inference. Experiments show that our method not only improves the model's reasoning performance but also equips it with remarkable self-rewarding capability, thereby boosting its inference-time scaling performance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar</title>
      <itunes:episode>1295</itunes:episode>
      <podcast:episode>1295</podcast:episode>
      <itunes:title>TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c5238776-4dda-4f04-a738-4dedfda04fa4</guid>
      <link>https://share.transistor.fm/s/69cdce68</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.LG, cs.PL, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Yinxi Li, Yuntian Deng, Pengyu Nie</p>

            <p><strong>Title:</strong><br>
            TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14972v1">http://arxiv.org/abs/2510.14972v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.LG, cs.PL, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Yinxi Li, Yuntian Deng, Pengyu Nie</p>

            <p><strong>Title:</strong><br>
            TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14972v1">http://arxiv.org/abs/2510.14972v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 17 Oct 2025 21:10:13 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/69cdce68/b80a85d7.mp3" length="20026181" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1248</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.LG, cs.PL, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Yinxi Li, Yuntian Deng, Pengyu Nie</p>

            <p><strong>Title:</strong><br>
            TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.14972v1">http://arxiv.org/abs/2510.14972v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BitNet Distillation</title>
      <itunes:episode>1294</itunes:episode>
      <podcast:episode>1294</podcast:episode>
      <itunes:title>BitNet Distillation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">eddb8f1c-d837-4ed8-bfae-0a19336c96a7</guid>
      <link>https://share.transistor.fm/s/ae95c5e0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xun Wu, Shaohan Huang, Wenhui Wang, Ting Song, Li Dong, Yan Xia, Furu Wei</p>

            <p><strong>Title:</strong><br>
            BitNet Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.13998v1">http://arxiv.org/abs/2510.13998v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we present BitNet Distillation (BitDistill), a lightweight pipeline that fine-tunes off-the-shelf full-precision LLMs (e.g., Qwen) into 1.58-bit precision (i.e., ternary weights {-1, 0, 1}) for specific downstream tasks, achieving strong task-specific performance with minimal computational cost. Specifically, BitDistill incorporates three key techniques: the SubLN module, as introduced in BitNet; multi-head attention distillation, based on MiniLM; and continual pre-training, which serves as a crucial warm-up step to mitigate the scalability issue of the performance gap between finetuned full-precision and 1.58-bit LLMs on specific tasks. Experimental results show that BitDistill achieves performance comparable to the full-precision counterpart models across model size, while enabling up to 10x memory savings and 2.65x faster inference on CPUs. Code is available at https://github.com/microsoft/BitNet.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xun Wu, Shaohan Huang, Wenhui Wang, Ting Song, Li Dong, Yan Xia, Furu Wei</p>

            <p><strong>Title:</strong><br>
            BitNet Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.13998v1">http://arxiv.org/abs/2510.13998v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we present BitNet Distillation (BitDistill), a lightweight pipeline that fine-tunes off-the-shelf full-precision LLMs (e.g., Qwen) into 1.58-bit precision (i.e., ternary weights {-1, 0, 1}) for specific downstream tasks, achieving strong task-specific performance with minimal computational cost. Specifically, BitDistill incorporates three key techniques: the SubLN module, as introduced in BitNet; multi-head attention distillation, based on MiniLM; and continual pre-training, which serves as a crucial warm-up step to mitigate the scalability issue of the performance gap between finetuned full-precision and 1.58-bit LLMs on specific tasks. Experimental results show that BitDistill achieves performance comparable to the full-precision counterpart models across model size, while enabling up to 10x memory savings and 2.65x faster inference on CPUs. Code is available at https://github.com/microsoft/BitNet.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 17 Oct 2025 21:09:50 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ae95c5e0/2b02470b.mp3" length="21460990" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1338</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xun Wu, Shaohan Huang, Wenhui Wang, Ting Song, Li Dong, Yan Xia, Furu Wei</p>

            <p><strong>Title:</strong><br>
            BitNet Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.13998v1">http://arxiv.org/abs/2510.13998v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we present BitNet Distillation (BitDistill), a lightweight pipeline that fine-tunes off-the-shelf full-precision LLMs (e.g., Qwen) into 1.58-bit precision (i.e., ternary weights {-1, 0, 1}) for specific downstream tasks, achieving strong task-specific performance with minimal computational cost. Specifically, BitDistill incorporates three key techniques: the SubLN module, as introduced in BitNet; multi-head attention distillation, based on MiniLM; and continual pre-training, which serves as a crucial warm-up step to mitigate the scalability issue of the performance gap between finetuned full-precision and 1.58-bit LLMs on specific tasks. Experimental results show that BitDistill achieves performance comparable to the full-precision counterpart models across model size, while enabling up to 10x memory savings and 2.65x faster inference on CPUs. Code is available at https://github.com/microsoft/BitNet.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model</title>
      <itunes:episode>1293</itunes:episode>
      <podcast:episode>1293</podcast:episode>
      <itunes:title>Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6891c895-47d5-4671-9ffe-bd84f7c00df9</guid>
      <link>https://share.transistor.fm/s/7ea65c51</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 133 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, Haoang Li</p>

            <p><strong>Title:</strong><br>
            Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12276v1">http://arxiv.org/abs/2510.12276v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language-action (VLA) models have recently shown strong potential in enabling robots to follow language instructions and execute precise actions. However, most VLAs are built upon vision-language models pretrained solely on 2D data, which lack accurate spatial awareness and hinder their ability to operate in the 3D physical world. Existing solutions attempt to incorporate explicit 3D sensor inputs such as depth maps or point clouds, but these approaches face challenges due to sensor noise, hardware heterogeneity, and incomplete depth coverage in existing datasets. Alternative methods that estimate 3D cues from 2D images also suffer from the limited performance of depth estimators.We propose Spatial Forcing (SF), a simple yet effective alignment strategy that implicitly forces VLA models to develop spatial comprehension capabilities without relying on explicit 3D inputs or depth estimators. SF aligns intermediate visual embeddings of VLAs with geometric representations produced by pretrained 3D foundation models. By enforcing alignment at intermediate layers, SF guides VLAs to encode richer spatial representations that enhance action precision.Extensive experiments in simulation and real-world environments demonstrate that SF achieves state-of-the-art results, surpassing both 2D- and 3D-based VLAs. SF further accelerates training by up to 3.8x and improves data efficiency across diverse robotic tasks. Project page is at https://spatial-forcing.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 133 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, Haoang Li</p>

            <p><strong>Title:</strong><br>
            Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12276v1">http://arxiv.org/abs/2510.12276v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language-action (VLA) models have recently shown strong potential in enabling robots to follow language instructions and execute precise actions. However, most VLAs are built upon vision-language models pretrained solely on 2D data, which lack accurate spatial awareness and hinder their ability to operate in the 3D physical world. Existing solutions attempt to incorporate explicit 3D sensor inputs such as depth maps or point clouds, but these approaches face challenges due to sensor noise, hardware heterogeneity, and incomplete depth coverage in existing datasets. Alternative methods that estimate 3D cues from 2D images also suffer from the limited performance of depth estimators.We propose Spatial Forcing (SF), a simple yet effective alignment strategy that implicitly forces VLA models to develop spatial comprehension capabilities without relying on explicit 3D inputs or depth estimators. SF aligns intermediate visual embeddings of VLAs with geometric representations produced by pretrained 3D foundation models. By enforcing alignment at intermediate layers, SF guides VLAs to encode richer spatial representations that enhance action precision.Extensive experiments in simulation and real-world environments demonstrate that SF achieves state-of-the-art results, surpassing both 2D- and 3D-based VLAs. SF further accelerates training by up to 3.8x and improves data efficiency across diverse robotic tasks. Project page is at https://spatial-forcing.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 15 Oct 2025 21:08:50 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7ea65c51/93742252.mp3" length="21689268" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1352</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 133 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, Haoang Li</p>

            <p><strong>Title:</strong><br>
            Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12276v1">http://arxiv.org/abs/2510.12276v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language-action (VLA) models have recently shown strong potential in enabling robots to follow language instructions and execute precise actions. However, most VLAs are built upon vision-language models pretrained solely on 2D data, which lack accurate spatial awareness and hinder their ability to operate in the 3D physical world. Existing solutions attempt to incorporate explicit 3D sensor inputs such as depth maps or point clouds, but these approaches face challenges due to sensor noise, hardware heterogeneity, and incomplete depth coverage in existing datasets. Alternative methods that estimate 3D cues from 2D images also suffer from the limited performance of depth estimators.We propose Spatial Forcing (SF), a simple yet effective alignment strategy that implicitly forces VLA models to develop spatial comprehension capabilities without relying on explicit 3D inputs or depth estimators. SF aligns intermediate visual embeddings of VLAs with geometric representations produced by pretrained 3D foundation models. By enforcing alignment at intermediate layers, SF guides VLAs to encode richer spatial representations that enhance action precision.Extensive experiments in simulation and real-world environments demonstrate that SF achieves state-of-the-art results, surpassing both 2D- and 3D-based VLAs. SF further accelerates training by up to 3.8x and improves data efficiency across diverse robotic tasks. Project page is at https://spatial-forcing.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training</title>
      <itunes:episode>1292</itunes:episode>
      <podcast:episode>1292</podcast:episode>
      <itunes:title>Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">661b23e0-6f24-4880-a766-7bf1f68491b9</guid>
      <link>https://share.transistor.fm/s/9592d676</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 91 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, Xiangxiang Chu</p>

            <p><strong>Title:</strong><br>
            Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12586v1">http://arxiv.org/abs/2510.12586v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pixel-space generative models are often more difficult to train and generally underperform compared to their latent-space counterparts, leaving a persistent performance and efficiency gap. In this paper, we introduce a novel two-stage training framework that closes this gap for pixel-space diffusion and consistency models. In the first stage, we pre-train encoders to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory, which evolves points from the prior to the data distribution. In the second stage, we integrate the encoder with a randomly initialized decoder and fine-tune the complete model end-to-end for both diffusion and consistency models. Our training framework demonstrates strong empirical performance on ImageNet dataset. Specifically, our diffusion model reaches an FID of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with 75 number of function evaluations (NFE), surpassing prior pixel-space methods by a large margin in both generation quality and efficiency while rivaling leading VAE-based models at comparable training cost. Furthermore, on ImageNet-256, our consistency model achieves an impressive FID of 8.82 in a single sampling step, significantly surpassing its latent-space counterpart. To the best of our knowledge, this marks the first successful training of a consistency model directly on high-resolution images without relying on pre-trained VAEs or diffusion models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 91 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, Xiangxiang Chu</p>

            <p><strong>Title:</strong><br>
            Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12586v1">http://arxiv.org/abs/2510.12586v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pixel-space generative models are often more difficult to train and generally underperform compared to their latent-space counterparts, leaving a persistent performance and efficiency gap. In this paper, we introduce a novel two-stage training framework that closes this gap for pixel-space diffusion and consistency models. In the first stage, we pre-train encoders to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory, which evolves points from the prior to the data distribution. In the second stage, we integrate the encoder with a randomly initialized decoder and fine-tune the complete model end-to-end for both diffusion and consistency models. Our training framework demonstrates strong empirical performance on ImageNet dataset. Specifically, our diffusion model reaches an FID of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with 75 number of function evaluations (NFE), surpassing prior pixel-space methods by a large margin in both generation quality and efficiency while rivaling leading VAE-based models at comparable training cost. Furthermore, on ImageNet-256, our consistency model achieves an impressive FID of 8.82 in a single sampling step, significantly surpassing its latent-space counterpart. To the best of our knowledge, this marks the first successful training of a consistency model directly on high-resolution images without relying on pre-trained VAEs or diffusion models.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 15 Oct 2025 21:08:24 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9592d676/27536476.mp3" length="22617967" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1410</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 91 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, Xiangxiang Chu</p>

            <p><strong>Title:</strong><br>
            Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12586v1">http://arxiv.org/abs/2510.12586v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pixel-space generative models are often more difficult to train and generally underperform compared to their latent-space counterparts, leaving a persistent performance and efficiency gap. In this paper, we introduce a novel two-stage training framework that closes this gap for pixel-space diffusion and consistency models. In the first stage, we pre-train encoders to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory, which evolves points from the prior to the data distribution. In the second stage, we integrate the encoder with a randomly initialized decoder and fine-tune the complete model end-to-end for both diffusion and consistency models. Our training framework demonstrates strong empirical performance on ImageNet dataset. Specifically, our diffusion model reaches an FID of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with 75 number of function evaluations (NFE), surpassing prior pixel-space methods by a large margin in both generation quality and efficiency while rivaling leading VAE-based models at comparable training cost. Furthermore, on ImageNet-256, our consistency model achieves an impressive FID of 8.82 in a single sampling step, significantly surpassing its latent-space counterpart. To the best of our knowledge, this marks the first successful training of a consistency model directly on high-resolution images without relying on pre-trained VAEs or diffusion models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation</title>
      <itunes:episode>1291</itunes:episode>
      <podcast:episode>1291</podcast:episode>
      <itunes:title>DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">401e249d-56f9-43df-ac7a-8d4cff53608d</guid>
      <link>https://share.transistor.fm/s/7626c9f5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 90 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Enze Zhang, Jiaying Wang, Mengxi Xiao, Jifei Liu, Ziyan Kuang, Rui Dong, Eric Dong, Sophia Ananiadou, Min Peng, Qianqian Xie</p>

            <p><strong>Title:</strong><br>
            DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.09116v2">http://arxiv.org/abs/2510.09116v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have substantially advanced machine translation (MT), yet their effectiveness in translating web novels remains unclear. Existing benchmarks rely on surface-level metrics that fail to capture the distinctive traits of this genre. To address these gaps, we introduce DITING, the first comprehensive evaluation framework for web novel translation, assessing narrative and cultural fidelity across six dimensions: idiom translation, lexical ambiguity, terminology localization, tense consistency, zero-pronoun resolution, and cultural safety, supported by over 18K expert-annotated Chinese-English sentence pairs. We further propose AgentEval, a reasoning-driven multi-agent evaluation framework that simulates expert deliberation to assess translation quality beyond lexical overlap, achieving the highest correlation with human judgments among seven tested automatic metrics. To enable metric comparison, we develop MetricAlign, a meta-evaluation dataset of 300 sentence pairs annotated with error labels and scalar quality scores. Comprehensive evaluation of fourteen open, closed, and commercial models reveals that Chinese-trained LLMs surpass larger foreign counterparts, and that DeepSeek-V3 delivers the most faithful and stylistically coherent translations. Our work establishes a new paradigm for exploring LLM-based web novel translation and provides public resources to advance future research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 90 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Enze Zhang, Jiaying Wang, Mengxi Xiao, Jifei Liu, Ziyan Kuang, Rui Dong, Eric Dong, Sophia Ananiadou, Min Peng, Qianqian Xie</p>

            <p><strong>Title:</strong><br>
            DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.09116v2">http://arxiv.org/abs/2510.09116v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have substantially advanced machine translation (MT), yet their effectiveness in translating web novels remains unclear. Existing benchmarks rely on surface-level metrics that fail to capture the distinctive traits of this genre. To address these gaps, we introduce DITING, the first comprehensive evaluation framework for web novel translation, assessing narrative and cultural fidelity across six dimensions: idiom translation, lexical ambiguity, terminology localization, tense consistency, zero-pronoun resolution, and cultural safety, supported by over 18K expert-annotated Chinese-English sentence pairs. We further propose AgentEval, a reasoning-driven multi-agent evaluation framework that simulates expert deliberation to assess translation quality beyond lexical overlap, achieving the highest correlation with human judgments among seven tested automatic metrics. To enable metric comparison, we develop MetricAlign, a meta-evaluation dataset of 300 sentence pairs annotated with error labels and scalar quality scores. Comprehensive evaluation of fourteen open, closed, and commercial models reveals that Chinese-trained LLMs surpass larger foreign counterparts, and that DeepSeek-V3 delivers the most faithful and stylistically coherent translations. Our work establishes a new paradigm for exploring LLM-based web novel translation and provides public resources to advance future research.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 15 Oct 2025 21:08:01 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7626c9f5/4549030c.mp3" length="21147583" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1318</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 90 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Enze Zhang, Jiaying Wang, Mengxi Xiao, Jifei Liu, Ziyan Kuang, Rui Dong, Eric Dong, Sophia Ananiadou, Min Peng, Qianqian Xie</p>

            <p><strong>Title:</strong><br>
            DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.09116v2">http://arxiv.org/abs/2510.09116v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have substantially advanced machine translation (MT), yet their effectiveness in translating web novels remains unclear. Existing benchmarks rely on surface-level metrics that fail to capture the distinctive traits of this genre. To address these gaps, we introduce DITING, the first comprehensive evaluation framework for web novel translation, assessing narrative and cultural fidelity across six dimensions: idiom translation, lexical ambiguity, terminology localization, tense consistency, zero-pronoun resolution, and cultural safety, supported by over 18K expert-annotated Chinese-English sentence pairs. We further propose AgentEval, a reasoning-driven multi-agent evaluation framework that simulates expert deliberation to assess translation quality beyond lexical overlap, achieving the highest correlation with human judgments among seven tested automatic metrics. To enable metric comparison, we develop MetricAlign, a meta-evaluation dataset of 300 sentence pairs annotated with error labels and scalar quality scores. Comprehensive evaluation of fourteen open, closed, and commercial models reveals that Chinese-trained LLMs surpass larger foreign counterparts, and that DeepSeek-V3 delivers the most faithful and stylistically coherent translations. Our work establishes a new paradigm for exploring LLM-based web novel translation and provides public resources to advance future research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Scaling Language-Centric Omnimodal Representation Learning</title>
      <itunes:episode>1290</itunes:episode>
      <podcast:episode>1290</podcast:episode>
      <itunes:title>Scaling Language-Centric Omnimodal Representation Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a1498cb2-fa34-40da-9d18-92e51d4b89bb</guid>
      <link>https://share.transistor.fm/s/8f64bffe</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 82 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong</p>

            <p><strong>Title:</strong><br>
            Scaling Language-Centric Omnimodal Representation Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11693v1">http://arxiv.org/abs/2510.11693v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 82 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong</p>

            <p><strong>Title:</strong><br>
            Scaling Language-Centric Omnimodal Representation Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11693v1">http://arxiv.org/abs/2510.11693v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 15 Oct 2025 21:07:38 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8f64bffe/0b7f005d.mp3" length="26986868" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1683</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 82 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong</p>

            <p><strong>Title:</strong><br>
            Scaling Language-Centric Omnimodal Representation Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11693v1">http://arxiv.org/abs/2510.11693v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Robot Learning: A Tutorial</title>
      <itunes:episode>1289</itunes:episode>
      <podcast:episode>1289</podcast:episode>
      <itunes:title>Robot Learning: A Tutorial</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">895dc37f-4d0b-4900-813b-4f49c1d4807c</guid>
      <link>https://share.transistor.fm/s/901a8934</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.RO, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Francesco Capuano, Caroline Pascal, Adil Zouitine, Thomas Wolf, Michel Aractingi</p>

            <p><strong>Title:</strong><br>
            Robot Learning: A Tutorial</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12403v1">http://arxiv.org/abs/2510.12403v1</a></p>

            <p><strong>Abstract:</strong><br>
            Robot learning is at an inflection point, driven by rapid advancements in machine learning and the growing availability of large-scale robotics data. This shift from classical, model-based methods to data-driven, learning-based paradigms is unlocking unprecedented capabilities in autonomous systems. This tutorial navigates the landscape of modern robot learning, charting a course from the foundational principles of Reinforcement Learning and Behavioral Cloning to generalist, language-conditioned models capable of operating across diverse tasks and even robot embodiments. This work is intended as a guide for researchers and practitioners, and our goal is to equip the reader with the conceptual understanding and practical tools necessary to contribute to developments in robot learning, with ready-to-use examples implemented in $\texttt{lerobot}$.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.RO, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Francesco Capuano, Caroline Pascal, Adil Zouitine, Thomas Wolf, Michel Aractingi</p>

            <p><strong>Title:</strong><br>
            Robot Learning: A Tutorial</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12403v1">http://arxiv.org/abs/2510.12403v1</a></p>

            <p><strong>Abstract:</strong><br>
            Robot learning is at an inflection point, driven by rapid advancements in machine learning and the growing availability of large-scale robotics data. This shift from classical, model-based methods to data-driven, learning-based paradigms is unlocking unprecedented capabilities in autonomous systems. This tutorial navigates the landscape of modern robot learning, charting a course from the foundational principles of Reinforcement Learning and Behavioral Cloning to generalist, language-conditioned models capable of operating across diverse tasks and even robot embodiments. This work is intended as a guide for researchers and practitioners, and our goal is to equip the reader with the conceptual understanding and practical tools necessary to contribute to developments in robot learning, with ready-to-use examples implemented in $\texttt{lerobot}$.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 15 Oct 2025 21:07:14 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/901a8934/58fce1ce.mp3" length="22793869" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1421</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.RO, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Francesco Capuano, Caroline Pascal, Adil Zouitine, Thomas Wolf, Michel Aractingi</p>

            <p><strong>Title:</strong><br>
            Robot Learning: A Tutorial</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12403v1">http://arxiv.org/abs/2510.12403v1</a></p>

            <p><strong>Abstract:</strong><br>
            Robot learning is at an inflection point, driven by rapid advancements in machine learning and the growing availability of large-scale robotics data. This shift from classical, model-based methods to data-driven, learning-based paradigms is unlocking unprecedented capabilities in autonomous systems. This tutorial navigates the landscape of modern robot learning, charting a course from the foundational principles of Reinforcement Learning and Behavioral Cloning to generalist, language-conditioned models capable of operating across diverse tasks and even robot embodiments. This work is intended as a guide for researchers and practitioners, and our goal is to equip the reader with the conceptual understanding and practical tools necessary to contribute to developments in robot learning, with ready-to-use examples implemented in $\texttt{lerobot}$.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Detect Anything via Next Point Prediction</title>
      <itunes:episode>1288</itunes:episode>
      <podcast:episode>1288</podcast:episode>
      <itunes:title>Detect Anything via Next Point Prediction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3219c97d-18d1-4bf0-8b26-4890471d877a</guid>
      <link>https://share.transistor.fm/s/77213759</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, Lei Zhang</p>

            <p><strong>Title:</strong><br>
            Detect Anything via Next Point Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12798v1">http://arxiv.org/abs/2510.12798v1</a></p>

            <p><strong>Abstract:</strong><br>
            Object detection has long been dominated by traditional coordinate regression-based models, such as YOLO, DETR, and Grounding DINO. Although recent efforts have attempted to leverage MLLMs to tackle this task, they face challenges like low recall rate, duplicate predictions, coordinate misalignment, etc. In this work, we bridge this gap and propose Rex-Omni, a 3B-scale MLLM that achieves state-of-the-art object perception performance. On benchmarks like COCO and LVIS, Rex-Omni attains performance comparable to or exceeding regression-based models (e.g., DINO, Grounding DINO) in a zero-shot setting. This is enabled by three key designs: 1) Task Formulation: we use special tokens to represent quantized coordinates from 0 to 999, reducing the model's learning difficulty and improving token efficiency for coordinate prediction; 2) Data Engines: we construct multiple data engines to generate high-quality grounding, referring, and pointing data, providing semantically rich supervision for training; \3) Training Pipelines: we employ a two-stage training process, combining supervised fine-tuning on 22 million data with GRPO-based reinforcement post-training. This RL post-training leverages geometry-aware rewards to effectively bridge the discrete-to-continuous coordinate prediction gap, improve box accuracy, and mitigate undesirable behaviors like duplicate predictions that stem from the teacher-guided nature of the initial SFT stage. Beyond conventional detection, Rex-Omni's inherent language understanding enables versatile capabilities such as object referring, pointing, visual prompting, GUI grounding, spatial referring, OCR and key-pointing, all systematically evaluated on dedicated benchmarks. We believe that Rex-Omni paves the way for more versatile and language-aware visual perception systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, Lei Zhang</p>

            <p><strong>Title:</strong><br>
            Detect Anything via Next Point Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12798v1">http://arxiv.org/abs/2510.12798v1</a></p>

            <p><strong>Abstract:</strong><br>
            Object detection has long been dominated by traditional coordinate regression-based models, such as YOLO, DETR, and Grounding DINO. Although recent efforts have attempted to leverage MLLMs to tackle this task, they face challenges like low recall rate, duplicate predictions, coordinate misalignment, etc. In this work, we bridge this gap and propose Rex-Omni, a 3B-scale MLLM that achieves state-of-the-art object perception performance. On benchmarks like COCO and LVIS, Rex-Omni attains performance comparable to or exceeding regression-based models (e.g., DINO, Grounding DINO) in a zero-shot setting. This is enabled by three key designs: 1) Task Formulation: we use special tokens to represent quantized coordinates from 0 to 999, reducing the model's learning difficulty and improving token efficiency for coordinate prediction; 2) Data Engines: we construct multiple data engines to generate high-quality grounding, referring, and pointing data, providing semantically rich supervision for training; \3) Training Pipelines: we employ a two-stage training process, combining supervised fine-tuning on 22 million data with GRPO-based reinforcement post-training. This RL post-training leverages geometry-aware rewards to effectively bridge the discrete-to-continuous coordinate prediction gap, improve box accuracy, and mitigate undesirable behaviors like duplicate predictions that stem from the teacher-guided nature of the initial SFT stage. Beyond conventional detection, Rex-Omni's inherent language understanding enables versatile capabilities such as object referring, pointing, visual prompting, GUI grounding, spatial referring, OCR and key-pointing, all systematically evaluated on dedicated benchmarks. We believe that Rex-Omni paves the way for more versatile and language-aware visual perception systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 15 Oct 2025 21:06:51 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/77213759/784a1cf6.mp3" length="22144376" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1380</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, Lei Zhang</p>

            <p><strong>Title:</strong><br>
            Detect Anything via Next Point Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12798v1">http://arxiv.org/abs/2510.12798v1</a></p>

            <p><strong>Abstract:</strong><br>
            Object detection has long been dominated by traditional coordinate regression-based models, such as YOLO, DETR, and Grounding DINO. Although recent efforts have attempted to leverage MLLMs to tackle this task, they face challenges like low recall rate, duplicate predictions, coordinate misalignment, etc. In this work, we bridge this gap and propose Rex-Omni, a 3B-scale MLLM that achieves state-of-the-art object perception performance. On benchmarks like COCO and LVIS, Rex-Omni attains performance comparable to or exceeding regression-based models (e.g., DINO, Grounding DINO) in a zero-shot setting. This is enabled by three key designs: 1) Task Formulation: we use special tokens to represent quantized coordinates from 0 to 999, reducing the model's learning difficulty and improving token efficiency for coordinate prediction; 2) Data Engines: we construct multiple data engines to generate high-quality grounding, referring, and pointing data, providing semantically rich supervision for training; \3) Training Pipelines: we employ a two-stage training process, combining supervised fine-tuning on 22 million data with GRPO-based reinforcement post-training. This RL post-training leverages geometry-aware rewards to effectively bridge the discrete-to-continuous coordinate prediction gap, improve box accuracy, and mitigate undesirable behaviors like duplicate predictions that stem from the teacher-guided nature of the initial SFT stage. Beyond conventional detection, Rex-Omni's inherent language understanding enables versatile capabilities such as object referring, pointing, visual prompting, GUI grounding, spatial referring, OCR and key-pointing, all systematically evaluated on dedicated benchmarks. We believe that Rex-Omni paves the way for more versatile and language-aware visual perception systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Survey of Vibe Coding with Large Language Models</title>
      <itunes:episode>1287</itunes:episode>
      <podcast:episode>1287</podcast:episode>
      <itunes:title>A Survey of Vibe Coding with Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">111c8633-d47c-4446-bf53-4e49152b8957</guid>
      <link>https://share.transistor.fm/s/2f1e9ad0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuyao Ge, Lingrui Mei, Zenghao Duan, Tianhao Li, Yujia Zheng, Yiwei Wang, Lexin Wang, Jiayu Yao, Tianyu Liu, Yujun Cai, Baolong Bi, Fangda Guo, Jiafeng Guo, Shenghua Liu, Xueqi Cheng</p>

            <p><strong>Title:</strong><br>
            A Survey of Vibe Coding with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12399v1">http://arxiv.org/abs/2510.12399v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advancement of large language models (LLMs) has catalyzed a paradigm shift from code generation assistance to autonomous coding agents, enabling a novel development methodology termed "Vibe Coding" where developers validate AI-generated implementations through outcome observation rather than line-by-line code comprehension. Despite its transformative potential, the effectiveness of this emergent paradigm remains under-explored, with empirical evidence revealing unexpected productivity losses and fundamental challenges in human-AI collaboration. To address this gap, this survey provides the first comprehensive and systematic review of Vibe Coding with large language models, establishing both theoretical foundations and practical frameworks for this transformative development approach. Drawing from systematic analysis of over 1000 research papers, we survey the entire vibe coding ecosystem, examining critical infrastructure components including LLMs for coding, LLM-based coding agent, development environment of coding agent, and feedback mechanisms. We first introduce Vibe Coding as a formal discipline by formalizing it through a Constrained Markov Decision Process that captures the dynamic triadic relationship among human developers, software projects, and coding agents. Building upon this theoretical foundation, we then synthesize existing practices into five distinct development models: Unconstrained Automation, Iterative Conversational Collaboration, Planning-Driven, Test-Driven, and Context-Enhanced Models, thus providing the first comprehensive taxonomy in this domain. Critically, our analysis reveals that successful Vibe Coding depends not merely on agent capabilities but on systematic context engineering, well-established development environments, and human-agent collaborative development models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuyao Ge, Lingrui Mei, Zenghao Duan, Tianhao Li, Yujia Zheng, Yiwei Wang, Lexin Wang, Jiayu Yao, Tianyu Liu, Yujun Cai, Baolong Bi, Fangda Guo, Jiafeng Guo, Shenghua Liu, Xueqi Cheng</p>

            <p><strong>Title:</strong><br>
            A Survey of Vibe Coding with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12399v1">http://arxiv.org/abs/2510.12399v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advancement of large language models (LLMs) has catalyzed a paradigm shift from code generation assistance to autonomous coding agents, enabling a novel development methodology termed "Vibe Coding" where developers validate AI-generated implementations through outcome observation rather than line-by-line code comprehension. Despite its transformative potential, the effectiveness of this emergent paradigm remains under-explored, with empirical evidence revealing unexpected productivity losses and fundamental challenges in human-AI collaboration. To address this gap, this survey provides the first comprehensive and systematic review of Vibe Coding with large language models, establishing both theoretical foundations and practical frameworks for this transformative development approach. Drawing from systematic analysis of over 1000 research papers, we survey the entire vibe coding ecosystem, examining critical infrastructure components including LLMs for coding, LLM-based coding agent, development environment of coding agent, and feedback mechanisms. We first introduce Vibe Coding as a formal discipline by formalizing it through a Constrained Markov Decision Process that captures the dynamic triadic relationship among human developers, software projects, and coding agents. Building upon this theoretical foundation, we then synthesize existing practices into five distinct development models: Unconstrained Automation, Iterative Conversational Collaboration, Planning-Driven, Test-Driven, and Context-Enhanced Models, thus providing the first comprehensive taxonomy in this domain. Critically, our analysis reveals that successful Vibe Coding depends not merely on agent capabilities but on systematic context engineering, well-established development environments, and human-agent collaborative development models.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 15 Oct 2025 21:06:27 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2f1e9ad0/d0c58fe2.mp3" length="21758608" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1356</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuyao Ge, Lingrui Mei, Zenghao Duan, Tianhao Li, Yujia Zheng, Yiwei Wang, Lexin Wang, Jiayu Yao, Tianyu Liu, Yujun Cai, Baolong Bi, Fangda Guo, Jiafeng Guo, Shenghua Liu, Xueqi Cheng</p>

            <p><strong>Title:</strong><br>
            A Survey of Vibe Coding with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12399v1">http://arxiv.org/abs/2510.12399v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advancement of large language models (LLMs) has catalyzed a paradigm shift from code generation assistance to autonomous coding agents, enabling a novel development methodology termed "Vibe Coding" where developers validate AI-generated implementations through outcome observation rather than line-by-line code comprehension. Despite its transformative potential, the effectiveness of this emergent paradigm remains under-explored, with empirical evidence revealing unexpected productivity losses and fundamental challenges in human-AI collaboration. To address this gap, this survey provides the first comprehensive and systematic review of Vibe Coding with large language models, establishing both theoretical foundations and practical frameworks for this transformative development approach. Drawing from systematic analysis of over 1000 research papers, we survey the entire vibe coding ecosystem, examining critical infrastructure components including LLMs for coding, LLM-based coding agent, development environment of coding agent, and feedback mechanisms. We first introduce Vibe Coding as a formal discipline by formalizing it through a Constrained Markov Decision Process that captures the dynamic triadic relationship among human developers, software projects, and coding agents. Building upon this theoretical foundation, we then synthesize existing practices into five distinct development models: Unconstrained Automation, Iterative Conversational Collaboration, Planning-Driven, Test-Driven, and Context-Enhanced Models, thus providing the first comprehensive taxonomy in this domain. Critically, our analysis reveals that successful Vibe Coding depends not merely on agent capabilities but on systematic context engineering, well-established development environments, and human-agent collaborative development models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution</title>
      <itunes:episode>1286</itunes:episode>
      <podcast:episode>1286</podcast:episode>
      <itunes:title>FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a33a723c-4c0c-467d-8d45-d536a3743a7e</guid>
      <link>https://share.transistor.fm/s/37de02da</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junhao Zhuang, Shi Guo, Xin Cai, Xiaohui Li, Yihao Liu, Chun Yuan, Tianfan Xue</p>

            <p><strong>Title:</strong><br>
            FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12747v1">http://arxiv.org/abs/2510.12747v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have recently advanced video restoration, but applying them to real-world video super-resolution (VSR) remains challenging due to high latency, prohibitive computation, and poor generalization to ultra-high resolutions. Our goal in this work is to make diffusion-based VSR practical by achieving efficiency, scalability, and real-time performance. To this end, we propose FlashVSR, the first diffusion-based one-step streaming framework towards real-time VSR. FlashVSR runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU by combining three complementary innovations: (i) a train-friendly three-stage distillation pipeline that enables streaming super-resolution, (ii) locality-constrained sparse attention that cuts redundant computation while bridging the train-test resolution gap, and (iii) a tiny conditional decoder that accelerates reconstruction without sacrificing quality. To support large-scale training, we also construct VSR-120K, a new dataset with 120k videos and 180k images. Extensive experiments show that FlashVSR scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models. We will release the code, pretrained models, and dataset to foster future research in efficient diffusion-based VSR.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junhao Zhuang, Shi Guo, Xin Cai, Xiaohui Li, Yihao Liu, Chun Yuan, Tianfan Xue</p>

            <p><strong>Title:</strong><br>
            FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12747v1">http://arxiv.org/abs/2510.12747v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have recently advanced video restoration, but applying them to real-world video super-resolution (VSR) remains challenging due to high latency, prohibitive computation, and poor generalization to ultra-high resolutions. Our goal in this work is to make diffusion-based VSR practical by achieving efficiency, scalability, and real-time performance. To this end, we propose FlashVSR, the first diffusion-based one-step streaming framework towards real-time VSR. FlashVSR runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU by combining three complementary innovations: (i) a train-friendly three-stage distillation pipeline that enables streaming super-resolution, (ii) locality-constrained sparse attention that cuts redundant computation while bridging the train-test resolution gap, and (iii) a tiny conditional decoder that accelerates reconstruction without sacrificing quality. To support large-scale training, we also construct VSR-120K, a new dataset with 120k videos and 180k images. Extensive experiments show that FlashVSR scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models. We will release the code, pretrained models, and dataset to foster future research in efficient diffusion-based VSR.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 15 Oct 2025 21:06:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/37de02da/860a5d6c.mp3" length="22792247" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1421</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junhao Zhuang, Shi Guo, Xin Cai, Xiaohui Li, Yihao Liu, Chun Yuan, Tianfan Xue</p>

            <p><strong>Title:</strong><br>
            FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12747v1">http://arxiv.org/abs/2510.12747v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have recently advanced video restoration, but applying them to real-world video super-resolution (VSR) remains challenging due to high latency, prohibitive computation, and poor generalization to ultra-high resolutions. Our goal in this work is to make diffusion-based VSR practical by achieving efficiency, scalability, and real-time performance. To this end, we propose FlashVSR, the first diffusion-based one-step streaming framework towards real-time VSR. FlashVSR runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU by combining three complementary innovations: (i) a train-friendly three-stage distillation pipeline that enables streaming super-resolution, (ii) locality-constrained sparse attention that cuts redundant computation while bridging the train-test resolution gap, and (iii) a tiny conditional decoder that accelerates reconstruction without sacrificing quality. To support large-scale training, we also construct VSR-120K, a new dataset with 120k videos and 180k images. Extensive experiments show that FlashVSR scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models. We will release the code, pretrained models, and dataset to foster future research in efficient diffusion-based VSR.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Dr.LLM: Dynamic Layer Routing in LLMs</title>
      <itunes:episode>1285</itunes:episode>
      <podcast:episode>1285</podcast:episode>
      <itunes:title>Dr.LLM: Dynamic Layer Routing in LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2ea373e8-679e-4881-91eb-cc8e19f58eb3</guid>
      <link>https://share.transistor.fm/s/52f7a0d8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh</p>

            <p><strong>Title:</strong><br>
            Dr.LLM: Dynamic Layer Routing in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12773v1">http://arxiv.org/abs/2510.12773v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr.LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, Dr.LLM shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh</p>

            <p><strong>Title:</strong><br>
            Dr.LLM: Dynamic Layer Routing in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12773v1">http://arxiv.org/abs/2510.12773v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr.LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, Dr.LLM shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 15 Oct 2025 21:05:40 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/52f7a0d8/0f2b9780.mp3" length="22933896" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1430</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh</p>

            <p><strong>Title:</strong><br>
            Dr.LLM: Dynamic Layer Routing in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.12773v1">http://arxiv.org/abs/2510.12773v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr.LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, Dr.LLM shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Temporal Alignment Guidance: On-Manifold Sampling in Diffusion Models</title>
      <itunes:episode>1284</itunes:episode>
      <podcast:episode>1284</podcast:episode>
      <itunes:title>Temporal Alignment Guidance: On-Manifold Sampling in Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ef48ea7e-05b9-4f75-abd7-aea1f9fb65df</guid>
      <link>https://share.transistor.fm/s/720ae0ed</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Youngrok Park, Hojung Jung, Sangmin Bae, Se-Young Yun</p>

            <p><strong>Title:</strong><br>
            Temporal Alignment Guidance: On-Manifold Sampling in Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11057v1">http://arxiv.org/abs/2510.11057v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have achieved remarkable success as generative models. However, even a well-trained model can accumulate errors throughout the generation process. These errors become particularly problematic when arbitrary guidance is applied to steer samples toward desired properties, which often breaks sample fidelity. In this paper, we propose a general solution to address the off-manifold phenomenon observed in diffusion models. Our approach leverages a time predictor to estimate deviations from the desired data manifold at each timestep, identifying that a larger time gap is associated with reduced generation quality. We then design a novel guidance mechanism, `Temporal Alignment Guidance' (TAG), attracting the samples back to the desired manifold at every timestep during generation. Through extensive experiments, we demonstrate that TAG consistently produces samples closely aligned with the desired manifold at each timestep, leading to significant improvements in generation quality across various downstream tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Youngrok Park, Hojung Jung, Sangmin Bae, Se-Young Yun</p>

            <p><strong>Title:</strong><br>
            Temporal Alignment Guidance: On-Manifold Sampling in Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11057v1">http://arxiv.org/abs/2510.11057v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have achieved remarkable success as generative models. However, even a well-trained model can accumulate errors throughout the generation process. These errors become particularly problematic when arbitrary guidance is applied to steer samples toward desired properties, which often breaks sample fidelity. In this paper, we propose a general solution to address the off-manifold phenomenon observed in diffusion models. Our approach leverages a time predictor to estimate deviations from the desired data manifold at each timestep, identifying that a larger time gap is associated with reduced generation quality. We then design a novel guidance mechanism, `Temporal Alignment Guidance' (TAG), attracting the samples back to the desired manifold at every timestep during generation. Through extensive experiments, we demonstrate that TAG consistently produces samples closely aligned with the desired manifold at each timestep, leading to significant improvements in generation quality across various downstream tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 15 Oct 2025 21:05:17 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/720ae0ed/947ee88b.mp3" length="20024515" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1248</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Youngrok Park, Hojung Jung, Sangmin Bae, Se-Young Yun</p>

            <p><strong>Title:</strong><br>
            Temporal Alignment Guidance: On-Manifold Sampling in Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11057v1">http://arxiv.org/abs/2510.11057v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have achieved remarkable success as generative models. However, even a well-trained model can accumulate errors throughout the generation process. These errors become particularly problematic when arbitrary guidance is applied to steer samples toward desired properties, which often breaks sample fidelity. In this paper, we propose a general solution to address the off-manifold phenomenon observed in diffusion models. Our approach leverages a time predictor to estimate deviations from the desired data manifold at each timestep, identifying that a larger time gap is associated with reduced generation quality. We then design a novel guidance mechanism, `Temporal Alignment Guidance' (TAG), attracting the samples back to the desired manifold at every timestep during generation. Through extensive experiments, we demonstrate that TAG consistently produces samples closely aligned with the desired manifold at each timestep, leading to significant improvements in generation quality across various downstream tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs</title>
      <itunes:episode>1283</itunes:episode>
      <podcast:episode>1283</podcast:episode>
      <itunes:title>QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0cded76f-3d4f-41ca-a923-dceb125490bd</guid>
      <link>https://share.transistor.fm/s/6fc15ae1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 106 | cs.LG, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, Yukang Chen</p>

            <p><strong>Title:</strong><br>
            QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11696v1">http://arxiv.org/abs/2510.11696v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 106 | cs.LG, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, Yukang Chen</p>

            <p><strong>Title:</strong><br>
            QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11696v1">http://arxiv.org/abs/2510.11696v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Oct 2025 21:08:05 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6fc15ae1/59d93077.mp3" length="23373215" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1457</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 106 | cs.LG, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, Yukang Chen</p>

            <p><strong>Title:</strong><br>
            QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11696v1">http://arxiv.org/abs/2510.11696v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Diffusion Transformers with Representation Autoencoders</title>
      <itunes:episode>1282</itunes:episode>
      <podcast:episode>1282</podcast:episode>
      <itunes:title>Diffusion Transformers with Representation Autoencoders</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">259bf2bd-e33a-4f68-b1a6-86dff0b58f85</guid>
      <link>https://share.transistor.fm/s/d6d7778f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 93 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie</p>

            <p><strong>Title:</strong><br>
            Diffusion Transformers with Representation Autoencoders</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11690v1">http://arxiv.org/abs/2510.11690v1</a></p>

            <p><strong>Abstract:</strong><br>
            Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved. Most DiTs continue to rely on the original VAE encoder, which introduces several limitations: outdated backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations that result from purely reconstruction-based training and ultimately limit generative quality. In this work, we explore replacing the VAE with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we term Representation Autoencoders (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture. Since these latent spaces are typically high-dimensional, a key challenge is enabling diffusion transformers to operate effectively within them. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant equipped with a lightweight, wide DDT head, we achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no guidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offers clear advantages and should be the new default for diffusion transformer training.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 93 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie</p>

            <p><strong>Title:</strong><br>
            Diffusion Transformers with Representation Autoencoders</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11690v1">http://arxiv.org/abs/2510.11690v1</a></p>

            <p><strong>Abstract:</strong><br>
            Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved. Most DiTs continue to rely on the original VAE encoder, which introduces several limitations: outdated backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations that result from purely reconstruction-based training and ultimately limit generative quality. In this work, we explore replacing the VAE with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we term Representation Autoencoders (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture. Since these latent spaces are typically high-dimensional, a key challenge is enabling diffusion transformers to operate effectively within them. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant equipped with a lightweight, wide DDT head, we achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no guidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offers clear advantages and should be the new default for diffusion transformer training.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Oct 2025 21:07:44 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d6d7778f/afacc2cb.mp3" length="23545807" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1468</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 93 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie</p>

            <p><strong>Title:</strong><br>
            Diffusion Transformers with Representation Autoencoders</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11690v1">http://arxiv.org/abs/2510.11690v1</a></p>

            <p><strong>Abstract:</strong><br>
            Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved. Most DiTs continue to rely on the original VAE encoder, which introduces several limitations: outdated backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations that result from purely reconstruction-based training and ultimately limit generative quality. In this work, we explore replacing the VAE with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we term Representation Autoencoders (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture. Since these latent spaces are typically high-dimensional, a key challenge is enabling diffusion transformers to operate effectively within them. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant equipped with a lightweight, wide DDT head, we achieve strong image generation results on ImageNet: 1.51 FID at 256x256 (no guidance) and 1.13 at both 256x256 and 512x512 (with guidance). RAE offers clear advantages and should be the new default for diffusion transformer training.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs</title>
      <itunes:episode>1281</itunes:episode>
      <podcast:episode>1281</podcast:episode>
      <itunes:title>OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">380c5e00-3ae1-4da9-a8ed-c9118f77eb9d</guid>
      <link>https://share.transistor.fm/s/fb010bb4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Caorui Li, Yu Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shihao Li, Yuanxing Zhang, Jiafu Tang, Zhenghao Song, Dingling Zhang, Ying He, Haoxiang Liu, Yuxuan Wang, Qiufeng Wang, Zhenhe Wu, Jiehui Luo, Zhiyu Pan, Weihao Xie, Chenchen Zhang, Zhaohui Wang, Jiayi Tian, Yanghai Wang, Zhe Cao, Minxin Dai, Ke Wang, Runzhe Wen, Yinghao Ma, Yaning Pan, Sungkyun Chang, Termeh Taheri, Haiwen Xia, Christos Plachouras, Emmanouil Benetos, Yizhi Li, Ge Zhang, Jian Yang, Tianhao Peng, Zili Wang, Minghao Liu, Junran Peng, Zhaoxiang Zhang, Jiaheng Liu</p>

            <p><strong>Title:</strong><br>
            OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.10689v1">http://arxiv.org/abs/2510.10689v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multimodal large language models (MLLMs) have demonstrated substantial potential in video understanding. However, existing benchmarks fail to comprehensively evaluate synergistic reasoning capabilities across audio and visual modalities, often neglecting either one of the modalities or integrating them in a logically inconsistent manner. To bridge this gap, we introduce OmniVideoBench, a large-scale and rigorously designed benchmark dedicated to assessing synergistic audio-visual understanding, with a strong emphasis on modality complementarity and logical consistency. Specifically, OmniVideoBench comprises 1000 high-quality question-answer(QA) pairs, each annotated with step-by-step reasoning traces, derived from 628 diverse videos ranging from several seconds to 30 minutes, and manually verified to guarantee complete correctness and uniqueness. Moreover, OmniVideoBench encompasses 13 carefully designed question types, covering temporal reasoning, spatial localization, counting, causal inference, summarization, and beyond, thereby capturing the essential challenges of video understanding. Evaluation of multiple MLLMs on OmniVideoBench reveals a pronounced gap between model performance and human reasoning, with open-source models lagging significantly behind their closed-source counterparts, underscoring the inherent difficulty of genuine audio-visual reasoning. We will release OmniVideoBench to foster the development of MLLMs with stronger and more generalizable reasoning capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Caorui Li, Yu Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shihao Li, Yuanxing Zhang, Jiafu Tang, Zhenghao Song, Dingling Zhang, Ying He, Haoxiang Liu, Yuxuan Wang, Qiufeng Wang, Zhenhe Wu, Jiehui Luo, Zhiyu Pan, Weihao Xie, Chenchen Zhang, Zhaohui Wang, Jiayi Tian, Yanghai Wang, Zhe Cao, Minxin Dai, Ke Wang, Runzhe Wen, Yinghao Ma, Yaning Pan, Sungkyun Chang, Termeh Taheri, Haiwen Xia, Christos Plachouras, Emmanouil Benetos, Yizhi Li, Ge Zhang, Jian Yang, Tianhao Peng, Zili Wang, Minghao Liu, Junran Peng, Zhaoxiang Zhang, Jiaheng Liu</p>

            <p><strong>Title:</strong><br>
            OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.10689v1">http://arxiv.org/abs/2510.10689v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multimodal large language models (MLLMs) have demonstrated substantial potential in video understanding. However, existing benchmarks fail to comprehensively evaluate synergistic reasoning capabilities across audio and visual modalities, often neglecting either one of the modalities or integrating them in a logically inconsistent manner. To bridge this gap, we introduce OmniVideoBench, a large-scale and rigorously designed benchmark dedicated to assessing synergistic audio-visual understanding, with a strong emphasis on modality complementarity and logical consistency. Specifically, OmniVideoBench comprises 1000 high-quality question-answer(QA) pairs, each annotated with step-by-step reasoning traces, derived from 628 diverse videos ranging from several seconds to 30 minutes, and manually verified to guarantee complete correctness and uniqueness. Moreover, OmniVideoBench encompasses 13 carefully designed question types, covering temporal reasoning, spatial localization, counting, causal inference, summarization, and beyond, thereby capturing the essential challenges of video understanding. Evaluation of multiple MLLMs on OmniVideoBench reveals a pronounced gap between model performance and human reasoning, with open-source models lagging significantly behind their closed-source counterparts, underscoring the inherent difficulty of genuine audio-visual reasoning. We will release OmniVideoBench to foster the development of MLLMs with stronger and more generalizable reasoning capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Oct 2025 21:07:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fb010bb4/162409b1.mp3" length="25753906" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1606</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Caorui Li, Yu Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shihao Li, Yuanxing Zhang, Jiafu Tang, Zhenghao Song, Dingling Zhang, Ying He, Haoxiang Liu, Yuxuan Wang, Qiufeng Wang, Zhenhe Wu, Jiehui Luo, Zhiyu Pan, Weihao Xie, Chenchen Zhang, Zhaohui Wang, Jiayi Tian, Yanghai Wang, Zhe Cao, Minxin Dai, Ke Wang, Runzhe Wen, Yinghao Ma, Yaning Pan, Sungkyun Chang, Termeh Taheri, Haiwen Xia, Christos Plachouras, Emmanouil Benetos, Yizhi Li, Ge Zhang, Jian Yang, Tianhao Peng, Zili Wang, Minghao Liu, Junran Peng, Zhaoxiang Zhang, Jiaheng Liu</p>

            <p><strong>Title:</strong><br>
            OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.10689v1">http://arxiv.org/abs/2510.10689v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multimodal large language models (MLLMs) have demonstrated substantial potential in video understanding. However, existing benchmarks fail to comprehensively evaluate synergistic reasoning capabilities across audio and visual modalities, often neglecting either one of the modalities or integrating them in a logically inconsistent manner. To bridge this gap, we introduce OmniVideoBench, a large-scale and rigorously designed benchmark dedicated to assessing synergistic audio-visual understanding, with a strong emphasis on modality complementarity and logical consistency. Specifically, OmniVideoBench comprises 1000 high-quality question-answer(QA) pairs, each annotated with step-by-step reasoning traces, derived from 628 diverse videos ranging from several seconds to 30 minutes, and manually verified to guarantee complete correctness and uniqueness. Moreover, OmniVideoBench encompasses 13 carefully designed question types, covering temporal reasoning, spatial localization, counting, causal inference, summarization, and beyond, thereby capturing the essential challenges of video understanding. Evaluation of multiple MLLMs on OmniVideoBench reveals a pronounced gap between model performance and human reasoning, with open-source models lagging significantly behind their closed-source counterparts, underscoring the inherent difficulty of genuine audio-visual reasoning. We will release OmniVideoBench to foster the development of MLLMs with stronger and more generalizable reasoning capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States</title>
      <itunes:episode>1280</itunes:episode>
      <podcast:episode>1280</podcast:episode>
      <itunes:title>Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">37ab720f-da87-4ed2-bc37-0fcac37ea1fd</guid>
      <link>https://share.transistor.fm/s/61876efd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qinglin Zhu, Yizhen Yao, Runcong Zhao, Yanzheng Xiang, Amrutha Saseendran, Chen Jin, Philip Alexander Teare, Bin Liang, Yulan He, Lin Gui</p>

            <p><strong>Title:</strong><br>
            Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11052v1">http://arxiv.org/abs/2510.11052v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive (AR) models remain the standard for natural language generation but still suffer from high latency due to strictly sequential decoding. Recent diffusion-inspired approaches, such as LlaDA and Dream, mitigate this by generating in parallel, yet they suffer from two core limitations: information loss, as predictive distributions for non-finalized tokens are discarded at each step, and premature commitment, where local decisions are made without sufficient global coordination. We introduce Latent Refinement Decoding (LRD), a two-stage framework with Latent Refinement and a Predictive Feedback Loop. The first stage maintains masked positions as distributional mixtures of predicted tokens and the mask embedding, allowing the model to establish more globally consistent beliefs. The second stage progressively finalizes confident tokens while retaining uncertain ones for iterative feedback. KL-divergence dynamics provide a principled and reliable criterion for convergence and early stopping. Experiments across coding (HumanEval +6.3, MBPP +2.6) and reasoning (GSM8K +2.9, MATH500 +3.8) show that LRD improves accuracy while delivering speedups of up to 10.6x, making it a strong and versatile alternative for parallel sequence generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qinglin Zhu, Yizhen Yao, Runcong Zhao, Yanzheng Xiang, Amrutha Saseendran, Chen Jin, Philip Alexander Teare, Bin Liang, Yulan He, Lin Gui</p>

            <p><strong>Title:</strong><br>
            Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11052v1">http://arxiv.org/abs/2510.11052v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive (AR) models remain the standard for natural language generation but still suffer from high latency due to strictly sequential decoding. Recent diffusion-inspired approaches, such as LlaDA and Dream, mitigate this by generating in parallel, yet they suffer from two core limitations: information loss, as predictive distributions for non-finalized tokens are discarded at each step, and premature commitment, where local decisions are made without sufficient global coordination. We introduce Latent Refinement Decoding (LRD), a two-stage framework with Latent Refinement and a Predictive Feedback Loop. The first stage maintains masked positions as distributional mixtures of predicted tokens and the mask embedding, allowing the model to establish more globally consistent beliefs. The second stage progressively finalizes confident tokens while retaining uncertain ones for iterative feedback. KL-divergence dynamics provide a principled and reliable criterion for convergence and early stopping. Experiments across coding (HumanEval +6.3, MBPP +2.6) and reasoning (GSM8K +2.9, MATH500 +3.8) show that LRD improves accuracy while delivering speedups of up to 10.6x, making it a strong and versatile alternative for parallel sequence generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Oct 2025 21:07:01 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/61876efd/a9966195.mp3" length="24233390" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1511</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qinglin Zhu, Yizhen Yao, Runcong Zhao, Yanzheng Xiang, Amrutha Saseendran, Chen Jin, Philip Alexander Teare, Bin Liang, Yulan He, Lin Gui</p>

            <p><strong>Title:</strong><br>
            Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11052v1">http://arxiv.org/abs/2510.11052v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive (AR) models remain the standard for natural language generation but still suffer from high latency due to strictly sequential decoding. Recent diffusion-inspired approaches, such as LlaDA and Dream, mitigate this by generating in parallel, yet they suffer from two core limitations: information loss, as predictive distributions for non-finalized tokens are discarded at each step, and premature commitment, where local decisions are made without sufficient global coordination. We introduce Latent Refinement Decoding (LRD), a two-stage framework with Latent Refinement and a Predictive Feedback Loop. The first stage maintains masked positions as distributional mixtures of predicted tokens and the mask embedding, allowing the model to establish more globally consistent beliefs. The second stage progressively finalizes confident tokens while retaining uncertain ones for iterative feedback. KL-divergence dynamics provide a principled and reliable criterion for convergence and early stopping. Experiments across coding (HumanEval +6.3, MBPP +2.6) and reasoning (GSM8K +2.9, MATH500 +3.8) show that LRD improves accuracy while delivering speedups of up to 10.6x, making it a strong and versatile alternative for parallel sequence generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Spotlight on Token Perception for Multimodal Reinforcement Learning</title>
      <itunes:episode>1279</itunes:episode>
      <podcast:episode>1279</podcast:episode>
      <itunes:title>Spotlight on Token Perception for Multimodal Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a85e9dea-5cad-442b-9d33-5da16d2ba433</guid>
      <link>https://share.transistor.fm/s/c5b1fd4d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            Spotlight on Token Perception for Multimodal Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.09285v1">http://arxiv.org/abs/2510.09285v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            Spotlight on Token Perception for Multimodal Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.09285v1">http://arxiv.org/abs/2510.09285v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Oct 2025 21:06:39 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c5b1fd4d/837dd5c2.mp3" length="22975722" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1432</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            Spotlight on Token Perception for Multimodal Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.09285v1">http://arxiv.org/abs/2510.09285v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RLFR: Extending Reinforcement Learning for LLMs with Flow Environment</title>
      <itunes:episode>1278</itunes:episode>
      <podcast:episode>1278</podcast:episode>
      <itunes:title>RLFR: Extending Reinforcement Learning for LLMs with Flow Environment</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bbb96312-514b-4edd-a4fe-168b6cb7fd90</guid>
      <link>https://share.transistor.fm/s/8fe17a0e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jinghao Zhang, Naishan Zheng, Ruilin Li, Dongzhou Cheng, Zheming Liang, Feng Zhao, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            RLFR: Extending Reinforcement Learning for LLMs with Flow Environment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.10201v1">http://arxiv.org/abs/2510.10201v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising framework for improving reasoning abilities in Large Language Models (LLMs). However, policy optimized with binary verification prone to overlook potential valuable exploration in reasoning trajectory. In view of heavy annotation cost of golden Process Reward Models (PRMs), recent works attempt using auxiliary signals for reward shaping of process tokens, involving entropy and likelihood collected from logit space. In this work, we offer a novel perspective on shaping RLVR with flow rewards derived from latent space, and propose RLFR, where the flow fields of model latents are constructed from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal. RLFR first demonstrates that a well-established flow field can be a sound environment for reward signal collection, highlighting the expressive latent space is much underexplored. Moreover, RLFR is able to compress any off-policy expert data as reference for constituting reward signals, and we show that the efficient context dependence compressed within the hidden states are utilized, rather than individual token-level denotation for context comprehending. Experiments on both language and multimodal reasoning benchmarks demonstrate the reliability of flow rewards, and suggesting a promising paradigm for reward shaping with auxiliary signals.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jinghao Zhang, Naishan Zheng, Ruilin Li, Dongzhou Cheng, Zheming Liang, Feng Zhao, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            RLFR: Extending Reinforcement Learning for LLMs with Flow Environment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.10201v1">http://arxiv.org/abs/2510.10201v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising framework for improving reasoning abilities in Large Language Models (LLMs). However, policy optimized with binary verification prone to overlook potential valuable exploration in reasoning trajectory. In view of heavy annotation cost of golden Process Reward Models (PRMs), recent works attempt using auxiliary signals for reward shaping of process tokens, involving entropy and likelihood collected from logit space. In this work, we offer a novel perspective on shaping RLVR with flow rewards derived from latent space, and propose RLFR, where the flow fields of model latents are constructed from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal. RLFR first demonstrates that a well-established flow field can be a sound environment for reward signal collection, highlighting the expressive latent space is much underexplored. Moreover, RLFR is able to compress any off-policy expert data as reference for constituting reward signals, and we show that the efficient context dependence compressed within the hidden states are utilized, rather than individual token-level denotation for context comprehending. Experiments on both language and multimodal reasoning benchmarks demonstrate the reliability of flow rewards, and suggesting a promising paradigm for reward shaping with auxiliary signals.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Oct 2025 21:06:16 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8fe17a0e/cb25ba50.mp3" length="23115741" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1441</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jinghao Zhang, Naishan Zheng, Ruilin Li, Dongzhou Cheng, Zheming Liang, Feng Zhao, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            RLFR: Extending Reinforcement Learning for LLMs with Flow Environment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.10201v1">http://arxiv.org/abs/2510.10201v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising framework for improving reasoning abilities in Large Language Models (LLMs). However, policy optimized with binary verification prone to overlook potential valuable exploration in reasoning trajectory. In view of heavy annotation cost of golden Process Reward Models (PRMs), recent works attempt using auxiliary signals for reward shaping of process tokens, involving entropy and likelihood collected from logit space. In this work, we offer a novel perspective on shaping RLVR with flow rewards derived from latent space, and propose RLFR, where the flow fields of model latents are constructed from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal. RLFR first demonstrates that a well-established flow field can be a sound environment for reward signal collection, highlighting the expressive latent space is much underexplored. Moreover, RLFR is able to compress any off-policy expert data as reference for constituting reward signals, and we show that the efficient context dependence compressed within the hidden states are utilized, rather than individual token-level denotation for context comprehending. Experiments on both language and multimodal reasoning benchmarks demonstrate the reliability of flow rewards, and suggesting a promising paradigm for reward shaping with auxiliary signals.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training</title>
      <itunes:episode>1277</itunes:episode>
      <podcast:episode>1277</podcast:episode>
      <itunes:title>DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">37c54437-53ed-4059-ae90-5b8541b731b4</guid>
      <link>https://share.transistor.fm/s/32b858dd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haoran Feng, Dizhe Zhang, Xiangtai Li, Bo Du, Lu Qi</p>

            <p><strong>Title:</strong><br>
            DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11712v1">http://arxiv.org/abs/2510.11712v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we propose DiT360, a DiT-based framework that performs hybrid training on perspective and panoramic data for panoramic image generation. For the issues of maintaining geometric fidelity and photorealism in generation quality, we attribute the main reason to the lack of large-scale, high-quality, real-world panoramic data, where such a data-centric view differs from prior methods that focus on model design. Basically, DiT360 has several key modules for inter-domain transformation and intra-domain augmentation, applied at both the pre-VAE image level and the post-VAE token level. At the image level, we incorporate cross-domain knowledge through perspective image guidance and panoramic refinement, which enhance perceptual quality while regularizing diversity and photorealism. At the token level, hybrid supervision is applied across multiple modules, which include circular padding for boundary continuity, yaw loss for rotational robustness, and cube loss for distortion awareness. Extensive experiments on text-to-panorama, inpainting, and outpainting tasks demonstrate that our method achieves better boundary consistency and image fidelity across eleven quantitative metrics. Our code is available at https://github.com/Insta360-Research-Team/DiT360.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haoran Feng, Dizhe Zhang, Xiangtai Li, Bo Du, Lu Qi</p>

            <p><strong>Title:</strong><br>
            DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11712v1">http://arxiv.org/abs/2510.11712v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we propose DiT360, a DiT-based framework that performs hybrid training on perspective and panoramic data for panoramic image generation. For the issues of maintaining geometric fidelity and photorealism in generation quality, we attribute the main reason to the lack of large-scale, high-quality, real-world panoramic data, where such a data-centric view differs from prior methods that focus on model design. Basically, DiT360 has several key modules for inter-domain transformation and intra-domain augmentation, applied at both the pre-VAE image level and the post-VAE token level. At the image level, we incorporate cross-domain knowledge through perspective image guidance and panoramic refinement, which enhance perceptual quality while regularizing diversity and photorealism. At the token level, hybrid supervision is applied across multiple modules, which include circular padding for boundary continuity, yaw loss for rotational robustness, and cube loss for distortion awareness. Extensive experiments on text-to-panorama, inpainting, and outpainting tasks demonstrate that our method achieves better boundary consistency and image fidelity across eleven quantitative metrics. Our code is available at https://github.com/Insta360-Research-Team/DiT360.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Oct 2025 21:05:55 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/32b858dd/64530523.mp3" length="21393330" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1333</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haoran Feng, Dizhe Zhang, Xiangtai Li, Bo Du, Lu Qi</p>

            <p><strong>Title:</strong><br>
            DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11712v1">http://arxiv.org/abs/2510.11712v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we propose DiT360, a DiT-based framework that performs hybrid training on perspective and panoramic data for panoramic image generation. For the issues of maintaining geometric fidelity and photorealism in generation quality, we attribute the main reason to the lack of large-scale, high-quality, real-world panoramic data, where such a data-centric view differs from prior methods that focus on model design. Basically, DiT360 has several key modules for inter-domain transformation and intra-domain augmentation, applied at both the pre-VAE image level and the post-VAE token level. At the image level, we incorporate cross-domain knowledge through perspective image guidance and panoramic refinement, which enhance perceptual quality while regularizing diversity and photorealism. At the token level, hybrid supervision is applied across multiple modules, which include circular padding for boundary continuity, yaw loss for rotational robustness, and cube loss for distortion awareness. Extensive experiments on text-to-panorama, inpainting, and outpainting tasks demonstrate that our method achieves better boundary consistency and image fidelity across eleven quantitative metrics. Our code is available at https://github.com/Insta360-Research-Team/DiT360.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration</title>
      <itunes:episode>1276</itunes:episode>
      <podcast:episode>1276</podcast:episode>
      <itunes:title>AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bb263432-7277-4289-823e-ab3fd051c9e6</guid>
      <link>https://share.transistor.fm/s/05857dcf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xinlong Chen, Yue Ding, Weihong Lin, Jingyun Hua, Linli Yao, Yang Shi, Bozhou Li, Yuanxing Zhang, Qiang Liu, Pengfei Wan, Liang Wang, Tieniu Tan</p>

            <p><strong>Title:</strong><br>
            AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.10395v1">http://arxiv.org/abs/2510.10395v1</a></p>

            <p><strong>Abstract:</strong><br>
            Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. In this paper, we present AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. We propose a two-stage post-training pipeline: (1) AVoCaDO SFT, which fine-tunes the model on a newly curated dataset of 107K high-quality, temporally-aligned audiovisual captions; and (2) AVoCaDO GRPO, which leverages tailored reward functions to further enhance temporal coherence and dialogue accuracy while regularizing caption length and reducing collapse. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance on the VDC and DREAM-1K benchmark under visual-only settings.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xinlong Chen, Yue Ding, Weihong Lin, Jingyun Hua, Linli Yao, Yang Shi, Bozhou Li, Yuanxing Zhang, Qiang Liu, Pengfei Wan, Liang Wang, Tieniu Tan</p>

            <p><strong>Title:</strong><br>
            AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.10395v1">http://arxiv.org/abs/2510.10395v1</a></p>

            <p><strong>Abstract:</strong><br>
            Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. In this paper, we present AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. We propose a two-stage post-training pipeline: (1) AVoCaDO SFT, which fine-tunes the model on a newly curated dataset of 107K high-quality, temporally-aligned audiovisual captions; and (2) AVoCaDO GRPO, which leverages tailored reward functions to further enhance temporal coherence and dialogue accuracy while regularizing caption length and reducing collapse. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance on the VDC and DREAM-1K benchmark under visual-only settings.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Oct 2025 21:05:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/05857dcf/3ab684bd.mp3" length="23485220" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1464</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xinlong Chen, Yue Ding, Weihong Lin, Jingyun Hua, Linli Yao, Yang Shi, Bozhou Li, Yuanxing Zhang, Qiang Liu, Pengfei Wan, Liang Wang, Tieniu Tan</p>

            <p><strong>Title:</strong><br>
            AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.10395v1">http://arxiv.org/abs/2510.10395v1</a></p>

            <p><strong>Abstract:</strong><br>
            Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. In this paper, we present AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. We propose a two-stage post-training pipeline: (1) AVoCaDO SFT, which fine-tunes the model on a newly curated dataset of 107K high-quality, temporally-aligned audiovisual captions; and (2) AVoCaDO GRPO, which leverages tailored reward functions to further enhance temporal coherence and dialogue accuracy while regularizing caption length and reducing collapse. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance on the VDC and DREAM-1K benchmark under visual-only settings.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models</title>
      <itunes:episode>1275</itunes:episode>
      <podcast:episode>1275</podcast:episode>
      <itunes:title>InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">71d94103-6509-4468-8ca6-b0a7edc345cc</guid>
      <link>https://share.transistor.fm/s/a0d9316c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, Yanwen Guo, Wenhai Wang, Kai Chen, Yu Qiao, Hongjie Zhang</p>

            <p><strong>Title:</strong><br>
            InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11341v1">http://arxiv.org/abs/2510.11341v1</a></p>

            <p><strong>Abstract:</strong><br>
            General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, Yanwen Guo, Wenhai Wang, Kai Chen, Yu Qiao, Hongjie Zhang</p>

            <p><strong>Title:</strong><br>
            InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11341v1">http://arxiv.org/abs/2510.11341v1</a></p>

            <p><strong>Abstract:</strong><br>
            General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Oct 2025 21:05:11 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a0d9316c/c7dbf3d2.mp3" length="24388431" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1521</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, Yanwen Guo, Wenhai Wang, Kai Chen, Yu Qiao, Hongjie Zhang</p>

            <p><strong>Title:</strong><br>
            InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.11341v1">http://arxiv.org/abs/2510.11341v1</a></p>

            <p><strong>Abstract:</strong><br>
            General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions</title>
      <itunes:episode>1274</itunes:episode>
      <podcast:episode>1274</podcast:episode>
      <itunes:title>BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">05a7edf3-4126-4129-8bcf-c5c77c15c7b3</guid>
      <link>https://share.transistor.fm/s/1eac75ed</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tao Yu, Zhengbo Zhang, Zhiheng Lyu, Junhao Gong, Hongzhu Yi, Xinming Wang, Yuxuan Zhou, Jiabing Yang, Ping Nie, Yan Huang, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.10666v2">http://arxiv.org/abs/2510.10666v2</a></p>

            <p><strong>Abstract:</strong><br>
            Efficiently solving real-world problems with LLMs increasingly hinges on their ability to interact with dynamic web environments and autonomously acquire external information. While recent research like Search-R1 and WebDancer demonstrates strong performance in solving web tasks, they heavily rely on additional tools to convert the interactive web environment into static text content. This is in contrast to human browsing behaviors, which involve diverse interactions with the browser, such as scrolling, clicking, and typing. In this paper, we propose BrowserAgent, a more interactive agent that solves complex tasks through human-inspired browser actions. BrowserAgent operates directly on raw web pages via Playwright through a set of predefined browser actions. We adopt a two-stage training (Supervised Fine-Tuning (SFT) and Rejection Fine-Tuning (RFT)) to improve the model's generalization abilities. Despite using significantly less training data than Search-R1, BrowserAgent achieves more competitive results across different Open-QA tasks. Additionally, we introduce an explicit memory mechanism to store key conclusions across steps, further enhancing the model's reasoning capabilities for long-horizon tasks. Notably, BrowserAgent-7B can achieve around 20\% improvement over Search-R1 on multi-hop QA tasks like HotpotQA, 2Wiki, and Bamboogle. These results indicate that BrowserAgent can serve as a more advanced framework for more interactive and scalable web agents.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tao Yu, Zhengbo Zhang, Zhiheng Lyu, Junhao Gong, Hongzhu Yi, Xinming Wang, Yuxuan Zhou, Jiabing Yang, Ping Nie, Yan Huang, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.10666v2">http://arxiv.org/abs/2510.10666v2</a></p>

            <p><strong>Abstract:</strong><br>
            Efficiently solving real-world problems with LLMs increasingly hinges on their ability to interact with dynamic web environments and autonomously acquire external information. While recent research like Search-R1 and WebDancer demonstrates strong performance in solving web tasks, they heavily rely on additional tools to convert the interactive web environment into static text content. This is in contrast to human browsing behaviors, which involve diverse interactions with the browser, such as scrolling, clicking, and typing. In this paper, we propose BrowserAgent, a more interactive agent that solves complex tasks through human-inspired browser actions. BrowserAgent operates directly on raw web pages via Playwright through a set of predefined browser actions. We adopt a two-stage training (Supervised Fine-Tuning (SFT) and Rejection Fine-Tuning (RFT)) to improve the model's generalization abilities. Despite using significantly less training data than Search-R1, BrowserAgent achieves more competitive results across different Open-QA tasks. Additionally, we introduce an explicit memory mechanism to store key conclusions across steps, further enhancing the model's reasoning capabilities for long-horizon tasks. Notably, BrowserAgent-7B can achieve around 20\% improvement over Search-R1 on multi-hop QA tasks like HotpotQA, 2Wiki, and Bamboogle. These results indicate that BrowserAgent can serve as a more advanced framework for more interactive and scalable web agents.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Oct 2025 21:04:50 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1eac75ed/eb33b4f2.mp3" length="20824493" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1298</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tao Yu, Zhengbo Zhang, Zhiheng Lyu, Junhao Gong, Hongzhu Yi, Xinming Wang, Yuxuan Zhou, Jiabing Yang, Ping Nie, Yan Huang, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.10666v2">http://arxiv.org/abs/2510.10666v2</a></p>

            <p><strong>Abstract:</strong><br>
            Efficiently solving real-world problems with LLMs increasingly hinges on their ability to interact with dynamic web environments and autonomously acquire external information. While recent research like Search-R1 and WebDancer demonstrates strong performance in solving web tasks, they heavily rely on additional tools to convert the interactive web environment into static text content. This is in contrast to human browsing behaviors, which involve diverse interactions with the browser, such as scrolling, clicking, and typing. In this paper, we propose BrowserAgent, a more interactive agent that solves complex tasks through human-inspired browser actions. BrowserAgent operates directly on raw web pages via Playwright through a set of predefined browser actions. We adopt a two-stage training (Supervised Fine-Tuning (SFT) and Rejection Fine-Tuning (RFT)) to improve the model's generalization abilities. Despite using significantly less training data than Search-R1, BrowserAgent achieves more competitive results across different Open-QA tasks. Additionally, we introduce an explicit memory mechanism to store key conclusions across steps, further enhancing the model's reasoning capabilities for long-horizon tasks. Notably, BrowserAgent-7B can achieve around 20\% improvement over Search-R1 on multi-hop QA tasks like HotpotQA, 2Wiki, and Bamboogle. These results indicate that BrowserAgent can serve as a more advanced framework for more interactive and scalable web agents.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI</title>
      <itunes:episode>1273</itunes:episode>
      <podcast:episode>1273</podcast:episode>
      <itunes:title>D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">731cf116-7605-4524-82fc-9f79439ab429</guid>
      <link>https://share.transistor.fm/s/654f71e9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 104 | cs.AI, cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Suwhan Choi, Jaeyoon Jung, Haebin Seong, Minchan Kim, Minyeong Kim, Yongjun Cho, Yoonshik Kim, Yubeen Park, Youngjae Yu, Yunsung Lee</p>

            <p><strong>Title:</strong><br>
            D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.05684v1">http://arxiv.org/abs/2510.05684v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks. This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics. We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models available at https://worv-ai.github.io/d2e/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 104 | cs.AI, cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Suwhan Choi, Jaeyoon Jung, Haebin Seong, Minchan Kim, Minyeong Kim, Yongjun Cho, Yoonshik Kim, Yubeen Park, Youngjae Yu, Yunsung Lee</p>

            <p><strong>Title:</strong><br>
            D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.05684v1">http://arxiv.org/abs/2510.05684v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks. This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics. We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models available at https://worv-ai.github.io/d2e/</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Oct 2025 21:05:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/654f71e9/234b3ed2.mp3" length="22927672" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1429</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 104 | cs.AI, cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Suwhan Choi, Jaeyoon Jung, Haebin Seong, Minchan Kim, Minyeong Kim, Yongjun Cho, Yoonshik Kim, Yubeen Park, Youngjae Yu, Yunsung Lee</p>

            <p><strong>Title:</strong><br>
            D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.05684v1">http://arxiv.org/abs/2510.05684v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks. This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics. We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models available at https://worv-ai.github.io/d2e/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation</title>
      <itunes:episode>1272</itunes:episode>
      <podcast:episode>1272</podcast:episode>
      <itunes:title>Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9fe08c57-2106-47e1-ab73-da56a1e06f5d</guid>
      <link>https://share.transistor.fm/s/f5dd3b5f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 86 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy</p>

            <p><strong>Title:</strong><br>
            Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08673v1">http://arxiv.org/abs/2510.08673v1</a></p>

            <p><strong>Abstract:</strong><br>
            Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 86 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy</p>

            <p><strong>Title:</strong><br>
            Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08673v1">http://arxiv.org/abs/2510.08673v1</a></p>

            <p><strong>Abstract:</strong><br>
            Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Oct 2025 21:04:45 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f5dd3b5f/e1f28ab4.mp3" length="22421955" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1398</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 86 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy</p>

            <p><strong>Title:</strong><br>
            Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08673v1">http://arxiv.org/abs/2510.08673v1</a></p>

            <p><strong>Abstract:</strong><br>
            Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TAG:Tangential Amplifying Guidance for Hallucination-Resistant Diffusion Sampling</title>
      <itunes:episode>1271</itunes:episode>
      <podcast:episode>1271</podcast:episode>
      <itunes:title>TAG:Tangential Amplifying Guidance for Hallucination-Resistant Diffusion Sampling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d27240ee-a2ce-49a1-85f6-96cbace732b8</guid>
      <link>https://share.transistor.fm/s/9d464809</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hyunmin Cho, Donghoon Ahn, Susung Hong, Jee Eun Kim, Seungryong Kim, Kyong Hwan Jin</p>

            <p><strong>Title:</strong><br>
            TAG:Tangential Amplifying Guidance for Hallucination-Resistant Diffusion Sampling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.04533v1">http://arxiv.org/abs/2510.04533v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent diffusion models achieve the state-of-the-art performance in image generation, but often suffer from semantic inconsistencies or hallucinations. While various inference-time guidance methods can enhance generation, they often operate indirectly by relying on external signals or architectural modifications, which introduces additional computational overhead. In this paper, we propose Tangential Amplifying Guidance (TAG), a more efficient and direct guidance method that operates solely on trajectory signals without modifying the underlying diffusion model. TAG leverages an intermediate sample as a projection basis and amplifies the tangential components of the estimated scores with respect to this basis to correct the sampling trajectory. We formalize this guidance process by leveraging a first-order Taylor expansion, which demonstrates that amplifying the tangential component steers the state toward higher-probability regions, thereby reducing inconsistencies and enhancing sample quality. TAG is a plug-and-play, architecture-agnostic module that improves diffusion sampling fidelity with minimal computational addition, offering a new perspective on diffusion guidance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hyunmin Cho, Donghoon Ahn, Susung Hong, Jee Eun Kim, Seungryong Kim, Kyong Hwan Jin</p>

            <p><strong>Title:</strong><br>
            TAG:Tangential Amplifying Guidance for Hallucination-Resistant Diffusion Sampling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.04533v1">http://arxiv.org/abs/2510.04533v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent diffusion models achieve the state-of-the-art performance in image generation, but often suffer from semantic inconsistencies or hallucinations. While various inference-time guidance methods can enhance generation, they often operate indirectly by relying on external signals or architectural modifications, which introduces additional computational overhead. In this paper, we propose Tangential Amplifying Guidance (TAG), a more efficient and direct guidance method that operates solely on trajectory signals without modifying the underlying diffusion model. TAG leverages an intermediate sample as a projection basis and amplifies the tangential components of the estimated scores with respect to this basis to correct the sampling trajectory. We formalize this guidance process by leveraging a first-order Taylor expansion, which demonstrates that amplifying the tangential component steers the state toward higher-probability regions, thereby reducing inconsistencies and enhancing sample quality. TAG is a plug-and-play, architecture-agnostic module that improves diffusion sampling fidelity with minimal computational addition, offering a new perspective on diffusion guidance.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Oct 2025 21:04:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9d464809/069a2f62.mp3" length="20848324" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1299</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hyunmin Cho, Donghoon Ahn, Susung Hong, Jee Eun Kim, Seungryong Kim, Kyong Hwan Jin</p>

            <p><strong>Title:</strong><br>
            TAG:Tangential Amplifying Guidance for Hallucination-Resistant Diffusion Sampling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.04533v1">http://arxiv.org/abs/2510.04533v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent diffusion models achieve the state-of-the-art performance in image generation, but often suffer from semantic inconsistencies or hallucinations. While various inference-time guidance methods can enhance generation, they often operate indirectly by relying on external signals or architectural modifications, which introduces additional computational overhead. In this paper, we propose Tangential Amplifying Guidance (TAG), a more efficient and direct guidance method that operates solely on trajectory signals without modifying the underlying diffusion model. TAG leverages an intermediate sample as a projection basis and amplifies the tangential components of the estimated scores with respect to this basis to correct the sampling trajectory. We formalize this guidance process by leveraging a first-order Taylor expansion, which demonstrates that amplifying the tangential component steers the state toward higher-probability regions, thereby reducing inconsistencies and enhancing sample quality. TAG is a plug-and-play, architecture-agnostic module that improves diffusion sampling fidelity with minimal computational addition, offering a new perspective on diffusion guidance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AutoPR: Let's Automate Your Academic Promotion!</title>
      <itunes:episode>1270</itunes:episode>
      <podcast:episode>1270</podcast:episode>
      <itunes:title>AutoPR: Let's Automate Your Academic Promotion!</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f55313e4-1f76-45b9-8752-dd2a33bc1812</guid>
      <link>https://share.transistor.fm/s/a2155943</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qiguang Chen, Zheng Yan, Mingda Yang, Libo Qin, Yixin Yuan, Hanjing Li, Jinhao Liu, Yiyan Ji, Dengyun Peng, Jiannan Guan, Mengkang Hu, Yantao Du, Wanxiang Che</p>

            <p><strong>Title:</strong><br>
            AutoPR: Let's Automate Your Academic Promotion!</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.09558v1">http://arxiv.org/abs/2510.09558v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the volume of peer-reviewed research surges, scholars increasingly rely on social platforms for discovery, while authors invest considerable effort in promoting their work to ensure visibility and citations. To streamline this process and reduce the reliance on human effort, we introduce Automatic Promotion (AutoPR), a novel task that transforms research papers into accurate, engaging, and timely public content. To enable rigorous evaluation, we release PRBench, a multimodal benchmark that links 512 peer-reviewed articles to high-quality promotional posts, assessing systems along three axes: Fidelity (accuracy and tone), Engagement (audience targeting and appeal), and Alignment (timing and channel optimization). We also introduce PRAgent, a multi-agent framework that automates AutoPR in three stages: content extraction with multimodal preparation, collaborative synthesis for polished outputs, and platform-specific adaptation to optimize norms, tone, and tagging for maximum reach. When compared to direct LLM pipelines on PRBench, PRAgent demonstrates substantial improvements, including a 604% increase in total watch time, a 438% rise in likes, and at least a 2.9x boost in overall engagement. Ablation studies show that platform modeling and targeted promotion contribute the most to these gains. Our results position AutoPR as a tractable, measurable research problem and provide a roadmap for scalable, impactful automated scholarly communication.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qiguang Chen, Zheng Yan, Mingda Yang, Libo Qin, Yixin Yuan, Hanjing Li, Jinhao Liu, Yiyan Ji, Dengyun Peng, Jiannan Guan, Mengkang Hu, Yantao Du, Wanxiang Che</p>

            <p><strong>Title:</strong><br>
            AutoPR: Let's Automate Your Academic Promotion!</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.09558v1">http://arxiv.org/abs/2510.09558v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the volume of peer-reviewed research surges, scholars increasingly rely on social platforms for discovery, while authors invest considerable effort in promoting their work to ensure visibility and citations. To streamline this process and reduce the reliance on human effort, we introduce Automatic Promotion (AutoPR), a novel task that transforms research papers into accurate, engaging, and timely public content. To enable rigorous evaluation, we release PRBench, a multimodal benchmark that links 512 peer-reviewed articles to high-quality promotional posts, assessing systems along three axes: Fidelity (accuracy and tone), Engagement (audience targeting and appeal), and Alignment (timing and channel optimization). We also introduce PRAgent, a multi-agent framework that automates AutoPR in three stages: content extraction with multimodal preparation, collaborative synthesis for polished outputs, and platform-specific adaptation to optimize norms, tone, and tagging for maximum reach. When compared to direct LLM pipelines on PRBench, PRAgent demonstrates substantial improvements, including a 604% increase in total watch time, a 438% rise in likes, and at least a 2.9x boost in overall engagement. Ablation studies show that platform modeling and targeted promotion contribute the most to these gains. Our results position AutoPR as a tractable, measurable research problem and provide a roadmap for scalable, impactful automated scholarly communication.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Oct 2025 21:03:59 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a2155943/1da3d0b2.mp3" length="21736453" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1355</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qiguang Chen, Zheng Yan, Mingda Yang, Libo Qin, Yixin Yuan, Hanjing Li, Jinhao Liu, Yiyan Ji, Dengyun Peng, Jiannan Guan, Mengkang Hu, Yantao Du, Wanxiang Che</p>

            <p><strong>Title:</strong><br>
            AutoPR: Let's Automate Your Academic Promotion!</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.09558v1">http://arxiv.org/abs/2510.09558v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the volume of peer-reviewed research surges, scholars increasingly rely on social platforms for discovery, while authors invest considerable effort in promoting their work to ensure visibility and citations. To streamline this process and reduce the reliance on human effort, we introduce Automatic Promotion (AutoPR), a novel task that transforms research papers into accurate, engaging, and timely public content. To enable rigorous evaluation, we release PRBench, a multimodal benchmark that links 512 peer-reviewed articles to high-quality promotional posts, assessing systems along three axes: Fidelity (accuracy and tone), Engagement (audience targeting and appeal), and Alignment (timing and channel optimization). We also introduce PRAgent, a multi-agent framework that automates AutoPR in three stages: content extraction with multimodal preparation, collaborative synthesis for polished outputs, and platform-specific adaptation to optimize norms, tone, and tagging for maximum reach. When compared to direct LLM pipelines on PRBench, PRAgent demonstrates substantial improvements, including a 604% increase in total watch time, a 438% rise in likes, and at least a 2.9x boost in overall engagement. Ablation studies show that platform modeling and targeted promotion contribute the most to these gains. Our results position AutoPR as a tractable, measurable research problem and provide a roadmap for scalable, impactful automated scholarly communication.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs</title>
      <itunes:episode>1269</itunes:episode>
      <podcast:episode>1269</podcast:episode>
      <itunes:title>Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">eef5ceb2-3703-407d-ae55-2364a0ff95ef</guid>
      <link>https://share.transistor.fm/s/5286931e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yumin Choi, Dongki Kim, Jinheon Baek, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.09201v1">http://arxiv.org/abs/2510.09201v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have shown remarkable success, and their multimodal expansions (MLLMs) further unlock capabilities spanning images, videos, and other modalities beyond text. However, despite this shift, prompt optimization approaches, designed to reduce the burden of manual prompt crafting while maximizing performance, remain confined to text, ultimately limiting the full potential of MLLMs. Motivated by this gap, we introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of textual and non-textual prompts. To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection process of candidate prompts by leveraging earlier evaluations as priors in a Bayesian-based selection strategy. Through extensive experiments across diverse modalities that go beyond text, such as images, videos, and even molecules, we demonstrate that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yumin Choi, Dongki Kim, Jinheon Baek, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.09201v1">http://arxiv.org/abs/2510.09201v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have shown remarkable success, and their multimodal expansions (MLLMs) further unlock capabilities spanning images, videos, and other modalities beyond text. However, despite this shift, prompt optimization approaches, designed to reduce the burden of manual prompt crafting while maximizing performance, remain confined to text, ultimately limiting the full potential of MLLMs. Motivated by this gap, we introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of textual and non-textual prompts. To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection process of candidate prompts by leveraging earlier evaluations as priors in a Bayesian-based selection strategy. Through extensive experiments across diverse modalities that go beyond text, such as images, videos, and even molecules, we demonstrate that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Oct 2025 21:03:36 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5286931e/72355ccc.mp3" length="24645898" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1537</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yumin Choi, Dongki Kim, Jinheon Baek, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.09201v1">http://arxiv.org/abs/2510.09201v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have shown remarkable success, and their multimodal expansions (MLLMs) further unlock capabilities spanning images, videos, and other modalities beyond text. However, despite this shift, prompt optimization approaches, designed to reduce the burden of manual prompt crafting while maximizing performance, remain confined to text, ultimately limiting the full potential of MLLMs. Motivated by this gap, we introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of textual and non-textual prompts. To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection process of candidate prompts by leveraging earlier evaluations as priors in a Bayesian-based selection strategy. Through extensive experiments across diverse modalities that go beyond text, such as images, videos, and even molecules, we demonstrate that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities</title>
      <itunes:episode>1268</itunes:episode>
      <podcast:episode>1268</podcast:episode>
      <itunes:title>BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e56e819f-e388-4d8e-954e-f4f730aad6ec</guid>
      <link>https://share.transistor.fm/s/6310e743</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Shiji Xin, Yijian Huang, Kai Cheng, Peiheng Wang, Jiazheng Liu, Jiayi Zhang, Yizhe Zhu, Wenqing Wang, Yiran Qin, Xupeng Zhu, Haojie Huang, Lawson L. S. Wong</p>

            <p><strong>Title:</strong><br>
            BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08759v1">http://arxiv.org/abs/2510.08759v1</a></p>

            <p><strong>Abstract:</strong><br>
            Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied capabilities remains underexplored, as existing benchmarks primarily focus on specific domains such as planning or spatial understanding. To bridge this gap, we introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities. BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning. Extensive evaluation results of 20 representative MLLMs reveal their persistent limitations across all domains of embodied capabilities. To tackle the shortfall, we propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities. It substantially enhances MLLM performance across diverse embodied capabilities on BEAR, yielding a 9.12% absolute gain and a relative improvement of 17.5% on GPT-5. Furthermore, our experiments indicate that improving MLLM embodied capabilities can benefit embodied tasks in simulated environments. Project website: https://bear-official66.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Shiji Xin, Yijian Huang, Kai Cheng, Peiheng Wang, Jiazheng Liu, Jiayi Zhang, Yizhe Zhu, Wenqing Wang, Yiran Qin, Xupeng Zhu, Haojie Huang, Lawson L. S. Wong</p>

            <p><strong>Title:</strong><br>
            BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08759v1">http://arxiv.org/abs/2510.08759v1</a></p>

            <p><strong>Abstract:</strong><br>
            Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied capabilities remains underexplored, as existing benchmarks primarily focus on specific domains such as planning or spatial understanding. To bridge this gap, we introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities. BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning. Extensive evaluation results of 20 representative MLLMs reveal their persistent limitations across all domains of embodied capabilities. To tackle the shortfall, we propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities. It substantially enhances MLLM performance across diverse embodied capabilities on BEAR, yielding a 9.12% absolute gain and a relative improvement of 17.5% on GPT-5. Furthermore, our experiments indicate that improving MLLM embodied capabilities can benefit embodied tasks in simulated environments. Project website: https://bear-official66.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Oct 2025 21:03:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6310e743/0f23789a.mp3" length="25661971" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1600</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Shiji Xin, Yijian Huang, Kai Cheng, Peiheng Wang, Jiazheng Liu, Jiayi Zhang, Yizhe Zhu, Wenqing Wang, Yiran Qin, Xupeng Zhu, Haojie Huang, Lawson L. S. Wong</p>

            <p><strong>Title:</strong><br>
            BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08759v1">http://arxiv.org/abs/2510.08759v1</a></p>

            <p><strong>Abstract:</strong><br>
            Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied capabilities remains underexplored, as existing benchmarks primarily focus on specific domains such as planning or spatial understanding. To bridge this gap, we introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities. BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning. Extensive evaluation results of 20 representative MLLMs reveal their persistent limitations across all domains of embodied capabilities. To tackle the shortfall, we propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities. It substantially enhances MLLM performance across diverse embodied capabilities on BEAR, yielding a 9.12% absolute gain and a relative improvement of 17.5% on GPT-5. Furthermore, our experiments indicate that improving MLLM embodied capabilities can benefit embodied tasks in simulated environments. Project website: https://bear-official66.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>StreamingVLM: Real-Time Understanding for Infinite Video Streams</title>
      <itunes:episode>1267</itunes:episode>
      <podcast:episode>1267</podcast:episode>
      <itunes:title>StreamingVLM: Real-Time Understanding for Infinite Video Streams</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bc4007c8-5159-4c4d-a7d9-5c7608f0cda5</guid>
      <link>https://share.transistor.fm/s/8236ecf0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, Song Han</p>

            <p><strong>Title:</strong><br>
            StreamingVLM: Real-Time Understanding for Infinite Video Streams</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.09608v1">http://arxiv.org/abs/2510.09608v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, Song Han</p>

            <p><strong>Title:</strong><br>
            StreamingVLM: Real-Time Understanding for Infinite Video Streams</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.09608v1">http://arxiv.org/abs/2510.09608v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Oct 2025 21:02:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8236ecf0/e37427e1.mp3" length="20563677" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1282</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, Song Han</p>

            <p><strong>Title:</strong><br>
            StreamingVLM: Real-Time Understanding for Infinite Video Streams</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.09608v1">http://arxiv.org/abs/2510.09608v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels</title>
      <itunes:episode>1266</itunes:episode>
      <podcast:episode>1266</podcast:episode>
      <itunes:title>Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bb0f041d-7528-4650-bc3d-c2993464812a</guid>
      <link>https://share.transistor.fm/s/82e43f05</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhepeng Cen, Haolin Chen, Shiyu Wang, Zuxin Liu, Zhiwei Liu, Ding Zhao, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao</p>

            <p><strong>Title:</strong><br>
            Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.06499v1">http://arxiv.org/abs/2510.06499v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100$\times$ fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhepeng Cen, Haolin Chen, Shiyu Wang, Zuxin Liu, Zhiwei Liu, Ding Zhao, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao</p>

            <p><strong>Title:</strong><br>
            Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.06499v1">http://arxiv.org/abs/2510.06499v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100$\times$ fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Oct 2025 21:02:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/82e43f05/7e1e81fb.mp3" length="23754809" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1481</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhepeng Cen, Haolin Chen, Shiyu Wang, Zuxin Liu, Zhiwei Liu, Ding Zhao, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao</p>

            <p><strong>Title:</strong><br>
            Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.06499v1">http://arxiv.org/abs/2510.06499v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100$\times$ fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution</title>
      <itunes:episode>1265</itunes:episode>
      <podcast:episode>1265</podcast:episode>
      <itunes:title>BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ef621135-11ff-49dd-9b6c-c06c163e8b3b</guid>
      <link>https://share.transistor.fm/s/13a288c2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.SE, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Terry Yue Zhuo, Xiaolong Jin, Hange Liu, Juyong Jiang, Tianyang Liu, Chen Gong, Bhupesh Bishnoi, Vaisakhi Mishra, Marek Suppa, Noah Ziems, Saiteja Utpala, Ming Xu, Guangyu Song, Kaixin Li, Yuhan Cao, Bo Liu, Zheng Liu, Sabina Abdurakhmanova, Wenhao Yu, Mengzhao Jia, Jihan Yao, Kenneth Hamilton, Kumar Shridhar, Minh Chien Vu, Dingmin Wang, Jiawei Liu, Zijian Wang, Qian Liu, Binyuan Hui, Meg Risdal, Ahsen Khaliq, Atin Sood, Zhenchang Xing, Wasi Uddin Ahmad, John Grundy, David Lo, Banghua Zhu, Xiaoning Du, Torsten Scholak, Leandro von Werra</p>

            <p><strong>Title:</strong><br>
            BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08697v1">http://arxiv.org/abs/2510.08697v1</a></p>

            <p><strong>Abstract:</strong><br>
            Crowdsourced model evaluation platforms, such as Chatbot Arena, enable real-time evaluation from human perspectives to assess the quality of model responses. In the coding domain, manually examining the quality of LLM-generated content is extremely challenging, as it requires understanding long chunks of raw code and deliberately simulating code execution. To this end, we introduce BigCodeArena, an open human evaluation platform for code generation backed by a comprehensive and on-the-fly execution environment. Built on top of Chatbot Arena, BigCodeArena enables the execution of LLM-generated code and allows humans to interact with the execution process and outcomes. We collected over 14,000 raw code-centric conversation sessions across 10 widely used LLMs, spanning 10 languages and 8 types of execution environments. Among these conversations, we identified more than 4,700 multi-turn samples with pairwise human preferences. Further analysis uncovers underexplored preferences of LLMs in fine-grained domains characterized by tasks, languages, and frameworks. To systematically examine code understanding and generation capabilities of frontier LLMs, we curated two benchmarks based on the collected data, namely BigCodeReward and AutoCodeArena. For BigCodeReward, we post-processed the 4,700 conversations and evaluated the consistency between reward models and human preferences. The evaluation shows that most LLMs have superior performance in judging coding preferences when the execution results are available. Inspired by these findings, we propose AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding quality of LLMs without human involvement. We find that proprietary LLMs like GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation performance among recent emerging models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.SE, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Terry Yue Zhuo, Xiaolong Jin, Hange Liu, Juyong Jiang, Tianyang Liu, Chen Gong, Bhupesh Bishnoi, Vaisakhi Mishra, Marek Suppa, Noah Ziems, Saiteja Utpala, Ming Xu, Guangyu Song, Kaixin Li, Yuhan Cao, Bo Liu, Zheng Liu, Sabina Abdurakhmanova, Wenhao Yu, Mengzhao Jia, Jihan Yao, Kenneth Hamilton, Kumar Shridhar, Minh Chien Vu, Dingmin Wang, Jiawei Liu, Zijian Wang, Qian Liu, Binyuan Hui, Meg Risdal, Ahsen Khaliq, Atin Sood, Zhenchang Xing, Wasi Uddin Ahmad, John Grundy, David Lo, Banghua Zhu, Xiaoning Du, Torsten Scholak, Leandro von Werra</p>

            <p><strong>Title:</strong><br>
            BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08697v1">http://arxiv.org/abs/2510.08697v1</a></p>

            <p><strong>Abstract:</strong><br>
            Crowdsourced model evaluation platforms, such as Chatbot Arena, enable real-time evaluation from human perspectives to assess the quality of model responses. In the coding domain, manually examining the quality of LLM-generated content is extremely challenging, as it requires understanding long chunks of raw code and deliberately simulating code execution. To this end, we introduce BigCodeArena, an open human evaluation platform for code generation backed by a comprehensive and on-the-fly execution environment. Built on top of Chatbot Arena, BigCodeArena enables the execution of LLM-generated code and allows humans to interact with the execution process and outcomes. We collected over 14,000 raw code-centric conversation sessions across 10 widely used LLMs, spanning 10 languages and 8 types of execution environments. Among these conversations, we identified more than 4,700 multi-turn samples with pairwise human preferences. Further analysis uncovers underexplored preferences of LLMs in fine-grained domains characterized by tasks, languages, and frameworks. To systematically examine code understanding and generation capabilities of frontier LLMs, we curated two benchmarks based on the collected data, namely BigCodeReward and AutoCodeArena. For BigCodeReward, we post-processed the 4,700 conversations and evaluated the consistency between reward models and human preferences. The evaluation shows that most LLMs have superior performance in judging coding preferences when the execution results are available. Inspired by these findings, we propose AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding quality of LLMs without human involvement. We find that proprietary LLMs like GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation performance among recent emerging models.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Oct 2025 21:02:02 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/13a288c2/b8d2112c.mp3" length="22126032" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1379</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.SE, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Terry Yue Zhuo, Xiaolong Jin, Hange Liu, Juyong Jiang, Tianyang Liu, Chen Gong, Bhupesh Bishnoi, Vaisakhi Mishra, Marek Suppa, Noah Ziems, Saiteja Utpala, Ming Xu, Guangyu Song, Kaixin Li, Yuhan Cao, Bo Liu, Zheng Liu, Sabina Abdurakhmanova, Wenhao Yu, Mengzhao Jia, Jihan Yao, Kenneth Hamilton, Kumar Shridhar, Minh Chien Vu, Dingmin Wang, Jiawei Liu, Zijian Wang, Qian Liu, Binyuan Hui, Meg Risdal, Ahsen Khaliq, Atin Sood, Zhenchang Xing, Wasi Uddin Ahmad, John Grundy, David Lo, Banghua Zhu, Xiaoning Du, Torsten Scholak, Leandro von Werra</p>

            <p><strong>Title:</strong><br>
            BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08697v1">http://arxiv.org/abs/2510.08697v1</a></p>

            <p><strong>Abstract:</strong><br>
            Crowdsourced model evaluation platforms, such as Chatbot Arena, enable real-time evaluation from human perspectives to assess the quality of model responses. In the coding domain, manually examining the quality of LLM-generated content is extremely challenging, as it requires understanding long chunks of raw code and deliberately simulating code execution. To this end, we introduce BigCodeArena, an open human evaluation platform for code generation backed by a comprehensive and on-the-fly execution environment. Built on top of Chatbot Arena, BigCodeArena enables the execution of LLM-generated code and allows humans to interact with the execution process and outcomes. We collected over 14,000 raw code-centric conversation sessions across 10 widely used LLMs, spanning 10 languages and 8 types of execution environments. Among these conversations, we identified more than 4,700 multi-turn samples with pairwise human preferences. Further analysis uncovers underexplored preferences of LLMs in fine-grained domains characterized by tasks, languages, and frameworks. To systematically examine code understanding and generation capabilities of frontier LLMs, we curated two benchmarks based on the collected data, namely BigCodeReward and AutoCodeArena. For BigCodeReward, we post-processed the 4,700 conversations and evaluated the consistency between reward models and human preferences. The evaluation shows that most LLMs have superior performance in judging coding preferences when the execution results are available. Inspired by these findings, we propose AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding quality of LLMs without human involvement. We find that proprietary LLMs like GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation performance among recent emerging models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?</title>
      <itunes:episode>1264</itunes:episode>
      <podcast:episode>1264</podcast:episode>
      <itunes:title>R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">43c0895e-b67c-4985-82ab-9dfe7fcd73bf</guid>
      <link>https://share.transistor.fm/s/aa0a6a7d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, Xunliang Cai</p>

            <p><strong>Title:</strong><br>
            R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08189v1">http://arxiv.org/abs/2510.08189v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks, with an increase of 7.5 on AIME2024. These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, Xunliang Cai</p>

            <p><strong>Title:</strong><br>
            R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08189v1">http://arxiv.org/abs/2510.08189v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks, with an increase of 7.5 on AIME2024. These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Oct 2025 21:01:39 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/aa0a6a7d/f85c8c74.mp3" length="23774874" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1482</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yi Lu, Jianing Wang, Linsen Guo, Wei He, Hongyin Tang, Tao Gui, Xuanjing Huang, Xuezhi Cao, Wei Wang, Xunliang Cai</p>

            <p><strong>Title:</strong><br>
            R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08189v1">http://arxiv.org/abs/2510.08189v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks, with an increase of 7.5 on AIME2024. These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Agent Learning via Early Experience</title>
      <itunes:episode>1263</itunes:episode>
      <podcast:episode>1263</podcast:episode>
      <itunes:title>Agent Learning via Early Experience</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2d7377c0-bb1c-41fa-bd11-cf574f0e03c5</guid>
      <link>https://share.transistor.fm/s/6b6de3c7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 124 | cs.AI, cs.CL, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu</p>

            <p><strong>Title:</strong><br>
            Agent Learning via Early Experience</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08558v1">http://arxiv.org/abs/2510.08558v1</a></p>

            <p><strong>Abstract:</strong><br>
            A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 124 | cs.AI, cs.CL, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu</p>

            <p><strong>Title:</strong><br>
            Agent Learning via Early Experience</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08558v1">http://arxiv.org/abs/2510.08558v1</a></p>

            <p><strong>Abstract:</strong><br>
            A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Oct 2025 21:02:48 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6b6de3c7/ad46f7f8.mp3" length="21901953" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1365</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 124 | cs.AI, cs.CL, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu</p>

            <p><strong>Title:</strong><br>
            Agent Learning via Early Experience</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08558v1">http://arxiv.org/abs/2510.08558v1</a></p>

            <p><strong>Abstract:</strong><br>
            A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization</title>
      <itunes:episode>1262</itunes:episode>
      <podcast:episode>1262</podcast:episode>
      <itunes:title>MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d751f1a5-b415-458e-aae1-b0d4037b3003</guid>
      <link>https://share.transistor.fm/s/5f50fad4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 92 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiangyu Zhao, Junming Lin, Tianhao Liang, Yifan Zhou, Wenhao Chai, Yuzhe Gu, Weiyun Wang, Kai Chen, Gen Luo, Wenwei Zhang, Junchi Yan, Hua Yang, Haodong Duan, Xue Yang</p>

            <p><strong>Title:</strong><br>
            MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08540v1">http://arxiv.org/abs/2510.08540v1</a></p>

            <p><strong>Abstract:</strong><br>
            While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7\% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 92 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiangyu Zhao, Junming Lin, Tianhao Liang, Yifan Zhou, Wenhao Chai, Yuzhe Gu, Weiyun Wang, Kai Chen, Gen Luo, Wenwei Zhang, Junchi Yan, Hua Yang, Haodong Duan, Xue Yang</p>

            <p><strong>Title:</strong><br>
            MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08540v1">http://arxiv.org/abs/2510.08540v1</a></p>

            <p><strong>Abstract:</strong><br>
            While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7\% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Oct 2025 21:02:25 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5f50fad4/c67c2af5.mp3" length="20789853" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1296</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 92 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiangyu Zhao, Junming Lin, Tianhao Liang, Yifan Zhou, Wenhao Chai, Yuzhe Gu, Weiyun Wang, Kai Chen, Gen Luo, Wenwei Zhang, Junchi Yan, Hua Yang, Haodong Duan, Xue Yang</p>

            <p><strong>Title:</strong><br>
            MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08540v1">http://arxiv.org/abs/2510.08540v1</a></p>

            <p><strong>Abstract:</strong><br>
            While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7\% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MemMamba: Rethinking Memory Patterns in State Space Model</title>
      <itunes:episode>1261</itunes:episode>
      <podcast:episode>1261</podcast:episode>
      <itunes:title>MemMamba: Rethinking Memory Patterns in State Space Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">af846694-fcf9-42cb-8dd6-5d2cfc07e14f</guid>
      <link>https://share.transistor.fm/s/43220778</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Youjin Wang, Yangjingyi Chen, Jiahao Yan, Jiaxuan Lu, Xiao Sun</p>

            <p><strong>Title:</strong><br>
            MemMamba: Rethinking Memory Patterns in State Space Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.03279v1">http://arxiv.org/abs/2510.03279v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the explosive growth of data, long-sequence modeling has become increasingly important in tasks such as natural language processing and bioinformatics. However, existing methods face inherent trade-offs between efficiency and memory. Recurrent neural networks suffer from gradient vanishing and explosion, making them hard to scale. Transformers can model global dependencies but are constrained by quadratic complexity. Recently, selective state-space models such as Mamba have demonstrated high efficiency with O(n) time and O(1) recurrent inference, yet their long-range memory decays exponentially. In this work, we conduct mathematical derivations and information-theoretic analysis to systematically uncover the memory decay mechanism of Mamba, answering a fundamental question: what is the nature of Mamba's long-range memory and how does it retain information? To quantify key information loss, we further introduce horizontal-vertical memory fidelity metrics that capture degradation both within and across layers. Inspired by how humans distill and retain salient information when reading long documents, we propose MemMamba, a novel architectural framework that integrates state summarization mechanism together with cross-layer and cross-token attention, which alleviates long-range forgetting while preserving linear complexity. MemMamba achieves significant improvements over existing Mamba variants and Transformers on long-sequence benchmarks such as PG19 and Passkey Retrieval, while delivering a 48% speedup in inference efficiency. Both theoretical analysis and empirical results demonstrate that MemMamba achieves a breakthrough in the complexity-memory trade-off, offering a new paradigm for ultra-long sequence modeling.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Youjin Wang, Yangjingyi Chen, Jiahao Yan, Jiaxuan Lu, Xiao Sun</p>

            <p><strong>Title:</strong><br>
            MemMamba: Rethinking Memory Patterns in State Space Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.03279v1">http://arxiv.org/abs/2510.03279v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the explosive growth of data, long-sequence modeling has become increasingly important in tasks such as natural language processing and bioinformatics. However, existing methods face inherent trade-offs between efficiency and memory. Recurrent neural networks suffer from gradient vanishing and explosion, making them hard to scale. Transformers can model global dependencies but are constrained by quadratic complexity. Recently, selective state-space models such as Mamba have demonstrated high efficiency with O(n) time and O(1) recurrent inference, yet their long-range memory decays exponentially. In this work, we conduct mathematical derivations and information-theoretic analysis to systematically uncover the memory decay mechanism of Mamba, answering a fundamental question: what is the nature of Mamba's long-range memory and how does it retain information? To quantify key information loss, we further introduce horizontal-vertical memory fidelity metrics that capture degradation both within and across layers. Inspired by how humans distill and retain salient information when reading long documents, we propose MemMamba, a novel architectural framework that integrates state summarization mechanism together with cross-layer and cross-token attention, which alleviates long-range forgetting while preserving linear complexity. MemMamba achieves significant improvements over existing Mamba variants and Transformers on long-sequence benchmarks such as PG19 and Passkey Retrieval, while delivering a 48% speedup in inference efficiency. Both theoretical analysis and empirical results demonstrate that MemMamba achieves a breakthrough in the complexity-memory trade-off, offering a new paradigm for ultra-long sequence modeling.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Oct 2025 21:02:02 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/43220778/cd3f69be.mp3" length="23208516" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1447</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Youjin Wang, Yangjingyi Chen, Jiahao Yan, Jiaxuan Lu, Xiao Sun</p>

            <p><strong>Title:</strong><br>
            MemMamba: Rethinking Memory Patterns in State Space Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.03279v1">http://arxiv.org/abs/2510.03279v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the explosive growth of data, long-sequence modeling has become increasingly important in tasks such as natural language processing and bioinformatics. However, existing methods face inherent trade-offs between efficiency and memory. Recurrent neural networks suffer from gradient vanishing and explosion, making them hard to scale. Transformers can model global dependencies but are constrained by quadratic complexity. Recently, selective state-space models such as Mamba have demonstrated high efficiency with O(n) time and O(1) recurrent inference, yet their long-range memory decays exponentially. In this work, we conduct mathematical derivations and information-theoretic analysis to systematically uncover the memory decay mechanism of Mamba, answering a fundamental question: what is the nature of Mamba's long-range memory and how does it retain information? To quantify key information loss, we further introduce horizontal-vertical memory fidelity metrics that capture degradation both within and across layers. Inspired by how humans distill and retain salient information when reading long documents, we propose MemMamba, a novel architectural framework that integrates state summarization mechanism together with cross-layer and cross-token attention, which alleviates long-range forgetting while preserving linear complexity. MemMamba achieves significant improvements over existing Mamba variants and Transformers on long-sequence benchmarks such as PG19 and Passkey Retrieval, while delivering a 48% speedup in inference efficiency. Both theoretical analysis and empirical results demonstrate that MemMamba achieves a breakthrough in the complexity-memory trade-off, offering a new paradigm for ultra-long sequence modeling.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UniVideo: Unified Understanding, Generation, and Editing for Videos</title>
      <itunes:episode>1260</itunes:episode>
      <podcast:episode>1260</podcast:episode>
      <itunes:title>UniVideo: Unified Understanding, Generation, and Editing for Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">873075b2-7be0-48a9-866f-0d9c8323984f</guid>
      <link>https://share.transistor.fm/s/5eca73c9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            UniVideo: Unified Understanding, Generation, and Editing for Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08377v1">http://arxiv.org/abs/2510.08377v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            UniVideo: Unified Understanding, Generation, and Editing for Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08377v1">http://arxiv.org/abs/2510.08377v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Oct 2025 21:01:39 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5eca73c9/c6348bf4.mp3" length="25120271" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1566</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            UniVideo: Unified Understanding, Generation, and Editing for Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08377v1">http://arxiv.org/abs/2510.08377v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning</title>
      <itunes:episode>1259</itunes:episode>
      <podcast:episode>1259</podcast:episode>
      <itunes:title>From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">877db65b-a23a-406e-9140-5758423f031f</guid>
      <link>https://share.transistor.fm/s/a27f73a4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Cheng Yang, Jiaxuan Lu, Haiyuan Wan, Junchi Yu, Feiwei Qin</p>

            <p><strong>Title:</strong><br>
            From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.23768v1">http://arxiv.org/abs/2509.23768v1</a></p>

            <p><strong>Abstract:</strong><br>
            The chemical reaction recommendation is to select proper reaction condition parameters for chemical reactions, which is pivotal to accelerating chemical science. With the rapid development of large language models (LLMs), there is growing interest in leveraging their reasoning and planning capabilities for reaction condition recommendation. Despite their success, existing methods rarely explain the rationale behind the recommended reaction conditions, limiting their utility in high-stakes scientific workflows. In this work, we propose ChemMAS, a multi-agent system that reframes condition prediction as an evidence-based reasoning task. ChemMAS decomposes the task into mechanistic grounding, multi-channel recall, constraint-aware agentic debate, and rationale aggregation. Each decision is backed by interpretable justifications grounded in chemical knowledge and retrieved precedents. Experiments show that ChemMAS achieves 20-35% gains over domain-specific baselines and outperforms general-purpose LLMs by 10-15% in Top-1 accuracy, while offering falsifiable, human-trustable rationales, which establishes a new paradigm for explainable AI in scientific discovery.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Cheng Yang, Jiaxuan Lu, Haiyuan Wan, Junchi Yu, Feiwei Qin</p>

            <p><strong>Title:</strong><br>
            From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.23768v1">http://arxiv.org/abs/2509.23768v1</a></p>

            <p><strong>Abstract:</strong><br>
            The chemical reaction recommendation is to select proper reaction condition parameters for chemical reactions, which is pivotal to accelerating chemical science. With the rapid development of large language models (LLMs), there is growing interest in leveraging their reasoning and planning capabilities for reaction condition recommendation. Despite their success, existing methods rarely explain the rationale behind the recommended reaction conditions, limiting their utility in high-stakes scientific workflows. In this work, we propose ChemMAS, a multi-agent system that reframes condition prediction as an evidence-based reasoning task. ChemMAS decomposes the task into mechanistic grounding, multi-channel recall, constraint-aware agentic debate, and rationale aggregation. Each decision is backed by interpretable justifications grounded in chemical knowledge and retrieved precedents. Experiments show that ChemMAS achieves 20-35% gains over domain-specific baselines and outperforms general-purpose LLMs by 10-15% in Top-1 accuracy, while offering falsifiable, human-trustable rationales, which establishes a new paradigm for explainable AI in scientific discovery.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Oct 2025 21:01:16 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a27f73a4/7bc77c32.mp3" length="25379434" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1583</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Cheng Yang, Jiaxuan Lu, Haiyuan Wan, Junchi Yu, Feiwei Qin</p>

            <p><strong>Title:</strong><br>
            From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.23768v1">http://arxiv.org/abs/2509.23768v1</a></p>

            <p><strong>Abstract:</strong><br>
            The chemical reaction recommendation is to select proper reaction condition parameters for chemical reactions, which is pivotal to accelerating chemical science. With the rapid development of large language models (LLMs), there is growing interest in leveraging their reasoning and planning capabilities for reaction condition recommendation. Despite their success, existing methods rarely explain the rationale behind the recommended reaction conditions, limiting their utility in high-stakes scientific workflows. In this work, we propose ChemMAS, a multi-agent system that reframes condition prediction as an evidence-based reasoning task. ChemMAS decomposes the task into mechanistic grounding, multi-channel recall, constraint-aware agentic debate, and rationale aggregation. Each decision is backed by interpretable justifications grounded in chemical knowledge and retrieved precedents. Experiments show that ChemMAS achieves 20-35% gains over domain-specific baselines and outperforms general-purpose LLMs by 10-15% in Top-1 accuracy, while offering falsifiable, human-trustable rationales, which establishes a new paradigm for explainable AI in scientific discovery.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs</title>
      <itunes:episode>1258</itunes:episode>
      <podcast:episode>1258</podcast:episode>
      <itunes:title>When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5f3a9b56-baf9-4254-9565-c6782aa3bb97</guid>
      <link>https://share.transistor.fm/s/5a085e0a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Soyeong Jeong, Taehee Jung, Sung Ju Hwang, Joo-Kyung Kim, Dongyeop Kang</p>

            <p><strong>Title:</strong><br>
            When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.07499v1">http://arxiv.org/abs/2510.07499v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent Long-Context Language Models (LCLMs) can process hundreds of thousands of tokens in a single prompt, enabling new opportunities for knowledge-intensive multi-hop reasoning by integrating large sets of retrieved documents or, in some cases, directly all necessary information. However, simply feeding more documents into the context window fails to capture how evidence should be connected. We address this gap with thought templates, which recast reasoning as reusable thought caches, derived from prior problem solving traces, structuring how evidence is combined and guiding multi-hop inference with factual documents. To keep these templates effective, we propose an update strategy that iteratively refines templates derived from training data through natural-language feedback. Across diverse benchmarks and LCLM families, our approach delivers consistent gains over strong baselines in both retrieval-based and retrieval-free settings. Furthermore, we show that optimized templates can be distilled into smaller open-source models, demonstrating its broad applicability and transparent reasoning reuse. We refer to our framework as Thought Template Augmented LCLMs (ToTAL).</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Soyeong Jeong, Taehee Jung, Sung Ju Hwang, Joo-Kyung Kim, Dongyeop Kang</p>

            <p><strong>Title:</strong><br>
            When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.07499v1">http://arxiv.org/abs/2510.07499v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent Long-Context Language Models (LCLMs) can process hundreds of thousands of tokens in a single prompt, enabling new opportunities for knowledge-intensive multi-hop reasoning by integrating large sets of retrieved documents or, in some cases, directly all necessary information. However, simply feeding more documents into the context window fails to capture how evidence should be connected. We address this gap with thought templates, which recast reasoning as reusable thought caches, derived from prior problem solving traces, structuring how evidence is combined and guiding multi-hop inference with factual documents. To keep these templates effective, we propose an update strategy that iteratively refines templates derived from training data through natural-language feedback. Across diverse benchmarks and LCLM families, our approach delivers consistent gains over strong baselines in both retrieval-based and retrieval-free settings. Furthermore, we show that optimized templates can be distilled into smaller open-source models, demonstrating its broad applicability and transparent reasoning reuse. We refer to our framework as Thought Template Augmented LCLMs (ToTAL).</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Oct 2025 21:00:52 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5a085e0a/148e7866.mp3" length="20581232" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1283</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Soyeong Jeong, Taehee Jung, Sung Ju Hwang, Joo-Kyung Kim, Dongyeop Kang</p>

            <p><strong>Title:</strong><br>
            When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.07499v1">http://arxiv.org/abs/2510.07499v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent Long-Context Language Models (LCLMs) can process hundreds of thousands of tokens in a single prompt, enabling new opportunities for knowledge-intensive multi-hop reasoning by integrating large sets of retrieved documents or, in some cases, directly all necessary information. However, simply feeding more documents into the context window fails to capture how evidence should be connected. We address this gap with thought templates, which recast reasoning as reusable thought caches, derived from prior problem solving traces, structuring how evidence is combined and guiding multi-hop inference with factual documents. To keep these templates effective, we propose an update strategy that iteratively refines templates derived from training data through natural-language feedback. Across diverse benchmarks and LCLM families, our approach delivers consistent gains over strong baselines in both retrieval-based and retrieval-free settings. Furthermore, we show that optimized templates can be distilled into smaller open-source models, demonstrating its broad applicability and transparent reasoning reuse. We refer to our framework as Thought Template Augmented LCLMs (ToTAL).</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning</title>
      <itunes:episode>1257</itunes:episode>
      <podcast:episode>1257</podcast:episode>
      <itunes:title>Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">82506581-667f-4b78-8bed-6c26bbae9976</guid>
      <link>https://share.transistor.fm/s/fb9dc78b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yoonjeon Kim, Doohyuk Jang, Eunho Yang</p>

            <p><strong>Title:</strong><br>
            Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.03259v1">http://arxiv.org/abs/2510.03259v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies on reasoning models explore the meta-awareness of language models, the ability to know how to think by itself. We argue that large reasoning models lack this meta-awareness property by proving severe misalignment between true rollouts and predicted meta information. We posit that aligning meta-prediction with true rollouts will lead to significant performance gains. To verify this hypothesis, we design a training pipeline that boosts Meta-Awareness via Self-Alignment (MASA), and prove that enhanced meta-awareness directly translates to improved accuracy. Unlike existing meta-cognitive reasoning models, our method does not require external training sources but leverages self-generated signals to train meta-awareness. Moreover, our method enables efficient training by i) filtering out zero-variance prompts that are either trivial or unsolvable and ii) cutting off lengthy rollouts when they are unlikely to lead to correct answers. The results are inspiring: our strategy yields significant improvements in both accuracy and training efficiency on in-domain tasks and shows strong generalization to out-of-domain benchmarks. More specifically, our method can speed up GRPO training by over 1.28x to reach the same performance, and achieve a 19.3% gain in accuracy on AIME25, and a 6.2 % average gain over six mathematics benchmarks. Training with meta-cognitive guidance enhances out-of-domain generalization, giving a 3.87 % boost on GPQA-Diamond and a 2.08 % overall accuracy gain across 13 benchmarks spanning logical, scientific, and coding domains.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yoonjeon Kim, Doohyuk Jang, Eunho Yang</p>

            <p><strong>Title:</strong><br>
            Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.03259v1">http://arxiv.org/abs/2510.03259v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies on reasoning models explore the meta-awareness of language models, the ability to know how to think by itself. We argue that large reasoning models lack this meta-awareness property by proving severe misalignment between true rollouts and predicted meta information. We posit that aligning meta-prediction with true rollouts will lead to significant performance gains. To verify this hypothesis, we design a training pipeline that boosts Meta-Awareness via Self-Alignment (MASA), and prove that enhanced meta-awareness directly translates to improved accuracy. Unlike existing meta-cognitive reasoning models, our method does not require external training sources but leverages self-generated signals to train meta-awareness. Moreover, our method enables efficient training by i) filtering out zero-variance prompts that are either trivial or unsolvable and ii) cutting off lengthy rollouts when they are unlikely to lead to correct answers. The results are inspiring: our strategy yields significant improvements in both accuracy and training efficiency on in-domain tasks and shows strong generalization to out-of-domain benchmarks. More specifically, our method can speed up GRPO training by over 1.28x to reach the same performance, and achieve a 19.3% gain in accuracy on AIME25, and a 6.2 % average gain over six mathematics benchmarks. Training with meta-cognitive guidance enhances out-of-domain generalization, giving a 3.87 % boost on GPQA-Diamond and a 2.08 % overall accuracy gain across 13 benchmarks spanning logical, scientific, and coding domains.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Oct 2025 21:00:29 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fb9dc78b/30a8ba87.mp3" length="23603509" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1472</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yoonjeon Kim, Doohyuk Jang, Eunho Yang</p>

            <p><strong>Title:</strong><br>
            Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.03259v1">http://arxiv.org/abs/2510.03259v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies on reasoning models explore the meta-awareness of language models, the ability to know how to think by itself. We argue that large reasoning models lack this meta-awareness property by proving severe misalignment between true rollouts and predicted meta information. We posit that aligning meta-prediction with true rollouts will lead to significant performance gains. To verify this hypothesis, we design a training pipeline that boosts Meta-Awareness via Self-Alignment (MASA), and prove that enhanced meta-awareness directly translates to improved accuracy. Unlike existing meta-cognitive reasoning models, our method does not require external training sources but leverages self-generated signals to train meta-awareness. Moreover, our method enables efficient training by i) filtering out zero-variance prompts that are either trivial or unsolvable and ii) cutting off lengthy rollouts when they are unlikely to lead to correct answers. The results are inspiring: our strategy yields significant improvements in both accuracy and training efficiency on in-domain tasks and shows strong generalization to out-of-domain benchmarks. More specifically, our method can speed up GRPO training by over 1.28x to reach the same performance, and achieve a 19.3% gain in accuracy on AIME25, and a 6.2 % average gain over six mathematics benchmarks. Training with meta-cognitive guidance enhances out-of-domain generalization, giving a 3.87 % boost on GPQA-Diamond and a 2.08 % overall accuracy gain across 13 benchmarks spanning logical, scientific, and coding domains.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning</title>
      <itunes:episode>1256</itunes:episode>
      <podcast:episode>1256</podcast:episode>
      <itunes:title>VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d5fd425a-d59f-4913-a2c1-6f4b109bd052</guid>
      <link>https://share.transistor.fm/s/e9a6c5d6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Minghong Cai, Qiulin Wang, Zongli Ye, Wenze Liu, Quande Liu, Weicai Ye, Xintao Wang, Pengfei Wan, Kun Gai, Xiangyu Yue</p>

            <p><strong>Title:</strong><br>
            VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08555v1">http://arxiv.org/abs/2510.08555v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Minghong Cai, Qiulin Wang, Zongli Ye, Wenze Liu, Quande Liu, Weicai Ye, Xintao Wang, Pengfei Wan, Kun Gai, Xiangyu Yue</p>

            <p><strong>Title:</strong><br>
            VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08555v1">http://arxiv.org/abs/2510.08555v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Oct 2025 21:00:06 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e9a6c5d6/49537189.mp3" length="23697992" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1477</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Minghong Cai, Qiulin Wang, Zongli Ye, Wenze Liu, Quande Liu, Weicai Ye, Xintao Wang, Pengfei Wan, Kun Gai, Xiangyu Yue</p>

            <p><strong>Title:</strong><br>
            VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08555v1">http://arxiv.org/abs/2510.08555v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks--including first-frame image-to-video, inpainting, extension, and interpolation--under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE's temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Alignment Waltz: Jointly Training Agents to Collaborate for Safety</title>
      <itunes:episode>1255</itunes:episode>
      <podcast:episode>1255</podcast:episode>
      <itunes:title>The Alignment Waltz: Jointly Training Agents to Collaborate for Safety</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">600dd4ad-54df-474f-8608-7c4b72a9500b</guid>
      <link>https://share.transistor.fm/s/676de22a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason Weston, Hongyuan Zhan</p>

            <p><strong>Title:</strong><br>
            The Alignment Waltz: Jointly Training Agents to Collaborate for Safety</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08240v1">http://arxiv.org/abs/2510.08240v1</a></p>

            <p><strong>Abstract:</strong><br>
            Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason Weston, Hongyuan Zhan</p>

            <p><strong>Title:</strong><br>
            The Alignment Waltz: Jointly Training Agents to Collaborate for Safety</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08240v1">http://arxiv.org/abs/2510.08240v1</a></p>

            <p><strong>Abstract:</strong><br>
            Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Oct 2025 20:59:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/676de22a/302fc24d.mp3" length="23417090" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1460</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason Weston, Hongyuan Zhan</p>

            <p><strong>Title:</strong><br>
            The Alignment Waltz: Jointly Training Agents to Collaborate for Safety</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.08240v1">http://arxiv.org/abs/2510.08240v1</a></p>

            <p><strong>Abstract:</strong><br>
            Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense</title>
      <itunes:episode>1254</itunes:episode>
      <podcast:episode>1254</podcast:episode>
      <itunes:title>Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">087c1399-50a9-4a8a-8a2b-8eaa914bd799</guid>
      <link>https://share.transistor.fm/s/0c357348</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Leitian Tao, Ilia Kulikov, Swarnadeep Saha, Tianlu Wang, Jing Xu, Yixuan Li, Jason E Weston, Ping Yu</p>

            <p><strong>Title:</strong><br>
            Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.07242v2">http://arxiv.org/abs/2510.07242v2</a></p>

            <p><strong>Abstract:</strong><br>
            Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Leitian Tao, Ilia Kulikov, Swarnadeep Saha, Tianlu Wang, Jing Xu, Yixuan Li, Jason E Weston, Ping Yu</p>

            <p><strong>Title:</strong><br>
            Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.07242v2">http://arxiv.org/abs/2510.07242v2</a></p>

            <p><strong>Abstract:</strong><br>
            Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Oct 2025 20:59:20 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0c357348/69e75cd8.mp3" length="23392429" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1458</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Leitian Tao, Ilia Kulikov, Swarnadeep Saha, Tianlu Wang, Jing Xu, Yixuan Li, Jason E Weston, Ping Yu</p>

            <p><strong>Title:</strong><br>
            Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.07242v2">http://arxiv.org/abs/2510.07242v2</a></p>

            <p><strong>Abstract:</strong><br>
            Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Cache-to-Cache: Direct Semantic Communication Between Large Language Models</title>
      <itunes:episode>1253</itunes:episode>
      <podcast:episode>1253</podcast:episode>
      <itunes:title>Cache-to-Cache: Direct Semantic Communication Between Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5bd9e5c3-5232-422c-afed-6fc8f178995b</guid>
      <link>https://share.transistor.fm/s/aa2ab859</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.CL, cs.LG, 68T07, 68T50, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai, Wanli Ouyang, Yu Wang</p>

            <p><strong>Title:</strong><br>
            Cache-to-Cache: Direct Semantic Communication Between Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.03215v1">http://arxiv.org/abs/2510.03215v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.CL, cs.LG, 68T07, 68T50, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai, Wanli Ouyang, Yu Wang</p>

            <p><strong>Title:</strong><br>
            Cache-to-Cache: Direct Semantic Communication Between Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.03215v1">http://arxiv.org/abs/2510.03215v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Oct 2025 20:41:07 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/aa2ab859/cbc4ca13.mp3" length="24038601" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1499</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.CL, cs.LG, 68T07, 68T50, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai, Wanli Ouyang, Yu Wang</p>

            <p><strong>Title:</strong><br>
            Cache-to-Cache: Direct Semantic Communication Between Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.03215v1">http://arxiv.org/abs/2510.03215v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer</title>
      <itunes:episode>1252</itunes:episode>
      <podcast:episode>1252</podcast:episode>
      <itunes:title>Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">acbae27f-50f0-4c10-8bc8-b90f010531dc</guid>
      <link>https://share.transistor.fm/s/38a1e662</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Yongjie Lv, Taozhi Huang, Jiajia Liu, Qingpei Guo, Ming Yang, Jingdong Chen, Jun Zhou</p>

            <p><strong>Title:</strong><br>
            Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.06590v1">http://arxiv.org/abs/2510.06590v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Yongjie Lv, Taozhi Huang, Jiajia Liu, Qingpei Guo, Ming Yang, Jingdong Chen, Jun Zhou</p>

            <p><strong>Title:</strong><br>
            Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.06590v1">http://arxiv.org/abs/2510.06590v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Oct 2025 20:40:44 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/38a1e662/e008d878.mp3" length="25774820" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1607</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Yongjie Lv, Taozhi Huang, Jiajia Liu, Qingpei Guo, Ming Yang, Jingdong Chen, Jun Zhou</p>

            <p><strong>Title:</strong><br>
            Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.06590v1">http://arxiv.org/abs/2510.06590v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding</title>
      <itunes:episode>1251</itunes:episode>
      <podcast:episode>1251</podcast:episode>
      <itunes:title>Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2b687184-ec87-468f-a05e-61aacae718a1</guid>
      <link>https://share.transistor.fm/s/d9b6313c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, Jinbin Bai, Qian Yu, Dengyang Jiang, Yuandong Pu, Haoxing Chen, Le Zhuo, Junjun He, Gen Luo, Tianbin Li, Ming Hu, Jin Ye, Shenglong Ye, Bo Zhang, Chang Xu, Wenhai Wang, Hongsheng Li, Guangtao Zhai, Tianfan Xue, Bin Fu, Xiaohong Liu, Yu Qiao, Yihao Liu</p>

            <p><strong>Title:</strong><br>
            Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.06308v1">http://arxiv.org/abs/2510.06308v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: https://synbol.github.io/Lumina-DiMOO.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, Jinbin Bai, Qian Yu, Dengyang Jiang, Yuandong Pu, Haoxing Chen, Le Zhuo, Junjun He, Gen Luo, Tianbin Li, Ming Hu, Jin Ye, Shenglong Ye, Bo Zhang, Chang Xu, Wenhai Wang, Hongsheng Li, Guangtao Zhai, Tianfan Xue, Bin Fu, Xiaohong Liu, Yu Qiao, Yihao Liu</p>

            <p><strong>Title:</strong><br>
            Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.06308v1">http://arxiv.org/abs/2510.06308v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: https://synbol.github.io/Lumina-DiMOO.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Oct 2025 20:40:20 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d9b6313c/fb5cbbed.mp3" length="20465907" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1275</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, Jinbin Bai, Qian Yu, Dengyang Jiang, Yuandong Pu, Haoxing Chen, Le Zhuo, Junjun He, Gen Luo, Tianbin Li, Ming Hu, Jin Ye, Shenglong Ye, Bo Zhang, Chang Xu, Wenhai Wang, Hongsheng Li, Guangtao Zhai, Tianfan Xue, Bin Fu, Xiaohong Liu, Yu Qiao, Yihao Liu</p>

            <p><strong>Title:</strong><br>
            Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.06308v1">http://arxiv.org/abs/2510.06308v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: https://synbol.github.io/Lumina-DiMOO.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models</title>
      <itunes:episode>1250</itunes:episode>
      <podcast:episode>1250</podcast:episode>
      <itunes:title>SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">12e025a1-d961-4d25-969a-d030bf8feabb</guid>
      <link>https://share.transistor.fm/s/74f6e6d0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CL, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang</p>

            <p><strong>Title:</strong><br>
            SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.06917v1">http://arxiv.org/abs/2510.06917v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting during the user's turn and can lead to high response latency while it waits to think. Consequently, thinking after receiving the full input is not suitable for speech-to-speech interaction, where real-time, low-latency exchange is important. We address this by noting that humans naturally "think while listening." In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input. SHANKS streams the input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses this unspoken reasoning to decide whether to interrupt the user and to make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user-SLM interaction in two scenarios: (1) when the user is presenting a step-by-step solution to a math problem, SHANKS can listen, reason, and interrupt when the user makes a mistake, achieving 37.1% higher interruption accuracy than a baseline that interrupts without thinking; and (2) in a tool-augmented dialogue, SHANKS can complete 56.9% of the tool calls before the user finishes their turn. Overall, SHANKS moves toward models that keep thinking throughout the conversation, not only after a turn ends. Animated illustrations of Shanks can be found at https://d223302.github.io/SHANKS/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CL, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang</p>

            <p><strong>Title:</strong><br>
            SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.06917v1">http://arxiv.org/abs/2510.06917v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting during the user's turn and can lead to high response latency while it waits to think. Consequently, thinking after receiving the full input is not suitable for speech-to-speech interaction, where real-time, low-latency exchange is important. We address this by noting that humans naturally "think while listening." In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input. SHANKS streams the input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses this unspoken reasoning to decide whether to interrupt the user and to make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user-SLM interaction in two scenarios: (1) when the user is presenting a step-by-step solution to a math problem, SHANKS can listen, reason, and interrupt when the user makes a mistake, achieving 37.1% higher interruption accuracy than a baseline that interrupts without thinking; and (2) in a tool-augmented dialogue, SHANKS can complete 56.9% of the tool calls before the user finishes their turn. Overall, SHANKS moves toward models that keep thinking throughout the conversation, not only after a turn ends. Animated illustrations of Shanks can be found at https://d223302.github.io/SHANKS/</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Oct 2025 20:39:58 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/74f6e6d0/7dfd5337.mp3" length="23638607" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1474</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CL, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang</p>

            <p><strong>Title:</strong><br>
            SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.06917v1">http://arxiv.org/abs/2510.06917v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting during the user's turn and can lead to high response latency while it waits to think. Consequently, thinking after receiving the full input is not suitable for speech-to-speech interaction, where real-time, low-latency exchange is important. We address this by noting that humans naturally "think while listening." In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input. SHANKS streams the input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses this unspoken reasoning to decide whether to interrupt the user and to make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user-SLM interaction in two scenarios: (1) when the user is presenting a step-by-step solution to a math problem, SHANKS can listen, reason, and interrupt when the user makes a mistake, achieving 37.1% higher interruption accuracy than a baseline that interrupts without thinking; and (2) in a tool-augmented dialogue, SHANKS can complete 56.9% of the tool calls before the user finishes their turn. Overall, SHANKS moves toward models that keep thinking throughout the conversation, not only after a turn ends. Animated illustrations of Shanks can be found at https://d223302.github.io/SHANKS/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MATRIX: Mask Track Alignment for Interaction-aware Video Generation</title>
      <itunes:episode>1249</itunes:episode>
      <podcast:episode>1249</podcast:episode>
      <itunes:title>MATRIX: Mask Track Alignment for Interaction-aware Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0de1e07d-55b4-4956-b074-01211c1e2573</guid>
      <link>https://share.transistor.fm/s/829afe56</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Siyoon Jin, Seongchan Kim, Dahyun Chung, Jaeho Lee, Hyunwook Choi, Jisu Nam, Jiyoung Kim, Seungryong Kim</p>

            <p><strong>Title:</strong><br>
            MATRIX: Mask Track Alignment for Interaction-aware Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.07310v1">http://arxiv.org/abs/2510.07310v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks from the MATRIX-11K dataset, enhancing both grounding and propagation. We further propose InterGenEval, an evaluation protocol for interaction-aware video generation. In experiments, MATRIX improves both interaction fidelity and semantic alignment while reducing drift and hallucination. Extensive ablations validate our design choices. Codes and weights will be released.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Siyoon Jin, Seongchan Kim, Dahyun Chung, Jaeho Lee, Hyunwook Choi, Jisu Nam, Jiyoung Kim, Seungryong Kim</p>

            <p><strong>Title:</strong><br>
            MATRIX: Mask Track Alignment for Interaction-aware Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.07310v1">http://arxiv.org/abs/2510.07310v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks from the MATRIX-11K dataset, enhancing both grounding and propagation. We further propose InterGenEval, an evaluation protocol for interaction-aware video generation. In experiments, MATRIX improves both interaction fidelity and semantic alignment while reducing drift and hallucination. Extensive ablations validate our design choices. Codes and weights will be released.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Oct 2025 20:39:35 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/829afe56/a404d567.mp3" length="22208349" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1384</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Siyoon Jin, Seongchan Kim, Dahyun Chung, Jaeho Lee, Hyunwook Choi, Jisu Nam, Jiyoung Kim, Seungryong Kim</p>

            <p><strong>Title:</strong><br>
            MATRIX: Mask Track Alignment for Interaction-aware Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.07310v1">http://arxiv.org/abs/2510.07310v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks from the MATRIX-11K dataset, enhancing both grounding and propagation. We further propose InterGenEval, an evaluation protocol for interaction-aware video generation. In experiments, MATRIX improves both interaction fidelity and semantic alignment while reducing drift and hallucination. Extensive ablations validate our design choices. Codes and weights will be released.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training</title>
      <itunes:episode>1248</itunes:episode>
      <podcast:episode>1248</podcast:episode>
      <itunes:title>RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a31f519d-24cc-4b3c-8c65-49d74a0a859c</guid>
      <link>https://share.transistor.fm/s/78e46c4d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, Zhihao Liu, Kang Chen, Wenhao Tang, Quanlu Zhang, Weinan Zhang, Chao Yu, Yu Wang</p>

            <p><strong>Title:</strong><br>
            RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.06710v1">http://arxiv.org/abs/2510.06710v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in vision and language foundation models has significantly advanced multimodal understanding, reasoning, and generation, inspiring a surge of interest in extending such capabilities to embodied settings through vision-language-action (VLA) models. Yet, most VLA models are still trained with supervised fine-tuning (SFT), which struggles to generalize under distribution shifts due to error accumulation. Reinforcement learning (RL) offers a promising alternative by directly optimizing task performance through interaction, but existing attempts remain fragmented and lack a unified platform for fair and systematic comparison across model architectures and algorithmic designs. To address this gap, we introduce RLinf-VLA, a unified and efficient framework for scalable RL training of VLA models. The system adopts a highly flexible resource allocation design that addresses the challenge of integrating rendering, training, and inference in RL+VLA training. In particular, for GPU-parallelized simulators, RLinf-VLA implements a novel hybrid fine-grained pipeline allocation mode, achieving a 1.61x-1.88x speedup in training. Through a unified interface, RLinf-VLA seamlessly supports diverse VLA architectures (e.g., OpenVLA, OpenVLA-OFT), multiple RL algorithms (e.g., PPO, GRPO), and various simulators (e.g., ManiSkill, LIBERO). In simulation, a unified model achieves 98.11\% across 130 LIBERO tasks and 97.66\% across 25 ManiSkill tasks. Beyond empirical performance, our study distills a set of best practices for applying RL to VLA training and sheds light on emerging patterns in this integration. Furthermore, we present preliminary deployment on a real-world Franka robot, where RL-trained policies exhibit stronger generalization than those trained with SFT. We envision RLinf-VLA as a foundation to accelerate and standardize research on embodied intelligence.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, Zhihao Liu, Kang Chen, Wenhao Tang, Quanlu Zhang, Weinan Zhang, Chao Yu, Yu Wang</p>

            <p><strong>Title:</strong><br>
            RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.06710v1">http://arxiv.org/abs/2510.06710v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in vision and language foundation models has significantly advanced multimodal understanding, reasoning, and generation, inspiring a surge of interest in extending such capabilities to embodied settings through vision-language-action (VLA) models. Yet, most VLA models are still trained with supervised fine-tuning (SFT), which struggles to generalize under distribution shifts due to error accumulation. Reinforcement learning (RL) offers a promising alternative by directly optimizing task performance through interaction, but existing attempts remain fragmented and lack a unified platform for fair and systematic comparison across model architectures and algorithmic designs. To address this gap, we introduce RLinf-VLA, a unified and efficient framework for scalable RL training of VLA models. The system adopts a highly flexible resource allocation design that addresses the challenge of integrating rendering, training, and inference in RL+VLA training. In particular, for GPU-parallelized simulators, RLinf-VLA implements a novel hybrid fine-grained pipeline allocation mode, achieving a 1.61x-1.88x speedup in training. Through a unified interface, RLinf-VLA seamlessly supports diverse VLA architectures (e.g., OpenVLA, OpenVLA-OFT), multiple RL algorithms (e.g., PPO, GRPO), and various simulators (e.g., ManiSkill, LIBERO). In simulation, a unified model achieves 98.11\% across 130 LIBERO tasks and 97.66\% across 25 ManiSkill tasks. Beyond empirical performance, our study distills a set of best practices for applying RL to VLA training and sheds light on emerging patterns in this integration. Furthermore, we present preliminary deployment on a real-world Franka robot, where RL-trained policies exhibit stronger generalization than those trained with SFT. We envision RLinf-VLA as a foundation to accelerate and standardize research on embodied intelligence.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Oct 2025 20:39:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/78e46c4d/c59d4d30.mp3" length="19547618" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1218</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, Zhihao Liu, Kang Chen, Wenhao Tang, Quanlu Zhang, Weinan Zhang, Chao Yu, Yu Wang</p>

            <p><strong>Title:</strong><br>
            RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.06710v1">http://arxiv.org/abs/2510.06710v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in vision and language foundation models has significantly advanced multimodal understanding, reasoning, and generation, inspiring a surge of interest in extending such capabilities to embodied settings through vision-language-action (VLA) models. Yet, most VLA models are still trained with supervised fine-tuning (SFT), which struggles to generalize under distribution shifts due to error accumulation. Reinforcement learning (RL) offers a promising alternative by directly optimizing task performance through interaction, but existing attempts remain fragmented and lack a unified platform for fair and systematic comparison across model architectures and algorithmic designs. To address this gap, we introduce RLinf-VLA, a unified and efficient framework for scalable RL training of VLA models. The system adopts a highly flexible resource allocation design that addresses the challenge of integrating rendering, training, and inference in RL+VLA training. In particular, for GPU-parallelized simulators, RLinf-VLA implements a novel hybrid fine-grained pipeline allocation mode, achieving a 1.61x-1.88x speedup in training. Through a unified interface, RLinf-VLA seamlessly supports diverse VLA architectures (e.g., OpenVLA, OpenVLA-OFT), multiple RL algorithms (e.g., PPO, GRPO), and various simulators (e.g., ManiSkill, LIBERO). In simulation, a unified model achieves 98.11\% across 130 LIBERO tasks and 97.66\% across 25 ManiSkill tasks. Beyond empirical performance, our study distills a set of best practices for applying RL to VLA training and sheds light on emerging patterns in this integration. Furthermore, we present preliminary deployment on a real-world Franka robot, where RL-trained policies exhibit stronger generalization than those trained with SFT. We envision RLinf-VLA as a foundation to accelerate and standardize research on embodied intelligence.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Vibe Checker: Aligning Code Evaluation with Human Preference</title>
      <itunes:episode>1247</itunes:episode>
      <podcast:episode>1247</podcast:episode>
      <itunes:title>Vibe Checker: Aligning Code Evaluation with Human Preference</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f8cd6345-0a75-471b-bd4a-5f69aadd289a</guid>
      <link>https://share.transistor.fm/s/da9980d9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.AI, cs.LG, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Ming Zhong, Xiang Zhou, Ting-Yun Chang, Qingze Wang, Nan Xu, Xiance Si, Dan Garrette, Shyam Upadhyay, Jeremiah Liu, Jiawei Han, Benoit Schillings, Jiao Sun</p>

            <p><strong>Title:</strong><br>
            Vibe Checker: Aligning Code Evaluation with Human Preference</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.07315v1">http://arxiv.org/abs/2510.07315v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check is tied to real-world human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking the non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check that represents human preference in coding besides functional correctness. To quantify models' code instruction following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with corresponding deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in Vibe Checker, a testbed to assess both code instruction following and functional correctness. Upon evaluating 31 leading LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit clear functional regression. Most importantly, a composite score of functional correctness and instruction following correlates the best with human preference, with the latter emerging as the primary differentiator on real-world programming tasks. Our work identifies core factors of the vibe check, providing a concrete path for benchmarking and developing models that better align with user preferences in coding.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.AI, cs.LG, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Ming Zhong, Xiang Zhou, Ting-Yun Chang, Qingze Wang, Nan Xu, Xiance Si, Dan Garrette, Shyam Upadhyay, Jeremiah Liu, Jiawei Han, Benoit Schillings, Jiao Sun</p>

            <p><strong>Title:</strong><br>
            Vibe Checker: Aligning Code Evaluation with Human Preference</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.07315v1">http://arxiv.org/abs/2510.07315v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check is tied to real-world human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking the non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check that represents human preference in coding besides functional correctness. To quantify models' code instruction following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with corresponding deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in Vibe Checker, a testbed to assess both code instruction following and functional correctness. Upon evaluating 31 leading LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit clear functional regression. Most importantly, a composite score of functional correctness and instruction following correlates the best with human preference, with the latter emerging as the primary differentiator on real-world programming tasks. Our work identifies core factors of the vibe check, providing a concrete path for benchmarking and developing models that better align with user preferences in coding.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Oct 2025 20:38:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/da9980d9/0ba53872.mp3" length="22756287" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1419</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.AI, cs.LG, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Ming Zhong, Xiang Zhou, Ting-Yun Chang, Qingze Wang, Nan Xu, Xiance Si, Dan Garrette, Shyam Upadhyay, Jeremiah Liu, Jiawei Han, Benoit Schillings, Jiao Sun</p>

            <p><strong>Title:</strong><br>
            Vibe Checker: Aligning Code Evaluation with Human Preference</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.07315v1">http://arxiv.org/abs/2510.07315v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check is tied to real-world human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking the non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check that represents human preference in coding besides functional correctness. To quantify models' code instruction following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with corresponding deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in Vibe Checker, a testbed to assess both code instruction following and functional correctness. Upon evaluating 31 leading LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit clear functional regression. Most importantly, a composite score of functional correctness and instruction following correlates the best with human preference, with the latter emerging as the primary differentiator on real-world programming tasks. Our work identifies core factors of the vibe check, providing a concrete path for benchmarking and developing models that better align with user preferences in coding.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Less is More: Recursive Reasoning with Tiny Networks</title>
      <itunes:episode>1246</itunes:episode>
      <podcast:episode>1246</podcast:episode>
      <itunes:title>Less is More: Recursive Reasoning with Tiny Networks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b932f19a-2c12-4d95-9377-5e80b2c08c9f</guid>
      <link>https://share.transistor.fm/s/2b31497d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 89 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Alexia Jolicoeur-Martineau</p>

            <p><strong>Title:</strong><br>
            Less is More: Recursive Reasoning with Tiny Networks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.04871v1">http://arxiv.org/abs/2510.04871v1</a></p>

            <p><strong>Abstract:</strong><br>
            Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on hard puzzle tasks such as Sudoku, Maze, and ARC-AGI while trained with small models (27M parameters) on small data (around 1000 examples). HRM holds great promise for solving hard problems with small networks, but it is not yet well understood and may be suboptimal. We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers. With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 89 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Alexia Jolicoeur-Martineau</p>

            <p><strong>Title:</strong><br>
            Less is More: Recursive Reasoning with Tiny Networks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.04871v1">http://arxiv.org/abs/2510.04871v1</a></p>

            <p><strong>Abstract:</strong><br>
            Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on hard puzzle tasks such as Sudoku, Maze, and ARC-AGI while trained with small models (27M parameters) on small data (around 1000 examples). HRM holds great promise for solving hard problems with small networks, but it is not yet well understood and may be suboptimal. We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers. With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Oct 2025 20:45:20 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2b31497d/d3567794.mp3" length="20877134" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1301</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 89 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Alexia Jolicoeur-Martineau</p>

            <p><strong>Title:</strong><br>
            Less is More: Recursive Reasoning with Tiny Networks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.04871v1">http://arxiv.org/abs/2510.04871v1</a></p>

            <p><strong>Abstract:</strong><br>
            Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on hard puzzle tasks such as Sudoku, Maze, and ARC-AGI while trained with small models (27M parameters) on small data (around 1000 examples). HRM holds great promise for solving hard problems with small networks, but it is not yet well understood and may be suboptimal. We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers. With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning</title>
      <itunes:episode>1245</itunes:episode>
      <podcast:episode>1245</podcast:episode>
      <itunes:title>TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fe4ec805-3b6c-4828-837e-0b8a3e0e022a</guid>
      <link>https://share.transistor.fm/s/2d79fee4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiaru Zou, Soumya Roy, Vinay Kumar Verma, Ziyi Wang, David Wipf, Pan Lu, Sumit Negi, James Zou, Jingrui He</p>

            <p><strong>Title:</strong><br>
            TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.06217v1">http://arxiv.org/abs/2510.06217v1</a></p>

            <p><strong>Abstract:</strong><br>
            Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiaru Zou, Soumya Roy, Vinay Kumar Verma, Ziyi Wang, David Wipf, Pan Lu, Sumit Negi, James Zou, Jingrui He</p>

            <p><strong>Title:</strong><br>
            TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.06217v1">http://arxiv.org/abs/2510.06217v1</a></p>

            <p><strong>Abstract:</strong><br>
            Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Oct 2025 20:44:57 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2d79fee4/4a1c1012.mp3" length="26322749" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1641</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiaru Zou, Soumya Roy, Vinay Kumar Verma, Ziyi Wang, David Wipf, Pan Lu, Sumit Negi, James Zou, Jingrui He</p>

            <p><strong>Title:</strong><br>
            TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.06217v1">http://arxiv.org/abs/2510.06217v1</a></p>

            <p><strong>Abstract:</strong><br>
            Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs</title>
      <itunes:episode>1244</itunes:episode>
      <podcast:episode>1244</podcast:episode>
      <itunes:title>Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">dccd43d8-3cfd-489b-9741-4bf2389eeeba</guid>
      <link>https://share.transistor.fm/s/ebfecbfe</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shreyas Singh, Kunal Singh, Pradeep Moturi</p>

            <p><strong>Title:</strong><br>
            Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.24107v1">http://arxiv.org/abs/2509.24107v1</a></p>

            <p><strong>Abstract:</strong><br>
            Tool-integrated reasoning has emerged as a key focus for enabling agentic applications. Among these, DeepResearch Agents have gained significant attention for their strong performance on complex, open-ended information-seeking tasks. We introduce Fathom-DeepResearch, an agentic system composed of two specialized models. The first is Fathom-Search-4B, a DeepSearch model trained from Qwen3-4B and optimized for evidence-based investigation through live web search and targeted webpage querying. Its training combines three advances: (i) DUETQA, a 5K-sample dataset generated via multi-agent self-play that enforces strict web-search dependence and heterogeneous source grounding; (ii) RAPO, a zero-overhead extension of GRPO that stabilizes multi-turn Reinforcement Learning with Verifiable Rewards through curriculum pruning, reward-aware advantage scaling, and per-prompt replay buffers; and (iii) a steerable step-level reward that classifies each tool call by cognitive behavior and marginal utility, enabling explicit control over search trajectory breadth, depth, and horizon. These improvements enable reliable extension of tool-calling beyond 20 calls when warranted. The second is Fathom-Synthesizer-4B, trained from Qwen3-4B, which converts multi-turn DeepSearch traces into structured, citation-dense DeepResearch Reports for comprehensive synthesis. Evaluated on DeepSearch benchmarks (SimpleQA, FRAMES, WebWalker, Seal0, MuSiQue) and DeepResearch-Bench, the system achieves state-of-the-art performance in the open-weights category while demonstrating strong generalization to diverse reasoning tasks including HLE, AIME-25, GPQA-Diamond, and MedQA.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shreyas Singh, Kunal Singh, Pradeep Moturi</p>

            <p><strong>Title:</strong><br>
            Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.24107v1">http://arxiv.org/abs/2509.24107v1</a></p>

            <p><strong>Abstract:</strong><br>
            Tool-integrated reasoning has emerged as a key focus for enabling agentic applications. Among these, DeepResearch Agents have gained significant attention for their strong performance on complex, open-ended information-seeking tasks. We introduce Fathom-DeepResearch, an agentic system composed of two specialized models. The first is Fathom-Search-4B, a DeepSearch model trained from Qwen3-4B and optimized for evidence-based investigation through live web search and targeted webpage querying. Its training combines three advances: (i) DUETQA, a 5K-sample dataset generated via multi-agent self-play that enforces strict web-search dependence and heterogeneous source grounding; (ii) RAPO, a zero-overhead extension of GRPO that stabilizes multi-turn Reinforcement Learning with Verifiable Rewards through curriculum pruning, reward-aware advantage scaling, and per-prompt replay buffers; and (iii) a steerable step-level reward that classifies each tool call by cognitive behavior and marginal utility, enabling explicit control over search trajectory breadth, depth, and horizon. These improvements enable reliable extension of tool-calling beyond 20 calls when warranted. The second is Fathom-Synthesizer-4B, trained from Qwen3-4B, which converts multi-turn DeepSearch traces into structured, citation-dense DeepResearch Reports for comprehensive synthesis. Evaluated on DeepSearch benchmarks (SimpleQA, FRAMES, WebWalker, Seal0, MuSiQue) and DeepResearch-Bench, the system achieves state-of-the-art performance in the open-weights category while demonstrating strong generalization to diverse reasoning tasks including HLE, AIME-25, GPQA-Diamond, and MedQA.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Oct 2025 20:44:35 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ebfecbfe/3f2c2aa2.mp3" length="22875433" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1426</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shreyas Singh, Kunal Singh, Pradeep Moturi</p>

            <p><strong>Title:</strong><br>
            Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.24107v1">http://arxiv.org/abs/2509.24107v1</a></p>

            <p><strong>Abstract:</strong><br>
            Tool-integrated reasoning has emerged as a key focus for enabling agentic applications. Among these, DeepResearch Agents have gained significant attention for their strong performance on complex, open-ended information-seeking tasks. We introduce Fathom-DeepResearch, an agentic system composed of two specialized models. The first is Fathom-Search-4B, a DeepSearch model trained from Qwen3-4B and optimized for evidence-based investigation through live web search and targeted webpage querying. Its training combines three advances: (i) DUETQA, a 5K-sample dataset generated via multi-agent self-play that enforces strict web-search dependence and heterogeneous source grounding; (ii) RAPO, a zero-overhead extension of GRPO that stabilizes multi-turn Reinforcement Learning with Verifiable Rewards through curriculum pruning, reward-aware advantage scaling, and per-prompt replay buffers; and (iii) a steerable step-level reward that classifies each tool call by cognitive behavior and marginal utility, enabling explicit control over search trajectory breadth, depth, and horizon. These improvements enable reliable extension of tool-calling beyond 20 calls when warranted. The second is Fathom-Synthesizer-4B, trained from Qwen3-4B, which converts multi-turn DeepSearch traces into structured, citation-dense DeepResearch Reports for comprehensive synthesis. Evaluated on DeepSearch benchmarks (SimpleQA, FRAMES, WebWalker, Seal0, MuSiQue) and DeepResearch-Bench, the system achieves state-of-the-art performance in the open-weights category while demonstrating strong generalization to diverse reasoning tasks including HLE, AIME-25, GPQA-Diamond, and MedQA.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>In-the-Flow Agentic System Optimization for Effective Planning and Tool Use</title>
      <itunes:episode>1243</itunes:episode>
      <podcast:episode>1243</podcast:episode>
      <itunes:title>In-the-Flow Agentic System Optimization for Effective Planning and Tool Use</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9bc4d8ce-4ecd-40df-8b34-ef7c1ea1fcc0</guid>
      <link>https://share.transistor.fm/s/18d06c51</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.AI, cs.CL, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, Pan Lu</p>

            <p><strong>Title:</strong><br>
            In-the-Flow Agentic System Optimization for Effective Planning and Tool Use</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.05592v1">http://arxiv.org/abs/2510.05592v1</a></p>

            <p><strong>Abstract:</strong><br>
            Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.AI, cs.CL, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, Pan Lu</p>

            <p><strong>Title:</strong><br>
            In-the-Flow Agentic System Optimization for Effective Planning and Tool Use</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.05592v1">http://arxiv.org/abs/2510.05592v1</a></p>

            <p><strong>Abstract:</strong><br>
            Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Oct 2025 20:44:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/18d06c51/868bae21.mp3" length="26693895" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1665</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.AI, cs.CL, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, Pan Lu</p>

            <p><strong>Title:</strong><br>
            In-the-Flow Agentic System Optimization for Effective Planning and Tool Use</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.05592v1">http://arxiv.org/abs/2510.05592v1</a></p>

            <p><strong>Abstract:</strong><br>
            Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Fast-dLLM v2: Efficient Block-Diffusion LLM</title>
      <itunes:episode>1242</itunes:episode>
      <podcast:episode>1242</podcast:episode>
      <itunes:title>Fast-dLLM v2: Efficient Block-Diffusion LLM</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">be1d972f-f35b-4847-8137-92f7e917e8b5</guid>
      <link>https://share.transistor.fm/s/856ffe60</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, Enze Xie</p>

            <p><strong>Title:</strong><br>
            Fast-dLLM v2: Efficient Block-Diffusion LLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.26328v1">http://arxiv.org/abs/2509.26328v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation, requiring only approximately 1B tokens of fine-tuning. This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model's performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs - marking a significant step toward the practical deployment of fast and accurate LLMs. Code and model will be publicly released.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, Enze Xie</p>

            <p><strong>Title:</strong><br>
            Fast-dLLM v2: Efficient Block-Diffusion LLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.26328v1">http://arxiv.org/abs/2509.26328v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation, requiring only approximately 1B tokens of fine-tuning. This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model's performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs - marking a significant step toward the practical deployment of fast and accurate LLMs. Code and model will be publicly released.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Oct 2025 20:43:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/856ffe60/53074a1e.mp3" length="22693158" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1415</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, Enze Xie</p>

            <p><strong>Title:</strong><br>
            Fast-dLLM v2: Efficient Block-Diffusion LLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.26328v1">http://arxiv.org/abs/2509.26328v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation, requiring only approximately 1B tokens of fine-tuning. This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model's performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs - marking a significant step toward the practical deployment of fast and accurate LLMs. Code and model will be publicly released.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CoDA: Coding LM via Diffusion Adaptation</title>
      <itunes:episode>1241</itunes:episode>
      <podcast:episode>1241</podcast:episode>
      <itunes:title>CoDA: Coding LM via Diffusion Adaptation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">03b16480-2040-4a5d-ac39-8f599de30cc8</guid>
      <link>https://share.transistor.fm/s/b48ed83b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Haolin Chen, Shiyu Wang, Can Qin, Bo Pang, Zuxin Liu, Jielin Qiu, Jianguo Zhang, Yingbo Zhou, Zeyuan Chen, Ran Xu, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao</p>

            <p><strong>Title:</strong><br>
            CoDA: Coding LM via Diffusion Adaptation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.03270v1">http://arxiv.org/abs/2510.03270v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion language models promise bidirectional context and infilling capabilities that autoregressive coders lack, yet practical systems remain heavyweight. We introduce CoDA, a 1.7B-parameter diffusion coder trained on TPU with a fully open-source training pipeline. CoDA pairs large-scale diffusion pre-training with code-centric mid-training and instruction tuning, enabling confidence-guided sampling that keeps inference latency competitive. On Humaneval, MBPP, and EvalPlus, CoDA-1.7B-Instruct matches or surpasses diffusion models up to 7B parameters. Our release includes model checkpoints, evaluation harnesses, and TPU training pipelines to accelerate research on lightweight diffusion-based coding assistants.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Haolin Chen, Shiyu Wang, Can Qin, Bo Pang, Zuxin Liu, Jielin Qiu, Jianguo Zhang, Yingbo Zhou, Zeyuan Chen, Ran Xu, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao</p>

            <p><strong>Title:</strong><br>
            CoDA: Coding LM via Diffusion Adaptation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.03270v1">http://arxiv.org/abs/2510.03270v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion language models promise bidirectional context and infilling capabilities that autoregressive coders lack, yet practical systems remain heavyweight. We introduce CoDA, a 1.7B-parameter diffusion coder trained on TPU with a fully open-source training pipeline. CoDA pairs large-scale diffusion pre-training with code-centric mid-training and instruction tuning, enabling confidence-guided sampling that keeps inference latency competitive. On Humaneval, MBPP, and EvalPlus, CoDA-1.7B-Instruct matches or surpasses diffusion models up to 7B parameters. Our release includes model checkpoints, evaluation harnesses, and TPU training pipelines to accelerate research on lightweight diffusion-based coding assistants.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Oct 2025 20:43:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b48ed83b/129e7efa.mp3" length="21376166" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1332</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Haolin Chen, Shiyu Wang, Can Qin, Bo Pang, Zuxin Liu, Jielin Qiu, Jianguo Zhang, Yingbo Zhou, Zeyuan Chen, Ran Xu, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao</p>

            <p><strong>Title:</strong><br>
            CoDA: Coding LM via Diffusion Adaptation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.03270v1">http://arxiv.org/abs/2510.03270v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion language models promise bidirectional context and infilling capabilities that autoregressive coders lack, yet practical systems remain heavyweight. We introduce CoDA, a 1.7B-parameter diffusion coder trained on TPU with a fully open-source training pipeline. CoDA pairs large-scale diffusion pre-training with code-centric mid-training and instruction tuning, enabling confidence-guided sampling that keeps inference latency competitive. On Humaneval, MBPP, and EvalPlus, CoDA-1.7B-Instruct matches or surpasses diffusion models up to 7B parameters. Our release includes model checkpoints, evaluation harnesses, and TPU training pipelines to accelerate research on lightweight diffusion-based coding assistants.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Drax: Speech Recognition with Discrete Flow Matching</title>
      <itunes:episode>1240</itunes:episode>
      <podcast:episode>1240</podcast:episode>
      <itunes:title>Drax: Speech Recognition with Discrete Flow Matching</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">74e3c3a2-edd6-462a-b3ef-f1ca69ba86b1</guid>
      <link>https://share.transistor.fm/s/649282a0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | eess.AS, cs.LG, cs.SD</p>

            <p><strong>Authors:</strong><br>
            Aviv Navon, Aviv Shamsian, Neta Glazer, Yael Segal-Feldman, Gill Hetz, Joseph Keshet, Ethan Fetaya</p>

            <p><strong>Title:</strong><br>
            Drax: Speech Recognition with Discrete Flow Matching</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.04162v1">http://arxiv.org/abs/2510.04162v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | eess.AS, cs.LG, cs.SD</p>

            <p><strong>Authors:</strong><br>
            Aviv Navon, Aviv Shamsian, Neta Glazer, Yael Segal-Feldman, Gill Hetz, Joseph Keshet, Ethan Fetaya</p>

            <p><strong>Title:</strong><br>
            Drax: Speech Recognition with Discrete Flow Matching</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.04162v1">http://arxiv.org/abs/2510.04162v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Oct 2025 20:43:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/649282a0/df8e8cbc.mp3" length="23962091" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1494</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | eess.AS, cs.LG, cs.SD</p>

            <p><strong>Authors:</strong><br>
            Aviv Navon, Aviv Shamsian, Neta Glazer, Yael Segal-Feldman, Gill Hetz, Joseph Keshet, Ethan Fetaya</p>

            <p><strong>Title:</strong><br>
            Drax: Speech Recognition with Discrete Flow Matching</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.04162v1">http://arxiv.org/abs/2510.04162v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Paper2Video: Automatic Video Generation from Scientific Papers</title>
      <itunes:episode>1239</itunes:episode>
      <podcast:episode>1239</podcast:episode>
      <itunes:title>Paper2Video: Automatic Video Generation from Scientific Papers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">37de2565-15d6-4b09-a7f5-cb1051f7ce82</guid>
      <link>https://share.transistor.fm/s/d8128ece</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CV, cs.AI, cs.CL, cs.MA, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            Paper2Video: Automatic Video Generation from Scientific Papers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.05096v1">http://arxiv.org/abs/2510.05096v1</a></p>

            <p><strong>Abstract:</strong><br>
            Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce PaperTalker, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CV, cs.AI, cs.CL, cs.MA, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            Paper2Video: Automatic Video Generation from Scientific Papers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.05096v1">http://arxiv.org/abs/2510.05096v1</a></p>

            <p><strong>Abstract:</strong><br>
            Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce PaperTalker, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Oct 2025 21:20:35 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d8128ece/f8d9f0f7.mp3" length="21077347" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1314</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CV, cs.AI, cs.CL, cs.MA, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            Paper2Video: Automatic Video Generation from Scientific Papers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.05096v1">http://arxiv.org/abs/2510.05096v1</a></p>

            <p><strong>Abstract:</strong><br>
            Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce PaperTalker, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MITS: Enhanced Tree Search Reasoning for LLMs via Pointwise Mutual Information</title>
      <itunes:episode>1238</itunes:episode>
      <podcast:episode>1238</podcast:episode>
      <itunes:title>MITS: Enhanced Tree Search Reasoning for LLMs via Pointwise Mutual Information</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e7ed8620-f86e-462f-9bad-0bdd1608e34a</guid>
      <link>https://share.transistor.fm/s/4d3cd2a8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiaxi Li, Yucheng Shi, Jin Lu, Ninghao Liu</p>

            <p><strong>Title:</strong><br>
            MITS: Enhanced Tree Search Reasoning for LLMs via Pointwise Mutual Information</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.03632v1">http://arxiv.org/abs/2510.03632v1</a></p>

            <p><strong>Abstract:</strong><br>
            Tree search has become as a representative framework for test-time reasoning with large language models (LLMs), exemplified by methods such as Tree-of-Thought and Monte Carlo Tree Search that explore multiple reasoning paths. However, it remains difficult to provide instant and reliable quantitative assessments of intermediate reasoning step quality, and extensive path exploration is computationally costly. To address this, we propose Mutual Information Tree Search (MITS), a novel framework that guides reasoning with information-theoretic principles. MITS introduces an effective scoring function based on pointwise mutual information (PMI), which enables step-wise evaluation of reasoning paths and search tree expansion via beam search without expensive look-ahead simulations, achieving superior reasoning performances while maintaining computational efficiency. The framework is complemented by an entropy-based dynamic sampling strategy that adaptively allocates computational resources to uncertain reasoning steps where exploration is most beneficial. For final prediction, MITS employs a weighted voting scheme that combines PMI scores with prediction consensus. Through comprehensive experiments on diverse reasoning benchmarks, MITS consistently surpasses baseline methods, establishing a principled and efficient framework for LLM reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiaxi Li, Yucheng Shi, Jin Lu, Ninghao Liu</p>

            <p><strong>Title:</strong><br>
            MITS: Enhanced Tree Search Reasoning for LLMs via Pointwise Mutual Information</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.03632v1">http://arxiv.org/abs/2510.03632v1</a></p>

            <p><strong>Abstract:</strong><br>
            Tree search has become as a representative framework for test-time reasoning with large language models (LLMs), exemplified by methods such as Tree-of-Thought and Monte Carlo Tree Search that explore multiple reasoning paths. However, it remains difficult to provide instant and reliable quantitative assessments of intermediate reasoning step quality, and extensive path exploration is computationally costly. To address this, we propose Mutual Information Tree Search (MITS), a novel framework that guides reasoning with information-theoretic principles. MITS introduces an effective scoring function based on pointwise mutual information (PMI), which enables step-wise evaluation of reasoning paths and search tree expansion via beam search without expensive look-ahead simulations, achieving superior reasoning performances while maintaining computational efficiency. The framework is complemented by an entropy-based dynamic sampling strategy that adaptively allocates computational resources to uncertain reasoning steps where exploration is most beneficial. For final prediction, MITS employs a weighted voting scheme that combines PMI scores with prediction consensus. Through comprehensive experiments on diverse reasoning benchmarks, MITS consistently surpasses baseline methods, establishing a principled and efficient framework for LLM reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Oct 2025 21:20:14 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4d3cd2a8/9a2c7bb0.mp3" length="24555619" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1531</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiaxi Li, Yucheng Shi, Jin Lu, Ninghao Liu</p>

            <p><strong>Title:</strong><br>
            MITS: Enhanced Tree Search Reasoning for LLMs via Pointwise Mutual Information</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.03632v1">http://arxiv.org/abs/2510.03632v1</a></p>

            <p><strong>Abstract:</strong><br>
            Tree search has become as a representative framework for test-time reasoning with large language models (LLMs), exemplified by methods such as Tree-of-Thought and Monte Carlo Tree Search that explore multiple reasoning paths. However, it remains difficult to provide instant and reliable quantitative assessments of intermediate reasoning step quality, and extensive path exploration is computationally costly. To address this, we propose Mutual Information Tree Search (MITS), a novel framework that guides reasoning with information-theoretic principles. MITS introduces an effective scoring function based on pointwise mutual information (PMI), which enables step-wise evaluation of reasoning paths and search tree expansion via beam search without expensive look-ahead simulations, achieving superior reasoning performances while maintaining computational efficiency. The framework is complemented by an entropy-based dynamic sampling strategy that adaptively allocates computational resources to uncertain reasoning steps where exploration is most beneficial. For final prediction, MITS employs a weighted voting scheme that combines PMI scores with prediction consensus. Through comprehensive experiments on diverse reasoning benchmarks, MITS consistently surpasses baseline methods, establishing a principled and efficient framework for LLM reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models</title>
      <itunes:episode>1237</itunes:episode>
      <podcast:episode>1237</podcast:episode>
      <itunes:title>Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a33e1e3d-b244-49ac-a82e-c11f4294784f</guid>
      <link>https://share.transistor.fm/s/494d7274</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Yuhe Nie, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu</p>

            <p><strong>Title:</strong><br>
            Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.05034v1">http://arxiv.org/abs/2510.05034v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Yuhe Nie, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu</p>

            <p><strong>Title:</strong><br>
            Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.05034v1">http://arxiv.org/abs/2510.05034v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Oct 2025 21:19:53 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/494d7274/33d83a92.mp3" length="25419549" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1585</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Yuhe Nie, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu</p>

            <p><strong>Title:</strong><br>
            Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.05034v1">http://arxiv.org/abs/2510.05034v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VChain: Chain-of-Visual-Thought for Reasoning in Video Generation</title>
      <itunes:episode>1236</itunes:episode>
      <podcast:episode>1236</podcast:episode>
      <itunes:title>VChain: Chain-of-Visual-Thought for Reasoning in Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">107d1b96-25d5-4eb9-9c6e-2033b713f798</guid>
      <link>https://share.transistor.fm/s/69e0f2e3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziqi Huang, Ning Yu, Gordon Chen, Haonan Qiu, Paul Debevec, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            VChain: Chain-of-Visual-Thought for Reasoning in Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.05094v1">http://arxiv.org/abs/2510.05094v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time tuning of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziqi Huang, Ning Yu, Gordon Chen, Haonan Qiu, Paul Debevec, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            VChain: Chain-of-Visual-Thought for Reasoning in Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.05094v1">http://arxiv.org/abs/2510.05094v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time tuning of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Oct 2025 21:19:32 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/69e0f2e3/cb597364.mp3" length="21496981" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1340</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziqi Huang, Ning Yu, Gordon Chen, Haonan Qiu, Paul Debevec, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            VChain: Chain-of-Visual-Thought for Reasoning in Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.05094v1">http://arxiv.org/abs/2510.05094v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time tuning of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Imperceptible Jailbreaking against Large Language Models</title>
      <itunes:episode>1235</itunes:episode>
      <podcast:episode>1235</podcast:episode>
      <itunes:title>Imperceptible Jailbreaking against Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">774647a0-c91a-4c2b-82da-29c77cd53fdd</guid>
      <link>https://share.transistor.fm/s/19006676</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.AI, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Kuofeng Gao, Yiming Li, Chao Du, Xin Wang, Xingjun Ma, Shu-Tao Xia, Tianyu Pang</p>

            <p><strong>Title:</strong><br>
            Imperceptible Jailbreaking against Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.05025v1">http://arxiv.org/abs/2510.05025v1</a></p>

            <p><strong>Abstract:</strong><br>
            Jailbreaking attacks on the vision modality typically rely on imperceptible adversarial perturbations, whereas attacks on the textual modality are generally assumed to require visible modifications (e.g., non-semantic suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors. By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen, while their tokenization is "secretly" altered. We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses. Our experiments show that our imperceptible jailbreaks achieve high attack success rates against four aligned LLMs and generalize to prompt injection attacks, all without producing any visible modifications in the written prompt. Our code is available at https://github.com/sail-sg/imperceptible-jailbreaks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.AI, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Kuofeng Gao, Yiming Li, Chao Du, Xin Wang, Xingjun Ma, Shu-Tao Xia, Tianyu Pang</p>

            <p><strong>Title:</strong><br>
            Imperceptible Jailbreaking against Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.05025v1">http://arxiv.org/abs/2510.05025v1</a></p>

            <p><strong>Abstract:</strong><br>
            Jailbreaking attacks on the vision modality typically rely on imperceptible adversarial perturbations, whereas attacks on the textual modality are generally assumed to require visible modifications (e.g., non-semantic suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors. By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen, while their tokenization is "secretly" altered. We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses. Our experiments show that our imperceptible jailbreaks achieve high attack success rates against four aligned LLMs and generalize to prompt injection attacks, all without producing any visible modifications in the written prompt. Our code is available at https://github.com/sail-sg/imperceptible-jailbreaks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Oct 2025 21:19:11 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/19006676/931ef61d.mp3" length="20041220" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1249</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.AI, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Kuofeng Gao, Yiming Li, Chao Du, Xin Wang, Xingjun Ma, Shu-Tao Xia, Tianyu Pang</p>

            <p><strong>Title:</strong><br>
            Imperceptible Jailbreaking against Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.05025v1">http://arxiv.org/abs/2510.05025v1</a></p>

            <p><strong>Abstract:</strong><br>
            Jailbreaking attacks on the vision modality typically rely on imperceptible adversarial perturbations, whereas attacks on the textual modality are generally assumed to require visible modifications (e.g., non-semantic suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors. By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen, while their tokenization is "secretly" altered. We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses. Our experiments show that our imperceptible jailbreaks achieve high attack success rates against four aligned LLMs and generalize to prompt injection attacks, all without producing any visible modifications in the written prompt. Our code is available at https://github.com/sail-sg/imperceptible-jailbreaks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models</title>
      <itunes:episode>1234</itunes:episode>
      <podcast:episode>1234</podcast:episode>
      <itunes:title>Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1f90e8be-f023-4bec-bb18-f99f57ed05ae</guid>
      <link>https://share.transistor.fm/s/78cf6320</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, Kunle Olukotun</p>

            <p><strong>Title:</strong><br>
            Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.04618v1">http://arxiv.org/abs/2510.04618v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation -- modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time. Building on the adaptive memory introduced by Dynamic Cheatsheet, we introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance, while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches the top-ranked production-level agent on the overall average and surpasses it on the harder test-challenge split, despite using a smaller open-source model. These results show that comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, Kunle Olukotun</p>

            <p><strong>Title:</strong><br>
            Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.04618v1">http://arxiv.org/abs/2510.04618v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation -- modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time. Building on the adaptive memory introduced by Dynamic Cheatsheet, we introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance, while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches the top-ranked production-level agent on the overall average and surpasses it on the harder test-challenge split, despite using a smaller open-source model. These results show that comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Oct 2025 21:18:50 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/78cf6320/0caada6a.mp3" length="25613895" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1597</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, Kunle Olukotun</p>

            <p><strong>Title:</strong><br>
            Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.04618v1">http://arxiv.org/abs/2510.04618v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation -- modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time. Building on the adaptive memory introduced by Dynamic Cheatsheet, we introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance, while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches the top-ranked production-level agent on the overall average and surpasses it on the harder test-challenge split, despite using a smaller open-source model. These results show that comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Hybrid Architectures for Language Models: Systematic Analysis and Design Insights</title>
      <itunes:episode>1233</itunes:episode>
      <podcast:episode>1233</podcast:episode>
      <itunes:title>Hybrid Architectures for Language Models: Systematic Analysis and Design Insights</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6c92efc4-5a73-4aa9-8a1a-abf0986c2945</guid>
      <link>https://share.transistor.fm/s/757020f5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sangmin Bae, Bilge Acun, Haroun Habeeb, Seungyeon Kim, Chien-Yu Lin, Liang Luo, Junjie Wang, Carole-Jean Wu</p>

            <p><strong>Title:</strong><br>
            Hybrid Architectures for Language Models: Systematic Analysis and Design Insights</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.04800v1">http://arxiv.org/abs/2510.04800v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in large language models demonstrates that hybrid architectures--combining self-attention mechanisms with structured state space models like Mamba--can achieve a compelling balance between modeling quality and computational efficiency, particularly for long-context tasks. While these hybrid models show promising performance, systematic comparisons of hybridization strategies and analyses on the key factors behind their effectiveness have not been clearly shared to the community. In this work, we present a holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion. We evaluate these designs from a variety of perspectives: language modeling performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitive, we identify the most critical elements for each hybridization strategy and further propose optimal design recipes for both hybrid models. Our comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating the optimization of architectural configurations.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sangmin Bae, Bilge Acun, Haroun Habeeb, Seungyeon Kim, Chien-Yu Lin, Liang Luo, Junjie Wang, Carole-Jean Wu</p>

            <p><strong>Title:</strong><br>
            Hybrid Architectures for Language Models: Systematic Analysis and Design Insights</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.04800v1">http://arxiv.org/abs/2510.04800v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in large language models demonstrates that hybrid architectures--combining self-attention mechanisms with structured state space models like Mamba--can achieve a compelling balance between modeling quality and computational efficiency, particularly for long-context tasks. While these hybrid models show promising performance, systematic comparisons of hybridization strategies and analyses on the key factors behind their effectiveness have not been clearly shared to the community. In this work, we present a holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion. We evaluate these designs from a variety of perspectives: language modeling performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitive, we identify the most critical elements for each hybridization strategy and further propose optimal design recipes for both hybrid models. Our comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating the optimization of architectural configurations.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Oct 2025 21:18:29 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/757020f5/05a7e9cb.mp3" length="22250995" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1387</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sangmin Bae, Bilge Acun, Haroun Habeeb, Seungyeon Kim, Chien-Yu Lin, Liang Luo, Junjie Wang, Carole-Jean Wu</p>

            <p><strong>Title:</strong><br>
            Hybrid Architectures for Language Models: Systematic Analysis and Design Insights</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.04800v1">http://arxiv.org/abs/2510.04800v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in large language models demonstrates that hybrid architectures--combining self-attention mechanisms with structured state space models like Mamba--can achieve a compelling balance between modeling quality and computational efficiency, particularly for long-context tasks. While these hybrid models show promising performance, systematic comparisons of hybridization strategies and analyses on the key factors behind their effectiveness have not been clearly shared to the community. In this work, we present a holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion. We evaluate these designs from a variety of perspectives: language modeling performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitive, we identify the most critical elements for each hybridization strategy and further propose optimal design recipes for both hybrid models. Our comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating the optimization of architectural configurations.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Optimal Scaling Needs Optimal Norm</title>
      <itunes:episode>1232</itunes:episode>
      <podcast:episode>1232</podcast:episode>
      <itunes:title>Optimal Scaling Needs Optimal Norm</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b71a922d-be18-4541-b733-e36c41b4a368</guid>
      <link>https://share.transistor.fm/s/81e774e4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG, cs.AI, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Oleg Filatov, Jiangtao Wang, Jan Ebert, Stefan Kesselheim</p>

            <p><strong>Title:</strong><br>
            Optimal Scaling Needs Optimal Norm</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.03871v1">http://arxiv.org/abs/2510.03871v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. Using the Scion optimizer, we discover that joint optimal scaling across model and dataset sizes is governed by a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair $(\eta^{\ast}, B^{\ast})$ consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple $(\eta, B)$ reach the optimal norm, only a unique $(\eta^{\ast}, B^{\ast})$ achieves the best loss. As a sufficient condition, we provide the first measurement of $(\eta^{\ast}, B^{\ast})$ scaling with dataset size for Scion, and find that the scaling rules are consistent with those of the Adam optimizer. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG, cs.AI, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Oleg Filatov, Jiangtao Wang, Jan Ebert, Stefan Kesselheim</p>

            <p><strong>Title:</strong><br>
            Optimal Scaling Needs Optimal Norm</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.03871v1">http://arxiv.org/abs/2510.03871v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. Using the Scion optimizer, we discover that joint optimal scaling across model and dataset sizes is governed by a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair $(\eta^{\ast}, B^{\ast})$ consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple $(\eta, B)$ reach the optimal norm, only a unique $(\eta^{\ast}, B^{\ast})$ achieves the best loss. As a sufficient condition, we provide the first measurement of $(\eta^{\ast}, B^{\ast})$ scaling with dataset size for Scion, and find that the scaling rules are consistent with those of the Adam optimizer. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Oct 2025 21:18:03 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/81e774e4/4cf1d83b.mp3" length="22017309" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1372</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG, cs.AI, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Oleg Filatov, Jiangtao Wang, Jan Ebert, Stefan Kesselheim</p>

            <p><strong>Title:</strong><br>
            Optimal Scaling Needs Optimal Norm</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.03871v1">http://arxiv.org/abs/2510.03871v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. Using the Scion optimizer, we discover that joint optimal scaling across model and dataset sizes is governed by a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair $(\eta^{\ast}, B^{\ast})$ consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple $(\eta, B)$ reach the optimal norm, only a unique $(\eta^{\ast}, B^{\ast})$ achieves the best loss. As a sufficient condition, we provide the first measurement of $(\eta^{\ast}, B^{\ast})$ scaling with dataset size for Scion, and find that the scaling rules are consistent with those of the Adam optimizer. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Apriel-1.5-15b-Thinker</title>
      <itunes:episode>1231</itunes:episode>
      <podcast:episode>1231</podcast:episode>
      <itunes:title>Apriel-1.5-15b-Thinker</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">43719bc3-9526-46db-893e-535fb7309c9f</guid>
      <link>https://share.transistor.fm/s/1304cf5f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 78 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shruthan Radhakrishna, Aman Tiwari, Aanjaneya Shukla, Masoud Hashemi, Rishabh Maheshwary, Shiva Krishna Reddy Malay, Jash Mehta, Pulkit Pattnaik, Saloni Mittal, Khalil Slimi, Kelechi Ogueji, Akintunde Oladipo, Soham Parikh, Oluwanifemi Bamgbose, Toby Liang, Ahmed Masry, Khyati Mahajan, Sai Rajeswar Mudumba, Vikas Yadav, Sathwik Tejaswi Madhusudhan, Torsten Scholak, Sagar Davasam, Srinivas Sunkara, Nicholas Chapados</p>

            <p><strong>Title:</strong><br>
            Apriel-1.5-15b-Thinker</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.01141v1">http://arxiv.org/abs/2510.01141v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Apriel-1.5-15B-Thinker, a 15-billion parameter open-weights multimodal reasoning model that achieves frontier-level performance through training design rather than sheer scale. Starting from Pixtral-12B, we apply a progressive three-stage methodology: (1) depth upscaling to expand reasoning capacity without pretraining from scratch, (2) staged continual pre-training that first develops foundational text and vision understanding, then enhances visual reasoning through targeted synthetic data generation addressing spatial structure, compositional understanding, and fine-grained perception, and (3) high-quality text-only supervised fine-tuning on curated instruction-response pairs with explicit reasoning traces spanning mathematics, coding, science, and tool use. Notably, our model achieves competitive results without reinforcement learning or preference optimization, isolating the contribution of our data-centric continual pre-training approach. On the Artificial Analysis Intelligence Index, Apriel-1.5-15B-Thinker attains a score of 52, matching DeepSeek-R1-0528 despite requiring significantly fewer computational resources. Across ten image benchmarks, its performance is on average within five points of Gemini-2.5-Flash and Claude Sonnet-3.7, a key achievement for a model operating within single-GPU deployment constraints. Our results demonstrate that thoughtful mid-training 2 design can close substantial capability gaps without massive scale, making frontier-level multimodal reasoning accessible to organizations with limited infrastructure. We release the model checkpoint, all training recipes, and evaluation protocols under the MIT license to to advance open-source research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 78 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shruthan Radhakrishna, Aman Tiwari, Aanjaneya Shukla, Masoud Hashemi, Rishabh Maheshwary, Shiva Krishna Reddy Malay, Jash Mehta, Pulkit Pattnaik, Saloni Mittal, Khalil Slimi, Kelechi Ogueji, Akintunde Oladipo, Soham Parikh, Oluwanifemi Bamgbose, Toby Liang, Ahmed Masry, Khyati Mahajan, Sai Rajeswar Mudumba, Vikas Yadav, Sathwik Tejaswi Madhusudhan, Torsten Scholak, Sagar Davasam, Srinivas Sunkara, Nicholas Chapados</p>

            <p><strong>Title:</strong><br>
            Apriel-1.5-15b-Thinker</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.01141v1">http://arxiv.org/abs/2510.01141v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Apriel-1.5-15B-Thinker, a 15-billion parameter open-weights multimodal reasoning model that achieves frontier-level performance through training design rather than sheer scale. Starting from Pixtral-12B, we apply a progressive three-stage methodology: (1) depth upscaling to expand reasoning capacity without pretraining from scratch, (2) staged continual pre-training that first develops foundational text and vision understanding, then enhances visual reasoning through targeted synthetic data generation addressing spatial structure, compositional understanding, and fine-grained perception, and (3) high-quality text-only supervised fine-tuning on curated instruction-response pairs with explicit reasoning traces spanning mathematics, coding, science, and tool use. Notably, our model achieves competitive results without reinforcement learning or preference optimization, isolating the contribution of our data-centric continual pre-training approach. On the Artificial Analysis Intelligence Index, Apriel-1.5-15B-Thinker attains a score of 52, matching DeepSeek-R1-0528 despite requiring significantly fewer computational resources. Across ten image benchmarks, its performance is on average within five points of Gemini-2.5-Flash and Claude Sonnet-3.7, a key achievement for a model operating within single-GPU deployment constraints. Our results demonstrate that thoughtful mid-training 2 design can close substantial capability gaps without massive scale, making frontier-level multimodal reasoning accessible to organizations with limited infrastructure. We release the model checkpoint, all training recipes, and evaluation protocols under the MIT license to to advance open-source research.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 06 Oct 2025 19:59:55 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1304cf5f/9efcba11.mp3" length="24458597" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1525</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 78 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shruthan Radhakrishna, Aman Tiwari, Aanjaneya Shukla, Masoud Hashemi, Rishabh Maheshwary, Shiva Krishna Reddy Malay, Jash Mehta, Pulkit Pattnaik, Saloni Mittal, Khalil Slimi, Kelechi Ogueji, Akintunde Oladipo, Soham Parikh, Oluwanifemi Bamgbose, Toby Liang, Ahmed Masry, Khyati Mahajan, Sai Rajeswar Mudumba, Vikas Yadav, Sathwik Tejaswi Madhusudhan, Torsten Scholak, Sagar Davasam, Srinivas Sunkara, Nicholas Chapados</p>

            <p><strong>Title:</strong><br>
            Apriel-1.5-15b-Thinker</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.01141v1">http://arxiv.org/abs/2510.01141v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Apriel-1.5-15B-Thinker, a 15-billion parameter open-weights multimodal reasoning model that achieves frontier-level performance through training design rather than sheer scale. Starting from Pixtral-12B, we apply a progressive three-stage methodology: (1) depth upscaling to expand reasoning capacity without pretraining from scratch, (2) staged continual pre-training that first develops foundational text and vision understanding, then enhances visual reasoning through targeted synthetic data generation addressing spatial structure, compositional understanding, and fine-grained perception, and (3) high-quality text-only supervised fine-tuning on curated instruction-response pairs with explicit reasoning traces spanning mathematics, coding, science, and tool use. Notably, our model achieves competitive results without reinforcement learning or preference optimization, isolating the contribution of our data-centric continual pre-training approach. On the Artificial Analysis Intelligence Index, Apriel-1.5-15B-Thinker attains a score of 52, matching DeepSeek-R1-0528 despite requiring significantly fewer computational resources. Across ten image benchmarks, its performance is on average within five points of Gemini-2.5-Flash and Claude Sonnet-3.7, a key achievement for a model operating within single-GPU deployment constraints. Our results demonstrate that thoughtful mid-training 2 design can close substantial capability gaps without massive scale, making frontier-level multimodal reasoning accessible to organizations with limited infrastructure. We release the model checkpoint, all training recipes, and evaluation protocols under the MIT license to to advance open-source research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Large Reasoning Models Learn Better Alignment from Flawed Thinking</title>
      <itunes:episode>1230</itunes:episode>
      <podcast:episode>1230</podcast:episode>
      <itunes:title>Large Reasoning Models Learn Better Alignment from Flawed Thinking</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7d69a727-78d8-4a48-8100-5e4a1369d317</guid>
      <link>https://share.transistor.fm/s/1bf1df61</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            ShengYun Peng, Eric Smith, Ivan Evtimov, Song Jiang, Pin-Yu Chen, Hongyuan Zhan, Haozhu Wang, Duen Horng Chau, Mahesh Pasupuleti, Jianfeng Chi</p>

            <p><strong>Title:</strong><br>
            Large Reasoning Models Learn Better Alignment from Flawed Thinking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.00938v1">http://arxiv.org/abs/2510.00938v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer, yet they still lack the ability to reason critically about safety alignment and are easily biased when a flawed premise is injected into their thought process. We propose RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories and reroute to safe and helpful responses. RECAP trains on a mixture of synthetically generated counter-aligned CoT prefills and standard prompts, requires no additional training cost or modifications beyond vanilla reinforcement learning from human feedback (RLHF), and substantially improves safety and jailbreak robustness, reduces overrefusal, and preserves core reasoning capability -- all while maintaining inference token budget. Extensive analysis shows that RECAP-trained models engage in self-reflection more frequently and remain robust under adaptive attacks, preserving safety even after repeated attempts to override their reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            ShengYun Peng, Eric Smith, Ivan Evtimov, Song Jiang, Pin-Yu Chen, Hongyuan Zhan, Haozhu Wang, Duen Horng Chau, Mahesh Pasupuleti, Jianfeng Chi</p>

            <p><strong>Title:</strong><br>
            Large Reasoning Models Learn Better Alignment from Flawed Thinking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.00938v1">http://arxiv.org/abs/2510.00938v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer, yet they still lack the ability to reason critically about safety alignment and are easily biased when a flawed premise is injected into their thought process. We propose RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories and reroute to safe and helpful responses. RECAP trains on a mixture of synthetically generated counter-aligned CoT prefills and standard prompts, requires no additional training cost or modifications beyond vanilla reinforcement learning from human feedback (RLHF), and substantially improves safety and jailbreak robustness, reduces overrefusal, and preserves core reasoning capability -- all while maintaining inference token budget. Extensive analysis shows that RECAP-trained models engage in self-reflection more frequently and remain robust under adaptive attacks, preserving safety even after repeated attempts to override their reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 06 Oct 2025 19:59:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1bf1df61/50796d23.mp3" length="21217785" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1322</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            ShengYun Peng, Eric Smith, Ivan Evtimov, Song Jiang, Pin-Yu Chen, Hongyuan Zhan, Haozhu Wang, Duen Horng Chau, Mahesh Pasupuleti, Jianfeng Chi</p>

            <p><strong>Title:</strong><br>
            Large Reasoning Models Learn Better Alignment from Flawed Thinking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.00938v1">http://arxiv.org/abs/2510.00938v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer, yet they still lack the ability to reason critically about safety alignment and are easily biased when a flawed premise is injected into their thought process. We propose RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories and reroute to safe and helpful responses. RECAP trains on a mixture of synthetically generated counter-aligned CoT prefills and standard prompts, requires no additional training cost or modifications beyond vanilla reinforcement learning from human feedback (RLHF), and substantially improves safety and jailbreak robustness, reduces overrefusal, and preserves core reasoning capability -- all while maintaining inference token budget. Extensive analysis shows that RECAP-trained models engage in self-reflection more frequently and remain robust under adaptive attacks, preserving safety even after repeated attempts to override their reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Efficient Multi-modal Large Language Models via Progressive Consistency Distillation</title>
      <itunes:episode>1229</itunes:episode>
      <podcast:episode>1229</podcast:episode>
      <itunes:title>Efficient Multi-modal Large Language Models via Progressive Consistency Distillation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1e61fdff-42d3-4da3-aa0c-2382debb3c32</guid>
      <link>https://share.transistor.fm/s/1da26b4e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zichen Wen, Shaobo Wang, Yufa Zhou, Junyuan Zhang, Qintong Zhang, Yifeng Gao, Zhaorun Chen, Bin Wang, Weijia Li, Conghui He, Linfeng Zhang</p>

            <p><strong>Title:</strong><br>
            Efficient Multi-modal Large Language Models via Progressive Consistency Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.00515v1">http://arxiv.org/abs/2510.00515v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model's parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zichen Wen, Shaobo Wang, Yufa Zhou, Junyuan Zhang, Qintong Zhang, Yifeng Gao, Zhaorun Chen, Bin Wang, Weijia Li, Conghui He, Linfeng Zhang</p>

            <p><strong>Title:</strong><br>
            Efficient Multi-modal Large Language Models via Progressive Consistency Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.00515v1">http://arxiv.org/abs/2510.00515v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model's parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 06 Oct 2025 19:59:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1da26b4e/9ca2dd75.mp3" length="20885943" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1302</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zichen Wen, Shaobo Wang, Yufa Zhou, Junyuan Zhang, Qintong Zhang, Yifeng Gao, Zhaorun Chen, Bin Wang, Weijia Li, Conghui He, Linfeng Zhang</p>

            <p><strong>Title:</strong><br>
            Efficient Multi-modal Large Language Models via Progressive Consistency Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.00515v1">http://arxiv.org/abs/2510.00515v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model's parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LongCodeZip: Compress Long Context for Code Language Models</title>
      <itunes:episode>1228</itunes:episode>
      <podcast:episode>1228</podcast:episode>
      <itunes:title>LongCodeZip: Compress Long Context for Code Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">39c40907-fde4-4efb-839e-326179566e61</guid>
      <link>https://share.transistor.fm/s/d1499fa6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CL, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, Xiaodong Gu</p>

            <p><strong>Title:</strong><br>
            LongCodeZip: Compress Long Context for Code Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.00446v1">http://arxiv.org/abs/2510.00446v1</a></p>

            <p><strong>Abstract:</strong><br>
            Code generation under long contexts is becoming increasingly critical as Large Language Models (LLMs) are required to reason over extensive information in the codebase. While recent advances enable code LLMs to process long inputs, high API costs and generation latency remain substantial bottlenecks. Existing context pruning techniques, such as LLMLingua, achieve promising results for general text but overlook code-specific structures and dependencies, leading to suboptimal performance in programming tasks. In this paper, we propose LongCodeZip, a novel plug-and-play code compression framework designed specifically for code LLMs. LongCodeZip employs a dual-stage strategy: (1) coarse-grained compression, which identifies and ranks function-level chunks using conditional perplexity with respect to the instruction, retaining only the most relevant functions; and (2) fine-grained compression, which segments retained functions into blocks based on perplexity and selects an optimal subset under an adaptive token budget to maximize relevance. Evaluations across multiple tasks, including code completion, summarization, and question answering, show that LongCodeZip consistently outperforms baseline methods, achieving up to a 5.6x compression ratio without degrading task performance. By effectively reducing context size while preserving essential information, LongCodeZip enables LLMs to better scale to real-world, large-scale code scenarios, advancing the efficiency and capability of code intelligence applications.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CL, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, Xiaodong Gu</p>

            <p><strong>Title:</strong><br>
            LongCodeZip: Compress Long Context for Code Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.00446v1">http://arxiv.org/abs/2510.00446v1</a></p>

            <p><strong>Abstract:</strong><br>
            Code generation under long contexts is becoming increasingly critical as Large Language Models (LLMs) are required to reason over extensive information in the codebase. While recent advances enable code LLMs to process long inputs, high API costs and generation latency remain substantial bottlenecks. Existing context pruning techniques, such as LLMLingua, achieve promising results for general text but overlook code-specific structures and dependencies, leading to suboptimal performance in programming tasks. In this paper, we propose LongCodeZip, a novel plug-and-play code compression framework designed specifically for code LLMs. LongCodeZip employs a dual-stage strategy: (1) coarse-grained compression, which identifies and ranks function-level chunks using conditional perplexity with respect to the instruction, retaining only the most relevant functions; and (2) fine-grained compression, which segments retained functions into blocks based on perplexity and selects an optimal subset under an adaptive token budget to maximize relevance. Evaluations across multiple tasks, including code completion, summarization, and question answering, show that LongCodeZip consistently outperforms baseline methods, achieving up to a 5.6x compression ratio without degrading task performance. By effectively reducing context size while preserving essential information, LongCodeZip enables LLMs to better scale to real-world, large-scale code scenarios, advancing the efficiency and capability of code intelligence applications.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Oct 2025 20:42:54 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d1499fa6/07cc873d.mp3" length="28395391" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1771</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CL, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, Xiaodong Gu</p>

            <p><strong>Title:</strong><br>
            LongCodeZip: Compress Long Context for Code Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.00446v1">http://arxiv.org/abs/2510.00446v1</a></p>

            <p><strong>Abstract:</strong><br>
            Code generation under long contexts is becoming increasingly critical as Large Language Models (LLMs) are required to reason over extensive information in the codebase. While recent advances enable code LLMs to process long inputs, high API costs and generation latency remain substantial bottlenecks. Existing context pruning techniques, such as LLMLingua, achieve promising results for general text but overlook code-specific structures and dependencies, leading to suboptimal performance in programming tasks. In this paper, we propose LongCodeZip, a novel plug-and-play code compression framework designed specifically for code LLMs. LongCodeZip employs a dual-stage strategy: (1) coarse-grained compression, which identifies and ranks function-level chunks using conditional perplexity with respect to the instruction, retaining only the most relevant functions; and (2) fine-grained compression, which segments retained functions into blocks based on perplexity and selects an optimal subset under an adaptive token budget to maximize relevance. Evaluations across multiple tasks, including code completion, summarization, and question answering, show that LongCodeZip consistently outperforms baseline methods, achieving up to a 5.6x compression ratio without degrading task performance. By effectively reducing context size while preserving essential information, LongCodeZip enables LLMs to better scale to real-world, large-scale code scenarios, advancing the efficiency and capability of code intelligence applications.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Self-Forcing++: Towards Minute-Scale High-Quality Video Generation</title>
      <itunes:episode>1227</itunes:episode>
      <podcast:episode>1227</podcast:episode>
      <itunes:title>Self-Forcing++: Towards Minute-Scale High-Quality Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2b013487-98d7-421f-8016-ecd0c90663a6</guid>
      <link>https://share.transistor.fm/s/675f5996</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 61 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh</p>

            <p><strong>Title:</strong><br>
            Self-Forcing++: Towards Minute-Scale High-Quality Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.02283v1">http://arxiv.org/abs/2510.02283v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher's capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model's position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 61 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh</p>

            <p><strong>Title:</strong><br>
            Self-Forcing++: Towards Minute-Scale High-Quality Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.02283v1">http://arxiv.org/abs/2510.02283v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher's capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model's position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Oct 2025 20:42:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/675f5996/792f2eec.mp3" length="22185361" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1383</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 61 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh</p>

            <p><strong>Title:</strong><br>
            Self-Forcing++: Towards Minute-Scale High-Quality Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.02283v1">http://arxiv.org/abs/2510.02283v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher's capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model's position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at https://self-forcing-plus-plus.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ExGRPO: Learning to Reason from Experience</title>
      <itunes:episode>1226</itunes:episode>
      <podcast:episode>1226</podcast:episode>
      <itunes:title>ExGRPO: Learning to Reason from Experience</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8227ad7a-470f-4e88-ab27-50fe3987db36</guid>
      <link>https://share.transistor.fm/s/52d16e93</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            ExGRPO: Learning to Reason from Experience</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.02245v1">http://arxiv.org/abs/2510.02245v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            ExGRPO: Learning to Reason from Experience</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.02245v1">http://arxiv.org/abs/2510.02245v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Oct 2025 20:42:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/52d16e93/d28431b5.mp3" length="21129154" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1317</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            ExGRPO: Learning to Reason from Experience</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.02245v1">http://arxiv.org/abs/2510.02245v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions</title>
      <itunes:episode>1225</itunes:episode>
      <podcast:episode>1225</podcast:episode>
      <itunes:title>StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f85265c5-8352-4fcf-8b13-be4a7b9a84de</guid>
      <link>https://share.transistor.fm/s/ec74c0b3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Bo-Hsu Ke, You-Zhe Xie, Yu-Lun Liu, Wei-Chen Chiu</p>

            <p><strong>Title:</strong><br>
            StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.02314v1">http://arxiv.org/abs/2510.02314v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D scene representation methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have significantly advanced novel view synthesis. As these methods become prevalent, addressing their vulnerabilities becomes critical. We analyze 3DGS robustness against image-level poisoning attacks and propose a novel density-guided poisoning method. Our method strategically injects Gaussian points into low-density regions identified via Kernel Density Estimation (KDE), embedding viewpoint-dependent illusory objects clearly visible from poisoned views while minimally affecting innocent views. Additionally, we introduce an adaptive noise strategy to disrupt multi-view consistency, further enhancing attack effectiveness. We propose a KDE-based evaluation protocol to assess attack difficulty systematically, enabling objective benchmarking for future research. Extensive experiments demonstrate our method's superior performance compared to state-of-the-art techniques. Project page: https://hentci.github.io/stealthattack/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Bo-Hsu Ke, You-Zhe Xie, Yu-Lun Liu, Wei-Chen Chiu</p>

            <p><strong>Title:</strong><br>
            StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.02314v1">http://arxiv.org/abs/2510.02314v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D scene representation methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have significantly advanced novel view synthesis. As these methods become prevalent, addressing their vulnerabilities becomes critical. We analyze 3DGS robustness against image-level poisoning attacks and propose a novel density-guided poisoning method. Our method strategically injects Gaussian points into low-density regions identified via Kernel Density Estimation (KDE), embedding viewpoint-dependent illusory objects clearly visible from poisoned views while minimally affecting innocent views. Additionally, we introduce an adaptive noise strategy to disrupt multi-view consistency, further enhancing attack effectiveness. We propose a KDE-based evaluation protocol to assess attack difficulty systematically, enabling objective benchmarking for future research. Extensive experiments demonstrate our method's superior performance compared to state-of-the-art techniques. Project page: https://hentci.github.io/stealthattack/</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Oct 2025 20:41:51 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ec74c0b3/9077f9c4.mp3" length="25130735" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1567</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Bo-Hsu Ke, You-Zhe Xie, Yu-Lun Liu, Wei-Chen Chiu</p>

            <p><strong>Title:</strong><br>
            StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.02314v1">http://arxiv.org/abs/2510.02314v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D scene representation methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have significantly advanced novel view synthesis. As these methods become prevalent, addressing their vulnerabilities becomes critical. We analyze 3DGS robustness against image-level poisoning attacks and propose a novel density-guided poisoning method. Our method strategically injects Gaussian points into low-density regions identified via Kernel Density Estimation (KDE), embedding viewpoint-dependent illusory objects clearly visible from poisoned views while minimally affecting innocent views. Additionally, we introduce an adaptive noise strategy to disrupt multi-view consistency, further enhancing attack effectiveness. We propose a KDE-based evaluation protocol to assess attack difficulty systematically, enabling objective benchmarking for future research. Extensive experiments demonstrate our method's superior performance compared to state-of-the-art techniques. Project page: https://hentci.github.io/stealthattack/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Interactive Training: Feedback-Driven Neural Network Optimization</title>
      <itunes:episode>1224</itunes:episode>
      <podcast:episode>1224</podcast:episode>
      <itunes:title>Interactive Training: Feedback-Driven Neural Network Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8622a3b7-be0d-4ae9-b883-a6a624b75d80</guid>
      <link>https://share.transistor.fm/s/335dcc77</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Wentao Zhang, Yang Young Lu, Yuntian Deng</p>

            <p><strong>Title:</strong><br>
            Interactive Training: Feedback-Driven Neural Network Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.02297v1">http://arxiv.org/abs/2510.02297v1</a></p>

            <p><strong>Abstract:</strong><br>
            Traditional neural network training typically follows fixed, predefined optimization recipes, lacking the flexibility to dynamically respond to instabilities or emerging training issues. In this paper, we introduce Interactive Training, an open-source framework that enables real-time, feedback-driven intervention during neural network training by human experts or automated AI agents. At its core, Interactive Training uses a control server to mediate communication between users or agents and the ongoing training process, allowing users to dynamically adjust optimizer hyperparameters, training data, and model checkpoints. Through three case studies, we demonstrate that Interactive Training achieves superior training stability, reduced sensitivity to initial hyperparameters, and improved adaptability to evolving user needs, paving the way toward a future training paradigm where AI agents autonomously monitor training logs, proactively resolve instabilities, and optimize training dynamics.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Wentao Zhang, Yang Young Lu, Yuntian Deng</p>

            <p><strong>Title:</strong><br>
            Interactive Training: Feedback-Driven Neural Network Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.02297v1">http://arxiv.org/abs/2510.02297v1</a></p>

            <p><strong>Abstract:</strong><br>
            Traditional neural network training typically follows fixed, predefined optimization recipes, lacking the flexibility to dynamically respond to instabilities or emerging training issues. In this paper, we introduce Interactive Training, an open-source framework that enables real-time, feedback-driven intervention during neural network training by human experts or automated AI agents. At its core, Interactive Training uses a control server to mediate communication between users or agents and the ongoing training process, allowing users to dynamically adjust optimizer hyperparameters, training data, and model checkpoints. Through three case studies, we demonstrate that Interactive Training achieves superior training stability, reduced sensitivity to initial hyperparameters, and improved adaptability to evolving user needs, paving the way toward a future training paradigm where AI agents autonomously monitor training logs, proactively resolve instabilities, and optimize training dynamics.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Oct 2025 20:41:29 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/335dcc77/e75f3da9.mp3" length="20137777" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1255</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Wentao Zhang, Yang Young Lu, Yuntian Deng</p>

            <p><strong>Title:</strong><br>
            Interactive Training: Feedback-Driven Neural Network Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.02297v1">http://arxiv.org/abs/2510.02297v1</a></p>

            <p><strong>Abstract:</strong><br>
            Traditional neural network training typically follows fixed, predefined optimization recipes, lacking the flexibility to dynamically respond to instabilities or emerging training issues. In this paper, we introduce Interactive Training, an open-source framework that enables real-time, feedback-driven intervention during neural network training by human experts or automated AI agents. At its core, Interactive Training uses a control server to mediate communication between users or agents and the ongoing training process, allowing users to dynamically adjust optimizer hyperparameters, training data, and model checkpoints. Through three case studies, we demonstrate that Interactive Training achieves superior training stability, reduced sensitivity to initial hyperparameters, and improved adaptability to evolving user needs, paving the way toward a future training paradigm where AI agents autonomously monitor training logs, proactively resolve instabilities, and optimize training dynamics.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ModernVBERT: Towards Smaller Visual Document Retrievers</title>
      <itunes:episode>1223</itunes:episode>
      <podcast:episode>1223</podcast:episode>
      <itunes:title>ModernVBERT: Towards Smaller Visual Document Retrievers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b93d734e-353b-46aa-a988-7e0fea549383</guid>
      <link>https://share.transistor.fm/s/11ee867d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.IR</p>

            <p><strong>Authors:</strong><br>
            Paul Teiletche, Quentin Macé, Max Conti, Antonio Loison, Gautier Viaud, Pierre Colombo, Manuel Faysse</p>

            <p><strong>Title:</strong><br>
            ModernVBERT: Towards Smaller Visual Document Retrievers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.01149v1">http://arxiv.org/abs/2510.01149v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal embedding models are gaining prevalence, notably for document retrieval as efficient alternatives to text-only pipelines. These models are typically built by finetuning large vision-language decoders (VLMs) with contrastive losses on text-image pairs. In this work, we show that, while cost-efficient, this repurposing approach often bottlenecks retrieval performance. Through controlled experiments, we establish a principled recipe for improving visual document retrieval models. We notably measure the impact of attention masking, image resolution, modality alignment data regimes, and late interaction centered contrastive objectives which emerge as central performance factors. Building on these insights, we release ModernVBERT, a compact 250M-parameter vision-language encoder that outperforms models up to 10 times larger when finetuned on document retrieval tasks. Models and code are made available at https://huggingface.co/ModernVBERT.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.IR</p>

            <p><strong>Authors:</strong><br>
            Paul Teiletche, Quentin Macé, Max Conti, Antonio Loison, Gautier Viaud, Pierre Colombo, Manuel Faysse</p>

            <p><strong>Title:</strong><br>
            ModernVBERT: Towards Smaller Visual Document Retrievers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.01149v1">http://arxiv.org/abs/2510.01149v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal embedding models are gaining prevalence, notably for document retrieval as efficient alternatives to text-only pipelines. These models are typically built by finetuning large vision-language decoders (VLMs) with contrastive losses on text-image pairs. In this work, we show that, while cost-efficient, this repurposing approach often bottlenecks retrieval performance. Through controlled experiments, we establish a principled recipe for improving visual document retrieval models. We notably measure the impact of attention masking, image resolution, modality alignment data regimes, and late interaction centered contrastive objectives which emerge as central performance factors. Building on these insights, we release ModernVBERT, a compact 250M-parameter vision-language encoder that outperforms models up to 10 times larger when finetuned on document retrieval tasks. Models and code are made available at https://huggingface.co/ModernVBERT.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Oct 2025 20:41:08 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/11ee867d/0a2be5fe.mp3" length="22474159" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1401</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.IR</p>

            <p><strong>Authors:</strong><br>
            Paul Teiletche, Quentin Macé, Max Conti, Antonio Loison, Gautier Viaud, Pierre Colombo, Manuel Faysse</p>

            <p><strong>Title:</strong><br>
            ModernVBERT: Towards Smaller Visual Document Retrievers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.01149v1">http://arxiv.org/abs/2510.01149v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal embedding models are gaining prevalence, notably for document retrieval as efficient alternatives to text-only pipelines. These models are typically built by finetuning large vision-language decoders (VLMs) with contrastive losses on text-image pairs. In this work, we show that, while cost-efficient, this repurposing approach often bottlenecks retrieval performance. Through controlled experiments, we establish a principled recipe for improving visual document retrieval models. We notably measure the impact of attention masking, image resolution, modality alignment data regimes, and late interaction centered contrastive objectives which emerge as central performance factors. Building on these insights, we release ModernVBERT, a compact 250M-parameter vision-language encoder that outperforms models up to 10 times larger when finetuned on document retrieval tasks. Models and code are made available at https://huggingface.co/ModernVBERT.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?</title>
      <itunes:episode>1222</itunes:episode>
      <podcast:episode>1222</podcast:episode>
      <itunes:title>StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bc6444a2-9734-4ba9-8fc0-4042411f9039</guid>
      <link>https://share.transistor.fm/s/8493619d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.02209v1">http://arxiv.org/abs/2510.02209v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have recently demonstrated strong capabilities as autonomous agents, showing promise in reasoning, tool use, and sequential decision-making. While prior benchmarks have evaluated LLM agents in domains such as software engineering and scientific discovery, the finance domain remains underexplored, despite its direct relevance to economic value and high-stakes decision-making. Existing financial benchmarks primarily test static knowledge through question answering, but they fall short of capturing the dynamic and iterative nature of trading. To address this gap, we introduce StockBench, a contamination-free benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments. Agents receive daily market signals -- including prices, fundamentals, and news -- and must make sequential buy, sell, or hold decisions. Performance is assessed using financial metrics such as cumulative return, maximum drawdown, and the Sortino ratio. Our evaluation of state-of-the-art proprietary (e.g., GPT-5, Claude-4) and open-weight (e.g., Qwen3, Kimi-K2, GLM-4.5) models shows that while most LLM agents struggle to outperform the simple buy-and-hold baseline, several models demonstrate the potential to deliver higher returns and manage risk more effectively. These findings highlight both the challenges and opportunities in developing LLM-powered financial agents, showing that excelling at static financial knowledge tasks does not necessarily translate into successful trading strategies. We release StockBench as an open-source resource to support reproducibility and advance future research in this domain.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.02209v1">http://arxiv.org/abs/2510.02209v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have recently demonstrated strong capabilities as autonomous agents, showing promise in reasoning, tool use, and sequential decision-making. While prior benchmarks have evaluated LLM agents in domains such as software engineering and scientific discovery, the finance domain remains underexplored, despite its direct relevance to economic value and high-stakes decision-making. Existing financial benchmarks primarily test static knowledge through question answering, but they fall short of capturing the dynamic and iterative nature of trading. To address this gap, we introduce StockBench, a contamination-free benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments. Agents receive daily market signals -- including prices, fundamentals, and news -- and must make sequential buy, sell, or hold decisions. Performance is assessed using financial metrics such as cumulative return, maximum drawdown, and the Sortino ratio. Our evaluation of state-of-the-art proprietary (e.g., GPT-5, Claude-4) and open-weight (e.g., Qwen3, Kimi-K2, GLM-4.5) models shows that while most LLM agents struggle to outperform the simple buy-and-hold baseline, several models demonstrate the potential to deliver higher returns and manage risk more effectively. These findings highlight both the challenges and opportunities in developing LLM-powered financial agents, showing that excelling at static financial knowledge tasks does not necessarily translate into successful trading strategies. We release StockBench as an open-source resource to support reproducibility and advance future research in this domain.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Oct 2025 20:40:47 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8493619d/3d98f2ee.mp3" length="29647611" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1849</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.02209v1">http://arxiv.org/abs/2510.02209v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have recently demonstrated strong capabilities as autonomous agents, showing promise in reasoning, tool use, and sequential decision-making. While prior benchmarks have evaluated LLM agents in domains such as software engineering and scientific discovery, the finance domain remains underexplored, despite its direct relevance to economic value and high-stakes decision-making. Existing financial benchmarks primarily test static knowledge through question answering, but they fall short of capturing the dynamic and iterative nature of trading. To address this gap, we introduce StockBench, a contamination-free benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments. Agents receive daily market signals -- including prices, fundamentals, and news -- and must make sequential buy, sell, or hold decisions. Performance is assessed using financial metrics such as cumulative return, maximum drawdown, and the Sortino ratio. Our evaluation of state-of-the-art proprietary (e.g., GPT-5, Claude-4) and open-weight (e.g., Qwen3, Kimi-K2, GLM-4.5) models shows that while most LLM agents struggle to outperform the simple buy-and-hold baseline, several models demonstrate the potential to deliver higher returns and manage risk more effectively. These findings highlight both the challenges and opportunities in developing LLM-powered financial agents, showing that excelling at static financial knowledge tasks does not necessarily translate into successful trading strategies. We release StockBench as an open-source resource to support reproducibility and advance future research in this domain.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search</title>
      <itunes:episode>1221</itunes:episode>
      <podcast:episode>1221</podcast:episode>
      <itunes:title>DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5fd0a6b1-5d6e-4b21-83e4-ad2799610d3e</guid>
      <link>https://share.transistor.fm/s/5322244c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 100 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fang Wu, Weihao Xuan, Heli Qi, Ximing Lu, Aaron Tu, Li Erran Li, Yejin Choi</p>

            <p><strong>Title:</strong><br>
            DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25454v2">http://arxiv.org/abs/2509.25454v2</a></p>

            <p><strong>Abstract:</strong><br>
            Although RLVR has become an essential component for developing advanced reasoning skills in LLMs, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models - using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 100 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fang Wu, Weihao Xuan, Heli Qi, Ximing Lu, Aaron Tu, Li Erran Li, Yejin Choi</p>

            <p><strong>Title:</strong><br>
            DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25454v2">http://arxiv.org/abs/2509.25454v2</a></p>

            <p><strong>Abstract:</strong><br>
            Although RLVR has become an essential component for developing advanced reasoning skills in LLMs, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models - using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Oct 2025 20:44:35 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5322244c/dcb9a207.mp3" length="23290074" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1452</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 100 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fang Wu, Weihao Xuan, Heli Qi, Ximing Lu, Aaron Tu, Li Erran Li, Yejin Choi</p>

            <p><strong>Title:</strong><br>
            DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25454v2">http://arxiv.org/abs/2509.25454v2</a></p>

            <p><strong>Abstract:</strong><br>
            Although RLVR has become an essential component for developing advanced reasoning skills in LLMs, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models - using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GEM: A Gym for Agentic LLMs</title>
      <itunes:episode>1220</itunes:episode>
      <podcast:episode>1220</podcast:episode>
      <itunes:title>GEM: A Gym for Agentic LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5c3b9457-dbab-4d5c-a555-ded19d678652</guid>
      <link>https://share.transistor.fm/s/5757449d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zichen Liu, Anya Sims, Keyu Duan, Changyu Chen, Simon Yu, Xiangxin Zhou, Haotian Xu, Shaopan Xiong, Bo Liu, Chenmien Tan, Chuen Yang Beh, Weixun Wang, Hao Zhu, Weiyan Shi, Diyi Yang, Michael Shieh, Yee Whye Teh, Wee Sun Lee, Min Lin</p>

            <p><strong>Title:</strong><br>
            GEM: A Gym for Agentic LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.01051v1">http://arxiv.org/abs/2510.01051v1</a></p>

            <p><strong>Abstract:</strong><br>
            The training paradigm for large language models (LLMs) is moving from static datasets to experience-based learning, where agents acquire skills via interacting with complex environments. To facilitate this transition we introduce GEM (General Experience Maker), an open-source environment simulator designed for the age of LLMs. Analogous to OpenAI-Gym for traditional reinforcement learning (RL), GEM provides a standardized framework for the environment-agent interface, including asynchronous vectorized execution for high throughput, and flexible wrappers for easy extensibility. GEM also features a diverse suite of environments, robust integrated tools, and single-file example scripts demonstrating using GEM with five popular RL training frameworks. Along with this, we also provide a set of baselines across 24 environments using REINFORCE with Return Batch Normalization (ReBN), which -- unlike GRPO -- is compatible with the full RL setting of dense per-turn rewards and offers better credit assignment. We further conduct apple-to-apple benchmarking of PPO, GRPO and REINFORCE in both single- and multi-turn settings using GEM to shed light on the algorithmic designs. Lastly, GEM also functions as a convenient evaluation toolkit besides a training environment. We hope this framework can help accelerate future agentic LLM research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zichen Liu, Anya Sims, Keyu Duan, Changyu Chen, Simon Yu, Xiangxin Zhou, Haotian Xu, Shaopan Xiong, Bo Liu, Chenmien Tan, Chuen Yang Beh, Weixun Wang, Hao Zhu, Weiyan Shi, Diyi Yang, Michael Shieh, Yee Whye Teh, Wee Sun Lee, Min Lin</p>

            <p><strong>Title:</strong><br>
            GEM: A Gym for Agentic LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.01051v1">http://arxiv.org/abs/2510.01051v1</a></p>

            <p><strong>Abstract:</strong><br>
            The training paradigm for large language models (LLMs) is moving from static datasets to experience-based learning, where agents acquire skills via interacting with complex environments. To facilitate this transition we introduce GEM (General Experience Maker), an open-source environment simulator designed for the age of LLMs. Analogous to OpenAI-Gym for traditional reinforcement learning (RL), GEM provides a standardized framework for the environment-agent interface, including asynchronous vectorized execution for high throughput, and flexible wrappers for easy extensibility. GEM also features a diverse suite of environments, robust integrated tools, and single-file example scripts demonstrating using GEM with five popular RL training frameworks. Along with this, we also provide a set of baselines across 24 environments using REINFORCE with Return Batch Normalization (ReBN), which -- unlike GRPO -- is compatible with the full RL setting of dense per-turn rewards and offers better credit assignment. We further conduct apple-to-apple benchmarking of PPO, GRPO and REINFORCE in both single- and multi-turn settings using GEM to shed light on the algorithmic designs. Lastly, GEM also functions as a convenient evaluation toolkit besides a training environment. We hope this framework can help accelerate future agentic LLM research.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Oct 2025 20:44:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5757449d/75408ddb.mp3" length="24913759" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1553</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zichen Liu, Anya Sims, Keyu Duan, Changyu Chen, Simon Yu, Xiangxin Zhou, Haotian Xu, Shaopan Xiong, Bo Liu, Chenmien Tan, Chuen Yang Beh, Weixun Wang, Hao Zhu, Weiyan Shi, Diyi Yang, Michael Shieh, Yee Whye Teh, Wee Sun Lee, Min Lin</p>

            <p><strong>Title:</strong><br>
            GEM: A Gym for Agentic LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.01051v1">http://arxiv.org/abs/2510.01051v1</a></p>

            <p><strong>Abstract:</strong><br>
            The training paradigm for large language models (LLMs) is moving from static datasets to experience-based learning, where agents acquire skills via interacting with complex environments. To facilitate this transition we introduce GEM (General Experience Maker), an open-source environment simulator designed for the age of LLMs. Analogous to OpenAI-Gym for traditional reinforcement learning (RL), GEM provides a standardized framework for the environment-agent interface, including asynchronous vectorized execution for high throughput, and flexible wrappers for easy extensibility. GEM also features a diverse suite of environments, robust integrated tools, and single-file example scripts demonstrating using GEM with five popular RL training frameworks. Along with this, we also provide a set of baselines across 24 environments using REINFORCE with Return Batch Normalization (ReBN), which -- unlike GRPO -- is compatible with the full RL setting of dense per-turn rewards and offers better credit assignment. We further conduct apple-to-apple benchmarking of PPO, GRPO and REINFORCE in both single- and multi-turn settings using GEM to shed light on the algorithmic designs. Lastly, GEM also functions as a convenient evaluation toolkit besides a training environment. We hope this framework can help accelerate future agentic LLM research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators</title>
      <itunes:episode>1219</itunes:episode>
      <podcast:episode>1219</podcast:episode>
      <itunes:title>VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">802eee23-701c-4c81-8938-db1e20196ad3</guid>
      <link>https://share.transistor.fm/s/f33bda91</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, Weihua Su</p>

            <p><strong>Title:</strong><br>
            VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.00406v1">http://arxiv.org/abs/2510.00406v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language-Action (VLA) models enable embodied decision-making but rely heavily on imitation learning, leading to compounding errors and poor robustness under distribution shift. Reinforcement learning (RL) can mitigate these issues yet typically demands costly real-world interactions or suffers from sim-to-real gaps. We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory-level rewards derived from goal-achieving references. This design delivers an efficient and action-aligned learning signal, drastically lowering sample requirements. With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL. Moreover, it exhibits strong robustness under perturbed conditions, sustaining stable task execution. Our results establish world-model-based RFT as a practical post-training paradigm to enhance the generalization and robustness of VLA models. For more details, please refer to https://vla-rft.github.io/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, Weihua Su</p>

            <p><strong>Title:</strong><br>
            VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.00406v1">http://arxiv.org/abs/2510.00406v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language-Action (VLA) models enable embodied decision-making but rely heavily on imitation learning, leading to compounding errors and poor robustness under distribution shift. Reinforcement learning (RL) can mitigate these issues yet typically demands costly real-world interactions or suffers from sim-to-real gaps. We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory-level rewards derived from goal-achieving references. This design delivers an efficient and action-aligned learning signal, drastically lowering sample requirements. With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL. Moreover, it exhibits strong robustness under perturbed conditions, sustaining stable task execution. Our results establish world-model-based RFT as a practical post-training paradigm to enhance the generalization and robustness of VLA models. For more details, please refer to https://vla-rft.github.io/.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Oct 2025 20:43:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f33bda91/9f0da3c2.mp3" length="25815787" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1610</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, Weihua Su</p>

            <p><strong>Title:</strong><br>
            VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.00406v1">http://arxiv.org/abs/2510.00406v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language-Action (VLA) models enable embodied decision-making but rely heavily on imitation learning, leading to compounding errors and poor robustness under distribution shift. Reinforcement learning (RL) can mitigate these issues yet typically demands costly real-world interactions or suffers from sim-to-real gaps. We introduce VLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. Trained from real interaction data, the simulator predicts future visual observations conditioned on actions, allowing policy rollouts with dense, trajectory-level rewards derived from goal-achieving references. This design delivers an efficient and action-aligned learning signal, drastically lowering sample requirements. With fewer than 400 fine-tuning steps, VLA-RFT surpasses strong supervised baselines and achieves greater efficiency than simulator-based RL. Moreover, it exhibits strong robustness under perturbed conditions, sustaining stable task execution. Our results establish world-model-based RFT as a practical post-training paradigm to enhance the generalization and robustness of VLA models. For more details, please refer to https://vla-rft.github.io/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation</title>
      <itunes:episode>1218</itunes:episode>
      <podcast:episode>1218</podcast:episode>
      <itunes:title>Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">008e895c-cf97-4090-bda5-3900e3368bec</guid>
      <link>https://share.transistor.fm/s/7e9e37f5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ziniu Li, Congliang Chen, Tianyun Yang, Tian Ding, Ruoyu Sun, Ge Zhang, Wenhao Huang, Zhi-Quan Luo</p>

            <p><strong>Title:</strong><br>
            Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25849v1">http://arxiv.org/abs/2509.25849v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) can self-improve through reinforcement learning, where they generate trajectories to explore and discover better solutions. However, this exploration process is computationally expensive, often forcing current methods to assign limited exploration budgets to each task. This uniform allocation creates problematic edge cases: easy tasks consistently succeed while difficult tasks consistently fail, both producing zero gradients during training updates for the widely used Group Relative Policy Optimization (GRPO). We address this problem from the lens of exploration budget allocation. Viewing each task's exploration as an "item" with a distinct "value" and "cost", we establish a connection to the classical knapsack problem. This formulation allows us to derive an optimal assignment rule that adaptively distributes resources based on the model's current learning status. When applied to GRPO, our method increases the effective ratio of non-zero policy gradients by 20-40% during training. Acting as a computational "free lunch", our approach could reallocate exploration budgets from tasks where learning is saturated to those where it is most impactful. This enables significantly larger budgets (e.g., 93 rollouts) for especially challenging problems, which would be computationally prohibitive under a uniform allocation. These improvements translate to meaningful gains on mathematical reasoning benchmarks, with average improvements of 2-4 points and peak gains of 9 points on specific tasks. Notably, achieving comparable performance with traditional homogeneous allocation would require about 2x the computational resources.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ziniu Li, Congliang Chen, Tianyun Yang, Tian Ding, Ruoyu Sun, Ge Zhang, Wenhao Huang, Zhi-Quan Luo</p>

            <p><strong>Title:</strong><br>
            Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25849v1">http://arxiv.org/abs/2509.25849v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) can self-improve through reinforcement learning, where they generate trajectories to explore and discover better solutions. However, this exploration process is computationally expensive, often forcing current methods to assign limited exploration budgets to each task. This uniform allocation creates problematic edge cases: easy tasks consistently succeed while difficult tasks consistently fail, both producing zero gradients during training updates for the widely used Group Relative Policy Optimization (GRPO). We address this problem from the lens of exploration budget allocation. Viewing each task's exploration as an "item" with a distinct "value" and "cost", we establish a connection to the classical knapsack problem. This formulation allows us to derive an optimal assignment rule that adaptively distributes resources based on the model's current learning status. When applied to GRPO, our method increases the effective ratio of non-zero policy gradients by 20-40% during training. Acting as a computational "free lunch", our approach could reallocate exploration budgets from tasks where learning is saturated to those where it is most impactful. This enables significantly larger budgets (e.g., 93 rollouts) for especially challenging problems, which would be computationally prohibitive under a uniform allocation. These improvements translate to meaningful gains on mathematical reasoning benchmarks, with average improvements of 2-4 points and peak gains of 9 points on specific tasks. Notably, achieving comparable performance with traditional homogeneous allocation would require about 2x the computational resources.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Oct 2025 20:43:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7e9e37f5/c0d71ce7.mp3" length="23906943" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1490</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ziniu Li, Congliang Chen, Tianyun Yang, Tian Ding, Ruoyu Sun, Ge Zhang, Wenhao Huang, Zhi-Quan Luo</p>

            <p><strong>Title:</strong><br>
            Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25849v1">http://arxiv.org/abs/2509.25849v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) can self-improve through reinforcement learning, where they generate trajectories to explore and discover better solutions. However, this exploration process is computationally expensive, often forcing current methods to assign limited exploration budgets to each task. This uniform allocation creates problematic edge cases: easy tasks consistently succeed while difficult tasks consistently fail, both producing zero gradients during training updates for the widely used Group Relative Policy Optimization (GRPO). We address this problem from the lens of exploration budget allocation. Viewing each task's exploration as an "item" with a distinct "value" and "cost", we establish a connection to the classical knapsack problem. This formulation allows us to derive an optimal assignment rule that adaptively distributes resources based on the model's current learning status. When applied to GRPO, our method increases the effective ratio of non-zero policy gradients by 20-40% during training. Acting as a computational "free lunch", our approach could reallocate exploration budgets from tasks where learning is saturated to those where it is most impactful. This enables significantly larger budgets (e.g., 93 rollouts) for especially challenging problems, which would be computationally prohibitive under a uniform allocation. These improvements translate to meaningful gains on mathematical reasoning benchmarks, with average improvements of 2-4 points and peak gains of 9 points on specific tasks. Notably, achieving comparable performance with traditional homogeneous allocation would require about 2x the computational resources.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PIPer: On-Device Environment Setup via Online Reinforcement Learning</title>
      <itunes:episode>1217</itunes:episode>
      <podcast:episode>1217</podcast:episode>
      <itunes:title>PIPer: On-Device Environment Setup via Online Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">dfe60bad-b3aa-42a5-9621-35c1d5eded47</guid>
      <link>https://share.transistor.fm/s/aac61e41</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.SE, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Alexander Kovrigin, Aleksandra Eliseeva, Konstantin Grotov, Egor Bogomolov, Yaroslav Zharov</p>

            <p><strong>Title:</strong><br>
            PIPer: On-Device Environment Setup via Online Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25455v1">http://arxiv.org/abs/2509.25455v1</a></p>

            <p><strong>Abstract:</strong><br>
            Environment setup-the process of configuring the system to work with a specific software project-represents a persistent challenge in Software Engineering (SE). Automated environment setup methods could assist developers by providing fully configured environments for arbitrary repositories without manual effort. This also helps SE researchers to scale execution-based benchmarks. However, recent studies reveal that even state-of-the-art Large Language Models (LLMs) achieve limited success in automating this task. To address this limitation, we tune a specialized model for environment setup. We combine supervised fine-tuning for generating correct Bash scripts and Reinforcement Learning with Verifiable Rewards (RLVR) to adapt it to the task of environment setup. On EnvBench-Python, our method enables Qwen3-8B (a model runnable on consumer hardware) to perform on par with larger models-Qwen3-32B and GPT-4o. The training code and model checkpoints are available online: https://github.com/JetBrains-Research/PIPer.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.SE, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Alexander Kovrigin, Aleksandra Eliseeva, Konstantin Grotov, Egor Bogomolov, Yaroslav Zharov</p>

            <p><strong>Title:</strong><br>
            PIPer: On-Device Environment Setup via Online Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25455v1">http://arxiv.org/abs/2509.25455v1</a></p>

            <p><strong>Abstract:</strong><br>
            Environment setup-the process of configuring the system to work with a specific software project-represents a persistent challenge in Software Engineering (SE). Automated environment setup methods could assist developers by providing fully configured environments for arbitrary repositories without manual effort. This also helps SE researchers to scale execution-based benchmarks. However, recent studies reveal that even state-of-the-art Large Language Models (LLMs) achieve limited success in automating this task. To address this limitation, we tune a specialized model for environment setup. We combine supervised fine-tuning for generating correct Bash scripts and Reinforcement Learning with Verifiable Rewards (RLVR) to adapt it to the task of environment setup. On EnvBench-Python, our method enables Qwen3-8B (a model runnable on consumer hardware) to perform on par with larger models-Qwen3-32B and GPT-4o. The training code and model checkpoints are available online: https://github.com/JetBrains-Research/PIPer.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Oct 2025 20:43:03 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/aac61e41/0b15bba2.mp3" length="19560579" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1219</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.SE, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Alexander Kovrigin, Aleksandra Eliseeva, Konstantin Grotov, Egor Bogomolov, Yaroslav Zharov</p>

            <p><strong>Title:</strong><br>
            PIPer: On-Device Environment Setup via Online Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25455v1">http://arxiv.org/abs/2509.25455v1</a></p>

            <p><strong>Abstract:</strong><br>
            Environment setup-the process of configuring the system to work with a specific software project-represents a persistent challenge in Software Engineering (SE). Automated environment setup methods could assist developers by providing fully configured environments for arbitrary repositories without manual effort. This also helps SE researchers to scale execution-based benchmarks. However, recent studies reveal that even state-of-the-art Large Language Models (LLMs) achieve limited success in automating this task. To address this limitation, we tune a specialized model for environment setup. We combine supervised fine-tuning for generating correct Bash scripts and Reinforcement Learning with Verifiable Rewards (RLVR) to adapt it to the task of environment setup. On EnvBench-Python, our method enables Qwen3-8B (a model runnable on consumer hardware) to perform on par with larger models-Qwen3-32B and GPT-4o. The training code and model checkpoints are available online: https://github.com/JetBrains-Research/PIPer.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights</title>
      <itunes:episode>1216</itunes:episode>
      <podcast:episode>1216</podcast:episode>
      <itunes:title>SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">92b8f0e1-206e-4fdd-a00a-898f6598a9ee</guid>
      <link>https://share.transistor.fm/s/43fd7e9c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Lorenz K. Müller, Philippe Bich, Jiawei Zhuang, Ahmet Çelik, Luca Benfenati, Lukas Cavigelli</p>

            <p><strong>Title:</strong><br>
            SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22944v2">http://arxiv.org/abs/2509.22944v2</a></p>

            <p><strong>Abstract:</strong><br>
            Post-training quantization has emerged as the most widely used strategy for deploying large language models at low precision. Still, current methods show perplexity degradation at bit-widths less than or equal to 4, partly because representing outliers causes precision issues in parameters that share the same scales as these outliers. This problem is especially pronounced for calibration-free, uniform quantization methods. We introduce SINQ to augment existing post-training quantizers with an additional second-axis scale factor and a fast Sinkhorn-Knopp-style algorithm that finds scales to normalize per-row and per-column variances, thereby minimizing a novel per-matrix proxy target for quantization: the matrix imbalance. Our method has no interactions between layers and can be trivially applied to new architectures to quantize any linear layers. We evaluate our method on the Qwen3 model family and DeepSeek-V2.5. SINQ improves WikiText2 and C4 perplexity significantly against uncalibrated uniform quantization baselines and can be further enhanced by combining it with calibration and non-uniform quantization levels. Code to reproduce the results of this work and to easily quantize models using SINQ is available at https://github.com/huawei-csl/SINQ.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Lorenz K. Müller, Philippe Bich, Jiawei Zhuang, Ahmet Çelik, Luca Benfenati, Lukas Cavigelli</p>

            <p><strong>Title:</strong><br>
            SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22944v2">http://arxiv.org/abs/2509.22944v2</a></p>

            <p><strong>Abstract:</strong><br>
            Post-training quantization has emerged as the most widely used strategy for deploying large language models at low precision. Still, current methods show perplexity degradation at bit-widths less than or equal to 4, partly because representing outliers causes precision issues in parameters that share the same scales as these outliers. This problem is especially pronounced for calibration-free, uniform quantization methods. We introduce SINQ to augment existing post-training quantizers with an additional second-axis scale factor and a fast Sinkhorn-Knopp-style algorithm that finds scales to normalize per-row and per-column variances, thereby minimizing a novel per-matrix proxy target for quantization: the matrix imbalance. Our method has no interactions between layers and can be trivially applied to new architectures to quantize any linear layers. We evaluate our method on the Qwen3 model family and DeepSeek-V2.5. SINQ improves WikiText2 and C4 perplexity significantly against uncalibrated uniform quantization baselines and can be further enhanced by combining it with calibration and non-uniform quantization levels. Code to reproduce the results of this work and to easily quantize models using SINQ is available at https://github.com/huawei-csl/SINQ.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Oct 2025 20:42:40 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/43fd7e9c/bee9d6d0.mp3" length="23185974" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1445</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Lorenz K. Müller, Philippe Bich, Jiawei Zhuang, Ahmet Çelik, Luca Benfenati, Lukas Cavigelli</p>

            <p><strong>Title:</strong><br>
            SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22944v2">http://arxiv.org/abs/2509.22944v2</a></p>

            <p><strong>Abstract:</strong><br>
            Post-training quantization has emerged as the most widely used strategy for deploying large language models at low precision. Still, current methods show perplexity degradation at bit-widths less than or equal to 4, partly because representing outliers causes precision issues in parameters that share the same scales as these outliers. This problem is especially pronounced for calibration-free, uniform quantization methods. We introduce SINQ to augment existing post-training quantizers with an additional second-axis scale factor and a fast Sinkhorn-Knopp-style algorithm that finds scales to normalize per-row and per-column variances, thereby minimizing a novel per-matrix proxy target for quantization: the matrix imbalance. Our method has no interactions between layers and can be trivially applied to new architectures to quantize any linear layers. We evaluate our method on the Qwen3 model family and DeepSeek-V2.5. SINQ improves WikiText2 and C4 perplexity significantly against uncalibrated uniform quantization baselines and can be further enhanced by combining it with calibration and non-uniform quantization levels. Code to reproduce the results of this work and to easily quantize models using SINQ is available at https://github.com/huawei-csl/SINQ.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ACON: Optimizing Context Compression for Long-horizon LLM Agents</title>
      <itunes:episode>1215</itunes:episode>
      <podcast:episode>1215</podcast:episode>
      <itunes:title>ACON: Optimizing Context Compression for Long-horizon LLM Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cd256010-9ea2-4bd1-99e0-4847e8b075c6</guid>
      <link>https://share.transistor.fm/s/1dcc0f60</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, Saravan Rajmohan</p>

            <p><strong>Title:</strong><br>
            ACON: Optimizing Context Compression for Long-horizon LLM Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.00615v1">http://arxiv.org/abs/2510.00615v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are increasingly deployed as agents in dynamic, real-world environments, where success requires both reasoning and effective tool use. A central challenge for agentic tasks is the growing context length, as agents must accumulate long histories of actions and observations. This expansion raises costs and reduces efficiency in long-horizon tasks, yet prior work on context compression has mostly focused on single-step tasks or narrow applications. We introduce Agent Context Optimization (ACON), a unified framework that optimally compresses both environment observations and interaction histories into concise yet informative condensations. ACON leverages compression guideline optimization in natural language space: given paired trajectories where full context succeeds but compressed context fails, capable LLMs analyze the causes of failure, and the compression guideline is updated accordingly. Furthermore, we propose distilling the optimized LLM compressor into smaller models to reduce the overhead of the additional module. Experiments on AppWorld, OfficeBench, and Multi-objective QA show that ACON reduces memory usage by 26-54% (peak tokens) while largely preserving task performance, preserves over 95% of accuracy when distilled into smaller compressors, and enhances smaller LMs as long-horizon agents with up to 46% performance improvement.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, Saravan Rajmohan</p>

            <p><strong>Title:</strong><br>
            ACON: Optimizing Context Compression for Long-horizon LLM Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.00615v1">http://arxiv.org/abs/2510.00615v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are increasingly deployed as agents in dynamic, real-world environments, where success requires both reasoning and effective tool use. A central challenge for agentic tasks is the growing context length, as agents must accumulate long histories of actions and observations. This expansion raises costs and reduces efficiency in long-horizon tasks, yet prior work on context compression has mostly focused on single-step tasks or narrow applications. We introduce Agent Context Optimization (ACON), a unified framework that optimally compresses both environment observations and interaction histories into concise yet informative condensations. ACON leverages compression guideline optimization in natural language space: given paired trajectories where full context succeeds but compressed context fails, capable LLMs analyze the causes of failure, and the compression guideline is updated accordingly. Furthermore, we propose distilling the optimized LLM compressor into smaller models to reduce the overhead of the additional module. Experiments on AppWorld, OfficeBench, and Multi-objective QA show that ACON reduces memory usage by 26-54% (peak tokens) while largely preserving task performance, preserves over 95% of accuracy when distilled into smaller compressors, and enhances smaller LMs as long-horizon agents with up to 46% performance improvement.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Oct 2025 20:42:17 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1dcc0f60/09561273.mp3" length="24316532" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1516</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, Saravan Rajmohan</p>

            <p><strong>Title:</strong><br>
            ACON: Optimizing Context Compression for Long-horizon LLM Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2510.00615v1">http://arxiv.org/abs/2510.00615v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are increasingly deployed as agents in dynamic, real-world environments, where success requires both reasoning and effective tool use. A central challenge for agentic tasks is the growing context length, as agents must accumulate long histories of actions and observations. This expansion raises costs and reduces efficiency in long-horizon tasks, yet prior work on context compression has mostly focused on single-step tasks or narrow applications. We introduce Agent Context Optimization (ACON), a unified framework that optimally compresses both environment observations and interaction histories into concise yet informative condensations. ACON leverages compression guideline optimization in natural language space: given paired trajectories where full context succeeds but compressed context fails, capable LLMs analyze the causes of failure, and the compression guideline is updated accordingly. Furthermore, we propose distilling the optimized LLM compressor into smaller models to reduce the overhead of the additional module. Experiments on AppWorld, OfficeBench, and Multi-objective QA show that ACON reduces memory usage by 26-54% (peak tokens) while largely preserving task performance, preserves over 95% of accuracy when distilled into smaller compressors, and enhances smaller LMs as long-horizon agents with up to 46% performance improvement.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use</title>
      <itunes:episode>1214</itunes:episode>
      <podcast:episode>1214</podcast:episode>
      <itunes:title>MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d33b3260-609b-47ca-a997-ad6778e34fd6</guid>
      <link>https://share.transistor.fm/s/1ef3fe05</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 124 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zijian Wu, Xiangyan Liu, Xinyuan Zhang, Lingjun Chen, Fanqing Meng, Lingxiao Du, Yiran Zhao, Fanshi Zhang, Yaoqi Ye, Jiawei Wang, Zirui Wang, Jinjie Ni, Yufan Yang, Arvin Xu, Michael Qizhe Shieh</p>

            <p><strong>Title:</strong><br>
            MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.24002v1">http://arxiv.org/abs/2509.24002v1</a></p>

            <p><strong>Abstract:</strong><br>
            MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of $127$ high-quality tasks collaboratively created by domain experts and AI agents. Each task begins with a curated initial state and includes a programmatic script for automatic verification. These tasks demand richer and more diverse interactions with the environment, involving a broad range of create, read, update, and delete (CRUD) operations. We conduct a comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, gpt-5-medium, reaches only $52.56$\% pass@1 and $33.86$\% pass^4, while other widely regarded strong models, including claude-sonnet-4 and o3, fall below $30$\% pass@1 and $15$\% pass^4. On average, LLMs require $16.2$ execution turns and $17.4$ tool calls per task, significantly surpassing those in previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 124 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zijian Wu, Xiangyan Liu, Xinyuan Zhang, Lingjun Chen, Fanqing Meng, Lingxiao Du, Yiran Zhao, Fanshi Zhang, Yaoqi Ye, Jiawei Wang, Zirui Wang, Jinjie Ni, Yufan Yang, Arvin Xu, Michael Qizhe Shieh</p>

            <p><strong>Title:</strong><br>
            MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.24002v1">http://arxiv.org/abs/2509.24002v1</a></p>

            <p><strong>Abstract:</strong><br>
            MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of $127$ high-quality tasks collaboratively created by domain experts and AI agents. Each task begins with a curated initial state and includes a programmatic script for automatic verification. These tasks demand richer and more diverse interactions with the environment, involving a broad range of create, read, update, and delete (CRUD) operations. We conduct a comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, gpt-5-medium, reaches only $52.56$\% pass@1 and $33.86$\% pass^4, while other widely regarded strong models, including claude-sonnet-4 and o3, fall below $30$\% pass@1 and $15$\% pass^4. On average, LLMs require $16.2$ execution turns and $17.4$ tool calls per task, significantly surpassing those in previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Oct 2025 21:32:41 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1ef3fe05/5088c0f7.mp3" length="24309020" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1516</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 124 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zijian Wu, Xiangyan Liu, Xinyuan Zhang, Lingjun Chen, Fanqing Meng, Lingxiao Du, Yiran Zhao, Fanshi Zhang, Yaoqi Ye, Jiawei Wang, Zirui Wang, Jinjie Ni, Yufan Yang, Arvin Xu, Michael Qizhe Shieh</p>

            <p><strong>Title:</strong><br>
            MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.24002v1">http://arxiv.org/abs/2509.24002v1</a></p>

            <p><strong>Abstract:</strong><br>
            MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of $127$ high-quality tasks collaboratively created by domain experts and AI agents. Each task begins with a curated initial state and includes a programmatic script for automatic verification. These tasks demand richer and more diverse interactions with the environment, involving a broad range of create, read, update, and delete (CRUD) operations. We conduct a comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, gpt-5-medium, reaches only $52.56$\% pass@1 and $33.86$\% pass^4, while other widely regarded strong models, including claude-sonnet-4 and o3, fall below $30$\% pass@1 and $15$\% pass^4. On average, LLMs require $16.2$ execution turns and $17.4$ tool calls per task, significantly surpassing those in previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain</title>
      <itunes:episode>1213</itunes:episode>
      <podcast:episode>1213</podcast:episode>
      <itunes:title>The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0108575a-b41e-409d-9673-7dfea4a7cb2e</guid>
      <link>https://share.transistor.fm/s/0a116a93</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 106 | cs.NE, cs.AI, cs.LG, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Adrian Kosowski, Przemysław Uznański, Jan Chorowski, Zuzanna Stamirowska, Michał Bartoszkiewicz</p>

            <p><strong>Title:</strong><br>
            The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.26507v1">http://arxiv.org/abs/2509.26507v1</a></p>

            <p><strong>Abstract:</strong><br>
            The relationship between computing systems and the brain has served as motivation for pioneering theoreticians since John von Neumann and Alan Turing. Uniform, scale-free biological networks, such as the brain, have powerful properties, including generalizing over time, which is the main barrier for Machine Learning on the path to Universal Reasoning Models.   We introduce `Dragon Hatchling' (BDH), a new Large Language Model architecture based on a scale-free biologically inspired network of \$n\$ locally-interacting neuron particles. BDH couples strong theoretical foundations and inherent interpretability without sacrificing Transformer-like performance.   BDH is a practical, performant state-of-the-art attention-based state space sequence learning architecture. In addition to being a graph model, BDH admits a GPU-friendly formulation. It exhibits Transformer-like scaling laws: empirically BDH rivals GPT2 performance on language and translation tasks, at the same number of parameters (10M to 1B), for the same training data.   BDH can be represented as a brain model. The working memory of BDH during inference entirely relies on synaptic plasticity with Hebbian learning using spiking neurons. We confirm empirically that specific, individual synapses strengthen connection whenever BDH hears or reasons about a specific concept while processing language inputs. The neuron interaction network of BDH is a graph of high modularity with heavy-tailed degree distribution. The BDH model is biologically plausible, explaining one possible mechanism which human neurons could use to achieve speech.   BDH is designed for interpretability. Activation vectors of BDH are sparse and positive. We demonstrate monosemanticity in BDH on language tasks. Interpretability of state, which goes beyond interpretability of neurons and model parameters, is an inherent feature of the BDH architecture.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 106 | cs.NE, cs.AI, cs.LG, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Adrian Kosowski, Przemysław Uznański, Jan Chorowski, Zuzanna Stamirowska, Michał Bartoszkiewicz</p>

            <p><strong>Title:</strong><br>
            The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.26507v1">http://arxiv.org/abs/2509.26507v1</a></p>

            <p><strong>Abstract:</strong><br>
            The relationship between computing systems and the brain has served as motivation for pioneering theoreticians since John von Neumann and Alan Turing. Uniform, scale-free biological networks, such as the brain, have powerful properties, including generalizing over time, which is the main barrier for Machine Learning on the path to Universal Reasoning Models.   We introduce `Dragon Hatchling' (BDH), a new Large Language Model architecture based on a scale-free biologically inspired network of \$n\$ locally-interacting neuron particles. BDH couples strong theoretical foundations and inherent interpretability without sacrificing Transformer-like performance.   BDH is a practical, performant state-of-the-art attention-based state space sequence learning architecture. In addition to being a graph model, BDH admits a GPU-friendly formulation. It exhibits Transformer-like scaling laws: empirically BDH rivals GPT2 performance on language and translation tasks, at the same number of parameters (10M to 1B), for the same training data.   BDH can be represented as a brain model. The working memory of BDH during inference entirely relies on synaptic plasticity with Hebbian learning using spiking neurons. We confirm empirically that specific, individual synapses strengthen connection whenever BDH hears or reasons about a specific concept while processing language inputs. The neuron interaction network of BDH is a graph of high modularity with heavy-tailed degree distribution. The BDH model is biologically plausible, explaining one possible mechanism which human neurons could use to achieve speech.   BDH is designed for interpretability. Activation vectors of BDH are sparse and positive. We demonstrate monosemanticity in BDH on language tasks. Interpretability of state, which goes beyond interpretability of neurons and model parameters, is an inherent feature of the BDH architecture.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Oct 2025 21:32:19 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0a116a93/1e4c0bf1.mp3" length="22778883" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1420</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 106 | cs.NE, cs.AI, cs.LG, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Adrian Kosowski, Przemysław Uznański, Jan Chorowski, Zuzanna Stamirowska, Michał Bartoszkiewicz</p>

            <p><strong>Title:</strong><br>
            The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.26507v1">http://arxiv.org/abs/2509.26507v1</a></p>

            <p><strong>Abstract:</strong><br>
            The relationship between computing systems and the brain has served as motivation for pioneering theoreticians since John von Neumann and Alan Turing. Uniform, scale-free biological networks, such as the brain, have powerful properties, including generalizing over time, which is the main barrier for Machine Learning on the path to Universal Reasoning Models.   We introduce `Dragon Hatchling' (BDH), a new Large Language Model architecture based on a scale-free biologically inspired network of \$n\$ locally-interacting neuron particles. BDH couples strong theoretical foundations and inherent interpretability without sacrificing Transformer-like performance.   BDH is a practical, performant state-of-the-art attention-based state space sequence learning architecture. In addition to being a graph model, BDH admits a GPU-friendly formulation. It exhibits Transformer-like scaling laws: empirically BDH rivals GPT2 performance on language and translation tasks, at the same number of parameters (10M to 1B), for the same training data.   BDH can be represented as a brain model. The working memory of BDH during inference entirely relies on synaptic plasticity with Hebbian learning using spiking neurons. We confirm empirically that specific, individual synapses strengthen connection whenever BDH hears or reasons about a specific concept while processing language inputs. The neuron interaction network of BDH is a graph of high modularity with heavy-tailed degree distribution. The BDH model is biologically plausible, explaining one possible mechanism which human neurons could use to achieve speech.   BDH is designed for interpretability. Activation vectors of BDH are sparse and positive. We demonstrate monosemanticity in BDH on language tasks. Interpretability of state, which goes beyond interpretability of neurons and model parameters, is an inherent feature of the BDH architecture.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play</title>
      <itunes:episode>1212</itunes:episode>
      <podcast:episode>1212</podcast:episode>
      <itunes:title>Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5f4cdf4c-8a7b-4f4b-94e5-28d3166438d2</guid>
      <link>https://share.transistor.fm/s/a44946dc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 103 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, Wentian Zhao</p>

            <p><strong>Title:</strong><br>
            Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25541v1">http://arxiv.org/abs/2509.25541v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision-language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose Vision-Zero, a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code has been released at https://github.com/wangqinsi1/Vision-Zero.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 103 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, Wentian Zhao</p>

            <p><strong>Title:</strong><br>
            Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25541v1">http://arxiv.org/abs/2509.25541v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision-language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose Vision-Zero, a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code has been released at https://github.com/wangqinsi1/Vision-Zero.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Oct 2025 21:31:56 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a44946dc/538f9c89.mp3" length="28048501" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1749</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 103 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, Wentian Zhao</p>

            <p><strong>Title:</strong><br>
            Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25541v1">http://arxiv.org/abs/2509.25541v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision-language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose Vision-Zero, a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code has been released at https://github.com/wangqinsi1/Vision-Zero.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning</title>
      <itunes:episode>1211</itunes:episode>
      <podcast:episode>1211</podcast:episode>
      <itunes:title>Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2f9537e4-a3d8-4bec-b575-cccf632dab6e</guid>
      <link>https://share.transistor.fm/s/4890944f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shaobo Wang, Jiaming Wang, Jiajun Zhang, Cong Wang, Yue Min, Zichen Wen, Fei Huang, Huiqiang Jiang, Junyang Lin, Dayiheng Liu, Linfeng Zhang</p>

            <p><strong>Title:</strong><br>
            Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.23873v1">http://arxiv.org/abs/2509.23873v1</a></p>

            <p><strong>Abstract:</strong><br>
            As supervised fine-tuning (SFT) evolves from a lightweight post-training step into a compute-intensive phase rivaling mid-training in scale, data efficiency has become critical for aligning large language models (LLMs) under tight budgets. Existing data pruning methods suffer from a fragmented design: they operate either at the sample level or the token level in isolation, failing to jointly optimize both dimensions. This disconnect leads to significant inefficiencies--high-value samples may still contain redundant tokens, while token-level pruning often discards crucial instructional or corrective signals embedded in individual examples. To address this bottleneck, we introduce the Error-Uncertainty (EU) Plane, a diagnostic framework that jointly characterizes the heterogeneous utility of training data across samples and tokens. Guided by this insight, we propose Quadrant-based Tuning (Q-Tuning), a unified framework that strategically coordinates sample pruning and token pruning. Q-Tuning employs a two-stage strategy: first, it performs sample-level triage to retain examples rich in informative misconceptions or calibration signals; second, it applies an asymmetric token-pruning policy, using a context-aware scoring mechanism to trim less salient tokens exclusively from misconception samples while preserving calibration samples in their entirety. Our method sets a new state of the art across five diverse benchmarks. Remarkably, on SmolLM2-1.7B, Q-Tuning achieves a +38\% average improvement over the full-data SFT baseline using only 12.5\% of the original training data. As the first dynamic pruning approach to consistently outperform full-data training, Q-Tuning provides a practical and scalable blueprint for maximizing data utilization in budget-constrained LLM SFT.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shaobo Wang, Jiaming Wang, Jiajun Zhang, Cong Wang, Yue Min, Zichen Wen, Fei Huang, Huiqiang Jiang, Junyang Lin, Dayiheng Liu, Linfeng Zhang</p>

            <p><strong>Title:</strong><br>
            Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.23873v1">http://arxiv.org/abs/2509.23873v1</a></p>

            <p><strong>Abstract:</strong><br>
            As supervised fine-tuning (SFT) evolves from a lightweight post-training step into a compute-intensive phase rivaling mid-training in scale, data efficiency has become critical for aligning large language models (LLMs) under tight budgets. Existing data pruning methods suffer from a fragmented design: they operate either at the sample level or the token level in isolation, failing to jointly optimize both dimensions. This disconnect leads to significant inefficiencies--high-value samples may still contain redundant tokens, while token-level pruning often discards crucial instructional or corrective signals embedded in individual examples. To address this bottleneck, we introduce the Error-Uncertainty (EU) Plane, a diagnostic framework that jointly characterizes the heterogeneous utility of training data across samples and tokens. Guided by this insight, we propose Quadrant-based Tuning (Q-Tuning), a unified framework that strategically coordinates sample pruning and token pruning. Q-Tuning employs a two-stage strategy: first, it performs sample-level triage to retain examples rich in informative misconceptions or calibration signals; second, it applies an asymmetric token-pruning policy, using a context-aware scoring mechanism to trim less salient tokens exclusively from misconception samples while preserving calibration samples in their entirety. Our method sets a new state of the art across five diverse benchmarks. Remarkably, on SmolLM2-1.7B, Q-Tuning achieves a +38\% average improvement over the full-data SFT baseline using only 12.5\% of the original training data. As the first dynamic pruning approach to consistently outperform full-data training, Q-Tuning provides a practical and scalable blueprint for maximizing data utilization in budget-constrained LLM SFT.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Oct 2025 21:31:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4890944f/a072b7c6.mp3" length="19321137" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1204</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shaobo Wang, Jiaming Wang, Jiajun Zhang, Cong Wang, Yue Min, Zichen Wen, Fei Huang, Huiqiang Jiang, Junyang Lin, Dayiheng Liu, Linfeng Zhang</p>

            <p><strong>Title:</strong><br>
            Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.23873v1">http://arxiv.org/abs/2509.23873v1</a></p>

            <p><strong>Abstract:</strong><br>
            As supervised fine-tuning (SFT) evolves from a lightweight post-training step into a compute-intensive phase rivaling mid-training in scale, data efficiency has become critical for aligning large language models (LLMs) under tight budgets. Existing data pruning methods suffer from a fragmented design: they operate either at the sample level or the token level in isolation, failing to jointly optimize both dimensions. This disconnect leads to significant inefficiencies--high-value samples may still contain redundant tokens, while token-level pruning often discards crucial instructional or corrective signals embedded in individual examples. To address this bottleneck, we introduce the Error-Uncertainty (EU) Plane, a diagnostic framework that jointly characterizes the heterogeneous utility of training data across samples and tokens. Guided by this insight, we propose Quadrant-based Tuning (Q-Tuning), a unified framework that strategically coordinates sample pruning and token pruning. Q-Tuning employs a two-stage strategy: first, it performs sample-level triage to retain examples rich in informative misconceptions or calibration signals; second, it applies an asymmetric token-pruning policy, using a context-aware scoring mechanism to trim less salient tokens exclusively from misconception samples while preserving calibration samples in their entirety. Our method sets a new state of the art across five diverse benchmarks. Remarkably, on SmolLM2-1.7B, Q-Tuning achieves a +38\% average improvement over the full-data SFT baseline using only 12.5\% of the original training data. As the first dynamic pruning approach to consistently outperform full-data training, Q-Tuning provides a practical and scalable blueprint for maximizing data utilization in budget-constrained LLM SFT.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning</title>
      <itunes:episode>1210</itunes:episode>
      <podcast:episode>1210</podcast:episode>
      <itunes:title>TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2303b930-f86f-4642-b4ae-9ff883f76291</guid>
      <link>https://share.transistor.fm/s/1679b76c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Sean Chen, Mohammad Kachuee, Teja Gollapudi, Tony Liao, Nicolas Scheffer, Rakesh Wanga, Anuj Kumar, Yu Meng, Wen-tau Yih, Xin Luna Dong</p>

            <p><strong>Title:</strong><br>
            TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25760v1">http://arxiv.org/abs/2509.25760v1</a></p>

            <p><strong>Abstract:</strong><br>
            While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that, compared to vanilla RL, TruthRL significantly reduces hallucinations by 28.9% and improves truthfulness by 21.1%, with consistent gains across various backbone models (e.g., Qwen, Llama) under both retrieval and non-retrieval setups. In-depth ablation study demonstrates that vanilla accuracy-driven methods, such as supervised fine-tuning or RL with a binary reward, struggle to balance factual correctness and uncertainty. In contrast, our proposed truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Sean Chen, Mohammad Kachuee, Teja Gollapudi, Tony Liao, Nicolas Scheffer, Rakesh Wanga, Anuj Kumar, Yu Meng, Wen-tau Yih, Xin Luna Dong</p>

            <p><strong>Title:</strong><br>
            TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25760v1">http://arxiv.org/abs/2509.25760v1</a></p>

            <p><strong>Abstract:</strong><br>
            While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that, compared to vanilla RL, TruthRL significantly reduces hallucinations by 28.9% and improves truthfulness by 21.1%, with consistent gains across various backbone models (e.g., Qwen, Llama) under both retrieval and non-retrieval setups. In-depth ablation study demonstrates that vanilla accuracy-driven methods, such as supervised fine-tuning or RL with a binary reward, struggle to balance factual correctness and uncertainty. In contrast, our proposed truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Oct 2025 21:31:11 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1679b76c/bdbe546d.mp3" length="23784469" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1483</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Sean Chen, Mohammad Kachuee, Teja Gollapudi, Tony Liao, Nicolas Scheffer, Rakesh Wanga, Anuj Kumar, Yu Meng, Wen-tau Yih, Xin Luna Dong</p>

            <p><strong>Title:</strong><br>
            TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25760v1">http://arxiv.org/abs/2509.25760v1</a></p>

            <p><strong>Abstract:</strong><br>
            While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that, compared to vanilla RL, TruthRL significantly reduces hallucinations by 28.9% and improves truthfulness by 21.1%, with consistent gains across various backbone models (e.g., Qwen, Llama) under both retrieval and non-retrieval setups. In-depth ablation study demonstrates that vanilla accuracy-driven methods, such as supervised fine-tuning or RL with a binary reward, struggle to balance factual correctness and uncertainty. In contrast, our proposed truthfulness-driven TruthRL achieves strong performance in both accuracy and truthfulness, underscoring the importance of learning objective design for developing truthful LLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training</title>
      <itunes:episode>1209</itunes:episode>
      <podcast:episode>1209</podcast:episode>
      <itunes:title>Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">47a78f76-f4fc-4015-8d1b-a81afdf733e2</guid>
      <link>https://share.transistor.fm/s/aa8e79f1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.LG, cs.AI, cs.CV, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Junlin Han, Shengbang Tong, David Fan, Yufan Ren, Koustuv Sinha, Philip Torr, Filippos Kokkinos</p>

            <p><strong>Title:</strong><br>
            Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.26625v1">http://arxiv.org/abs/2509.26625v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs), despite being trained on text alone, surprisingly develop rich visual priors. These priors allow latent visual capabilities to be unlocked for vision tasks with a relatively small amount of multimodal data, and in some cases, to perform visual tasks without ever having seen an image. Through systematic analysis, we reveal that visual priors-the implicit, emergent knowledge about the visual world acquired during language pre-training-are composed of separable perception and reasoning priors with unique scaling trends and origins. We show that an LLM's latent visual reasoning ability is predominantly developed by pre-training on reasoning-centric data (e.g., code, math, academia) and scales progressively. This reasoning prior acquired from language pre-training is transferable and universally applicable to visual reasoning. In contrast, a perception prior emerges more diffusely from broad corpora, and perception ability is more sensitive to the vision encoder and visual instruction tuning data. In parallel, text describing the visual world proves crucial, though its performance impact saturates rapidly. Leveraging these insights, we propose a data-centric recipe for pre-training vision-aware LLMs and verify it in 1T token scale pre-training. Our findings are grounded in over 100 controlled experiments consuming 500,000 GPU-hours, spanning the full MLLM construction pipeline-from LLM pre-training to visual alignment and supervised multimodal fine-tuning-across five model scales, a wide range of data categories and mixtures, and multiple adaptation setups. Along with our main findings, we propose and investigate several hypotheses, and introduce the Multi-Level Existence Bench (MLE-Bench). Together, this work provides a new way of deliberately cultivating visual priors from language pre-training, paving the way for the next generation of multimodal LLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.LG, cs.AI, cs.CV, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Junlin Han, Shengbang Tong, David Fan, Yufan Ren, Koustuv Sinha, Philip Torr, Filippos Kokkinos</p>

            <p><strong>Title:</strong><br>
            Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.26625v1">http://arxiv.org/abs/2509.26625v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs), despite being trained on text alone, surprisingly develop rich visual priors. These priors allow latent visual capabilities to be unlocked for vision tasks with a relatively small amount of multimodal data, and in some cases, to perform visual tasks without ever having seen an image. Through systematic analysis, we reveal that visual priors-the implicit, emergent knowledge about the visual world acquired during language pre-training-are composed of separable perception and reasoning priors with unique scaling trends and origins. We show that an LLM's latent visual reasoning ability is predominantly developed by pre-training on reasoning-centric data (e.g., code, math, academia) and scales progressively. This reasoning prior acquired from language pre-training is transferable and universally applicable to visual reasoning. In contrast, a perception prior emerges more diffusely from broad corpora, and perception ability is more sensitive to the vision encoder and visual instruction tuning data. In parallel, text describing the visual world proves crucial, though its performance impact saturates rapidly. Leveraging these insights, we propose a data-centric recipe for pre-training vision-aware LLMs and verify it in 1T token scale pre-training. Our findings are grounded in over 100 controlled experiments consuming 500,000 GPU-hours, spanning the full MLLM construction pipeline-from LLM pre-training to visual alignment and supervised multimodal fine-tuning-across five model scales, a wide range of data categories and mixtures, and multiple adaptation setups. Along with our main findings, we propose and investigate several hypotheses, and introduce the Multi-Level Existence Bench (MLE-Bench). Together, this work provides a new way of deliberately cultivating visual priors from language pre-training, paving the way for the next generation of multimodal LLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Oct 2025 21:30:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/aa8e79f1/778e19ca.mp3" length="26382111" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1645</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.LG, cs.AI, cs.CV, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Junlin Han, Shengbang Tong, David Fan, Yufan Ren, Koustuv Sinha, Philip Torr, Filippos Kokkinos</p>

            <p><strong>Title:</strong><br>
            Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.26625v1">http://arxiv.org/abs/2509.26625v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs), despite being trained on text alone, surprisingly develop rich visual priors. These priors allow latent visual capabilities to be unlocked for vision tasks with a relatively small amount of multimodal data, and in some cases, to perform visual tasks without ever having seen an image. Through systematic analysis, we reveal that visual priors-the implicit, emergent knowledge about the visual world acquired during language pre-training-are composed of separable perception and reasoning priors with unique scaling trends and origins. We show that an LLM's latent visual reasoning ability is predominantly developed by pre-training on reasoning-centric data (e.g., code, math, academia) and scales progressively. This reasoning prior acquired from language pre-training is transferable and universally applicable to visual reasoning. In contrast, a perception prior emerges more diffusely from broad corpora, and perception ability is more sensitive to the vision encoder and visual instruction tuning data. In parallel, text describing the visual world proves crucial, though its performance impact saturates rapidly. Leveraging these insights, we propose a data-centric recipe for pre-training vision-aware LLMs and verify it in 1T token scale pre-training. Our findings are grounded in over 100 controlled experiments consuming 500,000 GPU-hours, spanning the full MLLM construction pipeline-from LLM pre-training to visual alignment and supervised multimodal fine-tuning-across five model scales, a wide range of data categories and mixtures, and multiple adaptation setups. Along with our main findings, we propose and investigate several hypotheses, and introduce the Multi-Level Existence Bench (MLE-Bench). Together, this work provides a new way of deliberately cultivating visual priors from language pre-training, paving the way for the next generation of multimodal LLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OceanGym: A Benchmark Environment for Underwater Embodied Agents</title>
      <itunes:episode>1208</itunes:episode>
      <podcast:episode>1208</podcast:episode>
      <itunes:title>OceanGym: A Benchmark Environment for Underwater Embodied Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4ccc3430-31e4-4b88-9c33-4e0c05bc94f5</guid>
      <link>https://share.transistor.fm/s/0c526ff1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.AI, cs.CV, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Yida Xue, Mingjun Mao, Xiangyuan Ru, Yuqi Zhu, Baochang Ren, Shuofei Qiao, Mengru Wang, Shumin Deng, Xinyu An, Ningyu Zhang, Ying Chen, Huajun Chen</p>

            <p><strong>Title:</strong><br>
            OceanGym: A Benchmark Environment for Underwater Embodied Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.26536v1">http://arxiv.org/abs/2509.26536v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce OceanGym, the first comprehensive benchmark for ocean underwater embodied agents, designed to advance AI in one of the most demanding real-world environments. Unlike terrestrial or aerial domains, underwater settings present extreme perceptual and decision-making challenges, including low visibility, dynamic ocean currents, making effective agent deployment exceptionally difficult. OceanGym encompasses eight realistic task domains and a unified agent framework driven by Multi-modal Large Language Models (MLLMs), which integrates perception, memory, and sequential decision-making. Agents are required to comprehend optical and sonar data, autonomously explore complex environments, and accomplish long-horizon objectives under these harsh conditions. Extensive experiments reveal substantial gaps between state-of-the-art MLLM-driven agents and human experts, highlighting the persistent difficulty of perception, planning, and adaptability in ocean underwater environments. By providing a high-fidelity, rigorously designed platform, OceanGym establishes a testbed for developing robust embodied AI and transferring these capabilities to real-world autonomous ocean underwater vehicles, marking a decisive step toward intelligent agents capable of operating in one of Earth's last unexplored frontiers. The code and data are available at https://github.com/OceanGPT/OceanGym.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.AI, cs.CV, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Yida Xue, Mingjun Mao, Xiangyuan Ru, Yuqi Zhu, Baochang Ren, Shuofei Qiao, Mengru Wang, Shumin Deng, Xinyu An, Ningyu Zhang, Ying Chen, Huajun Chen</p>

            <p><strong>Title:</strong><br>
            OceanGym: A Benchmark Environment for Underwater Embodied Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.26536v1">http://arxiv.org/abs/2509.26536v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce OceanGym, the first comprehensive benchmark for ocean underwater embodied agents, designed to advance AI in one of the most demanding real-world environments. Unlike terrestrial or aerial domains, underwater settings present extreme perceptual and decision-making challenges, including low visibility, dynamic ocean currents, making effective agent deployment exceptionally difficult. OceanGym encompasses eight realistic task domains and a unified agent framework driven by Multi-modal Large Language Models (MLLMs), which integrates perception, memory, and sequential decision-making. Agents are required to comprehend optical and sonar data, autonomously explore complex environments, and accomplish long-horizon objectives under these harsh conditions. Extensive experiments reveal substantial gaps between state-of-the-art MLLM-driven agents and human experts, highlighting the persistent difficulty of perception, planning, and adaptability in ocean underwater environments. By providing a high-fidelity, rigorously designed platform, OceanGym establishes a testbed for developing robust embodied AI and transferring these capabilities to real-world autonomous ocean underwater vehicles, marking a decisive step toward intelligent agents capable of operating in one of Earth's last unexplored frontiers. The code and data are available at https://github.com/OceanGPT/OceanGym.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Oct 2025 21:30:25 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0c526ff1/20878404.mp3" length="21535432" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1342</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.AI, cs.CV, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Yida Xue, Mingjun Mao, Xiangyuan Ru, Yuqi Zhu, Baochang Ren, Shuofei Qiao, Mengru Wang, Shumin Deng, Xinyu An, Ningyu Zhang, Ying Chen, Huajun Chen</p>

            <p><strong>Title:</strong><br>
            OceanGym: A Benchmark Environment for Underwater Embodied Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.26536v1">http://arxiv.org/abs/2509.26536v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce OceanGym, the first comprehensive benchmark for ocean underwater embodied agents, designed to advance AI in one of the most demanding real-world environments. Unlike terrestrial or aerial domains, underwater settings present extreme perceptual and decision-making challenges, including low visibility, dynamic ocean currents, making effective agent deployment exceptionally difficult. OceanGym encompasses eight realistic task domains and a unified agent framework driven by Multi-modal Large Language Models (MLLMs), which integrates perception, memory, and sequential decision-making. Agents are required to comprehend optical and sonar data, autonomously explore complex environments, and accomplish long-horizon objectives under these harsh conditions. Extensive experiments reveal substantial gaps between state-of-the-art MLLM-driven agents and human experts, highlighting the persistent difficulty of perception, planning, and adaptability in ocean underwater environments. By providing a high-fidelity, rigorously designed platform, OceanGym establishes a testbed for developing robust embodied AI and transferring these capabilities to real-world autonomous ocean underwater vehicles, marking a decisive step toward intelligent agents capable of operating in one of Earth's last unexplored frontiers. The code and data are available at https://github.com/OceanGPT/OceanGym.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models</title>
      <itunes:episode>1207</itunes:episode>
      <podcast:episode>1207</podcast:episode>
      <itunes:title>More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6bde6c3b-7938-4b76-aa5d-3e2171d96636</guid>
      <link>https://share.transistor.fm/s/8112338a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, Jing Zhang</p>

            <p><strong>Title:</strong><br>
            More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25848v1">http://arxiv.org/abs/2509.25848v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model's reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks. Project page: https://xytian1008.github.io/VAPO/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, Jing Zhang</p>

            <p><strong>Title:</strong><br>
            More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25848v1">http://arxiv.org/abs/2509.25848v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model's reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks. Project page: https://xytian1008.github.io/VAPO/</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Oct 2025 21:30:03 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8112338a/17f6dddf.mp3" length="23653253" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1475</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, Jing Zhang</p>

            <p><strong>Title:</strong><br>
            More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25848v1">http://arxiv.org/abs/2509.25848v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model's reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks. Project page: https://xytian1008.github.io/VAPO/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners</title>
      <itunes:episode>1206</itunes:episode>
      <podcast:episode>1206</podcast:episode>
      <itunes:title>Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3158399a-4e9b-4ff8-8475-d84f15157e2a</guid>
      <link>https://share.transistor.fm/s/c8fe01a2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xin Xu, Cliveb AI, Kai Yang, Tianhao Chen, Yang Wang, Saiyong Yang, Can Yang</p>

            <p><strong>Title:</strong><br>
            Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.26226v1">http://arxiv.org/abs/2509.26226v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Reward (RLVR) effectively solves complex tasks but demands extremely long context lengths during training, leading to substantial computational costs. While multi-stage training can partially mitigate this, starting with overly short contexts often causes irreversible performance degradation, ultimately failing to reduce overall training compute significantly. In this paper, we introduce **T**hinking-**F**ree **P**olicy **I**nitialization (**TFPI**), a simple yet effective adaptation to RLVR that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple *ThinkFree* operation, explicitly discarding the thinking content via a direct ** append, to reduce token usage during inference. Training with *ThinkFree*-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that TFPI accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench using less than 4K H20 hours.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xin Xu, Cliveb AI, Kai Yang, Tianhao Chen, Yang Wang, Saiyong Yang, Can Yang</p>

            <p><strong>Title:</strong><br>
            Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.26226v1">http://arxiv.org/abs/2509.26226v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Reward (RLVR) effectively solves complex tasks but demands extremely long context lengths during training, leading to substantial computational costs. While multi-stage training can partially mitigate this, starting with overly short contexts often causes irreversible performance degradation, ultimately failing to reduce overall training compute significantly. In this paper, we introduce **T**hinking-**F**ree **P**olicy **I**nitialization (**TFPI**), a simple yet effective adaptation to RLVR that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple *ThinkFree* operation, explicitly discarding the thinking content via a direct ** append, to reduce token usage during inference. Training with *ThinkFree*-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that TFPI accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench using less than 4K H20 hours.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Oct 2025 21:29:40 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c8fe01a2/407da8fd.mp3" length="22448298" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1399</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xin Xu, Cliveb AI, Kai Yang, Tianhao Chen, Yang Wang, Saiyong Yang, Can Yang</p>

            <p><strong>Title:</strong><br>
            Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.26226v1">http://arxiv.org/abs/2509.26226v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Reward (RLVR) effectively solves complex tasks but demands extremely long context lengths during training, leading to substantial computational costs. While multi-stage training can partially mitigate this, starting with overly short contexts often causes irreversible performance degradation, ultimately failing to reduce overall training compute significantly. In this paper, we introduce **T**hinking-**F**ree **P**olicy **I**nitialization (**TFPI**), a simple yet effective adaptation to RLVR that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple *ThinkFree* operation, explicitly discarding the thinking content via a direct ** append, to reduce token usage during inference. Training with *ThinkFree*-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that TFPI accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench using less than 4K H20 hours.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder</title>
      <itunes:episode>1205</itunes:episode>
      <podcast:episode>1205</podcast:episode>
      <itunes:title>DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f94eb852-696f-4b53-b155-016fcb463335</guid>
      <link>https://share.transistor.fm/s/c57a837c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junyu Chen, Wenkun He, Yuchao Gu, Yuyang Zhao, Jincheng Yu, Junsong Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Muyang Li, Haocheng Xi, Ligeng Zhu, Enze Xie, Song Han, Han Cai</p>

            <p><strong>Title:</strong><br>
            DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25182v1">http://arxiv.org/abs/2509.25182v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation. DC-VideoGen can be applied to any pre-trained video diffusion model, improving efficiency by adapting it to a deep compression latent space with lightweight fine-tuning. The framework builds on two key innovations: (i) a Deep Compression Video Autoencoder with a novel chunk-causal temporal design that achieves 32x/64x spatial and 4x temporal compression while preserving reconstruction quality and generalization to longer videos; and (ii) AE-Adapt-V, a robust adaptation strategy that enables rapid and stable transfer of pre-trained models into the new latent space. Adapting the pre-trained Wan-2.1-14B model with DC-VideoGen requires only 10 GPU days on the NVIDIA H100 GPU. The accelerated models achieve up to 14.8x lower inference latency than their base counterparts without compromising quality, and further enable 2160x3840 video generation on a single GPU. Code: https://github.com/dc-ai-projects/DC-VideoGen.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junyu Chen, Wenkun He, Yuchao Gu, Yuyang Zhao, Jincheng Yu, Junsong Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Muyang Li, Haocheng Xi, Ligeng Zhu, Enze Xie, Song Han, Han Cai</p>

            <p><strong>Title:</strong><br>
            DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25182v1">http://arxiv.org/abs/2509.25182v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation. DC-VideoGen can be applied to any pre-trained video diffusion model, improving efficiency by adapting it to a deep compression latent space with lightweight fine-tuning. The framework builds on two key innovations: (i) a Deep Compression Video Autoencoder with a novel chunk-causal temporal design that achieves 32x/64x spatial and 4x temporal compression while preserving reconstruction quality and generalization to longer videos; and (ii) AE-Adapt-V, a robust adaptation strategy that enables rapid and stable transfer of pre-trained models into the new latent space. Adapting the pre-trained Wan-2.1-14B model with DC-VideoGen requires only 10 GPU days on the NVIDIA H100 GPU. The accelerated models achieve up to 14.8x lower inference latency than their base counterparts without compromising quality, and further enable 2160x3840 video generation on a single GPU. Code: https://github.com/dc-ai-projects/DC-VideoGen.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Oct 2025 21:29:17 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c57a837c/4f787d1c.mp3" length="24586131" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1533</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junyu Chen, Wenkun He, Yuchao Gu, Yuyang Zhao, Jincheng Yu, Junsong Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Muyang Li, Haocheng Xi, Ligeng Zhu, Enze Xie, Song Han, Han Cai</p>

            <p><strong>Title:</strong><br>
            DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25182v1">http://arxiv.org/abs/2509.25182v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation. DC-VideoGen can be applied to any pre-trained video diffusion model, improving efficiency by adapting it to a deep compression latent space with lightweight fine-tuning. The framework builds on two key innovations: (i) a Deep Compression Video Autoencoder with a novel chunk-causal temporal design that achieves 32x/64x spatial and 4x temporal compression while preserving reconstruction quality and generalization to longer videos; and (ii) AE-Adapt-V, a robust adaptation strategy that enables rapid and stable transfer of pre-trained models into the new latent space. Adapting the pre-trained Wan-2.1-14B model with DC-VideoGen requires only 10 GPU days on the NVIDIA H100 GPU. The accelerated models achieve up to 14.8x lower inference latency than their base counterparts without compromising quality, and further enable 2160x3840 video generation on a single GPU. Code: https://github.com/dc-ai-projects/DC-VideoGen.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention</title>
      <itunes:episode>1204</itunes:episode>
      <podcast:episode>1204</podcast:episode>
      <itunes:title>SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6115381c-e7b2-4796-9160-6b14707321dc</guid>
      <link>https://share.transistor.fm/s/34fa80bd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 98 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jun Zhu, Jianfei Chen</p>

            <p><strong>Title:</strong><br>
            SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.24006v1">http://arxiv.org/abs/2509.24006v1</a></p>

            <p><strong>Abstract:</strong><br>
            In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 98 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jun Zhu, Jianfei Chen</p>

            <p><strong>Title:</strong><br>
            SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.24006v1">http://arxiv.org/abs/2509.24006v1</a></p>

            <p><strong>Abstract:</strong><br>
            In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Sep 2025 21:10:30 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/34fa80bd/b064b498.mp3" length="23653254" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1475</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 98 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jun Zhu, Jianfei Chen</p>

            <p><strong>Title:</strong><br>
            SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.24006v1">http://arxiv.org/abs/2509.24006v1</a></p>

            <p><strong>Abstract:</strong><br>
            In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs</title>
      <itunes:episode>1203</itunes:episode>
      <podcast:episode>1203</podcast:episode>
      <itunes:title>StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6cc5fb57-3735-4dbe-86ef-81dcc74c9447</guid>
      <link>https://share.transistor.fm/s/b8c99757</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuhan Song, Linhao Zhang, Chuhan Wu, Aiwei Liu, Wei Jia, Houfeng Wang, Xiao Zhou</p>

            <p><strong>Title:</strong><br>
            StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22220v1">http://arxiv.org/abs/2509.22220v1</a></p>

            <p><strong>Abstract:</strong><br>
            Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuhan Song, Linhao Zhang, Chuhan Wu, Aiwei Liu, Wei Jia, Houfeng Wang, Xiao Zhou</p>

            <p><strong>Title:</strong><br>
            StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22220v1">http://arxiv.org/abs/2509.22220v1</a></p>

            <p><strong>Abstract:</strong><br>
            Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Sep 2025 21:10:07 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b8c99757/1d7799b9.mp3" length="20215531" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1260</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuhan Song, Linhao Zhang, Chuhan Wu, Aiwei Liu, Wei Jia, Houfeng Wang, Xiao Zhou</p>

            <p><strong>Title:</strong><br>
            StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22220v1">http://arxiv.org/abs/2509.22220v1</a></p>

            <p><strong>Abstract:</strong><br>
            Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Multiplayer Nash Preference Optimization</title>
      <itunes:episode>1202</itunes:episode>
      <podcast:episode>1202</podcast:episode>
      <itunes:title>Multiplayer Nash Preference Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b932d3bf-2d42-4c82-8e62-cbd1957bd9d5</guid>
      <link>https://share.transistor.fm/s/bc380a5b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fang Wu, Xu Huang, Weihao Xuan, Zhiwei Zhang, Yijia Xiao, Guancheng Wan, Xiaomin Li, Bing Hu, Peng Xia, Jure Leskovec, Yejin Choi</p>

            <p><strong>Title:</strong><br>
            Multiplayer Nash Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.23102v1">http://arxiv.org/abs/2509.23102v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models (LLMs) with human preferences. However, reward-based methods built on the Bradley-Terry assumption struggle to capture the non-transitive and heterogeneous nature of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While this perspective has inspired algorithms such as INPO, ONPO, and EGPO with strong theoretical and empirical guarantees, they remain fundamentally restricted to two-player interactions, creating a single-opponent bias that fails to capture the full complexity of realistic preference structures. In this work, we introduce Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an $n$-player game, where each policy competes against a population of opponents while being regularized toward a reference model. Our framework establishes well-defined Nash equilibria in multiplayer settings and extends the concept of duality gap to quantify approximation quality. We demonstrate that MNPO inherits the equilibrium guarantees of two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. Through comprehensive empirical evaluation, we show that MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios. Together, these results establish MNPO as a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences. Code is available at https://github.com/smiles724/MNPO.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fang Wu, Xu Huang, Weihao Xuan, Zhiwei Zhang, Yijia Xiao, Guancheng Wan, Xiaomin Li, Bing Hu, Peng Xia, Jure Leskovec, Yejin Choi</p>

            <p><strong>Title:</strong><br>
            Multiplayer Nash Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.23102v1">http://arxiv.org/abs/2509.23102v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models (LLMs) with human preferences. However, reward-based methods built on the Bradley-Terry assumption struggle to capture the non-transitive and heterogeneous nature of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While this perspective has inspired algorithms such as INPO, ONPO, and EGPO with strong theoretical and empirical guarantees, they remain fundamentally restricted to two-player interactions, creating a single-opponent bias that fails to capture the full complexity of realistic preference structures. In this work, we introduce Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an $n$-player game, where each policy competes against a population of opponents while being regularized toward a reference model. Our framework establishes well-defined Nash equilibria in multiplayer settings and extends the concept of duality gap to quantify approximation quality. We demonstrate that MNPO inherits the equilibrium guarantees of two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. Through comprehensive empirical evaluation, we show that MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios. Together, these results establish MNPO as a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences. Code is available at https://github.com/smiles724/MNPO.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Sep 2025 21:09:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bc380a5b/70b13556.mp3" length="25228495" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1573</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fang Wu, Xu Huang, Weihao Xuan, Zhiwei Zhang, Yijia Xiao, Guancheng Wan, Xiaomin Li, Bing Hu, Peng Xia, Jure Leskovec, Yejin Choi</p>

            <p><strong>Title:</strong><br>
            Multiplayer Nash Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.23102v1">http://arxiv.org/abs/2509.23102v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models (LLMs) with human preferences. However, reward-based methods built on the Bradley-Terry assumption struggle to capture the non-transitive and heterogeneous nature of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While this perspective has inspired algorithms such as INPO, ONPO, and EGPO with strong theoretical and empirical guarantees, they remain fundamentally restricted to two-player interactions, creating a single-opponent bias that fails to capture the full complexity of realistic preference structures. In this work, we introduce Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an $n$-player game, where each policy competes against a population of opponents while being regularized toward a reference model. Our framework establishes well-defined Nash equilibria in multiplayer settings and extends the concept of duality gap to quantify approximation quality. We demonstrate that MNPO inherits the equilibrium guarantees of two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. Through comprehensive empirical evaluation, we show that MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios. Together, these results establish MNPO as a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences. Code is available at https://github.com/smiles724/MNPO.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark</title>
      <itunes:episode>1201</itunes:episode>
      <podcast:episode>1201</podcast:episode>
      <itunes:title>RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0e31887b-0340-4734-9e78-da76a55918ac</guid>
      <link>https://share.transistor.fm/s/cd974db3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yang Shi, Yuhao Dong, Yue Ding, Yuran Wang, Xuanyu Zhu, Sheng Zhou, Wenting Liu, Haochen Tian, Rundong Wang, Huanqian Wang, Zuyan Liu, Bohan Zeng, Ruizhe Chen, Qixun Wang, Zhuoran Zhang, Xinlong Chen, Chengzhuo Tong, Bozhou Li, Chaoyou Fu, Qiang Liu, Haotian Wang, Wenjing Yang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.24897v1">http://arxiv.org/abs/2509.24897v1</a></p>

            <p><strong>Abstract:</strong><br>
            The integration of visual understanding and generation into unified multimodal models represents a significant stride toward general-purpose AI. However, a fundamental question remains unanswered by existing benchmarks: does this architectural unification actually enable synergetic interaction between the constituent capabilities? Existing evaluation paradigms, which primarily assess understanding and generation in isolation, are insufficient for determining whether a unified model can leverage its understanding to enhance its generation, or use generative simulation to facilitate deeper comprehension. To address this critical gap, we introduce RealUnify, a benchmark specifically designed to evaluate bidirectional capability synergy. RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks. It is structured around two core axes: 1) Understanding Enhances Generation, which requires reasoning (e.g., commonsense, logic) to guide image generation, and 2) Generation Enhances Understanding, which necessitates mental simulation or reconstruction (e.g., of transformed or disordered visual inputs) to solve reasoning tasks. A key contribution is our dual-evaluation protocol, which combines direct end-to-end assessment with a diagnostic stepwise evaluation that decomposes tasks into distinct understanding and generation phases. This protocol allows us to precisely discern whether performance bottlenecks stem from deficiencies in core abilities or from a failure to integrate them. Through large-scale evaluations of 12 leading unified models and 6 specialized baselines, we find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient. These results highlight the need for new training strategies and inductive biases to fully unlock the potential of unified modeling.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yang Shi, Yuhao Dong, Yue Ding, Yuran Wang, Xuanyu Zhu, Sheng Zhou, Wenting Liu, Haochen Tian, Rundong Wang, Huanqian Wang, Zuyan Liu, Bohan Zeng, Ruizhe Chen, Qixun Wang, Zhuoran Zhang, Xinlong Chen, Chengzhuo Tong, Bozhou Li, Chaoyou Fu, Qiang Liu, Haotian Wang, Wenjing Yang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.24897v1">http://arxiv.org/abs/2509.24897v1</a></p>

            <p><strong>Abstract:</strong><br>
            The integration of visual understanding and generation into unified multimodal models represents a significant stride toward general-purpose AI. However, a fundamental question remains unanswered by existing benchmarks: does this architectural unification actually enable synergetic interaction between the constituent capabilities? Existing evaluation paradigms, which primarily assess understanding and generation in isolation, are insufficient for determining whether a unified model can leverage its understanding to enhance its generation, or use generative simulation to facilitate deeper comprehension. To address this critical gap, we introduce RealUnify, a benchmark specifically designed to evaluate bidirectional capability synergy. RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks. It is structured around two core axes: 1) Understanding Enhances Generation, which requires reasoning (e.g., commonsense, logic) to guide image generation, and 2) Generation Enhances Understanding, which necessitates mental simulation or reconstruction (e.g., of transformed or disordered visual inputs) to solve reasoning tasks. A key contribution is our dual-evaluation protocol, which combines direct end-to-end assessment with a diagnostic stepwise evaluation that decomposes tasks into distinct understanding and generation phases. This protocol allows us to precisely discern whether performance bottlenecks stem from deficiencies in core abilities or from a failure to integrate them. Through large-scale evaluations of 12 leading unified models and 6 specialized baselines, we find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient. These results highlight the need for new training strategies and inductive biases to fully unlock the potential of unified modeling.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Sep 2025 21:09:20 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cd974db3/e7e6067f.mp3" length="24118860" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1504</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yang Shi, Yuhao Dong, Yue Ding, Yuran Wang, Xuanyu Zhu, Sheng Zhou, Wenting Liu, Haochen Tian, Rundong Wang, Huanqian Wang, Zuyan Liu, Bohan Zeng, Ruizhe Chen, Qixun Wang, Zhuoran Zhang, Xinlong Chen, Chengzhuo Tong, Bozhou Li, Chaoyou Fu, Qiang Liu, Haotian Wang, Wenjing Yang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.24897v1">http://arxiv.org/abs/2509.24897v1</a></p>

            <p><strong>Abstract:</strong><br>
            The integration of visual understanding and generation into unified multimodal models represents a significant stride toward general-purpose AI. However, a fundamental question remains unanswered by existing benchmarks: does this architectural unification actually enable synergetic interaction between the constituent capabilities? Existing evaluation paradigms, which primarily assess understanding and generation in isolation, are insufficient for determining whether a unified model can leverage its understanding to enhance its generation, or use generative simulation to facilitate deeper comprehension. To address this critical gap, we introduce RealUnify, a benchmark specifically designed to evaluate bidirectional capability synergy. RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks. It is structured around two core axes: 1) Understanding Enhances Generation, which requires reasoning (e.g., commonsense, logic) to guide image generation, and 2) Generation Enhances Understanding, which necessitates mental simulation or reconstruction (e.g., of transformed or disordered visual inputs) to solve reasoning tasks. A key contribution is our dual-evaluation protocol, which combines direct end-to-end assessment with a diagnostic stepwise evaluation that decomposes tasks into distinct understanding and generation phases. This protocol allows us to precisely discern whether performance bottlenecks stem from deficiencies in core abilities or from a failure to integrate them. Through large-scale evaluations of 12 leading unified models and 6 specialized baselines, we find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient. These results highlight the need for new training strategies and inductive biases to fully unlock the potential of unified modeling.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR</title>
      <itunes:episode>1200</itunes:episode>
      <podcast:episode>1200</podcast:episode>
      <itunes:title>Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">76ab1d90-5749-4c73-87c9-0afedd412bb7</guid>
      <link>https://share.transistor.fm/s/93f7033e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fanding Huang, Guanbo Huang, Xiao Fan, Yi He, Xiao Liang, Xiao Chen, Qinting Jiang, Faisal Nadeem Khan, Jingyan Jiang, Zhi Wang</p>

            <p><strong>Title:</strong><br>
            Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.23808v1">http://arxiv.org/abs/2509.23808v1</a></p>

            <p><strong>Abstract:</strong><br>
            A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our analysis reveals that at the hidden-state level, exploration and exploitation could be decoupled (Sec. 4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fanding Huang, Guanbo Huang, Xiao Fan, Yi He, Xiao Liang, Xiao Chen, Qinting Jiang, Faisal Nadeem Khan, Jingyan Jiang, Zhi Wang</p>

            <p><strong>Title:</strong><br>
            Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.23808v1">http://arxiv.org/abs/2509.23808v1</a></p>

            <p><strong>Abstract:</strong><br>
            A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our analysis reveals that at the hidden-state level, exploration and exploitation could be decoupled (Sec. 4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Sep 2025 21:08:56 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/93f7033e/7e4fbddf.mp3" length="23721808" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1479</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fanding Huang, Guanbo Huang, Xiao Fan, Yi He, Xiao Liang, Xiao Chen, Qinting Jiang, Faisal Nadeem Khan, Jingyan Jiang, Zhi Wang</p>

            <p><strong>Title:</strong><br>
            Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.23808v1">http://arxiv.org/abs/2509.23808v1</a></p>

            <p><strong>Abstract:</strong><br>
            A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our analysis reveals that at the hidden-state level, exploration and exploitation could be decoupled (Sec. 4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing</title>
      <itunes:episode>1199</itunes:episode>
      <podcast:episode>1199</podcast:episode>
      <itunes:title>OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">201445eb-97e6-4605-bb05-b01438248085</guid>
      <link>https://share.transistor.fm/s/9129f6e6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang</p>

            <p><strong>Title:</strong><br>
            OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.24900v1">http://arxiv.org/abs/2509.24900v1</a></p>

            <p><strong>Abstract:</strong><br>
            The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and simple object manipulation, they often lack the systematic structure and challenging scenarios required for real-world applications. To address this bottleneck, we introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology that combines hierarchical task taxonomy with automated data generation. Our taxonomy not only includes fundamental capabilities such as text rendering and style control but also introduces highly practical yet challenging categories like scientific imagery for chemistry illustrations and complex instruction editing requiring simultaneous execution of multiple operations. Through an automated pipeline leveraging structured resource pools and GPT-4o, we generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks. Extensive experiments show that fine-tuning leading models on our dataset achieves significant performance gains across multiple benchmarks, with improvements of up to 18\% on editing tasks (UniWorld-V1 on ImgEdit-Bench) and 13% on generation tasks (Harmon on GenEval). Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang</p>

            <p><strong>Title:</strong><br>
            OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.24900v1">http://arxiv.org/abs/2509.24900v1</a></p>

            <p><strong>Abstract:</strong><br>
            The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and simple object manipulation, they often lack the systematic structure and challenging scenarios required for real-world applications. To address this bottleneck, we introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology that combines hierarchical task taxonomy with automated data generation. Our taxonomy not only includes fundamental capabilities such as text rendering and style control but also introduces highly practical yet challenging categories like scientific imagery for chemistry illustrations and complex instruction editing requiring simultaneous execution of multiple operations. Through an automated pipeline leveraging structured resource pools and GPT-4o, we generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks. Extensive experiments show that fine-tuning leading models on our dataset achieves significant performance gains across multiple benchmarks, with improvements of up to 18\% on editing tasks (UniWorld-V1 on ImgEdit-Bench) and 13% on generation tasks (Harmon on GenEval). Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Sep 2025 21:08:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9129f6e6/56fee408.mp3" length="18864274" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1175</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang</p>

            <p><strong>Title:</strong><br>
            OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.24900v1">http://arxiv.org/abs/2509.24900v1</a></p>

            <p><strong>Abstract:</strong><br>
            The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and simple object manipulation, they often lack the systematic structure and challenging scenarios required for real-world applications. To address this bottleneck, we introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology that combines hierarchical task taxonomy with automated data generation. Our taxonomy not only includes fundamental capabilities such as text rendering and style control but also introduces highly practical yet challenging categories like scientific imagery for chemistry illustrations and complex instruction editing requiring simultaneous execution of multiple operations. Through an automated pipeline leveraging structured resource pools and GPT-4o, we generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks. Extensive experiments show that fine-tuning leading models on our dataset achieves significant performance gains across multiple benchmarks, with improvements of up to 18\% on editing tasks (UniWorld-V1 on ImgEdit-Bench) and 13% on generation tasks (Harmon on GenEval). Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer</title>
      <itunes:episode>1198</itunes:episode>
      <podcast:episode>1198</podcast:episode>
      <itunes:title>SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b7cabb45-deea-48a7-ade0-4dae4984de5d</guid>
      <link>https://share.transistor.fm/s/9556ccb3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie</p>

            <p><strong>Title:</strong><br>
            SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.24695v1">http://arxiv.org/abs/2509.24695v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie</p>

            <p><strong>Title:</strong><br>
            SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.24695v1">http://arxiv.org/abs/2509.24695v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Sep 2025 21:08:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9556ccb3/61e96a37.mp3" length="25382760" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1583</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie</p>

            <p><strong>Title:</strong><br>
            SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.24695v1">http://arxiv.org/abs/2509.24695v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Democratizing AI scientists using ToolUniverse</title>
      <itunes:episode>1197</itunes:episode>
      <podcast:episode>1197</podcast:episode>
      <itunes:title>Democratizing AI scientists using ToolUniverse</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e277e8f1-a69c-488f-ad2b-24960c7d520b</guid>
      <link>https://share.transistor.fm/s/7f3960d3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shanghua Gao, Richard Zhu, Pengwei Sui, Zhenglun Kong, Sufian Aldogom, Yepeng Huang, Ayush Noori, Reza Shamji, Krishna Parvataneni, Theodoros Tsiligkaridis, Marinka Zitnik</p>

            <p><strong>Title:</strong><br>
            Democratizing AI scientists using ToolUniverse</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.23426v1">http://arxiv.org/abs/2509.23426v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI scientists are emerging computational systems that serve as collaborative partners in discovery. These systems remain difficult to build because they are bespoke, tied to rigid workflows, and lack shared environments that unify tools, data, and analyses into a common ecosystem. In omics, unified ecosystems have transformed research by enabling interoperability, reuse, and community-driven development; AI scientists require comparable infrastructure. We present ToolUniverse, an ecosystem for building AI scientists from any language or reasoning model, whether open or closed. TOOLUNIVERSE standardizes how AI scientists identify and call tools, integrating more than 600 machine learning models, datasets, APIs, and scientific packages for data analysis, knowledge retrieval, and experimental design. It automatically refines tool interfaces for correct use by AI scientists, creates new tools from natural language descriptions, iteratively optimizes tool specifications, and composes tools into agentic workflows. In a case study of hypercholesterolemia, ToolUniverse was used to create an AI scientist to identify a potent analog of a drug with favorable predicted properties. The open-source ToolUniverse is available at https://aiscientist.tools.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shanghua Gao, Richard Zhu, Pengwei Sui, Zhenglun Kong, Sufian Aldogom, Yepeng Huang, Ayush Noori, Reza Shamji, Krishna Parvataneni, Theodoros Tsiligkaridis, Marinka Zitnik</p>

            <p><strong>Title:</strong><br>
            Democratizing AI scientists using ToolUniverse</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.23426v1">http://arxiv.org/abs/2509.23426v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI scientists are emerging computational systems that serve as collaborative partners in discovery. These systems remain difficult to build because they are bespoke, tied to rigid workflows, and lack shared environments that unify tools, data, and analyses into a common ecosystem. In omics, unified ecosystems have transformed research by enabling interoperability, reuse, and community-driven development; AI scientists require comparable infrastructure. We present ToolUniverse, an ecosystem for building AI scientists from any language or reasoning model, whether open or closed. TOOLUNIVERSE standardizes how AI scientists identify and call tools, integrating more than 600 machine learning models, datasets, APIs, and scientific packages for data analysis, knowledge retrieval, and experimental design. It automatically refines tool interfaces for correct use by AI scientists, creates new tools from natural language descriptions, iteratively optimizes tool specifications, and composes tools into agentic workflows. In a case study of hypercholesterolemia, ToolUniverse was used to create an AI scientist to identify a potent analog of a drug with favorable predicted properties. The open-source ToolUniverse is available at https://aiscientist.tools.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Sep 2025 21:07:46 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7f3960d3/6fee9906.mp3" length="25101860" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1565</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shanghua Gao, Richard Zhu, Pengwei Sui, Zhenglun Kong, Sufian Aldogom, Yepeng Huang, Ayush Noori, Reza Shamji, Krishna Parvataneni, Theodoros Tsiligkaridis, Marinka Zitnik</p>

            <p><strong>Title:</strong><br>
            Democratizing AI scientists using ToolUniverse</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.23426v1">http://arxiv.org/abs/2509.23426v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI scientists are emerging computational systems that serve as collaborative partners in discovery. These systems remain difficult to build because they are bespoke, tied to rigid workflows, and lack shared environments that unify tools, data, and analyses into a common ecosystem. In omics, unified ecosystems have transformed research by enabling interoperability, reuse, and community-driven development; AI scientists require comparable infrastructure. We present ToolUniverse, an ecosystem for building AI scientists from any language or reasoning model, whether open or closed. TOOLUNIVERSE standardizes how AI scientists identify and call tools, integrating more than 600 machine learning models, datasets, APIs, and scientific packages for data analysis, knowledge retrieval, and experimental design. It automatically refines tool interfaces for correct use by AI scientists, creates new tools from natural language descriptions, iteratively optimizes tool specifications, and composes tools into agentic workflows. In a case study of hypercholesterolemia, ToolUniverse was used to create an AI scientist to identify a potent analog of a drug with favorable predicted properties. The open-source ToolUniverse is available at https://aiscientist.tools.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Visual Jigsaw Post-Training Improves MLLMs</title>
      <itunes:episode>1196</itunes:episode>
      <podcast:episode>1196</podcast:episode>
      <itunes:title>Visual Jigsaw Post-Training Improves MLLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c5523ed9-0c0c-40f6-9a90-407ddf59cd4e</guid>
      <link>https://share.transistor.fm/s/c9b394ee</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Penghao Wu, Yushan Zhang, Haiwen Diao, Bo Li, Lewei Lu, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Visual Jigsaw Post-Training Improves MLLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25190v1">http://arxiv.org/abs/2509.25190v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning based post-training has recently emerged as a powerful paradigm for enhancing the alignment and reasoning capabilities of multimodal large language models (MLLMs). While vision-centric post-training is crucial for enhancing MLLMs' intrinsic understanding of visual signals, current post-training paradigms are predominantly text-centric, where dense visual inputs are only leveraged to extract sparse cues for text-based reasoning. There exist a few approaches in this direction, however, they often still rely on text as an intermediate mediator or introduce additional visual generative designs. In this work, we introduce Visual Jigsaw, a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This naturally aligns with reinforcement learning from verifiable rewards (RLVR), requires no additional visual generative components, and derives its supervisory signal automatically without any annotations. We instantiate Visual Jigsaw across three visual modalities, including images, videos, and 3D data. Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding. Our findings highlight the potential of self-supervised vision-centric tasks in post-training MLLMs and aim to inspire further research on vision-centric pretext designs. Project Page: https://penghao-wu.github.io/visual_jigsaw/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Penghao Wu, Yushan Zhang, Haiwen Diao, Bo Li, Lewei Lu, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Visual Jigsaw Post-Training Improves MLLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25190v1">http://arxiv.org/abs/2509.25190v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning based post-training has recently emerged as a powerful paradigm for enhancing the alignment and reasoning capabilities of multimodal large language models (MLLMs). While vision-centric post-training is crucial for enhancing MLLMs' intrinsic understanding of visual signals, current post-training paradigms are predominantly text-centric, where dense visual inputs are only leveraged to extract sparse cues for text-based reasoning. There exist a few approaches in this direction, however, they often still rely on text as an intermediate mediator or introduce additional visual generative designs. In this work, we introduce Visual Jigsaw, a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This naturally aligns with reinforcement learning from verifiable rewards (RLVR), requires no additional visual generative components, and derives its supervisory signal automatically without any annotations. We instantiate Visual Jigsaw across three visual modalities, including images, videos, and 3D data. Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding. Our findings highlight the potential of self-supervised vision-centric tasks in post-training MLLMs and aim to inspire further research on vision-centric pretext designs. Project Page: https://penghao-wu.github.io/visual_jigsaw/</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Sep 2025 21:07:23 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c9b394ee/21206b04.mp3" length="22390137" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1396</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Penghao Wu, Yushan Zhang, Haiwen Diao, Bo Li, Lewei Lu, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Visual Jigsaw Post-Training Improves MLLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.25190v1">http://arxiv.org/abs/2509.25190v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning based post-training has recently emerged as a powerful paradigm for enhancing the alignment and reasoning capabilities of multimodal large language models (MLLMs). While vision-centric post-training is crucial for enhancing MLLMs' intrinsic understanding of visual signals, current post-training paradigms are predominantly text-centric, where dense visual inputs are only leveraged to extract sparse cues for text-based reasoning. There exist a few approaches in this direction, however, they often still rely on text as an intermediate mediator or introduce additional visual generative designs. In this work, we introduce Visual Jigsaw, a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This naturally aligns with reinforcement learning from verifiable rewards (RLVR), requires no additional visual generative components, and derives its supervisory signal automatically without any annotations. We instantiate Visual Jigsaw across three visual modalities, including images, videos, and 3D data. Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding. Our findings highlight the potential of self-supervised vision-centric tasks in post-training MLLMs and aim to inspire further research on vision-centric pretext designs. Project Page: https://penghao-wu.github.io/visual_jigsaw/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance</title>
      <itunes:episode>1195</itunes:episode>
      <podcast:episode>1195</podcast:episode>
      <itunes:title>When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f8880a78-8b59-4a6a-9b34-92b5f870a0b3</guid>
      <link>https://share.transistor.fm/s/73a695ed</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Kevin El-Haddad, Céline Hudelot, Pierre Colombo</p>

            <p><strong>Title:</strong><br>
            When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22193v1">http://arxiv.org/abs/2509.22193v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) with reasoning capabilities have achieved state-of-the-art performance on a wide range of tasks. Despite its empirical success, the tasks and model scales at which reasoning becomes effective, as well as its training and inference costs, remain underexplored. In this work, we rely on a synthetic data distillation framework to conduct a large-scale supervised study. We compare Instruction Fine-Tuning (IFT) and reasoning models of varying sizes, on a wide range of math-centric and general-purpose tasks, evaluating both multiple-choice and open-ended formats. Our analysis reveals that reasoning consistently improves model performance, often matching or surpassing significantly larger IFT systems. Notably, while IFT remains Pareto-optimal in training and inference costs, reasoning models become increasingly valuable as model size scales, overcoming IFT performance limits on reasoning-intensive and open-ended tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Kevin El-Haddad, Céline Hudelot, Pierre Colombo</p>

            <p><strong>Title:</strong><br>
            When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22193v1">http://arxiv.org/abs/2509.22193v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) with reasoning capabilities have achieved state-of-the-art performance on a wide range of tasks. Despite its empirical success, the tasks and model scales at which reasoning becomes effective, as well as its training and inference costs, remain underexplored. In this work, we rely on a synthetic data distillation framework to conduct a large-scale supervised study. We compare Instruction Fine-Tuning (IFT) and reasoning models of varying sizes, on a wide range of math-centric and general-purpose tasks, evaluating both multiple-choice and open-ended formats. Our analysis reveals that reasoning consistently improves model performance, often matching or surpassing significantly larger IFT systems. Notably, while IFT remains Pareto-optimal in training and inference costs, reasoning models become increasingly valuable as model size scales, overcoming IFT performance limits on reasoning-intensive and open-ended tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 30 Sep 2025 21:06:59 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/73a695ed/25ffc4d7.mp3" length="23850121" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1487</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Kevin El-Haddad, Céline Hudelot, Pierre Colombo</p>

            <p><strong>Title:</strong><br>
            When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22193v1">http://arxiv.org/abs/2509.22193v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) with reasoning capabilities have achieved state-of-the-art performance on a wide range of tasks. Despite its empirical success, the tasks and model scales at which reasoning becomes effective, as well as its training and inference costs, remain underexplored. In this work, we rely on a synthetic data distillation framework to conduct a large-scale supervised study. We compare Instruction Fine-Tuning (IFT) and reasoning models of varying sizes, on a wide range of math-centric and general-purpose tasks, evaluating both multiple-choice and open-ended formats. Our analysis reveals that reasoning consistently improves model performance, often matching or surpassing significantly larger IFT systems. Notably, while IFT remains Pareto-optimal in training and inference costs, reasoning models become increasingly valuable as model size scales, overcoming IFT performance limits on reasoning-intensive and open-ended tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LongLive: Real-time Interactive Long Video Generation</title>
      <itunes:episode>1194</itunes:episode>
      <podcast:episode>1194</podcast:episode>
      <itunes:title>LongLive: Real-time Interactive Long Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bd942f5a-e73c-4ef6-bbfa-10a609d4ba54</guid>
      <link>https://share.transistor.fm/s/07b68ded</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 136 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, Yukang Chen</p>

            <p><strong>Title:</strong><br>
            LongLive: Real-time Interactive Long Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22622v1">http://arxiv.org/abs/2509.22622v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 136 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, Yukang Chen</p>

            <p><strong>Title:</strong><br>
            LongLive: Real-time Interactive Long Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22622v1">http://arxiv.org/abs/2509.22622v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 29 Sep 2025 21:13:38 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/07b68ded/475f46dc.mp3" length="23966690" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1494</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 136 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, Yukang Chen</p>

            <p><strong>Title:</strong><br>
            LongLive: Real-time Interactive Long Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22622v1">http://arxiv.org/abs/2509.22622v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Quantile Advantage Estimation for Entropy-Safe Reasoning</title>
      <itunes:episode>1193</itunes:episode>
      <podcast:episode>1193</podcast:episode>
      <itunes:title>Quantile Advantage Estimation for Entropy-Safe Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8862646c-4756-45a9-8671-c31288c641dd</guid>
      <link>https://share.transistor.fm/s/308f44be</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 102 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He</p>

            <p><strong>Title:</strong><br>
            Quantile Advantage Estimation for Entropy-Safe Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22611v1">http://arxiv.org/abs/2509.22611v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p &lt;= 1 - K) it reinforces rare successes, while on easy queries (p &gt; 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 102 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He</p>

            <p><strong>Title:</strong><br>
            Quantile Advantage Estimation for Entropy-Safe Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22611v1">http://arxiv.org/abs/2509.22611v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p &lt;= 1 - K) it reinforces rare successes, while on easy queries (p &gt; 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 29 Sep 2025 21:13:15 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/308f44be/fb97333f.mp3" length="22390151" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1396</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 102 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He</p>

            <p><strong>Title:</strong><br>
            Quantile Advantage Estimation for Entropy-Safe Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22611v1">http://arxiv.org/abs/2509.22611v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p &lt;= 1 - K) it reinforces rare successes, while on easy queries (p &gt; 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning</title>
      <itunes:episode>1192</itunes:episode>
      <podcast:episode>1192</podcast:episode>
      <itunes:title>EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ce7f0b0f-6e42-4eb0-b2ce-83e468a3742c</guid>
      <link>https://share.transistor.fm/s/306efc61</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 98 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xu Wujiang, Wentian Zhao, Zhenting Wang, Li Yu-Jhe, Jin Can, Jin Mingyu, Mei Kai, Wan Kun, Metaxas Dimitris</p>

            <p><strong>Title:</strong><br>
            EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22576v1">http://arxiv.org/abs/2509.22576v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 98 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xu Wujiang, Wentian Zhao, Zhenting Wang, Li Yu-Jhe, Jin Can, Jin Mingyu, Mei Kai, Wan Kun, Metaxas Dimitris</p>

            <p><strong>Title:</strong><br>
            EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22576v1">http://arxiv.org/abs/2509.22576v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 29 Sep 2025 21:12:52 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/306efc61/1f09af4e.mp3" length="26412198" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1647</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 98 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xu Wujiang, Wentian Zhao, Zhenting Wang, Li Yu-Jhe, Jin Can, Jin Mingyu, Mei Kai, Wan Kun, Metaxas Dimitris</p>

            <p><strong>Title:</strong><br>
            EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22576v1">http://arxiv.org/abs/2509.22576v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing</title>
      <itunes:episode>1191</itunes:episode>
      <podcast:episode>1191</podcast:episode>
      <itunes:title>MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9ad528de-099a-48d5-aea8-62ccffc7dece</guid>
      <link>https://share.transistor.fm/s/f99854fc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 81 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, Liqun Wei, Wei Li, Shasha Wang, Ruiliang Xu, Yuanyuan Cao, Lu Chen, Qianqian Wu, Huaiyu Gu, Lindong Lu, Keming Wang, Dechen Lin, Guanlin Shen, Xuanhe Zhou, Linfeng Zhang, Yuhang Zang, Xiaoyi Dong, Jiaqi Wang, Bo Zhang, Lei Bai, Pei Chu, Weijia Li, Jiang Wu, Lijun Wu, Zhenxiang Li, Guangyu Wang, Zhongying Tu, Chao Xu, Kai Chen, Yu Qiao, Bowen Zhou, Dahua Lin, Wentao Zhang, Conghui He</p>

            <p><strong>Title:</strong><br>
            MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22186v1">http://arxiv.org/abs/2509.22186v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 81 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, Liqun Wei, Wei Li, Shasha Wang, Ruiliang Xu, Yuanyuan Cao, Lu Chen, Qianqian Wu, Huaiyu Gu, Lindong Lu, Keming Wang, Dechen Lin, Guanlin Shen, Xuanhe Zhou, Linfeng Zhang, Yuhang Zang, Xiaoyi Dong, Jiaqi Wang, Bo Zhang, Lei Bai, Pei Chu, Weijia Li, Jiang Wu, Lijun Wu, Zhenxiang Li, Guangyu Wang, Zhongying Tu, Chao Xu, Kai Chen, Yu Qiao, Bowen Zhou, Dahua Lin, Wentao Zhang, Conghui He</p>

            <p><strong>Title:</strong><br>
            MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22186v1">http://arxiv.org/abs/2509.22186v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 29 Sep 2025 21:12:29 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f99854fc/d138c528.mp3" length="24064530" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1500</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 81 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, Liqun Wei, Wei Li, Shasha Wang, Ruiliang Xu, Yuanyuan Cao, Lu Chen, Qianqian Wu, Huaiyu Gu, Lindong Lu, Keming Wang, Dechen Lin, Guanlin Shen, Xuanhe Zhou, Linfeng Zhang, Yuhang Zang, Xiaoyi Dong, Jiaqi Wang, Bo Zhang, Lei Bai, Pei Chu, Weijia Li, Jiang Wu, Lijun Wu, Zhenxiang Li, Guangyu Wang, Zhongying Tu, Chao Xu, Kai Chen, Yu Qiao, Bowen Zhou, Dahua Lin, Wentao Zhang, Conghui He</p>

            <p><strong>Title:</strong><br>
            MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22186v1">http://arxiv.org/abs/2509.22186v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ReviewScore: Misinformed Peer Review Detection with Large Language Models</title>
      <itunes:episode>1190</itunes:episode>
      <podcast:episode>1190</podcast:episode>
      <itunes:title>ReviewScore: Misinformed Peer Review Detection with Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9e9ce499-b556-41ae-a7dc-473177028698</guid>
      <link>https://share.transistor.fm/s/0739e52f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hyun Ryu, Doohyuk Jang, Hyemin S. Lee, Joonhyun Jeong, Gyeongman Kim, Donghyeon Cho, Gyouk Chu, Minyeong Hwang, Hyeongwon Jang, Changhun Kim, Haechan Kim, Jina Kim, Joowon Kim, Yoonjeon Kim, Kwanhyung Lee, Chanjae Park, Heecheol Yun, Gregor Betz, Eunho Yang</p>

            <p><strong>Title:</strong><br>
            ReviewScore: Misinformed Peer Review Detection with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21679v1">http://arxiv.org/abs/2509.21679v1</a></p>

            <p><strong>Abstract:</strong><br>
            Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hyun Ryu, Doohyuk Jang, Hyemin S. Lee, Joonhyun Jeong, Gyeongman Kim, Donghyeon Cho, Gyouk Chu, Minyeong Hwang, Hyeongwon Jang, Changhun Kim, Haechan Kim, Jina Kim, Joowon Kim, Yoonjeon Kim, Kwanhyung Lee, Chanjae Park, Heecheol Yun, Gregor Betz, Eunho Yang</p>

            <p><strong>Title:</strong><br>
            ReviewScore: Misinformed Peer Review Detection with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21679v1">http://arxiv.org/abs/2509.21679v1</a></p>

            <p><strong>Abstract:</strong><br>
            Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 29 Sep 2025 21:12:07 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0739e52f/0f2db54d.mp3" length="21143395" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1318</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hyun Ryu, Doohyuk Jang, Hyemin S. Lee, Joonhyun Jeong, Gyeongman Kim, Donghyeon Cho, Gyouk Chu, Minyeong Hwang, Hyeongwon Jang, Changhun Kim, Haechan Kim, Jina Kim, Joowon Kim, Yoonjeon Kim, Kwanhyung Lee, Chanjae Park, Heecheol Yun, Gregor Betz, Eunho Yang</p>

            <p><strong>Title:</strong><br>
            ReviewScore: Misinformed Peer Review Detection with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21679v1">http://arxiv.org/abs/2509.21679v1</a></p>

            <p><strong>Abstract:</strong><br>
            Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Variational Reasoning for Language Models</title>
      <itunes:episode>1189</itunes:episode>
      <podcast:episode>1189</podcast:episode>
      <itunes:title>Variational Reasoning for Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bae48955-2b74-4331-9511-e10f35cf7a43</guid>
      <link>https://share.transistor.fm/s/e9154bbc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiangxin Zhou, Zichen Liu, Haonan Wang, Chao Du, Min Lin, Chongxuan Li, Liang Wang, Tianyu Pang</p>

            <p><strong>Title:</strong><br>
            Variational Reasoning for Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22637v1">http://arxiv.org/abs/2509.22637v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiangxin Zhou, Zichen Liu, Haonan Wang, Chao Du, Min Lin, Chongxuan Li, Liang Wang, Tianyu Pang</p>

            <p><strong>Title:</strong><br>
            Variational Reasoning for Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22637v1">http://arxiv.org/abs/2509.22637v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 29 Sep 2025 21:11:44 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e9154bbc/d7e45507.mp3" length="21710116" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1353</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiangxin Zhou, Zichen Liu, Haonan Wang, Chao Du, Min Lin, Chongxuan Li, Liang Wang, Tianyu Pang</p>

            <p><strong>Title:</strong><br>
            Variational Reasoning for Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22637v1">http://arxiv.org/abs/2509.22637v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Language Models Can Learn from Verbal Feedback Without Scalar Rewards</title>
      <itunes:episode>1188</itunes:episode>
      <podcast:episode>1188</podcast:episode>
      <itunes:title>Language Models Can Learn from Verbal Feedback Without Scalar Rewards</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fc5fa614-1d15-4fbb-9a80-4662aed77172</guid>
      <link>https://share.transistor.fm/s/2dfd000d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Renjie Luo, Zichen Liu, Xiangyan Liu, Chao Du, Min Lin, Wenhu Chen, Wei Lu, Tianyu Pang</p>

            <p><strong>Title:</strong><br>
            Language Models Can Learn from Verbal Feedback Without Scalar Rewards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22638v1">http://arxiv.org/abs/2509.22638v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Renjie Luo, Zichen Liu, Xiangyan Liu, Chao Du, Min Lin, Wenhu Chen, Wei Lu, Tianyu Pang</p>

            <p><strong>Title:</strong><br>
            Language Models Can Learn from Verbal Feedback Without Scalar Rewards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22638v1">http://arxiv.org/abs/2509.22638v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 29 Sep 2025 21:11:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2dfd000d/291dabd6.mp3" length="22625893" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1410</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Renjie Luo, Zichen Liu, Xiangyan Liu, Chao Du, Min Lin, Wenhu Chen, Wei Lu, Tianyu Pang</p>

            <p><strong>Title:</strong><br>
            Language Models Can Learn from Verbal Feedback Without Scalar Rewards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22638v1">http://arxiv.org/abs/2509.22638v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning</title>
      <itunes:episode>1187</itunes:episode>
      <podcast:episode>1187</podcast:episode>
      <itunes:title>MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3942f434-750b-4874-b6d2-9c3b4ec5cdd5</guid>
      <link>https://share.transistor.fm/s/61084ba6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Jinkun Hao, Naifu Liang, Zhen Luo, Xudong Xu, Weipeng Zhong, Ran Yi, Yichen Jin, Zhaoyang Lyu, Feng Zheng, Lizhuang Ma, Jiangmiao Pang</p>

            <p><strong>Title:</strong><br>
            MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22281v1">http://arxiv.org/abs/2509.22281v1</a></p>

            <p><strong>Abstract:</strong><br>
            The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at https://mesatask.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Jinkun Hao, Naifu Liang, Zhen Luo, Xudong Xu, Weipeng Zhong, Ran Yi, Yichen Jin, Zhaoyang Lyu, Feng Zheng, Lizhuang Ma, Jiangmiao Pang</p>

            <p><strong>Title:</strong><br>
            MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22281v1">http://arxiv.org/abs/2509.22281v1</a></p>

            <p><strong>Abstract:</strong><br>
            The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at https://mesatask.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 29 Sep 2025 21:10:59 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/61084ba6/fb44007f.mp3" length="24583624" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1533</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Jinkun Hao, Naifu Liang, Zhen Luo, Xudong Xu, Weipeng Zhong, Ran Yi, Yichen Jin, Zhaoyang Lyu, Feng Zheng, Lizhuang Ma, Jiangmiao Pang</p>

            <p><strong>Title:</strong><br>
            MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22281v1">http://arxiv.org/abs/2509.22281v1</a></p>

            <p><strong>Abstract:</strong><br>
            The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at https://mesatask.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning</title>
      <itunes:episode>1186</itunes:episode>
      <podcast:episode>1186</podcast:episode>
      <itunes:title>CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cc6ec71c-4f1c-4692-98c8-10e50f8b1c0b</guid>
      <link>https://share.transistor.fm/s/ef6ce911</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, Dahua Lin</p>

            <p><strong>Title:</strong><br>
            CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22647v1">http://arxiv.org/abs/2509.22647v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image captioning is a fundamental task that bridges the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable data annotated by humans or proprietary models. This approach often leads to models that memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome the limitation of SFT, we propose applying the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to the open-ended task of image captioning. A primary challenge, however, is designing an objective reward function for the inherently subjective nature of what constitutes a "good" caption. We introduce Captioning Reinforcement Learning (CapRL), a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. CapRL employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings. Pretraining on the CapRL-5M caption dataset annotated by CapRL-3B results in substantial gains across 12 benchmarks. Moreover, within the Prism Framework for caption quality evaluation, CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%. Code is available here: https://github.com/InternLM/CapRL.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, Dahua Lin</p>

            <p><strong>Title:</strong><br>
            CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22647v1">http://arxiv.org/abs/2509.22647v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image captioning is a fundamental task that bridges the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable data annotated by humans or proprietary models. This approach often leads to models that memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome the limitation of SFT, we propose applying the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to the open-ended task of image captioning. A primary challenge, however, is designing an objective reward function for the inherently subjective nature of what constitutes a "good" caption. We introduce Captioning Reinforcement Learning (CapRL), a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. CapRL employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings. Pretraining on the CapRL-5M caption dataset annotated by CapRL-3B results in substantial gains across 12 benchmarks. Moreover, within the Prism Framework for caption quality evaluation, CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%. Code is available here: https://github.com/InternLM/CapRL.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 29 Sep 2025 21:10:37 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ef6ce911/c03d442f.mp3" length="23008334" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1434</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, Dahua Lin</p>

            <p><strong>Title:</strong><br>
            CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.22647v1">http://arxiv.org/abs/2509.22647v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image captioning is a fundamental task that bridges the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable data annotated by humans or proprietary models. This approach often leads to models that memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome the limitation of SFT, we propose applying the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to the open-ended task of image captioning. A primary challenge, however, is designing an objective reward function for the inherently subjective nature of what constitutes a "good" caption. We introduce Captioning Reinforcement Learning (CapRL), a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. CapRL employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings. Pretraining on the CapRL-5M caption dataset annotated by CapRL-3B results in substantial gains across 12 benchmarks. Moreover, within the Prism Framework for caption quality evaluation, CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%. Code is available here: https://github.com/InternLM/CapRL.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping</title>
      <itunes:episode>1185</itunes:episode>
      <podcast:episode>1185</podcast:episode>
      <itunes:title>No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d2ac18d7-a47e-45b8-868e-006302a09f34</guid>
      <link>https://share.transistor.fm/s/e07b53b3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Thanh-Long V. Le, Myeongho Jeon, Kim Vu, Viet Lai, Eunho Yang</p>

            <p><strong>Title:</strong><br>
            No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21880v1">http://arxiv.org/abs/2509.21880v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward - so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Thanh-Long V. Le, Myeongho Jeon, Kim Vu, Viet Lai, Eunho Yang</p>

            <p><strong>Title:</strong><br>
            No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21880v1">http://arxiv.org/abs/2509.21880v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward - so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 29 Sep 2025 21:10:14 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e07b53b3/13e354f5.mp3" length="26829361" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1673</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Thanh-Long V. Le, Myeongho Jeon, Kim Vu, Viet Lai, Eunho Yang</p>

            <p><strong>Title:</strong><br>
            No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21880v1">http://arxiv.org/abs/2509.21880v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward - so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models</title>
      <itunes:episode>1184</itunes:episode>
      <podcast:episode>1184</podcast:episode>
      <itunes:title>VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7854130f-c722-42c3-9329-ee800a45d919</guid>
      <link>https://share.transistor.fm/s/0f721e20</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 95 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, Hao Wang</p>

            <p><strong>Title:</strong><br>
            VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.19803v1">http://arxiv.org/abs/2509.19803v1</a></p>

            <p><strong>Abstract:</strong><br>
            Policy-based reinforcement learning currently plays an important role in improving LLMs on mathematical reasoning tasks. However, existing rollout-based reinforcement learning methods (GRPO, DAPO, GSPO, etc.) fail to explicitly consider LLMs' learning ability for samples of different difficulty levels, which is contrary to the human cognitive process of mathematical reasoning tasks from easy to difficult. Intuitively, we find that the variance of the rollout group's reward in RLVR partly reflects the difficulty of the current sample for LLMs. Samples that are too easy or too difficult have a lower variance, while samples with moderate difficulty have a higher variance. Based on this, we propose VCRL, a curriculum reinforcement learning framework that dynamically controls the difficulty of training samples based on the variance of group rewards. Experiments on five mathematical benchmarks and two models reveal the advantages of VCRL over the current LLM RL baselines.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 95 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, Hao Wang</p>

            <p><strong>Title:</strong><br>
            VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.19803v1">http://arxiv.org/abs/2509.19803v1</a></p>

            <p><strong>Abstract:</strong><br>
            Policy-based reinforcement learning currently plays an important role in improving LLMs on mathematical reasoning tasks. However, existing rollout-based reinforcement learning methods (GRPO, DAPO, GSPO, etc.) fail to explicitly consider LLMs' learning ability for samples of different difficulty levels, which is contrary to the human cognitive process of mathematical reasoning tasks from easy to difficult. Intuitively, we find that the variance of the rollout group's reward in RLVR partly reflects the difficulty of the current sample for LLMs. Samples that are too easy or too difficult have a lower variance, while samples with moderate difficulty have a higher variance. Based on this, we propose VCRL, a curriculum reinforcement learning framework that dynamically controls the difficulty of training samples based on the variance of group rewards. Experiments on five mathematical benchmarks and two models reveal the advantages of VCRL over the current LLM RL baselines.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 26 Sep 2025 20:42:11 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0f721e20/7c9c2d08.mp3" length="21454782" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1337</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 95 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, Hao Wang</p>

            <p><strong>Title:</strong><br>
            VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.19803v1">http://arxiv.org/abs/2509.19803v1</a></p>

            <p><strong>Abstract:</strong><br>
            Policy-based reinforcement learning currently plays an important role in improving LLMs on mathematical reasoning tasks. However, existing rollout-based reinforcement learning methods (GRPO, DAPO, GSPO, etc.) fail to explicitly consider LLMs' learning ability for samples of different difficulty levels, which is contrary to the human cognitive process of mathematical reasoning tasks from easy to difficult. Intuitively, we find that the variance of the rollout group's reward in RLVR partly reflects the difficulty of the current sample for LLMs. Samples that are too easy or too difficult have a lower variance, while samples with moderate difficulty have a higher variance. Based on this, we propose VCRL, a curriculum reinforcement learning framework that dynamically controls the difficulty of training samples based on the variance of group rewards. Experiments on five mathematical benchmarks and two models reveal the advantages of VCRL over the current LLM RL baselines.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines</title>
      <itunes:episode>1183</itunes:episode>
      <podcast:episode>1183</podcast:episode>
      <itunes:title>SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">edbeaa5b-08dc-41cd-8249-ec01068cb78e</guid>
      <link>https://share.transistor.fm/s/f15e04b2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 76 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yizhou Wang, Chen Tang, Han Deng, Jiabei Xiao, Jiaqi Liu, Jianyu Wu, Jun Yao, Pengze Li, Encheng Su, Lintao Wang, Guohang Zhuang, Yuchen Ren, Ben Fei, Ming Hu, Xin Chen, Dongzhan Zhou, Junjun He, Xiangyu Yue, Zhenfei Yin, Jiamin Wu, Qihao Zheng, Yuhao Zhou, Huihui Xu, Chenglong Ma, Yan Lu, Wenlong Zhang, Chunfeng Song, Philip Torr, Shixiang Tang, Xinzhu Ma, Wanli Ouyang, Lei Bai</p>

            <p><strong>Title:</strong><br>
            SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21320v1">http://arxiv.org/abs/2509.21320v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 76 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yizhou Wang, Chen Tang, Han Deng, Jiabei Xiao, Jiaqi Liu, Jianyu Wu, Jun Yao, Pengze Li, Encheng Su, Lintao Wang, Guohang Zhuang, Yuchen Ren, Ben Fei, Ming Hu, Xin Chen, Dongzhan Zhou, Junjun He, Xiangyu Yue, Zhenfei Yin, Jiamin Wu, Qihao Zheng, Yuhao Zhou, Huihui Xu, Chenglong Ma, Yan Lu, Wenlong Zhang, Chunfeng Song, Philip Torr, Shixiang Tang, Xinzhu Ma, Wanli Ouyang, Lei Bai</p>

            <p><strong>Title:</strong><br>
            SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21320v1">http://arxiv.org/abs/2509.21320v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 26 Sep 2025 20:41:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f15e04b2/23dc30b2.mp3" length="22694857" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1415</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 76 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yizhou Wang, Chen Tang, Han Deng, Jiabei Xiao, Jiaqi Liu, Jianyu Wu, Jun Yao, Pengze Li, Encheng Su, Lintao Wang, Guohang Zhuang, Yuchen Ren, Ben Fei, Ming Hu, Xin Chen, Dongzhan Zhou, Junjun He, Xiangyu Yue, Zhenfei Yin, Jiamin Wu, Qihao Zheng, Yuhao Zhou, Huihui Xu, Chenglong Ma, Yan Lu, Wenlong Zhang, Chunfeng Song, Philip Torr, Shixiang Tang, Xinzhu Ma, Wanli Ouyang, Lei Bai</p>

            <p><strong>Title:</strong><br>
            SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21320v1">http://arxiv.org/abs/2509.21320v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources</title>
      <itunes:episode>1182</itunes:episode>
      <podcast:episode>1182</podcast:episode>
      <itunes:title>MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ae5fecba-177f-43c1-afba-d3d7b48491a4</guid>
      <link>https://share.transistor.fm/s/03b120be</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, Deli Zhao, Wei Lu, Yu Rong, Aixin Sun, Shijian Lu</p>

            <p><strong>Title:</strong><br>
            MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21268v1">http://arxiv.org/abs/2509.21268v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, Deli Zhao, Wei Lu, Yu Rong, Aixin Sun, Shijian Lu</p>

            <p><strong>Title:</strong><br>
            MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21268v1">http://arxiv.org/abs/2509.21268v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 26 Sep 2025 20:41:27 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/03b120be/a847fe18.mp3" length="27689065" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1727</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, Deli Zhao, Wei Lu, Yu Rong, Aixin Sun, Shijian Lu</p>

            <p><strong>Title:</strong><br>
            MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21268v1">http://arxiv.org/abs/2509.21268v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Tree Search for LLM Agent Reinforcement Learning</title>
      <itunes:episode>1181</itunes:episode>
      <podcast:episode>1181</podcast:episode>
      <itunes:title>Tree Search for LLM Agent Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bca276dc-d9cd-4c9d-acaa-c6f9c5f4a0f1</guid>
      <link>https://share.transistor.fm/s/c2e770c1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu</p>

            <p><strong>Title:</strong><br>
            Tree Search for LLM Agent Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21240v1">http://arxiv.org/abs/2509.21240v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu</p>

            <p><strong>Title:</strong><br>
            Tree Search for LLM Agent Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21240v1">http://arxiv.org/abs/2509.21240v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 26 Sep 2025 20:41:05 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c2e770c1/cd2ab17f.mp3" length="23903573" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1490</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu</p>

            <p><strong>Title:</strong><br>
            Tree Search for LLM Agent Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21240v1">http://arxiv.org/abs/2509.21240v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Seedream 4.0: Toward Next-generation Multimodal Image Generation</title>
      <itunes:episode>1180</itunes:episode>
      <podcast:episode>1180</podcast:episode>
      <itunes:title>Seedream 4.0: Toward Next-generation Multimodal Image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">01cb836a-fcf8-4e14-880d-4e71c77de6a3</guid>
      <link>https://share.transistor.fm/s/abdcff80</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tongtong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wenxu Wu, Yonghui Wu, Xin Xia, Xuefeng Xiao, Shuang Xu, Xin Yan, Ceyuan Yang, Jianchao Yang, Zhonghua Zhai, Chenlin Zhang, Heng Zhang, Qi Zhang, Xinyu Zhang, Yuwei Zhang, Shijia Zhao, Wenliang Zhao, Wenjia Zhu</p>

            <p><strong>Title:</strong><br>
            Seedream 4.0: Toward Next-generation Multimodal Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.20427v1">http://arxiv.org/abs/2509.20427v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tongtong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wenxu Wu, Yonghui Wu, Xin Xia, Xuefeng Xiao, Shuang Xu, Xin Yan, Ceyuan Yang, Jianchao Yang, Zhonghua Zhai, Chenlin Zhang, Heng Zhang, Qi Zhang, Xinyu Zhang, Yuwei Zhang, Shijia Zhao, Wenliang Zhao, Wenjia Zhu</p>

            <p><strong>Title:</strong><br>
            Seedream 4.0: Toward Next-generation Multimodal Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.20427v1">http://arxiv.org/abs/2509.20427v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 26 Sep 2025 20:40:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/abdcff80/84c5a9c4.mp3" length="20701185" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1290</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tongtong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wenxu Wu, Yonghui Wu, Xin Xia, Xuefeng Xiao, Shuang Xu, Xin Yan, Ceyuan Yang, Jianchao Yang, Zhonghua Zhai, Chenlin Zhang, Heng Zhang, Qi Zhang, Xinyu Zhang, Yuwei Zhang, Shijia Zhao, Wenliang Zhao, Wenjia Zhu</p>

            <p><strong>Title:</strong><br>
            Seedream 4.0: Toward Next-generation Multimodal Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.20427v1">http://arxiv.org/abs/2509.20427v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets</title>
      <itunes:episode>1179</itunes:episode>
      <podcast:episode>1179</podcast:episode>
      <itunes:title>Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fe933c77-024b-474f-8579-371c09123980</guid>
      <link>https://share.transistor.fm/s/b8227f2b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Team Hunyuan3D, :, Bowen Zhang, Chunchao Guo, Haolin Liu, Hongyu Yan, Huiwen Shi, Jingwei Huang, Junlin Yu, Kunhong Li, Linus, Penghao Wang, Qingxiang Lin, Sicong Liu, Xianghui Yang, Yixuan Tang, Yunfei Zhao, Zeqiang Lai, Zhihao Liang, Zibo Zhao</p>

            <p><strong>Title:</strong><br>
            Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21245v1">http://arxiv.org/abs/2509.21245v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in 3D-native generative models have accelerated asset creation for games, film, and design. However, most methods still rely primarily on image or text conditioning and lack fine-grained, cross-modal controls, which limits controllability and practical adoption. To address this gap, we present Hunyuan3D-Omni, a unified framework for fine-grained, controllable 3D asset generation built on Hunyuan3D 2.1. In addition to images, Hunyuan3D-Omni accepts point clouds, voxels, bounding boxes, and skeletal pose priors as conditioning signals, enabling precise control over geometry, topology, and pose. Instead of separate heads for each modality, our model unifies all signals in a single cross-modal architecture. We train with a progressive, difficulty-aware sampling strategy that selects one control modality per example and biases sampling toward harder signals (e.g., skeletal pose) while downweighting easier ones (e.g., point clouds), encouraging robust multi-modal fusion and graceful handling of missing inputs. Experiments show that these additional controls improve generation accuracy, enable geometry-aware transformations, and increase robustness for production workflows.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Team Hunyuan3D, :, Bowen Zhang, Chunchao Guo, Haolin Liu, Hongyu Yan, Huiwen Shi, Jingwei Huang, Junlin Yu, Kunhong Li, Linus, Penghao Wang, Qingxiang Lin, Sicong Liu, Xianghui Yang, Yixuan Tang, Yunfei Zhao, Zeqiang Lai, Zhihao Liang, Zibo Zhao</p>

            <p><strong>Title:</strong><br>
            Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21245v1">http://arxiv.org/abs/2509.21245v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in 3D-native generative models have accelerated asset creation for games, film, and design. However, most methods still rely primarily on image or text conditioning and lack fine-grained, cross-modal controls, which limits controllability and practical adoption. To address this gap, we present Hunyuan3D-Omni, a unified framework for fine-grained, controllable 3D asset generation built on Hunyuan3D 2.1. In addition to images, Hunyuan3D-Omni accepts point clouds, voxels, bounding boxes, and skeletal pose priors as conditioning signals, enabling precise control over geometry, topology, and pose. Instead of separate heads for each modality, our model unifies all signals in a single cross-modal architecture. We train with a progressive, difficulty-aware sampling strategy that selects one control modality per example and biases sampling toward harder signals (e.g., skeletal pose) while downweighting easier ones (e.g., point clouds), encouraging robust multi-modal fusion and graceful handling of missing inputs. Experiments show that these additional controls improve generation accuracy, enable geometry-aware transformations, and increase robustness for production workflows.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 26 Sep 2025 20:40:21 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b8227f2b/3def025a.mp3" length="24206621" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1509</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Team Hunyuan3D, :, Bowen Zhang, Chunchao Guo, Haolin Liu, Hongyu Yan, Huiwen Shi, Jingwei Huang, Junlin Yu, Kunhong Li, Linus, Penghao Wang, Qingxiang Lin, Sicong Liu, Xianghui Yang, Yixuan Tang, Yunfei Zhao, Zeqiang Lai, Zhihao Liang, Zibo Zhao</p>

            <p><strong>Title:</strong><br>
            Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21245v1">http://arxiv.org/abs/2509.21245v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in 3D-native generative models have accelerated asset creation for games, film, and design. However, most methods still rely primarily on image or text conditioning and lack fine-grained, cross-modal controls, which limits controllability and practical adoption. To address this gap, we present Hunyuan3D-Omni, a unified framework for fine-grained, controllable 3D asset generation built on Hunyuan3D 2.1. In addition to images, Hunyuan3D-Omni accepts point clouds, voxels, bounding boxes, and skeletal pose priors as conditioning signals, enabling precise control over geometry, topology, and pose. Instead of separate heads for each modality, our model unifies all signals in a single cross-modal architecture. We train with a progressive, difficulty-aware sampling strategy that selects one control modality per example and biases sampling toward harder signals (e.g., skeletal pose) while downweighting easier ones (e.g., point clouds), encouraging robust multi-modal fusion and graceful handling of missing inputs. Experiments show that these additional controls improve generation accuracy, enable geometry-aware transformations, and increase robustness for production workflows.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AutoIntent: AutoML for Text Classification</title>
      <itunes:episode>1178</itunes:episode>
      <podcast:episode>1178</podcast:episode>
      <itunes:title>AutoIntent: AutoML for Text Classification</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">36d67e63-6081-43f1-a701-53aa54018fde</guid>
      <link>https://share.transistor.fm/s/e2e61e23</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ilya Alekseev, Roman Solomatin, Darina Rustamova, Denis Kuznetsov</p>

            <p><strong>Title:</strong><br>
            AutoIntent: AutoML for Text Classification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21138v1">http://arxiv.org/abs/2509.21138v1</a></p>

            <p><strong>Abstract:</strong><br>
            AutoIntent is an automated machine learning tool for text classification tasks. Unlike existing solutions, AutoIntent offers end-to-end automation with embedding model selection, classifier optimization, and decision threshold tuning, all within a modular, sklearn-like interface. The framework is designed to support multi-label classification and out-of-scope detection. AutoIntent demonstrates superior performance compared to existing AutoML tools on standard intent classification datasets and enables users to balance effectiveness and resource consumption.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ilya Alekseev, Roman Solomatin, Darina Rustamova, Denis Kuznetsov</p>

            <p><strong>Title:</strong><br>
            AutoIntent: AutoML for Text Classification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21138v1">http://arxiv.org/abs/2509.21138v1</a></p>

            <p><strong>Abstract:</strong><br>
            AutoIntent is an automated machine learning tool for text classification tasks. Unlike existing solutions, AutoIntent offers end-to-end automation with embedding model selection, classifier optimization, and decision threshold tuning, all within a modular, sklearn-like interface. The framework is designed to support multi-label classification and out-of-scope detection. AutoIntent demonstrates superior performance compared to existing AutoML tools on standard intent classification datasets and enables users to balance effectiveness and resource consumption.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 26 Sep 2025 20:39:59 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e2e61e23/a65b2628.mp3" length="21634048" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1348</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ilya Alekseev, Roman Solomatin, Darina Rustamova, Denis Kuznetsov</p>

            <p><strong>Title:</strong><br>
            AutoIntent: AutoML for Text Classification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.21138v1">http://arxiv.org/abs/2509.21138v1</a></p>

            <p><strong>Abstract:</strong><br>
            AutoIntent is an automated machine learning tool for text classification tasks. Unlike existing solutions, AutoIntent offers end-to-end automation with embedding model selection, classifier optimization, and decision threshold tuning, all within a modular, sklearn-like interface. The framework is designed to support multi-label classification and out-of-scope detection. AutoIntent demonstrates superior performance compared to existing AutoML tools on standard intent classification datasets and enables users to balance effectiveness and resource consumption.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Video models are zero-shot learners and reasoners</title>
      <itunes:episode>1177</itunes:episode>
      <podcast:episode>1177</podcast:episode>
      <itunes:title>Video models are zero-shot learners and reasoners</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">371d8af7-2099-418f-bbe1-8c85c6eee98f</guid>
      <link>https://share.transistor.fm/s/0a247a4a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.LG, cs.AI, cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, Robert Geirhos</p>

            <p><strong>Title:</strong><br>
            Video models are zero-shot learners and reasoners</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.20328v1">http://arxiv.org/abs/2509.20328v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today's generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding? We demonstrate that Veo 3 can solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.LG, cs.AI, cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, Robert Geirhos</p>

            <p><strong>Title:</strong><br>
            Video models are zero-shot learners and reasoners</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.20328v1">http://arxiv.org/abs/2509.20328v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today's generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding? We demonstrate that Veo 3 can solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 25 Sep 2025 19:56:07 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0a247a4a/665b8d03.mp3" length="23977970" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1495</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.LG, cs.AI, cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, Robert Geirhos</p>

            <p><strong>Title:</strong><br>
            Video models are zero-shot learners and reasoners</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.20328v1">http://arxiv.org/abs/2509.20328v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today's generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding? We demonstrate that Veo 3 can solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SIM-CoT: Supervised Implicit Chain-of-Thought</title>
      <itunes:episode>1176</itunes:episode>
      <podcast:episode>1176</podcast:episode>
      <itunes:title>SIM-CoT: Supervised Implicit Chain-of-Thought</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7bda3e64-ac4d-481b-b9eb-c4e070d99c37</guid>
      <link>https://share.transistor.fm/s/011a53da</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, Dahua Lin</p>

            <p><strong>Title:</strong><br>
            SIM-CoT: Supervised Implicit Chain-of-Thought</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.20317v2">http://arxiv.org/abs/2509.20317v2</a></p>

            <p><strong>Abstract:</strong><br>
            Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited their adoption. We identify a core latent instability issue when scaling the computational budget of implicit CoT: as the number of reasoning tokens increases, training often becomes unstable and collapses. Our analysis shows that this instability arises from latent representations becoming homogeneous and losing semantic diversity, caused by insufficient step-level supervision in current implicit CoT methods. To address this, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space. SIM-CoT employs an auxiliary decoder during training to align each implicit token with its corresponding explicit reasoning step, ensuring latent states capture distinct and meaningful information. The auxiliary decoder is removed at inference, preserving the efficiency of implicit CoT with no added overhead. It also provides interpretability by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization and diagnosis. SIM-CoT significantly improves both in-domain accuracy and out-of-domain stability of implicit CoT methods, boosting Coconut by +8.2\% on GPT-2 and CODI by +3.0\% on LLaMA-3.1 8B. It further surpasses the explicit CoT baseline on GPT-2 by 2.1\% with 2.3$\times$ greater token efficiency, while closing the performance gap on larger models like LLaMA-3.1 8B. Code: https://github.com/InternLM/SIM-CoT</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, Dahua Lin</p>

            <p><strong>Title:</strong><br>
            SIM-CoT: Supervised Implicit Chain-of-Thought</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.20317v2">http://arxiv.org/abs/2509.20317v2</a></p>

            <p><strong>Abstract:</strong><br>
            Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited their adoption. We identify a core latent instability issue when scaling the computational budget of implicit CoT: as the number of reasoning tokens increases, training often becomes unstable and collapses. Our analysis shows that this instability arises from latent representations becoming homogeneous and losing semantic diversity, caused by insufficient step-level supervision in current implicit CoT methods. To address this, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space. SIM-CoT employs an auxiliary decoder during training to align each implicit token with its corresponding explicit reasoning step, ensuring latent states capture distinct and meaningful information. The auxiliary decoder is removed at inference, preserving the efficiency of implicit CoT with no added overhead. It also provides interpretability by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization and diagnosis. SIM-CoT significantly improves both in-domain accuracy and out-of-domain stability of implicit CoT methods, boosting Coconut by +8.2\% on GPT-2 and CODI by +3.0\% on LLaMA-3.1 8B. It further surpasses the explicit CoT baseline on GPT-2 by 2.1\% with 2.3$\times$ greater token efficiency, while closing the performance gap on larger models like LLaMA-3.1 8B. Code: https://github.com/InternLM/SIM-CoT</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 25 Sep 2025 19:55:46 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/011a53da/8f0e80e2.mp3" length="23201816" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1446</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, Dahua Lin</p>

            <p><strong>Title:</strong><br>
            SIM-CoT: Supervised Implicit Chain-of-Thought</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.20317v2">http://arxiv.org/abs/2509.20317v2</a></p>

            <p><strong>Abstract:</strong><br>
            Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited their adoption. We identify a core latent instability issue when scaling the computational budget of implicit CoT: as the number of reasoning tokens increases, training often becomes unstable and collapses. Our analysis shows that this instability arises from latent representations becoming homogeneous and losing semantic diversity, caused by insufficient step-level supervision in current implicit CoT methods. To address this, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space. SIM-CoT employs an auxiliary decoder during training to align each implicit token with its corresponding explicit reasoning step, ensuring latent states capture distinct and meaningful information. The auxiliary decoder is removed at inference, preserving the efficiency of implicit CoT with no added overhead. It also provides interpretability by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization and diagnosis. SIM-CoT significantly improves both in-domain accuracy and out-of-domain stability of implicit CoT methods, boosting Coconut by +8.2\% on GPT-2 and CODI by +3.0\% on LLaMA-3.1 8B. It further surpasses the explicit CoT baseline on GPT-2 by 2.1\% with 2.3$\times$ greater token efficiency, while closing the performance gap on larger models like LLaMA-3.1 8B. Code: https://github.com/InternLM/SIM-CoT</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR</title>
      <itunes:episode>1175</itunes:episode>
      <podcast:episode>1175</podcast:episode>
      <itunes:title>Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">31b3f6b8-5a86-4072-b457-bf605c09ad8d</guid>
      <link>https://share.transistor.fm/s/3096afa8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Khalil Hennara, Muhammad Hreden, Mohamed Motasim Hamed, Ahmad Bastati, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan</p>

            <p><strong>Title:</strong><br>
            Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.18174v1">http://arxiv.org/abs/2509.18174v1</a></p>

            <p><strong>Abstract:</strong><br>
            Arabic document OCR remains a challenging task due to the language's cursive script, diverse fonts, diacritics, and right-to-left orientation. While modern Multimodal Large Language Models (MLLMs) have advanced document understanding for high-resource languages, their performance on Arabic remains limited. In this work, we introduce Baseer, a vision-language model fine- tuned specifically for Arabic document OCR. Leveraging a large-scale dataset combining synthetic and real-world documents, Baseer is trained using a decoder-only fine-tuning strategy to adapt a pre-trained MLLM while preserving general visual features. We also present Misraj-DocOCR, a high-quality, expert-verified benchmark designed for rigorous evaluation of Arabic OCR systems. Our experiments show that Baseer significantly outperforms existing open-source and commercial solutions, achieving a WER of 0.25 and establishing a new state-of-the-art in the domain of Arabic document OCR. Our results highlight the benefits of domain-specific adaptation of general-purpose MLLMs and establish a strong baseline for high-accuracy OCR on morphologically rich languages like Arabic.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Khalil Hennara, Muhammad Hreden, Mohamed Motasim Hamed, Ahmad Bastati, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan</p>

            <p><strong>Title:</strong><br>
            Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.18174v1">http://arxiv.org/abs/2509.18174v1</a></p>

            <p><strong>Abstract:</strong><br>
            Arabic document OCR remains a challenging task due to the language's cursive script, diverse fonts, diacritics, and right-to-left orientation. While modern Multimodal Large Language Models (MLLMs) have advanced document understanding for high-resource languages, their performance on Arabic remains limited. In this work, we introduce Baseer, a vision-language model fine- tuned specifically for Arabic document OCR. Leveraging a large-scale dataset combining synthetic and real-world documents, Baseer is trained using a decoder-only fine-tuning strategy to adapt a pre-trained MLLM while preserving general visual features. We also present Misraj-DocOCR, a high-quality, expert-verified benchmark designed for rigorous evaluation of Arabic OCR systems. Our experiments show that Baseer significantly outperforms existing open-source and commercial solutions, achieving a WER of 0.25 and establishing a new state-of-the-art in the domain of Arabic document OCR. Our results highlight the benefits of domain-specific adaptation of general-purpose MLLMs and establish a strong baseline for high-accuracy OCR on morphologically rich languages like Arabic.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 24 Sep 2025 20:16:44 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3096afa8/b144e36b.mp3" length="19602792" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1221</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Khalil Hennara, Muhammad Hreden, Mohamed Motasim Hamed, Ahmad Bastati, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan</p>

            <p><strong>Title:</strong><br>
            Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.18174v1">http://arxiv.org/abs/2509.18174v1</a></p>

            <p><strong>Abstract:</strong><br>
            Arabic document OCR remains a challenging task due to the language's cursive script, diverse fonts, diacritics, and right-to-left orientation. While modern Multimodal Large Language Models (MLLMs) have advanced document understanding for high-resource languages, their performance on Arabic remains limited. In this work, we introduce Baseer, a vision-language model fine- tuned specifically for Arabic document OCR. Leveraging a large-scale dataset combining synthetic and real-world documents, Baseer is trained using a decoder-only fine-tuning strategy to adapt a pre-trained MLLM while preserving general visual features. We also present Misraj-DocOCR, a high-quality, expert-verified benchmark designed for rigorous evaluation of Arabic OCR systems. Our experiments show that Baseer significantly outperforms existing open-source and commercial solutions, achieving a WER of 0.25 and establishing a new state-of-the-art in the domain of Arabic document OCR. Our results highlight the benefits of domain-specific adaptation of general-purpose MLLMs and establish a strong baseline for high-accuracy OCR on morphologically rich languages like Arabic.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Reinforcement Learning on Pre-Training Data</title>
      <itunes:episode>1174</itunes:episode>
      <podcast:episode>1174</podcast:episode>
      <itunes:title>Reinforcement Learning on Pre-Training Data</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7c9252de-2bbc-4370-a5b6-c341c4012496</guid>
      <link>https://share.transistor.fm/s/7c169bd0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Siheng Li, Kejiao Li, Zenan Xu, Guanhua Huang, Evander Yang, Kun Li, Haoyuan Wu, Jiajia Wu, Zihao Zheng, Chenchen Zhang, Kun Shi, Kyrierl Deng, Qi Yi, Ruibin Xiong, Tingqiang Xu, Yuhao Jiang, Jianfeng Yan, Yuyuan Zeng, Guanghui Xu, Jinbao Xue, Zhijiang Xu, Zheng Fang, Shuai Li, Qibin Liu, Xiaoxue Li, Zhuoyu Li, Yangyu Tao, Fei Gao, Cheng Jiang, Bo Chao Wang, Kai Liu, Jianchen Zhu, Wai Lam, Wayyt Wang, Bo Zhou, Di Wang</p>

            <p><strong>Title:</strong><br>
            Reinforcement Learning on Pre-Training Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.19249v1">http://arxiv.org/abs/2509.19249v1</a></p>

            <p><strong>Abstract:</strong><br>
            The growing disparity between the exponential scaling of computational resources and the finite growth of high-quality text data now constrains conventional scaling approaches for large language models (LLMs). To address this challenge, we introduce Reinforcement Learning on Pre-Training data (RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast to prior approaches that scale training primarily through supervised learning, RLPT enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL). While existing RL strategies such as reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) rely on human annotation for reward construction, RLPT eliminates this dependency by deriving reward signals directly from pre-training data. Specifically, it adopts a next-segment reasoning objective, rewarding the policy for accurately predicting subsequent text segments conditioned on the preceding context. This formulation allows RL to be scaled on pre-training data, encouraging the exploration of richer trajectories across broader contexts and thereby fostering more generalizable reasoning skills. Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of RLPT. For example, when applied to Qwen3-4B-Base, RLPT yields absolute improvements of $3.0$, $5.1$, $8.1$, $6.0$, $6.6$, and $5.3$ on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. The results further demonstrate favorable scaling behavior, suggesting strong potential for continued gains with more compute. In addition, RLPT provides a solid foundation, extending the reasoning boundaries of LLMs and enhancing RLVR performance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Siheng Li, Kejiao Li, Zenan Xu, Guanhua Huang, Evander Yang, Kun Li, Haoyuan Wu, Jiajia Wu, Zihao Zheng, Chenchen Zhang, Kun Shi, Kyrierl Deng, Qi Yi, Ruibin Xiong, Tingqiang Xu, Yuhao Jiang, Jianfeng Yan, Yuyuan Zeng, Guanghui Xu, Jinbao Xue, Zhijiang Xu, Zheng Fang, Shuai Li, Qibin Liu, Xiaoxue Li, Zhuoyu Li, Yangyu Tao, Fei Gao, Cheng Jiang, Bo Chao Wang, Kai Liu, Jianchen Zhu, Wai Lam, Wayyt Wang, Bo Zhou, Di Wang</p>

            <p><strong>Title:</strong><br>
            Reinforcement Learning on Pre-Training Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.19249v1">http://arxiv.org/abs/2509.19249v1</a></p>

            <p><strong>Abstract:</strong><br>
            The growing disparity between the exponential scaling of computational resources and the finite growth of high-quality text data now constrains conventional scaling approaches for large language models (LLMs). To address this challenge, we introduce Reinforcement Learning on Pre-Training data (RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast to prior approaches that scale training primarily through supervised learning, RLPT enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL). While existing RL strategies such as reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) rely on human annotation for reward construction, RLPT eliminates this dependency by deriving reward signals directly from pre-training data. Specifically, it adopts a next-segment reasoning objective, rewarding the policy for accurately predicting subsequent text segments conditioned on the preceding context. This formulation allows RL to be scaled on pre-training data, encouraging the exploration of richer trajectories across broader contexts and thereby fostering more generalizable reasoning skills. Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of RLPT. For example, when applied to Qwen3-4B-Base, RLPT yields absolute improvements of $3.0$, $5.1$, $8.1$, $6.0$, $6.6$, and $5.3$ on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. The results further demonstrate favorable scaling behavior, suggesting strong potential for continued gains with more compute. In addition, RLPT provides a solid foundation, extending the reasoning boundaries of LLMs and enhancing RLVR performance.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 24 Sep 2025 20:16:21 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7c169bd0/9fe7895d.mp3" length="20143607" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1255</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Siheng Li, Kejiao Li, Zenan Xu, Guanhua Huang, Evander Yang, Kun Li, Haoyuan Wu, Jiajia Wu, Zihao Zheng, Chenchen Zhang, Kun Shi, Kyrierl Deng, Qi Yi, Ruibin Xiong, Tingqiang Xu, Yuhao Jiang, Jianfeng Yan, Yuyuan Zeng, Guanghui Xu, Jinbao Xue, Zhijiang Xu, Zheng Fang, Shuai Li, Qibin Liu, Xiaoxue Li, Zhuoyu Li, Yangyu Tao, Fei Gao, Cheng Jiang, Bo Chao Wang, Kai Liu, Jianchen Zhu, Wai Lam, Wayyt Wang, Bo Zhou, Di Wang</p>

            <p><strong>Title:</strong><br>
            Reinforcement Learning on Pre-Training Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.19249v1">http://arxiv.org/abs/2509.19249v1</a></p>

            <p><strong>Abstract:</strong><br>
            The growing disparity between the exponential scaling of computational resources and the finite growth of high-quality text data now constrains conventional scaling approaches for large language models (LLMs). To address this challenge, we introduce Reinforcement Learning on Pre-Training data (RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast to prior approaches that scale training primarily through supervised learning, RLPT enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL). While existing RL strategies such as reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) rely on human annotation for reward construction, RLPT eliminates this dependency by deriving reward signals directly from pre-training data. Specifically, it adopts a next-segment reasoning objective, rewarding the policy for accurately predicting subsequent text segments conditioned on the preceding context. This formulation allows RL to be scaled on pre-training data, encouraging the exploration of richer trajectories across broader contexts and thereby fostering more generalizable reasoning skills. Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of RLPT. For example, when applied to Qwen3-4B-Base, RLPT yields absolute improvements of $3.0$, $5.1$, $8.1$, $6.0$, $6.6$, and $5.3$ on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. The results further demonstrate favorable scaling behavior, suggesting strong potential for continued gains with more compute. In addition, RLPT provides a solid foundation, extending the reasoning boundaries of LLMs and enhancing RLVR performance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Do You Need Proprioceptive States in Visuomotor Policies?</title>
      <itunes:episode>1173</itunes:episode>
      <podcast:episode>1173</podcast:episode>
      <itunes:title>Do You Need Proprioceptive States in Visuomotor Policies?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c08103cd-0510-41b7-bc3e-8d3922c47953</guid>
      <link>https://share.transistor.fm/s/6178d4c8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.RO, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Juntu Zhao, Wenbo Lu, Di Zhang, Yufeng Liu, Yushen Liang, Tianluo Zhang, Yifeng Cao, Junyuan Xie, Yingdong Hu, Shengjie Wang, Junliang Guo, Dequan Wang, Yang Gao</p>

            <p><strong>Title:</strong><br>
            Do You Need Proprioceptive States in Visuomotor Policies?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.18644v2">http://arxiv.org/abs/2509.18644v2</a></p>

            <p><strong>Abstract:</strong><br>
            Imitation-learning-based visuomotor policies have been widely used in robot manipulation, where both visual observations and proprioceptive states are typically adopted together for precise control. However, in this study, we find that this common practice makes the policy overly reliant on the proprioceptive state input, which causes overfitting to the training trajectories and results in poor spatial generalization. On the contrary, we propose the State-free Policy, removing the proprioceptive state input and predicting actions only conditioned on visual observations. The State-free Policy is built in the relative end-effector action space, and should ensure the full task-relevant visual observations, here provided by dual wide-angle wrist cameras. Empirical results demonstrate that the State-free policy achieves significantly stronger spatial generalization than the state-based policy: in real-world tasks such as pick-and-place, challenging shirt-folding, and complex whole-body manipulation, spanning multiple robot embodiments, the average success rate improves from 0% to 85% in height generalization and from 6% to 64% in horizontal generalization. Furthermore, they also show advantages in data efficiency and cross-embodiment adaptation, enhancing their practicality for real-world deployment. Discover more by visiting: https://statefreepolicy.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.RO, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Juntu Zhao, Wenbo Lu, Di Zhang, Yufeng Liu, Yushen Liang, Tianluo Zhang, Yifeng Cao, Junyuan Xie, Yingdong Hu, Shengjie Wang, Junliang Guo, Dequan Wang, Yang Gao</p>

            <p><strong>Title:</strong><br>
            Do You Need Proprioceptive States in Visuomotor Policies?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.18644v2">http://arxiv.org/abs/2509.18644v2</a></p>

            <p><strong>Abstract:</strong><br>
            Imitation-learning-based visuomotor policies have been widely used in robot manipulation, where both visual observations and proprioceptive states are typically adopted together for precise control. However, in this study, we find that this common practice makes the policy overly reliant on the proprioceptive state input, which causes overfitting to the training trajectories and results in poor spatial generalization. On the contrary, we propose the State-free Policy, removing the proprioceptive state input and predicting actions only conditioned on visual observations. The State-free Policy is built in the relative end-effector action space, and should ensure the full task-relevant visual observations, here provided by dual wide-angle wrist cameras. Empirical results demonstrate that the State-free policy achieves significantly stronger spatial generalization than the state-based policy: in real-world tasks such as pick-and-place, challenging shirt-folding, and complex whole-body manipulation, spanning multiple robot embodiments, the average success rate improves from 0% to 85% in height generalization and from 6% to 64% in horizontal generalization. Furthermore, they also show advantages in data efficiency and cross-embodiment adaptation, enhancing their practicality for real-world deployment. Discover more by visiting: https://statefreepolicy.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 24 Sep 2025 20:15:58 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6178d4c8/96767347.mp3" length="25041267" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1561</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.RO, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Juntu Zhao, Wenbo Lu, Di Zhang, Yufeng Liu, Yushen Liang, Tianluo Zhang, Yifeng Cao, Junyuan Xie, Yingdong Hu, Shengjie Wang, Junliang Guo, Dequan Wang, Yang Gao</p>

            <p><strong>Title:</strong><br>
            Do You Need Proprioceptive States in Visuomotor Policies?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.18644v2">http://arxiv.org/abs/2509.18644v2</a></p>

            <p><strong>Abstract:</strong><br>
            Imitation-learning-based visuomotor policies have been widely used in robot manipulation, where both visual observations and proprioceptive states are typically adopted together for precise control. However, in this study, we find that this common practice makes the policy overly reliant on the proprioceptive state input, which causes overfitting to the training trajectories and results in poor spatial generalization. On the contrary, we propose the State-free Policy, removing the proprioceptive state input and predicting actions only conditioned on visual observations. The State-free Policy is built in the relative end-effector action space, and should ensure the full task-relevant visual observations, here provided by dual wide-angle wrist cameras. Empirical results demonstrate that the State-free policy achieves significantly stronger spatial generalization than the state-based policy: in real-world tasks such as pick-and-place, challenging shirt-folding, and complex whole-body manipulation, spanning multiple robot embodiments, the average success rate improves from 0% to 85% in height generalization and from 6% to 64% in horizontal generalization. Furthermore, they also show advantages in data efficiency and cross-embodiment adaptation, enhancing their practicality for real-world deployment. Discover more by visiting: https://statefreepolicy.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe</title>
      <itunes:episode>1172</itunes:episode>
      <podcast:episode>1172</podcast:episode>
      <itunes:title>MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0b2130eb-00f4-4a05-a0fe-577ed5a76b64</guid>
      <link>https://share.transistor.fm/s/59e7c0ce</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning Ding, Xu Han, Yuan Yao, Zhiyuan Liu, Maosong Sun</p>

            <p><strong>Title:</strong><br>
            MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.18154v1">http://arxiv.org/abs/2509.18154v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for document knowledge and text recognition without heavy data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes. Comprehensive experimental results in OpenCompass evaluation show that MiniCPM-V 4.5 surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B. Notably, the strong performance is achieved with remarkable efficiency. For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size, using just 46.7\% GPU memory cost and 8.7\% inference time of Qwen2.5-VL 7B.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning Ding, Xu Han, Yuan Yao, Zhiyuan Liu, Maosong Sun</p>

            <p><strong>Title:</strong><br>
            MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.18154v1">http://arxiv.org/abs/2509.18154v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for document knowledge and text recognition without heavy data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes. Comprehensive experimental results in OpenCompass evaluation show that MiniCPM-V 4.5 surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B. Notably, the strong performance is achieved with remarkable efficiency. For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size, using just 46.7\% GPU memory cost and 8.7\% inference time of Qwen2.5-VL 7B.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 24 Sep 2025 20:15:35 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/59e7c0ce/30a5cb02.mp3" length="24121363" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1504</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning Ding, Xu Han, Yuan Yao, Zhiyuan Liu, Maosong Sun</p>

            <p><strong>Title:</strong><br>
            MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.18154v1">http://arxiv.org/abs/2509.18154v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for document knowledge and text recognition without heavy data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes. Comprehensive experimental results in OpenCompass evaluation show that MiniCPM-V 4.5 surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B. Notably, the strong performance is achieved with remarkable efficiency. For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size, using just 46.7\% GPU memory cost and 8.7\% inference time of Qwen2.5-VL 7B.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LIMI: Less is More for Agency</title>
      <itunes:episode>1171</itunes:episode>
      <podcast:episode>1171</podcast:episode>
      <itunes:title>LIMI: Less is More for Agency</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2eda8625-58f1-44ad-8557-86608df9efe2</guid>
      <link>https://share.transistor.fm/s/7ea53ffb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 69 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yang Xiao, Mohan Jiang, Jie Sun, Keyu Li, Jifan Lin, Yumin Zhuang, Ji Zeng, Shijie Xia, Qishuo Hua, Xuefeng Li, Xiaojie Cai, Tongyu Wang, Yue Zhang, Liming Liu, Xia Wu, Jinlong Hou, Yuan Cheng, Wenjie Li, Xiang Wang, Dequan Wang, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            LIMI: Less is More for Agency</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.17567v1">http://arxiv.org/abs/2509.17567v1</a></p>

            <p><strong>Abstract:</strong><br>
            We define Agency as the emergent capacity of AI systems to function as autonomous agents actively discovering problems, formulating hypotheses, and executing solutions through self-directed engagement with environments and tools. This fundamental capability marks the dawn of the Age of AI Agency, driven by a critical industry shift: the urgent need for AI systems that don't just think, but work. While current AI excels at reasoning and generating responses, industries demand autonomous agents that can execute tasks, operate tools, and drive real-world outcomes. As agentic intelligence becomes the defining characteristic separating cognitive systems from productive workers, efficiently cultivating machine autonomy becomes paramount. Current approaches assume that more data yields better agency, following traditional scaling laws from language modeling. We fundamentally challenge this paradigm. LIMI (Less Is More for Intelligent Agency) demonstrates that agency follows radically different development principles. Through strategic focus on collaborative software development and scientific research workflows, we show that sophisticated agentic intelligence can emerge from minimal but strategically curated demonstrations of autonomous behavior. Using only 78 carefully designed training samples, LIMI achieves 73.5% on comprehensive agency benchmarks, dramatically outperforming state-of-the-art models: Kimi-K2-Instruct (24.1%), DeepSeek-V3.1 (11.9%), Qwen3-235B-A22B-Instruct (27.5%), and GLM-4.5 (45.1%). Most strikingly, LIMI demonstrates 53.7% improvement over models trained on 10,000 samples-achieving superior agentic intelligence with 128 times fewer samples. Our findings establish the Agency Efficiency Principle: machine autonomy emerges not from data abundance but from strategic curation of high-quality agentic demonstrations.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 69 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yang Xiao, Mohan Jiang, Jie Sun, Keyu Li, Jifan Lin, Yumin Zhuang, Ji Zeng, Shijie Xia, Qishuo Hua, Xuefeng Li, Xiaojie Cai, Tongyu Wang, Yue Zhang, Liming Liu, Xia Wu, Jinlong Hou, Yuan Cheng, Wenjie Li, Xiang Wang, Dequan Wang, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            LIMI: Less is More for Agency</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.17567v1">http://arxiv.org/abs/2509.17567v1</a></p>

            <p><strong>Abstract:</strong><br>
            We define Agency as the emergent capacity of AI systems to function as autonomous agents actively discovering problems, formulating hypotheses, and executing solutions through self-directed engagement with environments and tools. This fundamental capability marks the dawn of the Age of AI Agency, driven by a critical industry shift: the urgent need for AI systems that don't just think, but work. While current AI excels at reasoning and generating responses, industries demand autonomous agents that can execute tasks, operate tools, and drive real-world outcomes. As agentic intelligence becomes the defining characteristic separating cognitive systems from productive workers, efficiently cultivating machine autonomy becomes paramount. Current approaches assume that more data yields better agency, following traditional scaling laws from language modeling. We fundamentally challenge this paradigm. LIMI (Less Is More for Intelligent Agency) demonstrates that agency follows radically different development principles. Through strategic focus on collaborative software development and scientific research workflows, we show that sophisticated agentic intelligence can emerge from minimal but strategically curated demonstrations of autonomous behavior. Using only 78 carefully designed training samples, LIMI achieves 73.5% on comprehensive agency benchmarks, dramatically outperforming state-of-the-art models: Kimi-K2-Instruct (24.1%), DeepSeek-V3.1 (11.9%), Qwen3-235B-A22B-Instruct (27.5%), and GLM-4.5 (45.1%). Most strikingly, LIMI demonstrates 53.7% improvement over models trained on 10,000 samples-achieving superior agentic intelligence with 128 times fewer samples. Our findings establish the Agency Efficiency Principle: machine autonomy emerges not from data abundance but from strategic curation of high-quality agentic demonstrations.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 23 Sep 2025 20:22:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7ea53ffb/2e886da8.mp3" length="20645144" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1287</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 69 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yang Xiao, Mohan Jiang, Jie Sun, Keyu Li, Jifan Lin, Yumin Zhuang, Ji Zeng, Shijie Xia, Qishuo Hua, Xuefeng Li, Xiaojie Cai, Tongyu Wang, Yue Zhang, Liming Liu, Xia Wu, Jinlong Hou, Yuan Cheng, Wenjie Li, Xiang Wang, Dequan Wang, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            LIMI: Less is More for Agency</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.17567v1">http://arxiv.org/abs/2509.17567v1</a></p>

            <p><strong>Abstract:</strong><br>
            We define Agency as the emergent capacity of AI systems to function as autonomous agents actively discovering problems, formulating hypotheses, and executing solutions through self-directed engagement with environments and tools. This fundamental capability marks the dawn of the Age of AI Agency, driven by a critical industry shift: the urgent need for AI systems that don't just think, but work. While current AI excels at reasoning and generating responses, industries demand autonomous agents that can execute tasks, operate tools, and drive real-world outcomes. As agentic intelligence becomes the defining characteristic separating cognitive systems from productive workers, efficiently cultivating machine autonomy becomes paramount. Current approaches assume that more data yields better agency, following traditional scaling laws from language modeling. We fundamentally challenge this paradigm. LIMI (Less Is More for Intelligent Agency) demonstrates that agency follows radically different development principles. Through strategic focus on collaborative software development and scientific research workflows, we show that sophisticated agentic intelligence can emerge from minimal but strategically curated demonstrations of autonomous behavior. Using only 78 carefully designed training samples, LIMI achieves 73.5% on comprehensive agency benchmarks, dramatically outperforming state-of-the-art models: Kimi-K2-Instruct (24.1%), DeepSeek-V3.1 (11.9%), Qwen3-235B-A22B-Instruct (27.5%), and GLM-4.5 (45.1%). Most strikingly, LIMI demonstrates 53.7% improvement over models trained on 10,000 samples-achieving superior agentic intelligence with 128 times fewer samples. Our findings establish the Agency Efficiency Principle: machine autonomy emerges not from data abundance but from strategic curation of high-quality agentic demonstrations.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Qwen3-Omni Technical Report</title>
      <itunes:episode>1170</itunes:episode>
      <podcast:episode>1170</podcast:episode>
      <itunes:title>Qwen3-Omni Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2bfc4519-ae78-4252-aedf-b12613b21d03</guid>
      <link>https://share.transistor.fm/s/f659770e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CL, cs.AI, cs.CV, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Qwen3-Omni Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.17765v1">http://arxiv.org/abs/2509.17765v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CL, cs.AI, cs.CV, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Qwen3-Omni Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.17765v1">http://arxiv.org/abs/2509.17765v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 23 Sep 2025 20:22:05 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f659770e/2fb20661.mp3" length="14744394" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>918</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CL, cs.AI, cs.CV, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Qwen3-Omni Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.17765v1">http://arxiv.org/abs/2509.17765v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models</title>
      <itunes:episode>1169</itunes:episode>
      <podcast:episode>1169</podcast:episode>
      <itunes:title>OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bffddd3b-e566-4a88-af7e-1d12e2cc6c75</guid>
      <link>https://share.transistor.fm/s/2ca8b52e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jinshu Chen, Xinghui Li, Xu Bai, Tianxiang Ma, Pengze Zhang, Zhuowei Chen, Gen Li, Lijie Liu, Songtao Zhao, Bingchuan Li, Qian He</p>

            <p><strong>Title:</strong><br>
            OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.17627v1">http://arxiv.org/abs/2509.17627v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in video insertion based on diffusion models are impressive. However, existing methods rely on complex control signals but struggle with subject consistency, limiting their practical applicability. In this paper, we focus on the task of Mask-free Video Insertion and aim to resolve three key challenges: data scarcity, subject-scene equilibrium, and insertion harmonization. To address the data scarcity, we propose a new data pipeline InsertPipe, constructing diverse cross-pair data automatically. Building upon our data pipeline, we develop OmniInsert, a novel unified framework for mask-free video insertion from both single and multiple subject references. Specifically, to maintain subject-scene equilibrium, we introduce a simple yet effective Condition-Specific Feature Injection mechanism to distinctly inject multi-source conditions and propose a novel Progressive Training strategy that enables the model to balance feature injection from subjects and source video. Meanwhile, we design the Subject-Focused Loss to improve the detailed appearance of the subjects. To further enhance insertion harmonization, we propose an Insertive Preference Optimization methodology to optimize the model by simulating human preferences, and incorporate a Context-Aware Rephraser module during reference to seamlessly integrate the subject into the original scenes. To address the lack of a benchmark for the field, we introduce InsertBench, a comprehensive benchmark comprising diverse scenes with meticulously selected subjects. Evaluation on InsertBench indicates OmniInsert outperforms state-of-the-art closed-source commercial solutions. The code will be released.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jinshu Chen, Xinghui Li, Xu Bai, Tianxiang Ma, Pengze Zhang, Zhuowei Chen, Gen Li, Lijie Liu, Songtao Zhao, Bingchuan Li, Qian He</p>

            <p><strong>Title:</strong><br>
            OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.17627v1">http://arxiv.org/abs/2509.17627v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in video insertion based on diffusion models are impressive. However, existing methods rely on complex control signals but struggle with subject consistency, limiting their practical applicability. In this paper, we focus on the task of Mask-free Video Insertion and aim to resolve three key challenges: data scarcity, subject-scene equilibrium, and insertion harmonization. To address the data scarcity, we propose a new data pipeline InsertPipe, constructing diverse cross-pair data automatically. Building upon our data pipeline, we develop OmniInsert, a novel unified framework for mask-free video insertion from both single and multiple subject references. Specifically, to maintain subject-scene equilibrium, we introduce a simple yet effective Condition-Specific Feature Injection mechanism to distinctly inject multi-source conditions and propose a novel Progressive Training strategy that enables the model to balance feature injection from subjects and source video. Meanwhile, we design the Subject-Focused Loss to improve the detailed appearance of the subjects. To further enhance insertion harmonization, we propose an Insertive Preference Optimization methodology to optimize the model by simulating human preferences, and incorporate a Context-Aware Rephraser module during reference to seamlessly integrate the subject into the original scenes. To address the lack of a benchmark for the field, we introduce InsertBench, a comprehensive benchmark comprising diverse scenes with meticulously selected subjects. Evaluation on InsertBench indicates OmniInsert outperforms state-of-the-art closed-source commercial solutions. The code will be released.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 23 Sep 2025 20:21:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2ca8b52e/f97d41fc.mp3" length="22050799" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1374</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jinshu Chen, Xinghui Li, Xu Bai, Tianxiang Ma, Pengze Zhang, Zhuowei Chen, Gen Li, Lijie Liu, Songtao Zhao, Bingchuan Li, Qian He</p>

            <p><strong>Title:</strong><br>
            OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.17627v1">http://arxiv.org/abs/2509.17627v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in video insertion based on diffusion models are impressive. However, existing methods rely on complex control signals but struggle with subject consistency, limiting their practical applicability. In this paper, we focus on the task of Mask-free Video Insertion and aim to resolve three key challenges: data scarcity, subject-scene equilibrium, and insertion harmonization. To address the data scarcity, we propose a new data pipeline InsertPipe, constructing diverse cross-pair data automatically. Building upon our data pipeline, we develop OmniInsert, a novel unified framework for mask-free video insertion from both single and multiple subject references. Specifically, to maintain subject-scene equilibrium, we introduce a simple yet effective Condition-Specific Feature Injection mechanism to distinctly inject multi-source conditions and propose a novel Progressive Training strategy that enables the model to balance feature injection from subjects and source video. Meanwhile, we design the Subject-Focused Loss to improve the detailed appearance of the subjects. To further enhance insertion harmonization, we propose an Insertive Preference Optimization methodology to optimize the model by simulating human preferences, and incorporate a Context-Aware Rephraser module during reference to seamlessly integrate the subject into the original scenes. To address the lack of a benchmark for the field, we introduce InsertBench, a comprehensive benchmark comprising diverse scenes with meticulously selected subjects. Evaluation on InsertBench indicates OmniInsert outperforms state-of-the-art closed-source commercial solutions. The code will be released.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System</title>
      <itunes:episode>1168</itunes:episode>
      <podcast:episode>1168</podcast:episode>
      <itunes:title>OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">33288b5a-c617-45a1-9add-7889ac452cd9</guid>
      <link>https://share.transistor.fm/s/40aba4bd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.IR, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sunhao Dai, Jiakai Tang, Jiahua Wu, Kun Wang, Yuxuan Zhu, Bingjun Chen, Bangyang Hong, Yu Zhao, Cong Fu, Kangle Wu, Yabo Ni, Anxiang Zeng, Wenjie Wang, Xu Chen, Jun Xu, See-Kiong Ng</p>

            <p><strong>Title:</strong><br>
            OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.18091v1">http://arxiv.org/abs/2509.18091v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the growing interest in replicating the scaled success of large language models (LLMs) in industrial search and recommender systems, most existing industrial efforts remain limited to transplanting Transformer architectures, which bring only incremental improvements over strong Deep Learning Recommendation Models (DLRMs). From a first principle perspective, the breakthroughs of LLMs stem not only from their architectures but also from two complementary mechanisms: context engineering, which enriches raw input queries with contextual cues to better elicit model capabilities, and multi-step reasoning, which iteratively refines model outputs through intermediate reasoning paths. However, these two mechanisms and their potential to unlock substantial improvements remain largely underexplored in industrial ranking systems.   In this paper, we propose OnePiece, a unified framework that seamlessly integrates LLM-style context engineering and reasoning into both retrieval and ranking models of industrial cascaded pipelines. OnePiece is built on a pure Transformer backbone and further introduces three key innovations: (1) structured context engineering, which augments interaction history with preference and scenario signals and unifies them into a structured tokenized input sequence for both retrieval and ranking; (2) block-wise latent reasoning, which equips the model with multi-step refinement of representations and scales reasoning bandwidth via block size; (3) progressive multi-task training, which leverages user feedback chains to effectively supervise reasoning steps during training. OnePiece has been deployed in the main personalized search scenario of Shopee and achieves consistent online gains across different key business metrics, including over $+2\%$ GMV/UU and a $+2.90\%$ increase in advertising revenue.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.IR, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sunhao Dai, Jiakai Tang, Jiahua Wu, Kun Wang, Yuxuan Zhu, Bingjun Chen, Bangyang Hong, Yu Zhao, Cong Fu, Kangle Wu, Yabo Ni, Anxiang Zeng, Wenjie Wang, Xu Chen, Jun Xu, See-Kiong Ng</p>

            <p><strong>Title:</strong><br>
            OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.18091v1">http://arxiv.org/abs/2509.18091v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the growing interest in replicating the scaled success of large language models (LLMs) in industrial search and recommender systems, most existing industrial efforts remain limited to transplanting Transformer architectures, which bring only incremental improvements over strong Deep Learning Recommendation Models (DLRMs). From a first principle perspective, the breakthroughs of LLMs stem not only from their architectures but also from two complementary mechanisms: context engineering, which enriches raw input queries with contextual cues to better elicit model capabilities, and multi-step reasoning, which iteratively refines model outputs through intermediate reasoning paths. However, these two mechanisms and their potential to unlock substantial improvements remain largely underexplored in industrial ranking systems.   In this paper, we propose OnePiece, a unified framework that seamlessly integrates LLM-style context engineering and reasoning into both retrieval and ranking models of industrial cascaded pipelines. OnePiece is built on a pure Transformer backbone and further introduces three key innovations: (1) structured context engineering, which augments interaction history with preference and scenario signals and unifies them into a structured tokenized input sequence for both retrieval and ranking; (2) block-wise latent reasoning, which equips the model with multi-step refinement of representations and scales reasoning bandwidth via block size; (3) progressive multi-task training, which leverages user feedback chains to effectively supervise reasoning steps during training. OnePiece has been deployed in the main personalized search scenario of Shopee and achieves consistent online gains across different key business metrics, including over $+2\%$ GMV/UU and a $+2.90\%$ increase in advertising revenue.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 23 Sep 2025 20:21:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/40aba4bd/19df05ea.mp3" length="22829041" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1423</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.IR, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sunhao Dai, Jiakai Tang, Jiahua Wu, Kun Wang, Yuxuan Zhu, Bingjun Chen, Bangyang Hong, Yu Zhao, Cong Fu, Kangle Wu, Yabo Ni, Anxiang Zeng, Wenjie Wang, Xu Chen, Jun Xu, See-Kiong Ng</p>

            <p><strong>Title:</strong><br>
            OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.18091v1">http://arxiv.org/abs/2509.18091v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the growing interest in replicating the scaled success of large language models (LLMs) in industrial search and recommender systems, most existing industrial efforts remain limited to transplanting Transformer architectures, which bring only incremental improvements over strong Deep Learning Recommendation Models (DLRMs). From a first principle perspective, the breakthroughs of LLMs stem not only from their architectures but also from two complementary mechanisms: context engineering, which enriches raw input queries with contextual cues to better elicit model capabilities, and multi-step reasoning, which iteratively refines model outputs through intermediate reasoning paths. However, these two mechanisms and their potential to unlock substantial improvements remain largely underexplored in industrial ranking systems.   In this paper, we propose OnePiece, a unified framework that seamlessly integrates LLM-style context engineering and reasoning into both retrieval and ranking models of industrial cascaded pipelines. OnePiece is built on a pure Transformer backbone and further introduces three key innovations: (1) structured context engineering, which augments interaction history with preference and scenario signals and unifies them into a structured tokenized input sequence for both retrieval and ranking; (2) block-wise latent reasoning, which equips the model with multi-step refinement of representations and scales reasoning bandwidth via block size; (3) progressive multi-task training, which leverages user feedback chains to effectively supervise reasoning steps during training. OnePiece has been deployed in the main personalized search scenario of Shopee and achieves consistent online gains across different key business metrics, including over $+2\%$ GMV/UU and a $+2.90\%$ increase in advertising revenue.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs</title>
      <itunes:episode>1167</itunes:episode>
      <podcast:episode>1167</podcast:episode>
      <itunes:title>TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0d0d1c25-78fa-48a0-a828-0b5888878c47</guid>
      <link>https://share.transistor.fm/s/53674bd7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng</p>

            <p><strong>Title:</strong><br>
            TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.18056v1">http://arxiv.org/abs/2509.18056v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: https://github.com/HVision-NKU/TempSamp-R1</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng</p>

            <p><strong>Title:</strong><br>
            TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.18056v1">http://arxiv.org/abs/2509.18056v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: https://github.com/HVision-NKU/TempSamp-R1</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 23 Sep 2025 20:21:01 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/53674bd7/bd68a14d.mp3" length="25849211" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1612</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng</p>

            <p><strong>Title:</strong><br>
            TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.18056v1">http://arxiv.org/abs/2509.18056v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: https://github.com/HVision-NKU/TempSamp-R1</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation</title>
      <itunes:episode>1166</itunes:episode>
      <podcast:episode>1166</podcast:episode>
      <itunes:title>RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">969164f3-db9c-4c6d-99c6-73533429a5be</guid>
      <link>https://share.transistor.fm/s/3621a702</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 89 | cs.CL, cs.AI, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Jane Luo, Xin Zhang, Steven Liu, Jie Wu, Yiming Huang, Yangyu Huang, Chengyu Yin, Ying Xin, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Qi Chen, Scarlett Li, Mao Yang</p>

            <p><strong>Title:</strong><br>
            RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.16198v1">http://arxiv.org/abs/2509.16198v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models excel at function- and file-level code generation, yet generating complete repositories from scratch remains a fundamental challenge. This process demands coherent and reliable planning across proposal- and implementation-level stages, while natural language, due to its ambiguity and verbosity, is ill-suited for faithfully representing complex software structures. To address this, we introduce the Repository Planning Graph (RPG), a persistent representation that unifies proposal- and implementation-level planning by encoding capabilities, file structures, data flows, and functions in one graph. RPG replaces ambiguous natural language with an explicit blueprint, enabling long-horizon planning and scalable repository generation. Building on RPG, we develop ZeroRepo, a graph-driven framework for repository generation from scratch. It operates in three stages: proposal-level planning and implementation-level refinement to construct the graph, followed by graph-guided code generation with test validation. To evaluate this setting, we construct RepoCraft, a benchmark of six real-world projects with 1,052 tasks. On RepoCraft, ZeroRepo produces repositories averaging nearly 36K LOC, roughly 3.9$\times$ the strongest baseline (Claude Code) and about 64$\times$ other baselines. It attains 81.5% functional coverage and a 69.7% pass rate, exceeding Claude Code by 27.3 and 35.8 percentage points, respectively. Further analysis shows that RPG models complex dependencies, enables progressively more sophisticated planning through near-linear scaling, and enhances LLM understanding of repositories, thereby accelerating agent localization.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 89 | cs.CL, cs.AI, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Jane Luo, Xin Zhang, Steven Liu, Jie Wu, Yiming Huang, Yangyu Huang, Chengyu Yin, Ying Xin, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Qi Chen, Scarlett Li, Mao Yang</p>

            <p><strong>Title:</strong><br>
            RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.16198v1">http://arxiv.org/abs/2509.16198v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models excel at function- and file-level code generation, yet generating complete repositories from scratch remains a fundamental challenge. This process demands coherent and reliable planning across proposal- and implementation-level stages, while natural language, due to its ambiguity and verbosity, is ill-suited for faithfully representing complex software structures. To address this, we introduce the Repository Planning Graph (RPG), a persistent representation that unifies proposal- and implementation-level planning by encoding capabilities, file structures, data flows, and functions in one graph. RPG replaces ambiguous natural language with an explicit blueprint, enabling long-horizon planning and scalable repository generation. Building on RPG, we develop ZeroRepo, a graph-driven framework for repository generation from scratch. It operates in three stages: proposal-level planning and implementation-level refinement to construct the graph, followed by graph-guided code generation with test validation. To evaluate this setting, we construct RepoCraft, a benchmark of six real-world projects with 1,052 tasks. On RepoCraft, ZeroRepo produces repositories averaging nearly 36K LOC, roughly 3.9$\times$ the strongest baseline (Claude Code) and about 64$\times$ other baselines. It attains 81.5% functional coverage and a 69.7% pass rate, exceeding Claude Code by 27.3 and 35.8 percentage points, respectively. Further analysis shows that RPG models complex dependencies, enables progressively more sophisticated planning through near-linear scaling, and enhances LLM understanding of repositories, thereby accelerating agent localization.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 22 Sep 2025 20:04:47 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3621a702/2def81dd.mp3" length="27345914" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1705</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 89 | cs.CL, cs.AI, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Jane Luo, Xin Zhang, Steven Liu, Jie Wu, Yiming Huang, Yangyu Huang, Chengyu Yin, Ying Xin, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Qi Chen, Scarlett Li, Mao Yang</p>

            <p><strong>Title:</strong><br>
            RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.16198v1">http://arxiv.org/abs/2509.16198v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models excel at function- and file-level code generation, yet generating complete repositories from scratch remains a fundamental challenge. This process demands coherent and reliable planning across proposal- and implementation-level stages, while natural language, due to its ambiguity and verbosity, is ill-suited for faithfully representing complex software structures. To address this, we introduce the Repository Planning Graph (RPG), a persistent representation that unifies proposal- and implementation-level planning by encoding capabilities, file structures, data flows, and functions in one graph. RPG replaces ambiguous natural language with an explicit blueprint, enabling long-horizon planning and scalable repository generation. Building on RPG, we develop ZeroRepo, a graph-driven framework for repository generation from scratch. It operates in three stages: proposal-level planning and implementation-level refinement to construct the graph, followed by graph-guided code generation with test validation. To evaluate this setting, we construct RepoCraft, a benchmark of six real-world projects with 1,052 tasks. On RepoCraft, ZeroRepo produces repositories averaging nearly 36K LOC, roughly 3.9$\times$ the strongest baseline (Claude Code) and about 64$\times$ other baselines. It attains 81.5% functional coverage and a 69.7% pass rate, exceeding Claude Code by 27.3 and 35.8 percentage points, respectively. Further analysis shows that RPG models complex dependencies, enables progressively more sophisticated planning through near-linear scaling, and enhances LLM understanding of repositories, thereby accelerating agent localization.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer</title>
      <itunes:episode>1165</itunes:episode>
      <podcast:episode>1165</podcast:episode>
      <itunes:title>MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">647a106d-b501-48b2-bf7a-20dbbc3c6607</guid>
      <link>https://share.transistor.fm/s/b2d51ab0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, Hyunjik Kim, Chao Jia, Zhenbang Wang, Yinfei Yang, Mingfei Gao, Zi-Yi Dou, Wenze Hu, Chang Gao, Dongxu Li, Philipp Dufter, Zirui Wang, Guoli Yin, Zhengdong Zhang, Chen Chen, Yang Zhao, Ruoming Pang, Zhifeng Chen</p>

            <p><strong>Title:</strong><br>
            MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.16197v1">http://arxiv.org/abs/2509.16197v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, Hyunjik Kim, Chao Jia, Zhenbang Wang, Yinfei Yang, Mingfei Gao, Zi-Yi Dou, Wenze Hu, Chang Gao, Dongxu Li, Philipp Dufter, Zirui Wang, Guoli Yin, Zhengdong Zhang, Chen Chen, Yang Zhao, Ruoming Pang, Zhifeng Chen</p>

            <p><strong>Title:</strong><br>
            MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.16197v1">http://arxiv.org/abs/2509.16197v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 22 Sep 2025 20:04:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b2d51ab0/8a9c61bf.mp3" length="24698151" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1540</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, Hyunjik Kim, Chao Jia, Zhenbang Wang, Yinfei Yang, Mingfei Gao, Zi-Yi Dou, Wenze Hu, Chang Gao, Dongxu Li, Philipp Dufter, Zirui Wang, Guoli Yin, Zhengdong Zhang, Chen Chen, Yang Zhao, Ruoming Pang, Zhifeng Chen</p>

            <p><strong>Title:</strong><br>
            MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.16197v1">http://arxiv.org/abs/2509.16197v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification</title>
      <itunes:episode>1164</itunes:episode>
      <podcast:episode>1164</podcast:episode>
      <itunes:title>Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">409268c6-c13b-4b03-9766-f6d2202c4aaa</guid>
      <link>https://share.transistor.fm/s/601f63c8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.AI, cs.CV, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Zinan Lin, Enshu Liu, Xuefei Ning, Junyi Zhu, Wenyu Wang, Sergey Yekhanin</p>

            <p><strong>Title:</strong><br>
            Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.15591v1">http://arxiv.org/abs/2509.15591v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59-without modifying the training objective. (2) LZN can solve tasks independently (representation learning): LZN can implement unsupervised representation learning without auxiliary loss functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear classification on ImageNet. (3) LZN can solve multiple tasks simultaneously (joint generation and classification): With image and label encoders/decoders, LZN performs both tasks jointly by design, improving FID and achieving SoTA classification accuracy on CIFAR10. The code and trained models are available at https://github.com/microsoft/latent-zoning-networks. The project website is at https://zinanlin.me/blogs/latent_zoning_networks.html.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.AI, cs.CV, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Zinan Lin, Enshu Liu, Xuefei Ning, Junyi Zhu, Wenyu Wang, Sergey Yekhanin</p>

            <p><strong>Title:</strong><br>
            Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.15591v1">http://arxiv.org/abs/2509.15591v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59-without modifying the training objective. (2) LZN can solve tasks independently (representation learning): LZN can implement unsupervised representation learning without auxiliary loss functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear classification on ImageNet. (3) LZN can solve multiple tasks simultaneously (joint generation and classification): With image and label encoders/decoders, LZN performs both tasks jointly by design, improving FID and achieving SoTA classification accuracy on CIFAR10. The code and trained models are available at https://github.com/microsoft/latent-zoning-networks. The project website is at https://zinanlin.me/blogs/latent_zoning_networks.html.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 22 Sep 2025 20:04:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/601f63c8/069c891c.mp3" length="21854800" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1362</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.AI, cs.CV, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Zinan Lin, Enshu Liu, Xuefei Ning, Junyi Zhu, Wenyu Wang, Sergey Yekhanin</p>

            <p><strong>Title:</strong><br>
            Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.15591v1">http://arxiv.org/abs/2509.15591v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59-without modifying the training objective. (2) LZN can solve tasks independently (representation learning): LZN can implement unsupervised representation learning without auxiliary loss functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear classification on ImageNet. (3) LZN can solve multiple tasks simultaneously (joint generation and classification): With image and label encoders/decoders, LZN performs both tasks jointly by design, improving FID and achieving SoTA classification accuracy on CIFAR10. The code and trained models are available at https://github.com/microsoft/latent-zoning-networks. The project website is at https://zinanlin.me/blogs/latent_zoning_networks.html.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data</title>
      <itunes:episode>1163</itunes:episode>
      <podcast:episode>1163</podcast:episode>
      <itunes:title>ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3af0352f-935a-4291-9a24-5a653028ec76</guid>
      <link>https://share.transistor.fm/s/5625fe98</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 81 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Zeyue Tian, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang</p>

            <p><strong>Title:</strong><br>
            ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.15221v1">http://arxiv.org/abs/2509.15221v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 81 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Zeyue Tian, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang</p>

            <p><strong>Title:</strong><br>
            ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.15221v1">http://arxiv.org/abs/2509.15221v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 19 Sep 2025 20:45:08 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5625fe98/1d1fd787.mp3" length="22509287" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1403</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 81 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Zeyue Tian, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang</p>

            <p><strong>Title:</strong><br>
            ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.15221v1">http://arxiv.org/abs/2509.15221v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
      <podcast:transcript url="https://share.transistor.fm/s/5625fe98/transcription.vtt" type="text/vtt" rel="captions"/>
      <podcast:transcript url="https://share.transistor.fm/s/5625fe98/transcription.srt" type="application/x-subrip" rel="captions"/>
      <podcast:transcript url="https://share.transistor.fm/s/5625fe98/transcription.json" type="application/json" rel="captions"/>
      <podcast:transcript url="https://share.transistor.fm/s/5625fe98/transcription.txt" type="text/plain"/>
      <podcast:transcript url="https://share.transistor.fm/s/5625fe98/transcription" type="text/html"/>
    </item>
    <item>
      <title>FlowRL: Matching Reward Distributions for LLM Reasoning</title>
      <itunes:episode>1162</itunes:episode>
      <podcast:episode>1162</podcast:episode>
      <itunes:title>FlowRL: Matching Reward Distributions for LLM Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6c32f91c-fe86-4568-b707-e255c3c071f8</guid>
      <link>https://share.transistor.fm/s/1564aed7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo Xue, Yunchong Song, Zhenjie Yang, Ganqu Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei, Zhouhan Lin</p>

            <p><strong>Title:</strong><br>
            FlowRL: Matching Reward Distributions for LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.15207v1">http://arxiv.org/abs/2509.15207v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo Xue, Yunchong Song, Zhenjie Yang, Ganqu Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei, Zhouhan Lin</p>

            <p><strong>Title:</strong><br>
            FlowRL: Matching Reward Distributions for LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.15207v1">http://arxiv.org/abs/2509.15207v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 19 Sep 2025 20:44:46 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1564aed7/2027967f.mp3" length="19030176" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1186</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo Xue, Yunchong Song, Zhenjie Yang, Ganqu Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei, Zhouhan Lin</p>

            <p><strong>Title:</strong><br>
            FlowRL: Matching Reward Distributions for LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.15207v1">http://arxiv.org/abs/2509.15207v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration</title>
      <itunes:episode>1161</itunes:episode>
      <podcast:episode>1161</podcast:episode>
      <itunes:title>Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">84c59bb1-29bd-4f34-9bc9-f62373271bf6</guid>
      <link>https://share.transistor.fm/s/56515e3d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haoran Zhang, Yafu Li, Xuyang Hu, Dongrui Liu, Zhilin Wang, Bo Li, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.14760v1">http://arxiv.org/abs/2509.14760v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are increasingly applied in diverse real-world scenarios, each governed by bespoke behavioral and safety specifications (spec) custom-tailored by users or organizations. These spec, categorized into safety-spec and behavioral-spec, vary across scenarios and evolve with changing preferences and requirements. We formalize this challenge as specification alignment, focusing on LLMs' ability to follow dynamic, scenario-specific spec from both behavioral and safety perspectives. To address this challenge, we propose Align3, a lightweight method that employs Test-Time Deliberation (TTD) with hierarchical reflection and revision to reason over the specification boundaries. We further present SpecBench, a unified benchmark for measuring specification alignment, covering 5 scenarios, 103 spec, and 1,500 prompts. Experiments on 15 reasoning and 18 instruct models with several TTD methods, including Self-Refine, TPO, and MoreThink, yield three key findings: (i) test-time deliberation enhances specification alignment; (ii) Align3 advances the safety-helpfulness trade-off frontier with minimal overhead; (iii) SpecBench effectively reveals alignment gaps. These results highlight the potential of test-time deliberation as an effective strategy for reasoning over the real-world specification boundaries.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haoran Zhang, Yafu Li, Xuyang Hu, Dongrui Liu, Zhilin Wang, Bo Li, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.14760v1">http://arxiv.org/abs/2509.14760v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are increasingly applied in diverse real-world scenarios, each governed by bespoke behavioral and safety specifications (spec) custom-tailored by users or organizations. These spec, categorized into safety-spec and behavioral-spec, vary across scenarios and evolve with changing preferences and requirements. We formalize this challenge as specification alignment, focusing on LLMs' ability to follow dynamic, scenario-specific spec from both behavioral and safety perspectives. To address this challenge, we propose Align3, a lightweight method that employs Test-Time Deliberation (TTD) with hierarchical reflection and revision to reason over the specification boundaries. We further present SpecBench, a unified benchmark for measuring specification alignment, covering 5 scenarios, 103 spec, and 1,500 prompts. Experiments on 15 reasoning and 18 instruct models with several TTD methods, including Self-Refine, TPO, and MoreThink, yield three key findings: (i) test-time deliberation enhances specification alignment; (ii) Align3 advances the safety-helpfulness trade-off frontier with minimal overhead; (iii) SpecBench effectively reveals alignment gaps. These results highlight the potential of test-time deliberation as an effective strategy for reasoning over the real-world specification boundaries.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 19 Sep 2025 20:44:25 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/56515e3d/5bf4e9e5.mp3" length="20498915" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1277</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haoran Zhang, Yafu Li, Xuyang Hu, Dongrui Liu, Zhilin Wang, Bo Li, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.14760v1">http://arxiv.org/abs/2509.14760v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are increasingly applied in diverse real-world scenarios, each governed by bespoke behavioral and safety specifications (spec) custom-tailored by users or organizations. These spec, categorized into safety-spec and behavioral-spec, vary across scenarios and evolve with changing preferences and requirements. We formalize this challenge as specification alignment, focusing on LLMs' ability to follow dynamic, scenario-specific spec from both behavioral and safety perspectives. To address this challenge, we propose Align3, a lightweight method that employs Test-Time Deliberation (TTD) with hierarchical reflection and revision to reason over the specification boundaries. We further present SpecBench, a unified benchmark for measuring specification alignment, covering 5 scenarios, 103 spec, and 1,500 prompts. Experiments on 15 reasoning and 18 instruct models with several TTD methods, including Self-Refine, TPO, and MoreThink, yield three key findings: (i) test-time deliberation enhances specification alignment; (ii) Align3 advances the safety-helpfulness trade-off frontier with minimal overhead; (iii) SpecBench effectively reveals alignment gaps. These results highlight the potential of test-time deliberation as an effective strategy for reasoning over the real-world specification boundaries.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation</title>
      <itunes:episode>1160</itunes:episode>
      <podcast:episode>1160</podcast:episode>
      <itunes:title>Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b4d28b07-47c4-417e-8935-181dd67ffc7b</guid>
      <link>https://share.transistor.fm/s/a4fea617</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu</p>

            <p><strong>Title:</strong><br>
            Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.15194v1">http://arxiv.org/abs/2509.15194v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing label-free methods, confidence minimization, self-consistency, or majority-vote objectives, stabilize learning but steadily shrink exploration, causing an entropy collapse: generations become shorter, less diverse, and brittle. Unlike prior approaches such as Test-Time Reinforcement Learning (TTRL), which primarily adapt models to the immediate unlabeled dataset at hand, our goal is broader: to enable general improvements without sacrificing the model's inherent exploration capacity and generalization ability, i.e., evolving. We formalize this issue and propose EVolution-Oriented and Label-free Reinforcement Learning (EVOL-RL), a simple rule that couples stability with variation under a label-free setting. EVOL-RL keeps the majority-voted answer as a stable anchor (selection) while adding a novelty-aware reward that favors responses whose reasoning differs from what has already been produced (variation), measured in semantic space. Implemented with GRPO, EVOL-RL also uses asymmetric clipping to preserve strong signals and an entropy regularizer to sustain search. This majority-for-selection + novelty-for-variation design prevents collapse, maintains longer and more informative chains of thought, and improves both pass@1 and pass@n. EVOL-RL consistently outperforms the majority-only TTRL baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from TTRL's 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents diversity collapse but also unlocks stronger generalization across domains (e.g., GPQA). Furthermore, we demonstrate that EVOL-RL also boosts performance in the RLVR setting, highlighting its broad applicability.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu</p>

            <p><strong>Title:</strong><br>
            Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.15194v1">http://arxiv.org/abs/2509.15194v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing label-free methods, confidence minimization, self-consistency, or majority-vote objectives, stabilize learning but steadily shrink exploration, causing an entropy collapse: generations become shorter, less diverse, and brittle. Unlike prior approaches such as Test-Time Reinforcement Learning (TTRL), which primarily adapt models to the immediate unlabeled dataset at hand, our goal is broader: to enable general improvements without sacrificing the model's inherent exploration capacity and generalization ability, i.e., evolving. We formalize this issue and propose EVolution-Oriented and Label-free Reinforcement Learning (EVOL-RL), a simple rule that couples stability with variation under a label-free setting. EVOL-RL keeps the majority-voted answer as a stable anchor (selection) while adding a novelty-aware reward that favors responses whose reasoning differs from what has already been produced (variation), measured in semantic space. Implemented with GRPO, EVOL-RL also uses asymmetric clipping to preserve strong signals and an entropy regularizer to sustain search. This majority-for-selection + novelty-for-variation design prevents collapse, maintains longer and more informative chains of thought, and improves both pass@1 and pass@n. EVOL-RL consistently outperforms the majority-only TTRL baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from TTRL's 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents diversity collapse but also unlocks stronger generalization across domains (e.g., GPQA). Furthermore, we demonstrate that EVOL-RL also boosts performance in the RLVR setting, highlighting its broad applicability.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 19 Sep 2025 20:44:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a4fea617/bb91ef81.mp3" length="21672971" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1351</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu</p>

            <p><strong>Title:</strong><br>
            Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.15194v1">http://arxiv.org/abs/2509.15194v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing label-free methods, confidence minimization, self-consistency, or majority-vote objectives, stabilize learning but steadily shrink exploration, causing an entropy collapse: generations become shorter, less diverse, and brittle. Unlike prior approaches such as Test-Time Reinforcement Learning (TTRL), which primarily adapt models to the immediate unlabeled dataset at hand, our goal is broader: to enable general improvements without sacrificing the model's inherent exploration capacity and generalization ability, i.e., evolving. We formalize this issue and propose EVolution-Oriented and Label-free Reinforcement Learning (EVOL-RL), a simple rule that couples stability with variation under a label-free setting. EVOL-RL keeps the majority-voted answer as a stable anchor (selection) while adding a novelty-aware reward that favors responses whose reasoning differs from what has already been produced (variation), measured in semantic space. Implemented with GRPO, EVOL-RL also uses asymmetric clipping to preserve strong signals and an entropy regularizer to sustain search. This majority-for-selection + novelty-for-variation design prevents collapse, maintains longer and more informative chains of thought, and improves both pass@1 and pass@n. EVOL-RL consistently outperforms the majority-only TTRL baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from TTRL's 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents diversity collapse but also unlocks stronger generalization across domains (e.g., GPQA). Furthermore, we demonstrate that EVOL-RL also boosts performance in the RLVR setting, highlighting its broad applicability.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning</title>
      <itunes:episode>1159</itunes:episode>
      <podcast:episode>1159</podcast:episode>
      <itunes:title>FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">63d5db0c-5cb0-48a1-b8da-b748d1e70601</guid>
      <link>https://share.transistor.fm/s/69c0fb37</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, Yali Liao, Zaiyuan Wang, Chenghao Yang, Qianyu Yang, Mingren Yin, Zhiyuan Zeng, Ge Zhang, Xinyi Zhang, Xiying Zhao, Zhenwei Zhu, Hongseok Namkoong, Wenhao Huang, Yuwen Tang</p>

            <p><strong>Title:</strong><br>
            FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13160v1">http://arxiv.org/abs/2509.13160v1</a></p>

            <p><strong>Abstract:</strong><br>
            Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks -- Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation -- closely reproduce real-world financial analyst workflows. To ensure difficulty and reliability, we engage 70 professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes 635 questions spanning global and Greater China markets, and we evaluate 21 models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly.By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, Yali Liao, Zaiyuan Wang, Chenghao Yang, Qianyu Yang, Mingren Yin, Zhiyuan Zeng, Ge Zhang, Xinyi Zhang, Xiying Zhao, Zhenwei Zhu, Hongseok Namkoong, Wenhao Huang, Yuwen Tang</p>

            <p><strong>Title:</strong><br>
            FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13160v1">http://arxiv.org/abs/2509.13160v1</a></p>

            <p><strong>Abstract:</strong><br>
            Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks -- Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation -- closely reproduce real-world financial analyst workflows. To ensure difficulty and reliability, we engage 70 professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes 635 questions spanning global and Greater China markets, and we evaluate 21 models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly.By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 19 Sep 2025 20:43:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/69c0fb37/895b9a00.mp3" length="25101071" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1565</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, Yali Liao, Zaiyuan Wang, Chenghao Yang, Qianyu Yang, Mingren Yin, Zhiyuan Zeng, Ge Zhang, Xinyi Zhang, Xiying Zhao, Zhenwei Zhu, Hongseok Namkoong, Wenhao Huang, Yuwen Tang</p>

            <p><strong>Title:</strong><br>
            FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13160v1">http://arxiv.org/abs/2509.13160v1</a></p>

            <p><strong>Abstract:</strong><br>
            Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks -- Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation -- closely reproduce real-world financial analyst workflows. To ensure difficulty and reliability, we engage 70 professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes 635 questions spanning global and Greater China markets, and we evaluate 21 models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly.By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation</title>
      <itunes:episode>1158</itunes:episode>
      <podcast:episode>1158</podcast:episode>
      <itunes:title>Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ad57b087-6787-4584-9ff0-719d42d3feb7</guid>
      <link>https://share.transistor.fm/s/52e8579e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiaoyu Yue, Zidong Wang, Yuqing Wang, Wenlong Zhang, Xihui Liu, Wanli Ouyang, Lei Bai, Luping Zhou</p>

            <p><strong>Title:</strong><br>
            Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.15185v1">http://arxiv.org/abs/2509.15185v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies have demonstrated the importance of high-quality visual representations in image generation and have highlighted the limitations of generative models in image understanding. As a generative paradigm originally designed for natural language, autoregressive models face similar challenges. In this work, we present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain. We identify three key properties that hinder the learning of high-level visual semantics: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency. We show that these issues can be effectively addressed by introducing self-supervised objectives during training, leading to a novel training framework, Self-guided Training for AutoRegressive models (ST-AR). Without relying on pre-trained representation models, ST-AR significantly enhances the image understanding ability of autoregressive models and leads to improved generation quality. Specifically, ST-AR brings approximately 42% FID improvement for LlamaGen-L and 49% FID improvement for LlamaGen-XL, while maintaining the same sampling strategy.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiaoyu Yue, Zidong Wang, Yuqing Wang, Wenlong Zhang, Xihui Liu, Wanli Ouyang, Lei Bai, Luping Zhou</p>

            <p><strong>Title:</strong><br>
            Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.15185v1">http://arxiv.org/abs/2509.15185v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies have demonstrated the importance of high-quality visual representations in image generation and have highlighted the limitations of generative models in image understanding. As a generative paradigm originally designed for natural language, autoregressive models face similar challenges. In this work, we present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain. We identify three key properties that hinder the learning of high-level visual semantics: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency. We show that these issues can be effectively addressed by introducing self-supervised objectives during training, leading to a novel training framework, Self-guided Training for AutoRegressive models (ST-AR). Without relying on pre-trained representation models, ST-AR significantly enhances the image understanding ability of autoregressive models and leads to improved generation quality. Specifically, ST-AR brings approximately 42% FID improvement for LlamaGen-L and 49% FID improvement for LlamaGen-XL, while maintaining the same sampling strategy.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 19 Sep 2025 20:43:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/52e8579e/1cc206c2.mp3" length="20424938" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1273</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiaoyu Yue, Zidong Wang, Yuqing Wang, Wenlong Zhang, Xihui Liu, Wanli Ouyang, Lei Bai, Luping Zhou</p>

            <p><strong>Title:</strong><br>
            Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.15185v1">http://arxiv.org/abs/2509.15185v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies have demonstrated the importance of high-quality visual representations in image generation and have highlighted the limitations of generative models in image understanding. As a generative paradigm originally designed for natural language, autoregressive models face similar challenges. In this work, we present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain. We identify three key properties that hinder the learning of high-level visual semantics: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency. We show that these issues can be effectively addressed by introducing self-supervised objectives during training, leading to a novel training framework, Self-guided Training for AutoRegressive models (ST-AR). Without relying on pre-trained representation models, ST-AR significantly enhances the image understanding ability of autoregressive models and leads to improved generation quality. Specifically, ST-AR brings approximately 42% FID improvement for LlamaGen-L and 49% FID improvement for LlamaGen-XL, while maintaining the same sampling strategy.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Hala Technical Report: Building Arabic-Centric Instruction &amp; Translation Models at Scale</title>
      <itunes:episode>1157</itunes:episode>
      <podcast:episode>1157</podcast:episode>
      <itunes:title>Hala Technical Report: Building Arabic-Centric Instruction &amp; Translation Models at Scale</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7434c00d-db1a-4e6b-8c2d-d5d63cb1d244</guid>
      <link>https://share.transistor.fm/s/daa99710</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hasan Abed Al Kader Hammoud, Mohammad Zbeeb, Bernard Ghanem</p>

            <p><strong>Title:</strong><br>
            Hala Technical Report: Building Arabic-Centric Instruction &amp; Translation Models at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.14008v1">http://arxiv.org/abs/2509.14008v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Hala, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline. We first compress a strong AR$\leftrightarrow$EN teacher to FP8 (yielding $\sim$2$\times$ higher throughput with no quality loss) and use it to create high-fidelity bilingual supervision. A lightweight language model LFM2-1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic, producing a million-scale corpus tailored to instruction following. We train Hala models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths. On Arabic-centric benchmarks, Hala achieves state-of-the-art results within both the "nano" ($\leq$2B) and "small" (7-9B) categories, outperforming their bases. We release models, data, evaluation, and recipes to accelerate research in Arabic NLP.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hasan Abed Al Kader Hammoud, Mohammad Zbeeb, Bernard Ghanem</p>

            <p><strong>Title:</strong><br>
            Hala Technical Report: Building Arabic-Centric Instruction &amp; Translation Models at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.14008v1">http://arxiv.org/abs/2509.14008v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Hala, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline. We first compress a strong AR$\leftrightarrow$EN teacher to FP8 (yielding $\sim$2$\times$ higher throughput with no quality loss) and use it to create high-fidelity bilingual supervision. A lightweight language model LFM2-1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic, producing a million-scale corpus tailored to instruction following. We train Hala models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths. On Arabic-centric benchmarks, Hala achieves state-of-the-art results within both the "nano" ($\leq$2B) and "small" (7-9B) categories, outperforming their bases. We release models, data, evaluation, and recipes to accelerate research in Arabic NLP.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 18 Sep 2025 20:10:31 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/daa99710/fac0c215.mp3" length="20801520" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1296</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hasan Abed Al Kader Hammoud, Mohammad Zbeeb, Bernard Ghanem</p>

            <p><strong>Title:</strong><br>
            Hala Technical Report: Building Arabic-Centric Instruction &amp; Translation Models at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.14008v1">http://arxiv.org/abs/2509.14008v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Hala, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline. We first compress a strong AR$\leftrightarrow$EN teacher to FP8 (yielding $\sim$2$\times$ higher throughput with no quality loss) and use it to create high-fidelity bilingual supervision. A lightweight language model LFM2-1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic, producing a million-scale corpus tailored to instruction following. We train Hala models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths. On Arabic-centric benchmarks, Hala achieves state-of-the-art results within both the "nano" ($\leq$2B) and "small" (7-9B) categories, outperforming their bases. We release models, data, evaluation, and recipes to accelerate research in Arabic NLP.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SAIL-VL2 Technical Report</title>
      <itunes:episode>1156</itunes:episode>
      <podcast:episode>1156</podcast:episode>
      <itunes:title>SAIL-VL2 Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5f35cc25-73d9-437b-94b1-f8b52e97f77b</guid>
      <link>https://share.transistor.fm/s/a3707599</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weijie Yin, Yongjie Ye, Fangxun Shu, Yue Liao, Zijian Kang, Hongyuan Dong, Haiyang Yu, Dingkang Yang, Jiacong Wang, Han Wang, Wenzhuo Liu, Xiao Liang, Shuicheng Yan, Chao Feng</p>

            <p><strong>Title:</strong><br>
            SAIL-VL2 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.14033v1">http://arxiv.org/abs/2509.14033v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weijie Yin, Yongjie Ye, Fangxun Shu, Yue Liao, Zijian Kang, Hongyuan Dong, Haiyang Yu, Dingkang Yang, Jiacong Wang, Han Wang, Wenzhuo Liu, Xiao Liang, Shuicheng Yan, Chao Feng</p>

            <p><strong>Title:</strong><br>
            SAIL-VL2 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.14033v1">http://arxiv.org/abs/2509.14033v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 18 Sep 2025 20:10:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a3707599/6e4c7c62.mp3" length="23585901" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1470</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weijie Yin, Yongjie Ye, Fangxun Shu, Yue Liao, Zijian Kang, Hongyuan Dong, Haiyang Yu, Dingkang Yang, Jiacong Wang, Han Wang, Wenzhuo Liu, Xiao Liang, Shuicheng Yan, Chao Feng</p>

            <p><strong>Title:</strong><br>
            SAIL-VL2 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.14033v1">http://arxiv.org/abs/2509.14033v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era</title>
      <itunes:episode>1155</itunes:episode>
      <podcast:episode>1155</podcast:episode>
      <itunes:title>PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">85ff7f2b-c893-4cc1-93db-3010ac5fd663</guid>
      <link>https://share.transistor.fm/s/8ecb1a70</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xu Zheng, Chenfei Liao, Ziqiao Weng, Kaiyu Lei, Zihao Dongfang, Haocong He, Yuanhuiyi Lyu, Lutao Jiang, Lu Qi, Li Chen, Danda Pani Paudel, Kailun Yang, Linfeng Zhang, Luc Van Gool, Xuming Hu</p>

            <p><strong>Title:</strong><br>
            PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.12989v1">http://arxiv.org/abs/2509.12989v1</a></p>

            <p><strong>Abstract:</strong><br>
            Omnidirectional vision, using 360-degree vision to understand the environment, has become increasingly critical across domains like robotics, industrial inspection, and environmental monitoring. Compared to traditional pinhole vision, omnidirectional vision provides holistic environmental awareness, significantly enhancing the completeness of scene perception and the reliability of decision-making. However, foundational research in this area has historically lagged behind traditional pinhole vision. This talk presents an emerging trend in the embodied AI era: the rapid development of omnidirectional vision, driven by growing industrial demand and academic interest. We highlight recent breakthroughs in omnidirectional generation, omnidirectional perception, omnidirectional understanding, and related datasets. Drawing on insights from both academia and industry, we propose an ideal panoramic system architecture in the embodied AI era, PANORAMA, which consists of four key subsystems. Moreover, we offer in-depth opinions related to emerging trends and cross-community impacts at the intersection of panoramic vision and embodied AI, along with the future roadmap and open challenges. This overview synthesizes state-of-the-art advancements and outlines challenges and opportunities for future research in building robust, general-purpose omnidirectional AI systems in the embodied AI era.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xu Zheng, Chenfei Liao, Ziqiao Weng, Kaiyu Lei, Zihao Dongfang, Haocong He, Yuanhuiyi Lyu, Lutao Jiang, Lu Qi, Li Chen, Danda Pani Paudel, Kailun Yang, Linfeng Zhang, Luc Van Gool, Xuming Hu</p>

            <p><strong>Title:</strong><br>
            PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.12989v1">http://arxiv.org/abs/2509.12989v1</a></p>

            <p><strong>Abstract:</strong><br>
            Omnidirectional vision, using 360-degree vision to understand the environment, has become increasingly critical across domains like robotics, industrial inspection, and environmental monitoring. Compared to traditional pinhole vision, omnidirectional vision provides holistic environmental awareness, significantly enhancing the completeness of scene perception and the reliability of decision-making. However, foundational research in this area has historically lagged behind traditional pinhole vision. This talk presents an emerging trend in the embodied AI era: the rapid development of omnidirectional vision, driven by growing industrial demand and academic interest. We highlight recent breakthroughs in omnidirectional generation, omnidirectional perception, omnidirectional understanding, and related datasets. Drawing on insights from both academia and industry, we propose an ideal panoramic system architecture in the embodied AI era, PANORAMA, which consists of four key subsystems. Moreover, we offer in-depth opinions related to emerging trends and cross-community impacts at the intersection of panoramic vision and embodied AI, along with the future roadmap and open challenges. This overview synthesizes state-of-the-art advancements and outlines challenges and opportunities for future research in building robust, general-purpose omnidirectional AI systems in the embodied AI era.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 18 Sep 2025 20:09:48 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8ecb1a70/fdc16e0e.mp3" length="19517110" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1216</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xu Zheng, Chenfei Liao, Ziqiao Weng, Kaiyu Lei, Zihao Dongfang, Haocong He, Yuanhuiyi Lyu, Lutao Jiang, Lu Qi, Li Chen, Danda Pani Paudel, Kailun Yang, Linfeng Zhang, Luc Van Gool, Xuming Hu</p>

            <p><strong>Title:</strong><br>
            PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.12989v1">http://arxiv.org/abs/2509.12989v1</a></p>

            <p><strong>Abstract:</strong><br>
            Omnidirectional vision, using 360-degree vision to understand the environment, has become increasingly critical across domains like robotics, industrial inspection, and environmental monitoring. Compared to traditional pinhole vision, omnidirectional vision provides holistic environmental awareness, significantly enhancing the completeness of scene perception and the reliability of decision-making. However, foundational research in this area has historically lagged behind traditional pinhole vision. This talk presents an emerging trend in the embodied AI era: the rapid development of omnidirectional vision, driven by growing industrial demand and academic interest. We highlight recent breakthroughs in omnidirectional generation, omnidirectional perception, omnidirectional understanding, and related datasets. Drawing on insights from both academia and industry, we propose an ideal panoramic system architecture in the embodied AI era, PANORAMA, which consists of four key subsystems. Moreover, we offer in-depth opinions related to emerging trends and cross-community impacts at the intersection of panoramic vision and embodied AI, along with the future roadmap and open challenges. This overview synthesizes state-of-the-art advancements and outlines challenges and opportunities for future research in building robust, general-purpose omnidirectional AI systems in the embodied AI era.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research</title>
      <itunes:episode>1154</itunes:episode>
      <podcast:episode>1154</podcast:episode>
      <itunes:title>WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8b5409a7-b720-449f-9b26-292d7b39f758</guid>
      <link>https://share.transistor.fm/s/e6ecb0df</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 77 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, Jun Zhang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13312v1">http://arxiv.org/abs/2509.13312v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper tackles open-ended deep research (OEDR), a complex challenge where AI agents must synthesize vast web-scale information into insightful reports. Current approaches are plagued by dual-fold limitations: static research pipelines that decouple planning from evidence acquisition and one-shot generation paradigms that easily suffer from long-context failure issues like "loss in the middle" and hallucinations. To address these challenges, we introduce WebWeaver, a novel dual-agent framework that emulates the human research process. The planner operates in a dynamic cycle, iteratively interleaving evidence acquisition with outline optimization to produce a comprehensive, source-grounded outline linking to a memory bank of evidence. The writer then executes a hierarchical retrieval and writing process, composing the report section by section. By performing targeted retrieval of only the necessary evidence from the memory bank for each part, it effectively mitigates long-context issues. Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym. These results validate our human-centric, iterative methodology, demonstrating that adaptive planning and focused synthesis are crucial for producing high-quality, reliable, and well-structured reports.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 77 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, Jun Zhang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13312v1">http://arxiv.org/abs/2509.13312v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper tackles open-ended deep research (OEDR), a complex challenge where AI agents must synthesize vast web-scale information into insightful reports. Current approaches are plagued by dual-fold limitations: static research pipelines that decouple planning from evidence acquisition and one-shot generation paradigms that easily suffer from long-context failure issues like "loss in the middle" and hallucinations. To address these challenges, we introduce WebWeaver, a novel dual-agent framework that emulates the human research process. The planner operates in a dynamic cycle, iteratively interleaving evidence acquisition with outline optimization to produce a comprehensive, source-grounded outline linking to a memory bank of evidence. The writer then executes a hierarchical retrieval and writing process, composing the report section by section. By performing targeted retrieval of only the necessary evidence from the memory bank for each part, it effectively mitigates long-context issues. Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym. These results validate our human-centric, iterative methodology, demonstrating that adaptive planning and focused synthesis are crucial for producing high-quality, reliable, and well-structured reports.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 17 Sep 2025 20:52:21 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e6ecb0df/c5db4b56.mp3" length="19169393" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1194</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 77 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, Jun Zhang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13312v1">http://arxiv.org/abs/2509.13312v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper tackles open-ended deep research (OEDR), a complex challenge where AI agents must synthesize vast web-scale information into insightful reports. Current approaches are plagued by dual-fold limitations: static research pipelines that decouple planning from evidence acquisition and one-shot generation paradigms that easily suffer from long-context failure issues like "loss in the middle" and hallucinations. To address these challenges, we introduce WebWeaver, a novel dual-agent framework that emulates the human research process. The planner operates in a dynamic cycle, iteratively interleaving evidence acquisition with outline optimization to produce a comprehensive, source-grounded outline linking to a memory bank of evidence. The writer then executes a hierarchical retrieval and writing process, composing the report section by section. By performing targeted retrieval of only the necessary evidence from the memory bank for each part, it effectively mitigates long-context issues. Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym. These results validate our human-centric, iterative methodology, demonstrating that adaptive planning and focused synthesis are crucial for producing high-quality, reliable, and well-structured reports.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Scaling Agents via Continual Pre-training</title>
      <itunes:episode>1153</itunes:episode>
      <podcast:episode>1153</podcast:episode>
      <itunes:title>Scaling Agents via Continual Pre-training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">91a19933-9870-4f6f-9724-511bcd3fa65f</guid>
      <link>https://share.transistor.fm/s/2f773c4e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, Zile Qiao, Zhongwang Zhang, Huifeng Yin, Shihao Cai, Runnan Fang, Zhengwei Tao, Wenbiao Yin, Chenxiong Qian, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            Scaling Agents via Continual Pre-training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13310v1">http://arxiv.org/abs/2509.13310v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, Zile Qiao, Zhongwang Zhang, Huifeng Yin, Shihao Cai, Runnan Fang, Zhengwei Tao, Wenbiao Yin, Chenxiong Qian, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            Scaling Agents via Continual Pre-training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13310v1">http://arxiv.org/abs/2509.13310v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 17 Sep 2025 20:51:58 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2f773c4e/125e0dd1.mp3" length="22832336" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1423</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, Zile Qiao, Zhongwang Zhang, Huifeng Yin, Shihao Cai, Runnan Fang, Zhengwei Tao, Wenbiao Yin, Chenxiong Qian, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            Scaling Agents via Continual Pre-training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13310v1">http://arxiv.org/abs/2509.13310v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning</title>
      <itunes:episode>1152</itunes:episode>
      <podcast:episode>1152</podcast:episode>
      <itunes:title>WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9239f9f8-c5cd-4c2e-9da5-1d9ae5d176dd</guid>
      <link>https://share.transistor.fm/s/f9da3ec2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13305v1">http://arxiv.org/abs/2509.13305v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13305v1">http://arxiv.org/abs/2509.13305v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 17 Sep 2025 20:51:34 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f9da3ec2/bd377725.mp3" length="21259624" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1325</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13305v1">http://arxiv.org/abs/2509.13305v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Towards General Agentic Intelligence via Environment Scaling</title>
      <itunes:episode>1151</itunes:episode>
      <podcast:episode>1151</podcast:episode>
      <itunes:title>Towards General Agentic Intelligence via Environment Scaling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">48aa683b-7ba2-4a77-9bbf-4a5f13daa86b</guid>
      <link>https://share.transistor.fm/s/91c69ccb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Runnan Fang, Shihao Cai, Baixuan Li, Jialong Wu, Guangyu Li, Wenbiao Yin, Xinyu Wang, Xiaobin Wang, Liangcai Su, Zhen Zhang, Shibin Wu, Zhengwei Tao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            Towards General Agentic Intelligence via Environment Scaling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13311v1">http://arxiv.org/abs/2509.13311v1</a></p>

            <p><strong>Abstract:</strong><br>
            Advanced agentic intelligence is a prerequisite for deploying Large Language Models in practical, real-world applications. Diverse real-world APIs demand precise, robust function-calling intelligence, which needs agents to develop these capabilities through interaction in varied environments. The breadth of function-calling competence is closely tied to the diversity of environments in which agents are trained. In this work, we scale up environments as a step towards advancing general agentic intelligence. This gives rise to two central challenges: (i) how to scale environments in a principled manner, and (ii) how to effectively train agentic capabilities from experiences derived through interactions with these environments. To address these, we design a scalable framework that automatically constructs heterogeneous environments that are fully simulated, systematically broadening the space of function-calling scenarios. We further adapt a two-phase agent fine-tuning strategy: first endowing agents with fundamental agentic capabilities, then specializing them for domain-specific contexts. Extensive experiments on agentic benchmarks, tau-bench, tau2-Bench, and ACEBench, demonstrate that our trained model, AgentScaler, significantly enhances the function-calling capability of models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Runnan Fang, Shihao Cai, Baixuan Li, Jialong Wu, Guangyu Li, Wenbiao Yin, Xinyu Wang, Xiaobin Wang, Liangcai Su, Zhen Zhang, Shibin Wu, Zhengwei Tao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            Towards General Agentic Intelligence via Environment Scaling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13311v1">http://arxiv.org/abs/2509.13311v1</a></p>

            <p><strong>Abstract:</strong><br>
            Advanced agentic intelligence is a prerequisite for deploying Large Language Models in practical, real-world applications. Diverse real-world APIs demand precise, robust function-calling intelligence, which needs agents to develop these capabilities through interaction in varied environments. The breadth of function-calling competence is closely tied to the diversity of environments in which agents are trained. In this work, we scale up environments as a step towards advancing general agentic intelligence. This gives rise to two central challenges: (i) how to scale environments in a principled manner, and (ii) how to effectively train agentic capabilities from experiences derived through interactions with these environments. To address these, we design a scalable framework that automatically constructs heterogeneous environments that are fully simulated, systematically broadening the space of function-calling scenarios. We further adapt a two-phase agent fine-tuning strategy: first endowing agents with fundamental agentic capabilities, then specializing them for domain-specific contexts. Extensive experiments on agentic benchmarks, tau-bench, tau2-Bench, and ACEBench, demonstrate that our trained model, AgentScaler, significantly enhances the function-calling capability of models.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 17 Sep 2025 20:51:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/91c69ccb/75951c2a.mp3" length="22210850" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1384</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Runnan Fang, Shihao Cai, Baixuan Li, Jialong Wu, Guangyu Li, Wenbiao Yin, Xinyu Wang, Xiaobin Wang, Liangcai Su, Zhen Zhang, Shibin Wu, Zhengwei Tao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            Towards General Agentic Intelligence via Environment Scaling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13311v1">http://arxiv.org/abs/2509.13311v1</a></p>

            <p><strong>Abstract:</strong><br>
            Advanced agentic intelligence is a prerequisite for deploying Large Language Models in practical, real-world applications. Diverse real-world APIs demand precise, robust function-calling intelligence, which needs agents to develop these capabilities through interaction in varied environments. The breadth of function-calling competence is closely tied to the diversity of environments in which agents are trained. In this work, we scale up environments as a step towards advancing general agentic intelligence. This gives rise to two central challenges: (i) how to scale environments in a principled manner, and (ii) how to effectively train agentic capabilities from experiences derived through interactions with these environments. To address these, we design a scalable framework that automatically constructs heterogeneous environments that are fully simulated, systematically broadening the space of function-calling scenarios. We further adapt a two-phase agent fine-tuning strategy: first endowing agents with fundamental agentic capabilities, then specializing them for domain-specific contexts. Extensive experiments on agentic benchmarks, tau-bench, tau2-Bench, and ACEBench, demonstrate that our trained model, AgentScaler, significantly enhances the function-calling capability of models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents</title>
      <itunes:episode>1150</itunes:episode>
      <podcast:episode>1150</podcast:episode>
      <itunes:title>WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a3b14214-db55-400c-8cec-4b0c26bf1135</guid>
      <link>https://share.transistor.fm/s/a956a10f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, Rui Min, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13309v1">http://arxiv.org/abs/2509.13309v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in deep-research systems have demonstrated the potential for AI agents to autonomously discover and synthesize knowledge from external sources. In this paper, we introduce WebResearcher, a novel framework for building such agents through two key components: (1) WebResearcher, an iterative deep-research paradigm that reformulates deep research as a Markov Decision Process, where agents periodically consolidate findings into evolving reports while maintaining focused workspaces, overcoming the context suffocation and noise contamination that plague existing mono-contextual approaches; and (2) WebFrontier, a scalable data synthesis engine that generates high-quality training data through tool-augmented complexity escalation, enabling systematic creation of research tasks that bridge the gap between passive knowledge recall and active knowledge construction. Notably, we find that the training data from our paradigm significantly enhances tool-use capabilities even for traditional mono-contextual methods. Furthermore, our paradigm naturally scales through parallel thinking, enabling concurrent multi-agent exploration for more comprehensive conclusions. Extensive experiments across 6 challenging benchmarks demonstrate that WebResearcher achieves state-of-the-art performance, even surpassing frontier proprietary systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, Rui Min, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13309v1">http://arxiv.org/abs/2509.13309v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in deep-research systems have demonstrated the potential for AI agents to autonomously discover and synthesize knowledge from external sources. In this paper, we introduce WebResearcher, a novel framework for building such agents through two key components: (1) WebResearcher, an iterative deep-research paradigm that reformulates deep research as a Markov Decision Process, where agents periodically consolidate findings into evolving reports while maintaining focused workspaces, overcoming the context suffocation and noise contamination that plague existing mono-contextual approaches; and (2) WebFrontier, a scalable data synthesis engine that generates high-quality training data through tool-augmented complexity escalation, enabling systematic creation of research tasks that bridge the gap between passive knowledge recall and active knowledge construction. Notably, we find that the training data from our paradigm significantly enhances tool-use capabilities even for traditional mono-contextual methods. Furthermore, our paradigm naturally scales through parallel thinking, enabling concurrent multi-agent exploration for more comprehensive conclusions. Extensive experiments across 6 challenging benchmarks demonstrate that WebResearcher achieves state-of-the-art performance, even surpassing frontier proprietary systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 17 Sep 2025 20:50:47 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a956a10f/ff51d41e.mp3" length="19848146" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1237</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, Rui Min, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13309v1">http://arxiv.org/abs/2509.13309v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in deep-research systems have demonstrated the potential for AI agents to autonomously discover and synthesize knowledge from external sources. In this paper, we introduce WebResearcher, a novel framework for building such agents through two key components: (1) WebResearcher, an iterative deep-research paradigm that reformulates deep research as a Markov Decision Process, where agents periodically consolidate findings into evolving reports while maintaining focused workspaces, overcoming the context suffocation and noise contamination that plague existing mono-contextual approaches; and (2) WebFrontier, a scalable data synthesis engine that generates high-quality training data through tool-augmented complexity escalation, enabling systematic creation of research tasks that bridge the gap between passive knowledge recall and active knowledge construction. Notably, we find that the training data from our paradigm significantly enhances tool-use capabilities even for traditional mono-contextual methods. Furthermore, our paradigm naturally scales through parallel thinking, enabling concurrent multi-agent exploration for more comprehensive conclusions. Extensive experiments across 6 challenging benchmarks demonstrate that WebResearcher achieves state-of-the-art performance, even surpassing frontier proprietary systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization</title>
      <itunes:episode>1149</itunes:episode>
      <podcast:episode>1149</podcast:episode>
      <itunes:title>ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">67d468d8-3f0c-48b3-9e32-797431ed7e1f</guid>
      <link>https://share.transistor.fm/s/4c01daf5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13313v1">http://arxiv.org/abs/2509.13313v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Model (LLM)-based web agents demonstrate strong performance on knowledge-intensive tasks but are hindered by context window limitations in paradigms like ReAct. Complex queries involving multiple entities, intertwined relationships, and high uncertainty demand extensive search cycles that rapidly exhaust context budgets before reaching complete solutions. To overcome this challenge, we introduce ReSum, a novel paradigm that enables indefinite exploration through periodic context summarization. ReSum converts growing interaction histories into compact reasoning states, maintaining awareness of prior discoveries while bypassing context constraints. For paradigm adaptation, we propose ReSum-GRPO, integrating GRPO with segmented trajectory training and advantage broadcasting to familiarize agents with summary-conditioned reasoning. Extensive experiments on web agents of varying scales across three benchmarks demonstrate that ReSum delivers an average absolute improvement of 4.5\% over ReAct, with further gains of up to 8.2\% following ReSum-GRPO training. Notably, with only 1K training samples, our WebResummer-30B (a ReSum-GRPO-trained version of WebSailor-30B) achieves 33.3\% Pass@1 on BrowseComp-zh and 18.3\% on BrowseComp-en, surpassing existing open-source web agents.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13313v1">http://arxiv.org/abs/2509.13313v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Model (LLM)-based web agents demonstrate strong performance on knowledge-intensive tasks but are hindered by context window limitations in paradigms like ReAct. Complex queries involving multiple entities, intertwined relationships, and high uncertainty demand extensive search cycles that rapidly exhaust context budgets before reaching complete solutions. To overcome this challenge, we introduce ReSum, a novel paradigm that enables indefinite exploration through periodic context summarization. ReSum converts growing interaction histories into compact reasoning states, maintaining awareness of prior discoveries while bypassing context constraints. For paradigm adaptation, we propose ReSum-GRPO, integrating GRPO with segmented trajectory training and advantage broadcasting to familiarize agents with summary-conditioned reasoning. Extensive experiments on web agents of varying scales across three benchmarks demonstrate that ReSum delivers an average absolute improvement of 4.5\% over ReAct, with further gains of up to 8.2\% following ReSum-GRPO training. Notably, with only 1K training samples, our WebResummer-30B (a ReSum-GRPO-trained version of WebSailor-30B) achieves 33.3\% Pass@1 on BrowseComp-zh and 18.3\% on BrowseComp-en, surpassing existing open-source web agents.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 17 Sep 2025 20:50:24 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4c01daf5/a03331be.mp3" length="18637314" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1161</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13313v1">http://arxiv.org/abs/2509.13313v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Model (LLM)-based web agents demonstrate strong performance on knowledge-intensive tasks but are hindered by context window limitations in paradigms like ReAct. Complex queries involving multiple entities, intertwined relationships, and high uncertainty demand extensive search cycles that rapidly exhaust context budgets before reaching complete solutions. To overcome this challenge, we introduce ReSum, a novel paradigm that enables indefinite exploration through periodic context summarization. ReSum converts growing interaction histories into compact reasoning states, maintaining awareness of prior discoveries while bypassing context constraints. For paradigm adaptation, we propose ReSum-GRPO, integrating GRPO with segmented trajectory training and advantage broadcasting to familiarize agents with summary-conditioned reasoning. Extensive experiments on web agents of varying scales across three benchmarks demonstrate that ReSum delivers an average absolute improvement of 4.5\% over ReAct, with further gains of up to 8.2\% following ReSum-GRPO training. Notably, with only 1K training samples, our WebResummer-30B (a ReSum-GRPO-trained version of WebSailor-30B) achieves 33.3\% Pass@1 on BrowseComp-zh and 18.3\% on BrowseComp-en, surpassing existing open-source web agents.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Single-stream Policy Optimization</title>
      <itunes:episode>1148</itunes:episode>
      <podcast:episode>1148</podcast:episode>
      <itunes:title>Single-stream Policy Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5951261e-c1a0-43c2-a994-7cba1f175456</guid>
      <link>https://share.transistor.fm/s/da9895d2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.LG, cs.AI, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Zhongwen Xu, Zihan Ding</p>

            <p><strong>Title:</strong><br>
            Single-stream Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13232v1">http://arxiv.org/abs/2509.13232v1</a></p>

            <p><strong>Abstract:</strong><br>
            We revisit policy-gradient optimization for Large Language Models (LLMs) from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO's gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3 8B, SPO improves the average maj@32 by +3.4 percentage points (pp) over GRPO, driven by substantial absolute point gains on challenging datasets, including +7.3 pp on BRUMO 25, +4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves consistent relative gain in pass@$k$ across the evaluated $k$ values. SPO's success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.LG, cs.AI, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Zhongwen Xu, Zihan Ding</p>

            <p><strong>Title:</strong><br>
            Single-stream Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13232v1">http://arxiv.org/abs/2509.13232v1</a></p>

            <p><strong>Abstract:</strong><br>
            We revisit policy-gradient optimization for Large Language Models (LLMs) from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO's gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3 8B, SPO improves the average maj@32 by +3.4 percentage points (pp) over GRPO, driven by substantial absolute point gains on challenging datasets, including +7.3 pp on BRUMO 25, +4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves consistent relative gain in pass@$k$ across the evaluated $k$ values. SPO's success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 17 Sep 2025 20:50:01 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/da9895d2/a71055ec.mp3" length="20939809" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1305</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.LG, cs.AI, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Zhongwen Xu, Zihan Ding</p>

            <p><strong>Title:</strong><br>
            Single-stream Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.13232v1">http://arxiv.org/abs/2509.13232v1</a></p>

            <p><strong>Abstract:</strong><br>
            We revisit policy-gradient optimization for Large Language Models (LLMs) from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO's gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3 8B, SPO improves the average maj@32 by +3.4 percentage points (pp) over GRPO, driven by substantial absolute point gains on challenging datasets, including +7.3 pp on BRUMO 25, +4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves consistent relative gain in pass@$k$ across the evaluated $k$ values. SPO's success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling</title>
      <itunes:episode>1147</itunes:episode>
      <podcast:episode>1147</podcast:episode>
      <itunes:title>OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1e90e816-f327-49be-a85e-943bfdd15e4b</guid>
      <link>https://share.transistor.fm/s/4f75280a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 75 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, Mingyu Liu, Dingning Liu, Jiange Yang, Zhoujie Fu, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Kaipeng Zhang, Tong He</p>

            <p><strong>Title:</strong><br>
            OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.12201v1">http://arxiv.org/abs/2509.12201v1</a></p>

            <p><strong>Abstract:</strong><br>
            The field of 4D world modeling - aiming to jointly capture spatial geometry and temporal dynamics - has witnessed remarkable progress in recent years, driven by advances in large-scale generative models and multimodal learning. However, the development of truly general 4D world models remains fundamentally constrained by the availability of high-quality data. Existing datasets and benchmarks often lack the dynamic complexity, multi-domain diversity, and spatial-temporal annotations required to support key tasks such as 4D geometric reconstruction, future prediction, and camera-control video generation. To address this gap, we introduce OmniWorld, a large-scale, multi-domain, multi-modal dataset specifically designed for 4D world modeling. OmniWorld consists of a newly collected OmniWorld-Game dataset and several curated public datasets spanning diverse domains. Compared with existing synthetic datasets, OmniWorld-Game provides richer modality coverage, larger scale, and more realistic dynamic interactions. Based on this dataset, we establish a challenging benchmark that exposes the limitations of current state-of-the-art (SOTA) approaches in modeling complex 4D environments. Moreover, fine-tuning existing SOTA methods on OmniWorld leads to significant performance gains across 4D reconstruction and video generation tasks, strongly validating OmniWorld as a powerful resource for training and evaluation. We envision OmniWorld as a catalyst for accelerating the development of general-purpose 4D world models, ultimately advancing machines' holistic understanding of the physical world.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 75 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, Mingyu Liu, Dingning Liu, Jiange Yang, Zhoujie Fu, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Kaipeng Zhang, Tong He</p>

            <p><strong>Title:</strong><br>
            OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.12201v1">http://arxiv.org/abs/2509.12201v1</a></p>

            <p><strong>Abstract:</strong><br>
            The field of 4D world modeling - aiming to jointly capture spatial geometry and temporal dynamics - has witnessed remarkable progress in recent years, driven by advances in large-scale generative models and multimodal learning. However, the development of truly general 4D world models remains fundamentally constrained by the availability of high-quality data. Existing datasets and benchmarks often lack the dynamic complexity, multi-domain diversity, and spatial-temporal annotations required to support key tasks such as 4D geometric reconstruction, future prediction, and camera-control video generation. To address this gap, we introduce OmniWorld, a large-scale, multi-domain, multi-modal dataset specifically designed for 4D world modeling. OmniWorld consists of a newly collected OmniWorld-Game dataset and several curated public datasets spanning diverse domains. Compared with existing synthetic datasets, OmniWorld-Game provides richer modality coverage, larger scale, and more realistic dynamic interactions. Based on this dataset, we establish a challenging benchmark that exposes the limitations of current state-of-the-art (SOTA) approaches in modeling complex 4D environments. Moreover, fine-tuning existing SOTA methods on OmniWorld leads to significant performance gains across 4D reconstruction and video generation tasks, strongly validating OmniWorld as a powerful resource for training and evaluation. We envision OmniWorld as a catalyst for accelerating the development of general-purpose 4D world models, ultimately advancing machines' holistic understanding of the physical world.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 16 Sep 2025 20:07:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4f75280a/f5fd3759.mp3" length="21692592" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1352</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 75 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, Mingyu Liu, Dingning Liu, Jiange Yang, Zhoujie Fu, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Kaipeng Zhang, Tong He</p>

            <p><strong>Title:</strong><br>
            OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.12201v1">http://arxiv.org/abs/2509.12201v1</a></p>

            <p><strong>Abstract:</strong><br>
            The field of 4D world modeling - aiming to jointly capture spatial geometry and temporal dynamics - has witnessed remarkable progress in recent years, driven by advances in large-scale generative models and multimodal learning. However, the development of truly general 4D world models remains fundamentally constrained by the availability of high-quality data. Existing datasets and benchmarks often lack the dynamic complexity, multi-domain diversity, and spatial-temporal annotations required to support key tasks such as 4D geometric reconstruction, future prediction, and camera-control video generation. To address this gap, we introduce OmniWorld, a large-scale, multi-domain, multi-modal dataset specifically designed for 4D world modeling. OmniWorld consists of a newly collected OmniWorld-Game dataset and several curated public datasets spanning diverse domains. Compared with existing synthetic datasets, OmniWorld-Game provides richer modality coverage, larger scale, and more realistic dynamic interactions. Based on this dataset, we establish a challenging benchmark that exposes the limitations of current state-of-the-art (SOTA) approaches in modeling complex 4D environments. Moreover, fine-tuning existing SOTA methods on OmniWorld leads to significant performance gains across 4D reconstruction and video generation tasks, strongly validating OmniWorld as a powerful resource for training and evaluation. We envision OmniWorld as a catalyst for accelerating the development of general-purpose 4D world models, ultimately advancing machines' holistic understanding of the physical world.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning</title>
      <itunes:episode>1146</itunes:episode>
      <podcast:episode>1146</podcast:episode>
      <itunes:title>UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d3da2734-f69e-4636-9deb-1cd69313c15a</guid>
      <link>https://share.transistor.fm/s/69123ea7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhengxi Lu, Jiabo Ye, Fei Tang, Yongliang Shen, Haiyang Xu, Ziwei Zheng, Weiming Lu, Ming Yan, Fei Huang, Jun Xiao, Yueting Zhuang</p>

            <p><strong>Title:</strong><br>
            UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.11543v1">http://arxiv.org/abs/2509.11543v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interface (GUI) agents have demonstrated remarkable progress in automating complex user interface interactions through reinforcement learning. However, current approaches face a fundamental dilemma: offline RL enables stable training on pre-collected trajectories, but struggles with multi-step task execution for lack of trajectory-level reward signals; online RL captures these signals through environment interaction, but suffers from sparse rewards and prohibitive deployment costs. To address it, we present Semi-online Reinforcement Learning, a novel paradigm that simulates online RL on offline trajectories. During each rollout process, we preserve the original model output within the multi-turn dialogue, where a Patch Module adaptively recovers the divergence between rollout and expert trajectories. To capture long-term training signals, Semi-online RL introduces discounted future returns into the reward computation and optimizes the policy with weighted step-level and episode-level advantages. We further introduce Semi-Online Performance (SOP), a metric that aligns better with true online performance, serving as a practical and effective proxy for real-world evaluation. Experiments show that ours Semi-online RL achieves SOTA performance among 7B models across four dynamic benchmarks, with significant gains over the base model (e.g., +12.0% on AndroidWorld, +23.8% on AITW), demonstrating significant progress in bridging the gap between offline training efficiency and online multi-turn reasoning. The code is available at https://github.com/X-PLUG/MobileAgent/tree/main/UI-S1.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhengxi Lu, Jiabo Ye, Fei Tang, Yongliang Shen, Haiyang Xu, Ziwei Zheng, Weiming Lu, Ming Yan, Fei Huang, Jun Xiao, Yueting Zhuang</p>

            <p><strong>Title:</strong><br>
            UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.11543v1">http://arxiv.org/abs/2509.11543v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interface (GUI) agents have demonstrated remarkable progress in automating complex user interface interactions through reinforcement learning. However, current approaches face a fundamental dilemma: offline RL enables stable training on pre-collected trajectories, but struggles with multi-step task execution for lack of trajectory-level reward signals; online RL captures these signals through environment interaction, but suffers from sparse rewards and prohibitive deployment costs. To address it, we present Semi-online Reinforcement Learning, a novel paradigm that simulates online RL on offline trajectories. During each rollout process, we preserve the original model output within the multi-turn dialogue, where a Patch Module adaptively recovers the divergence between rollout and expert trajectories. To capture long-term training signals, Semi-online RL introduces discounted future returns into the reward computation and optimizes the policy with weighted step-level and episode-level advantages. We further introduce Semi-Online Performance (SOP), a metric that aligns better with true online performance, serving as a practical and effective proxy for real-world evaluation. Experiments show that ours Semi-online RL achieves SOTA performance among 7B models across four dynamic benchmarks, with significant gains over the base model (e.g., +12.0% on AndroidWorld, +23.8% on AITW), demonstrating significant progress in bridging the gap between offline training efficiency and online multi-turn reasoning. The code is available at https://github.com/X-PLUG/MobileAgent/tree/main/UI-S1.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 16 Sep 2025 20:06:48 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/69123ea7/e79a84de.mp3" length="19604467" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1222</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhengxi Lu, Jiabo Ye, Fei Tang, Yongliang Shen, Haiyang Xu, Ziwei Zheng, Weiming Lu, Ming Yan, Fei Huang, Jun Xiao, Yueting Zhuang</p>

            <p><strong>Title:</strong><br>
            UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.11543v1">http://arxiv.org/abs/2509.11543v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interface (GUI) agents have demonstrated remarkable progress in automating complex user interface interactions through reinforcement learning. However, current approaches face a fundamental dilemma: offline RL enables stable training on pre-collected trajectories, but struggles with multi-step task execution for lack of trajectory-level reward signals; online RL captures these signals through environment interaction, but suffers from sparse rewards and prohibitive deployment costs. To address it, we present Semi-online Reinforcement Learning, a novel paradigm that simulates online RL on offline trajectories. During each rollout process, we preserve the original model output within the multi-turn dialogue, where a Patch Module adaptively recovers the divergence between rollout and expert trajectories. To capture long-term training signals, Semi-online RL introduces discounted future returns into the reward computation and optimizes the policy with weighted step-level and episode-level advantages. We further introduce Semi-Online Performance (SOP), a metric that aligns better with true online performance, serving as a practical and effective proxy for real-world evaluation. Experiments show that ours Semi-online RL achieves SOTA performance among 7B models across four dynamic benchmarks, with significant gains over the base model (e.g., +12.0% on AndroidWorld, +23.8% on AITW), demonstrating significant progress in bridging the gap between offline training efficiency and online multi-turn reasoning. The code is available at https://github.com/X-PLUG/MobileAgent/tree/main/UI-S1.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts</title>
      <itunes:episode>1145</itunes:episode>
      <podcast:episode>1145</podcast:episode>
      <itunes:title>InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2e208b7d-9704-4023-9581-6633322eb7de</guid>
      <link>https://share.transistor.fm/s/a0296d0d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Weipeng Zhong, Peizhou Cao, Yichen Jin, Li Luo, Wenzhe Cai, Jingli Lin, Hanqing Wang, Zhaoyang Lyu, Tai Wang, Bo Dai, Xudong Xu, Jiangmiao Pang</p>

            <p><strong>Title:</strong><br>
            InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.10813v1">http://arxiv.org/abs/2509.10813v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advancement of Embodied AI heavily relies on large-scale, simulatable 3D scene datasets characterized by scene diversity and realistic layouts. However, existing datasets typically suffer from limitations in data scale or diversity, sanitized layouts lacking small items, and severe object collisions. To address these shortcomings, we introduce \textbf{InternScenes}, a novel large-scale simulatable indoor scene dataset comprising approximately 40,000 diverse scenes by integrating three disparate scene sources, real-world scans, procedurally generated scenes, and designer-created scenes, including 1.96M 3D objects and covering 15 common scene types and 288 object classes. We particularly preserve massive small items in the scenes, resulting in realistic and complex layouts with an average of 41.5 objects per region. Our comprehensive data processing pipeline ensures simulatability by creating real-to-sim replicas for real-world scans, enhances interactivity by incorporating interactive objects into these scenes, and resolves object collisions by physical simulations. We demonstrate the value of InternScenes with two benchmark applications: scene layout generation and point-goal navigation. Both show the new challenges posed by the complex and realistic layouts. More importantly, InternScenes paves the way for scaling up the model training for both tasks, making the generation and navigation in such complex scenes possible. We commit to open-sourcing the data, models, and benchmarks to benefit the whole community.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Weipeng Zhong, Peizhou Cao, Yichen Jin, Li Luo, Wenzhe Cai, Jingli Lin, Hanqing Wang, Zhaoyang Lyu, Tai Wang, Bo Dai, Xudong Xu, Jiangmiao Pang</p>

            <p><strong>Title:</strong><br>
            InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.10813v1">http://arxiv.org/abs/2509.10813v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advancement of Embodied AI heavily relies on large-scale, simulatable 3D scene datasets characterized by scene diversity and realistic layouts. However, existing datasets typically suffer from limitations in data scale or diversity, sanitized layouts lacking small items, and severe object collisions. To address these shortcomings, we introduce \textbf{InternScenes}, a novel large-scale simulatable indoor scene dataset comprising approximately 40,000 diverse scenes by integrating three disparate scene sources, real-world scans, procedurally generated scenes, and designer-created scenes, including 1.96M 3D objects and covering 15 common scene types and 288 object classes. We particularly preserve massive small items in the scenes, resulting in realistic and complex layouts with an average of 41.5 objects per region. Our comprehensive data processing pipeline ensures simulatability by creating real-to-sim replicas for real-world scans, enhances interactivity by incorporating interactive objects into these scenes, and resolves object collisions by physical simulations. We demonstrate the value of InternScenes with two benchmark applications: scene layout generation and point-goal navigation. Both show the new challenges posed by the complex and realistic layouts. More importantly, InternScenes paves the way for scaling up the model training for both tasks, making the generation and navigation in such complex scenes possible. We commit to open-sourcing the data, models, and benchmarks to benefit the whole community.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 16 Sep 2025 20:06:27 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a0296d0d/dcd3917e.mp3" length="20409887" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1272</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Weipeng Zhong, Peizhou Cao, Yichen Jin, Li Luo, Wenzhe Cai, Jingli Lin, Hanqing Wang, Zhaoyang Lyu, Tai Wang, Bo Dai, Xudong Xu, Jiangmiao Pang</p>

            <p><strong>Title:</strong><br>
            InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.10813v1">http://arxiv.org/abs/2509.10813v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advancement of Embodied AI heavily relies on large-scale, simulatable 3D scene datasets characterized by scene diversity and realistic layouts. However, existing datasets typically suffer from limitations in data scale or diversity, sanitized layouts lacking small items, and severe object collisions. To address these shortcomings, we introduce \textbf{InternScenes}, a novel large-scale simulatable indoor scene dataset comprising approximately 40,000 diverse scenes by integrating three disparate scene sources, real-world scans, procedurally generated scenes, and designer-created scenes, including 1.96M 3D objects and covering 15 common scene types and 288 object classes. We particularly preserve massive small items in the scenes, resulting in realistic and complex layouts with an average of 41.5 objects per region. Our comprehensive data processing pipeline ensures simulatability by creating real-to-sim replicas for real-world scans, enhances interactivity by incorporating interactive objects into these scenes, and resolves object collisions by physical simulations. We demonstrate the value of InternScenes with two benchmark applications: scene layout generation and point-goal navigation. Both show the new challenges posed by the complex and realistic layouts. More importantly, InternScenes paves the way for scaling up the model training for both tasks, making the generation and navigation in such complex scenes possible. We commit to open-sourcing the data, models, and benchmarks to benefit the whole community.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>IntrEx: A Dataset for Modeling Engagement in Educational Conversations</title>
      <itunes:episode>1144</itunes:episode>
      <podcast:episode>1144</podcast:episode>
      <itunes:title>IntrEx: A Dataset for Modeling Engagement in Educational Conversations</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8a044e29-cd8d-415d-a833-d92dc82bcd86</guid>
      <link>https://share.transistor.fm/s/3ebde82a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingwei Tan, Mahathi Parvatham, Chiara Gambi, Gabriele Pergola</p>

            <p><strong>Title:</strong><br>
            IntrEx: A Dataset for Modeling Engagement in Educational Conversations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.06652v1">http://arxiv.org/abs/2509.06652v1</a></p>

            <p><strong>Abstract:</strong><br>
            Engagement and motivation are crucial for second-language acquisition, yet maintaining learner interest in educational conversations remains a challenge. While prior research has explored what makes educational texts interesting, still little is known about the linguistic features that drive engagement in conversations. To address this gap, we introduce IntrEx, the first large dataset annotated for interestingness and expected interestingness in teacher-student interactions. Built upon the Teacher-Student Chatroom Corpus (TSCC), IntrEx extends prior work by incorporating sequence-level annotations, allowing for the study of engagement beyond isolated turns to capture how interest evolves over extended dialogues. We employ a rigorous annotation process with over 100 second-language learners, using a comparison-based rating approach inspired by reinforcement learning from human feedback (RLHF) to improve agreement. We investigate whether large language models (LLMs) can predict human interestingness judgments. We find that LLMs (7B/8B parameters) fine-tuned on interestingness ratings outperform larger proprietary models like GPT-4o, demonstrating the potential for specialised datasets to model engagement in educational settings. Finally, we analyze how linguistic and cognitive factors, such as concreteness, comprehensibility (readability), and uptake, influence engagement in educational dialogues.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingwei Tan, Mahathi Parvatham, Chiara Gambi, Gabriele Pergola</p>

            <p><strong>Title:</strong><br>
            IntrEx: A Dataset for Modeling Engagement in Educational Conversations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.06652v1">http://arxiv.org/abs/2509.06652v1</a></p>

            <p><strong>Abstract:</strong><br>
            Engagement and motivation are crucial for second-language acquisition, yet maintaining learner interest in educational conversations remains a challenge. While prior research has explored what makes educational texts interesting, still little is known about the linguistic features that drive engagement in conversations. To address this gap, we introduce IntrEx, the first large dataset annotated for interestingness and expected interestingness in teacher-student interactions. Built upon the Teacher-Student Chatroom Corpus (TSCC), IntrEx extends prior work by incorporating sequence-level annotations, allowing for the study of engagement beyond isolated turns to capture how interest evolves over extended dialogues. We employ a rigorous annotation process with over 100 second-language learners, using a comparison-based rating approach inspired by reinforcement learning from human feedback (RLHF) to improve agreement. We investigate whether large language models (LLMs) can predict human interestingness judgments. We find that LLMs (7B/8B parameters) fine-tuned on interestingness ratings outperform larger proprietary models like GPT-4o, demonstrating the potential for specialised datasets to model engagement in educational settings. Finally, we analyze how linguistic and cognitive factors, such as concreteness, comprehensibility (readability), and uptake, influence engagement in educational dialogues.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 15 Sep 2025 19:57:46 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3ebde82a/94b372d2.mp3" length="23285015" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1452</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingwei Tan, Mahathi Parvatham, Chiara Gambi, Gabriele Pergola</p>

            <p><strong>Title:</strong><br>
            IntrEx: A Dataset for Modeling Engagement in Educational Conversations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.06652v1">http://arxiv.org/abs/2509.06652v1</a></p>

            <p><strong>Abstract:</strong><br>
            Engagement and motivation are crucial for second-language acquisition, yet maintaining learner interest in educational conversations remains a challenge. While prior research has explored what makes educational texts interesting, still little is known about the linguistic features that drive engagement in conversations. To address this gap, we introduce IntrEx, the first large dataset annotated for interestingness and expected interestingness in teacher-student interactions. Built upon the Teacher-Student Chatroom Corpus (TSCC), IntrEx extends prior work by incorporating sequence-level annotations, allowing for the study of engagement beyond isolated turns to capture how interest evolves over extended dialogues. We employ a rigorous annotation process with over 100 second-language learners, using a comparison-based rating approach inspired by reinforcement learning from human feedback (RLHF) to improve agreement. We investigate whether large language models (LLMs) can predict human interestingness judgments. We find that LLMs (7B/8B parameters) fine-tuned on interestingness ratings outperform larger proprietary models like GPT-4o, demonstrating the potential for specialised datasets to model engagement in educational settings. Finally, we analyze how linguistic and cognitive factors, such as concreteness, comprehensibility (readability), and uptake, influence engagement in educational dialogues.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs</title>
      <itunes:episode>1143</itunes:episode>
      <podcast:episode>1143</podcast:episode>
      <itunes:title>The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d8b1b0b0-3e14-4a35-980b-5056473b542d</guid>
      <link>https://share.transistor.fm/s/751f7a5c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, Jonas Geiping</p>

            <p><strong>Title:</strong><br>
            The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09677v1">http://arxiv.org/abs/2509.09677v1</a></p>

            <p><strong>Abstract:</strong><br>
            Does continued scaling of large language models (LLMs) yield diminishing returns? Real-world value often stems from the length of task an agent can complete. We start this work by observing the simple but counterintuitive fact that marginal gains in single-step accuracy can compound into exponential improvements in the length of a task a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. We propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. We find that larger models can correctly execute significantly more turns even when small models have 100\% single-turn accuracy. We observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations -- curiously, we observe a self-conditioning effect -- models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. In contrast, recent thinking models do not self-condition, and can also execute much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of task they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, Jonas Geiping</p>

            <p><strong>Title:</strong><br>
            The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09677v1">http://arxiv.org/abs/2509.09677v1</a></p>

            <p><strong>Abstract:</strong><br>
            Does continued scaling of large language models (LLMs) yield diminishing returns? Real-world value often stems from the length of task an agent can complete. We start this work by observing the simple but counterintuitive fact that marginal gains in single-step accuracy can compound into exponential improvements in the length of a task a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. We propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. We find that larger models can correctly execute significantly more turns even when small models have 100\% single-turn accuracy. We observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations -- curiously, we observe a self-conditioning effect -- models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. In contrast, recent thinking models do not self-condition, and can also execute much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of task they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 15 Sep 2025 19:57:23 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/751f7a5c/70082372.mp3" length="23154619" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1443</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, Jonas Geiping</p>

            <p><strong>Title:</strong><br>
            The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09677v1">http://arxiv.org/abs/2509.09677v1</a></p>

            <p><strong>Abstract:</strong><br>
            Does continued scaling of large language models (LLMs) yield diminishing returns? Real-world value often stems from the length of task an agent can complete. We start this work by observing the simple but counterintuitive fact that marginal gains in single-step accuracy can compound into exponential improvements in the length of a task a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. We propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. We find that larger models can correctly execute significantly more turns even when small models have 100\% single-turn accuracy. We observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations -- curiously, we observe a self-conditioning effect -- models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. In contrast, recent thinking models do not self-condition, and can also execute much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of task they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model</title>
      <itunes:episode>1142</itunes:episode>
      <podcast:episode>1142</podcast:episode>
      <itunes:title>VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9aa3674f-d7e5-4ea2-b289-cf0724a20372</guid>
      <link>https://share.transistor.fm/s/9c09d30f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 114 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, Donglin Wang</p>

            <p><strong>Title:</strong><br>
            VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09372v1">http://arxiv.org/abs/2509.09372v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that VLA-Adapter not only achieves state-of-the-art level performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, greatly lowering the barrier to deploying the VLA model. Project page: https://vla-adapter.github.io/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 114 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, Donglin Wang</p>

            <p><strong>Title:</strong><br>
            VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09372v1">http://arxiv.org/abs/2509.09372v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that VLA-Adapter not only achieves state-of-the-art level performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, greatly lowering the barrier to deploying the VLA model. Project page: https://vla-adapter.github.io/.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 12 Sep 2025 21:07:13 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9c09d30f/e0df72c6.mp3" length="20742159" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1293</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 114 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, Donglin Wang</p>

            <p><strong>Title:</strong><br>
            VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09372v1">http://arxiv.org/abs/2509.09372v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that VLA-Adapter not only achieves state-of-the-art level performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, greatly lowering the barrier to deploying the VLA model. Project page: https://vla-adapter.github.io/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning</title>
      <itunes:episode>1141</itunes:episode>
      <podcast:episode>1141</podcast:episode>
      <itunes:title>HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">24c84d64-07e8-484d-8e3d-3f86b6fa4d85</guid>
      <link>https://share.transistor.fm/s/16ecc176</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 88 | cs.CV, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, Zhiyong Wu</p>

            <p><strong>Title:</strong><br>
            HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.08519v1">http://arxiv.org/abs/2509.08519v1</a></p>

            <p><strong>Abstract:</strong><br>
            Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct a high-quality dataset with diverse and paired text, reference images, and audio. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies. For the subject preservation task, to maintain the prompt following and visual generation abilities of the foundation model, we adopt the minimal-invasive image injection strategy. For the audio-visual sync task, besides the commonly adopted audio cross-attention layer, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multimodal inputs, building on previously acquired capabilities, we progressively incorporate the audio-visual sync task. During inference, for flexible and fine-grained multimodal control, we design a time-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Project Page: https://phantom-video.github.io/HuMo.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 88 | cs.CV, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, Zhiyong Wu</p>

            <p><strong>Title:</strong><br>
            HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.08519v1">http://arxiv.org/abs/2509.08519v1</a></p>

            <p><strong>Abstract:</strong><br>
            Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct a high-quality dataset with diverse and paired text, reference images, and audio. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies. For the subject preservation task, to maintain the prompt following and visual generation abilities of the foundation model, we adopt the minimal-invasive image injection strategy. For the audio-visual sync task, besides the commonly adopted audio cross-attention layer, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multimodal inputs, building on previously acquired capabilities, we progressively incorporate the audio-visual sync task. During inference, for flexible and fine-grained multimodal control, we design a time-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Project Page: https://phantom-video.github.io/HuMo.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 12 Sep 2025 21:06:52 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/16ecc176/ef80b91a.mp3" length="23187222" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1446</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 88 | cs.CV, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, Zhiyong Wu</p>

            <p><strong>Title:</strong><br>
            HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.08519v1">http://arxiv.org/abs/2509.08519v1</a></p>

            <p><strong>Abstract:</strong><br>
            Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct a high-quality dataset with diverse and paired text, reference images, and audio. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies. For the subject preservation task, to maintain the prompt following and visual generation abilities of the foundation model, we adopt the minimal-invasive image injection strategy. For the audio-visual sync task, besides the commonly adopted audio cross-attention layer, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multimodal inputs, building on previously acquired capabilities, we progressively incorporate the audio-visual sync task. During inference, for flexible and fine-grained multimodal control, we design a time-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Project Page: https://phantom-video.github.io/HuMo.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning</title>
      <itunes:episode>1140</itunes:episode>
      <podcast:episode>1140</podcast:episode>
      <itunes:title>SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">56f7d2c4-ce4e-4e4b-8a58-b64acc8b5766</guid>
      <link>https://share.transistor.fm/s/3a5d92ed</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.RO, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, Ning Ding</p>

            <p><strong>Title:</strong><br>
            SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09674v1">http://arxiv.org/abs/2509.09674v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms $\pi_0$ on RoboTwin 1.0\&amp;2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut'' during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.RO, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, Ning Ding</p>

            <p><strong>Title:</strong><br>
            SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09674v1">http://arxiv.org/abs/2509.09674v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms $\pi_0$ on RoboTwin 1.0\&amp;2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut'' during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 12 Sep 2025 21:06:30 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3a5d92ed/135fd081.mp3" length="24031481" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1498</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.RO, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, Ning Ding</p>

            <p><strong>Title:</strong><br>
            SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09674v1">http://arxiv.org/abs/2509.09674v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms $\pi_0$ on RoboTwin 1.0\&amp;2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut'' during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs</title>
      <itunes:episode>1139</itunes:episode>
      <podcast:episode>1139</podcast:episode>
      <itunes:title>EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fc680c1e-1dfc-42be-90e8-2fd64d5d41ad</guid>
      <link>https://share.transistor.fm/s/89ddbe34</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.CL, cs.AI, cs.SD</p>

            <p><strong>Authors:</strong><br>
            Yuhao Zhang, Yuhao Du, Zhanchen Dai, Xiangnan Ma, Kaiqi Kou, Benyou Wang, Haizhou Li</p>

            <p><strong>Title:</strong><br>
            EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09174v1">http://arxiv.org/abs/2509.09174v1</a></p>

            <p><strong>Abstract:</strong><br>
            Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at https://github.com/FreedomIntelligence/EchoX.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.CL, cs.AI, cs.SD</p>

            <p><strong>Authors:</strong><br>
            Yuhao Zhang, Yuhao Du, Zhanchen Dai, Xiangnan Ma, Kaiqi Kou, Benyou Wang, Haizhou Li</p>

            <p><strong>Title:</strong><br>
            EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09174v1">http://arxiv.org/abs/2509.09174v1</a></p>

            <p><strong>Abstract:</strong><br>
            Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at https://github.com/FreedomIntelligence/EchoX.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 12 Sep 2025 21:06:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/89ddbe34/75cd228d.mp3" length="18640256" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1161</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.CL, cs.AI, cs.SD</p>

            <p><strong>Authors:</strong><br>
            Yuhao Zhang, Yuhao Du, Zhanchen Dai, Xiangnan Ma, Kaiqi Kou, Benyou Wang, Haizhou Li</p>

            <p><strong>Title:</strong><br>
            EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09174v1">http://arxiv.org/abs/2509.09174v1</a></p>

            <p><strong>Abstract:</strong><br>
            Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at https://github.com/FreedomIntelligence/EchoX.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents</title>
      <itunes:episode>1138</itunes:episode>
      <podcast:episode>1138</podcast:episode>
      <itunes:title>Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8c67bcc5-d545-4d8b-a98c-cdc8c73c7108</guid>
      <link>https://share.transistor.fm/s/5b32a904</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiawei Wang, Jiacai Liu, Yuqian Fu, Yingru Li, Xintao Wang, Yuan Lin, Yu Yue, Lin Zhang, Yang Wang, Ke Wang</p>

            <p><strong>Title:</strong><br>
            Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09265v1">http://arxiv.org/abs/2509.09265v1</a></p>

            <p><strong>Abstract:</strong><br>
            In long-horizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines. Project page is at https://empgseed-seed.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiawei Wang, Jiacai Liu, Yuqian Fu, Yingru Li, Xintao Wang, Yuan Lin, Yu Yue, Lin Zhang, Yang Wang, Ke Wang</p>

            <p><strong>Title:</strong><br>
            Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09265v1">http://arxiv.org/abs/2509.09265v1</a></p>

            <p><strong>Abstract:</strong><br>
            In long-horizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines. Project page is at https://empgseed-seed.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 12 Sep 2025 21:05:47 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5b32a904/1dfd45f7.mp3" length="20052953" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1250</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiawei Wang, Jiacai Liu, Yuqian Fu, Yingru Li, Xintao Wang, Yuan Lin, Yu Yue, Lin Zhang, Yang Wang, Ke Wang</p>

            <p><strong>Title:</strong><br>
            Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09265v1">http://arxiv.org/abs/2509.09265v1</a></p>

            <p><strong>Abstract:</strong><br>
            In long-horizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines. Project page is at https://empgseed-seed.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis</title>
      <itunes:episode>1137</itunes:episode>
      <podcast:episode>1137</podcast:episode>
      <itunes:title>Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a98d83aa-af9b-4037-958a-cfa8e40e8b97</guid>
      <link>https://share.transistor.fm/s/fe423e8f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yikang Ding, Jiwen Liu, Wenyuan Zhang, Zekun Wang, Wentao Hu, Liyuan Cui, Mingming Lao, Yingchao Shao, Hui Liu, Xiaohan Li, Ming Chen, Xiaoqiang Liu, Yu-Shen Liu, Pengfei Wan</p>

            <p><strong>Title:</strong><br>
            Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09595v1">http://arxiv.org/abs/2509.09595v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yikang Ding, Jiwen Liu, Wenyuan Zhang, Zekun Wang, Wentao Hu, Liyuan Cui, Mingming Lao, Yingchao Shao, Hui Liu, Xiaohan Li, Ming Chen, Xiaoqiang Liu, Yu-Shen Liu, Pengfei Wan</p>

            <p><strong>Title:</strong><br>
            Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09595v1">http://arxiv.org/abs/2509.09595v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 12 Sep 2025 21:05:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fe423e8f/a5d6b347.mp3" length="24055763" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1500</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yikang Ding, Jiwen Liu, Wenyuan Zhang, Zekun Wang, Wentao Hu, Liyuan Cui, Mingming Lao, Yingchao Shao, Hui Liu, Xiaohan Li, Ming Chen, Xiaoqiang Liu, Yu-Shen Liu, Pengfei Wan</p>

            <p><strong>Title:</strong><br>
            Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09595v1">http://arxiv.org/abs/2509.09595v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FLUX-Reason-6M &amp; PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark</title>
      <itunes:episode>1136</itunes:episode>
      <podcast:episode>1136</podcast:episode>
      <itunes:title>FLUX-Reason-6M &amp; PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1e0ab96a-f7ca-423d-bb4a-fe7deb59b53d</guid>
      <link>https://share.transistor.fm/s/639739db</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            FLUX-Reason-6M &amp; PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09680v1">http://arxiv.org/abs/2509.09680v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            FLUX-Reason-6M &amp; PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09680v1">http://arxiv.org/abs/2509.09680v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 12 Sep 2025 21:05:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/639739db/a1a966b8.mp3" length="25903982" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1615</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            FLUX-Reason-6M &amp; PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09680v1">http://arxiv.org/abs/2509.09680v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Can Understanding and Generation Truly Benefit Together -- or Just Coexist?</title>
      <itunes:episode>1135</itunes:episode>
      <podcast:episode>1135</podcast:episode>
      <itunes:title>Can Understanding and Generation Truly Benefit Together -- or Just Coexist?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ab2fce1b-41cd-43c4-a105-65fc6bff2ae2</guid>
      <link>https://share.transistor.fm/s/86508fba</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Zhendong Wang, Hao Liu, Bin Lin, Hao Li, Xue Xu, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan</p>

            <p><strong>Title:</strong><br>
            Can Understanding and Generation Truly Benefit Together -- or Just Coexist?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09666v1">http://arxiv.org/abs/2509.09666v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce an insightful paradigm through the Auto-Encoder lens-understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. Using reconstruction fidelity as the unified training objective, we enforce the coherent bidirectional information flow between the understanding and generation processes, bringing mutual gains. To implement this, we propose UAE, a novel framework for unified multimodal learning. We begin by pre-training the decoder with large-scale long-context image captions to capture fine-grained semantic and complex spatial relationships. We then propose Unified-GRPO via reinforcement learning (RL), which covers three stages: (1) A cold-start phase to gently initialize both encoder and decoder with a semantic reconstruction loss; (2) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual understanding; (3) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. For evaluation, we introduce Unified-Bench, the first benchmark tailored to assess the degree of unification of the UMMs. A surprising "aha moment" arises within the multimodal learning domain: as RL progresses, the encoder autonomously produces more descriptive captions, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Zhendong Wang, Hao Liu, Bin Lin, Hao Li, Xue Xu, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan</p>

            <p><strong>Title:</strong><br>
            Can Understanding and Generation Truly Benefit Together -- or Just Coexist?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09666v1">http://arxiv.org/abs/2509.09666v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce an insightful paradigm through the Auto-Encoder lens-understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. Using reconstruction fidelity as the unified training objective, we enforce the coherent bidirectional information flow between the understanding and generation processes, bringing mutual gains. To implement this, we propose UAE, a novel framework for unified multimodal learning. We begin by pre-training the decoder with large-scale long-context image captions to capture fine-grained semantic and complex spatial relationships. We then propose Unified-GRPO via reinforcement learning (RL), which covers three stages: (1) A cold-start phase to gently initialize both encoder and decoder with a semantic reconstruction loss; (2) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual understanding; (3) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. For evaluation, we introduce Unified-Bench, the first benchmark tailored to assess the degree of unification of the UMMs. A surprising "aha moment" arises within the multimodal learning domain: as RL progresses, the encoder autonomously produces more descriptive captions, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 12 Sep 2025 21:04:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/86508fba/218ab934.mp3" length="23489402" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1464</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Zhendong Wang, Hao Liu, Bin Lin, Hao Li, Xue Xu, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan</p>

            <p><strong>Title:</strong><br>
            Can Understanding and Generation Truly Benefit Together -- or Just Coexist?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.09666v1">http://arxiv.org/abs/2509.09666v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce an insightful paradigm through the Auto-Encoder lens-understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. Using reconstruction fidelity as the unified training objective, we enforce the coherent bidirectional information flow between the understanding and generation processes, bringing mutual gains. To implement this, we propose UAE, a novel framework for unified multimodal learning. We begin by pre-training the decoder with large-scale long-context image captions to capture fine-grained semantic and complex spatial relationships. We then propose Unified-GRPO via reinforcement learning (RL), which covers three stages: (1) A cold-start phase to gently initialize both encoder and decoder with a semantic reconstruction loss; (2) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual understanding; (3) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. For evaluation, we introduce Unified-Bench, the first benchmark tailored to assess the degree of unification of the UMMs. A surprising "aha moment" arises within the multimodal learning domain: as RL progresses, the encoder autonomously produces more descriptive captions, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining</title>
      <itunes:episode>1134</itunes:episode>
      <podcast:episode>1134</podcast:episode>
      <itunes:title>MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">81ba543c-bd33-4915-93b4-e03b42c3daf9</guid>
      <link>https://share.transistor.fm/s/15d742cb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke</p>

            <p><strong>Title:</strong><br>
            MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.06806v3">http://arxiv.org/abs/2509.06806v3</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows.   Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference.   Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke</p>

            <p><strong>Title:</strong><br>
            MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.06806v3">http://arxiv.org/abs/2509.06806v3</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows.   Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference.   Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 12 Sep 2025 21:04:21 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/15d742cb/75b628fe.mp3" length="22116831" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1379</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke</p>

            <p><strong>Title:</strong><br>
            MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.06806v3">http://arxiv.org/abs/2509.06806v3</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows.   Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference.   Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Survey of Reinforcement Learning for Large Reasoning Models</title>
      <itunes:episode>1133</itunes:episode>
      <podcast:episode>1133</podcast:episode>
      <itunes:title>A Survey of Reinforcement Learning for Large Reasoning Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">eca3a20d-e7c8-4a01-9278-12789aefe58e</guid>
      <link>https://share.transistor.fm/s/4b894a67</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 99 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yuan, Junqi Gao, Dong Li, Zhiyuan Ma, Ganqu Cui, Zhiyuan Liu, Biqing Qi, Ning Ding, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            A Survey of Reinforcement Learning for Large Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.08827v1">http://arxiv.org/abs/2509.08827v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 99 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yuan, Junqi Gao, Dong Li, Zhiyuan Ma, Ganqu Cui, Zhiyuan Liu, Biqing Qi, Ning Ding, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            A Survey of Reinforcement Learning for Large Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.08827v1">http://arxiv.org/abs/2509.08827v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 11 Sep 2025 20:19:42 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4b894a67/8742be79.mp3" length="20092634" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1252</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 99 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yuan, Junqi Gao, Dong Li, Zhiyuan Ma, Ganqu Cui, Zhiyuan Liu, Biqing Qi, Ning Ding, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            A Survey of Reinforcement Learning for Large Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.08827v1">http://arxiv.org/abs/2509.08827v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RewardDance: Reward Scaling in Visual Generation</title>
      <itunes:episode>1132</itunes:episode>
      <podcast:episode>1132</podcast:episode>
      <itunes:title>RewardDance: Reward Scaling in Visual Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ac5644f5-7d2f-4c2a-b8d2-1b29bd35b62e</guid>
      <link>https://share.transistor.fm/s/748b954b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, Yan Zeng, Weilin Huang</p>

            <p><strong>Title:</strong><br>
            RewardDance: Reward Scaling in Visual Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.08826v1">http://arxiv.org/abs/2509.08826v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reward Models (RMs) are critical for improving generation models via Reinforcement Learning (RL), yet the RM scaling paradigm in visual generation remains largely unexplored. It primarily due to fundamental limitations in existing approaches: CLIP-based RMs suffer from architectural and input modality constraints, while prevalent Bradley-Terry losses are fundamentally misaligned with the next-token prediction mechanism of Vision-Language Models (VLMs), hindering effective scaling. More critically, the RLHF optimization process is plagued by Reward Hacking issue, where models exploit flaws in the reward signal without improving true quality. To address these challenges, we introduce RewardDance, a scalable reward modeling framework that overcomes these barriers through a novel generative reward paradigm. By reformulating the reward score as the model's probability of predicting a "yes" token, indicating that the generated image outperforms a reference image according to specific criteria, RewardDance intrinsically aligns reward objectives with VLM architectures. This alignment unlocks scaling across two dimensions: (1) Model Scaling: Systematic scaling of RMs up to 26 billion parameters; (2) Context Scaling: Integration of task-specific instructions, reference examples, and chain-of-thought (CoT) reasoning. Extensive experiments demonstrate that RewardDance significantly surpasses state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation. Crucially, we resolve the persistent challenge of "reward hacking": Our large-scale RMs exhibit and maintain high reward variance during RL fine-tuning, proving their resistance to hacking and ability to produce diverse, high-quality outputs. It greatly relieves the mode collapse problem that plagues smaller models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, Yan Zeng, Weilin Huang</p>

            <p><strong>Title:</strong><br>
            RewardDance: Reward Scaling in Visual Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.08826v1">http://arxiv.org/abs/2509.08826v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reward Models (RMs) are critical for improving generation models via Reinforcement Learning (RL), yet the RM scaling paradigm in visual generation remains largely unexplored. It primarily due to fundamental limitations in existing approaches: CLIP-based RMs suffer from architectural and input modality constraints, while prevalent Bradley-Terry losses are fundamentally misaligned with the next-token prediction mechanism of Vision-Language Models (VLMs), hindering effective scaling. More critically, the RLHF optimization process is plagued by Reward Hacking issue, where models exploit flaws in the reward signal without improving true quality. To address these challenges, we introduce RewardDance, a scalable reward modeling framework that overcomes these barriers through a novel generative reward paradigm. By reformulating the reward score as the model's probability of predicting a "yes" token, indicating that the generated image outperforms a reference image according to specific criteria, RewardDance intrinsically aligns reward objectives with VLM architectures. This alignment unlocks scaling across two dimensions: (1) Model Scaling: Systematic scaling of RMs up to 26 billion parameters; (2) Context Scaling: Integration of task-specific instructions, reference examples, and chain-of-thought (CoT) reasoning. Extensive experiments demonstrate that RewardDance significantly surpasses state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation. Crucially, we resolve the persistent challenge of "reward hacking": Our large-scale RMs exhibit and maintain high reward variance during RL fine-tuning, proving their resistance to hacking and ability to produce diverse, high-quality outputs. It greatly relieves the mode collapse problem that plagues smaller models.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 11 Sep 2025 20:19:20 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/748b954b/5c4418ab.mp3" length="20654358" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1287</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, Yan Zeng, Weilin Huang</p>

            <p><strong>Title:</strong><br>
            RewardDance: Reward Scaling in Visual Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.08826v1">http://arxiv.org/abs/2509.08826v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reward Models (RMs) are critical for improving generation models via Reinforcement Learning (RL), yet the RM scaling paradigm in visual generation remains largely unexplored. It primarily due to fundamental limitations in existing approaches: CLIP-based RMs suffer from architectural and input modality constraints, while prevalent Bradley-Terry losses are fundamentally misaligned with the next-token prediction mechanism of Vision-Language Models (VLMs), hindering effective scaling. More critically, the RLHF optimization process is plagued by Reward Hacking issue, where models exploit flaws in the reward signal without improving true quality. To address these challenges, we introduce RewardDance, a scalable reward modeling framework that overcomes these barriers through a novel generative reward paradigm. By reformulating the reward score as the model's probability of predicting a "yes" token, indicating that the generated image outperforms a reference image according to specific criteria, RewardDance intrinsically aligns reward objectives with VLM architectures. This alignment unlocks scaling across two dimensions: (1) Model Scaling: Systematic scaling of RMs up to 26 billion parameters; (2) Context Scaling: Integration of task-specific instructions, reference examples, and chain-of-thought (CoT) reasoning. Extensive experiments demonstrate that RewardDance significantly surpasses state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation. Crucially, we resolve the persistent challenge of "reward hacking": Our large-scale RMs exhibit and maintain high reward variance during RL fine-tuning, proving their resistance to hacking and ability to produce diverse, high-quality outputs. It greatly relieves the mode collapse problem that plagues smaller models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>3D and 4D World Modeling: A Survey</title>
      <itunes:episode>1131</itunes:episode>
      <podcast:episode>1131</podcast:episode>
      <itunes:title>3D and 4D World Modeling: A Survey</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cbe5f89c-4ba6-4a94-a475-8a08b50dbb09</guid>
      <link>https://share.transistor.fm/s/c877f45a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, Junyuan Deng, Kaiwen Zhang, Yang Wu, Tianyi Yan, Shenyuan Gao, Song Wang, Linfeng Li, Liang Pan, Yong Liu, Jianke Zhu, Wei Tsang Ooi, Steven C. H. Hoi, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            3D and 4D World Modeling: A Survey</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.07996v2">http://arxiv.org/abs/2509.07996v2</a></p>

            <p><strong>Abstract:</strong><br>
            World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds for large-scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for ``world models'' has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. A systematic summary of existing literature is available at https://github.com/worldbench/survey</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, Junyuan Deng, Kaiwen Zhang, Yang Wu, Tianyi Yan, Shenyuan Gao, Song Wang, Linfeng Li, Liang Pan, Yong Liu, Jianke Zhu, Wei Tsang Ooi, Steven C. H. Hoi, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            3D and 4D World Modeling: A Survey</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.07996v2">http://arxiv.org/abs/2509.07996v2</a></p>

            <p><strong>Abstract:</strong><br>
            World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds for large-scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for ``world models'' has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. A systematic summary of existing literature is available at https://github.com/worldbench/survey</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 11 Sep 2025 20:18:59 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c877f45a/cfbcc0bc.mp3" length="19655840" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1225</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, Junyuan Deng, Kaiwen Zhang, Yang Wu, Tianyi Yan, Shenyuan Gao, Song Wang, Linfeng Li, Liang Pan, Yong Liu, Jianke Zhu, Wei Tsang Ooi, Steven C. H. Hoi, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            3D and 4D World Modeling: A Survey</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.07996v2">http://arxiv.org/abs/2509.07996v2</a></p>

            <p><strong>Abstract:</strong><br>
            World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds for large-scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for ``world models'' has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. A systematic summary of existing literature is available at https://github.com/worldbench/survey</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning</title>
      <itunes:episode>1130</itunes:episode>
      <podcast:episode>1130</podcast:episode>
      <itunes:title>AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f6dc75ce-67a0-483f-ab00-e204bebce382</guid>
      <link>https://share.transistor.fm/s/5f3c7d45</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, Wei He, Yiwen Ding, Guanyu Li, Zehui Chen, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Tao Gui, Zuxuan Wu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang</p>

            <p><strong>Title:</strong><br>
            AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.08755v1">http://arxiv.org/abs/2509.08755v1</a></p>

            <p><strong>Abstract:</strong><br>
            Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch -- without relying on supervised fine-tuning (SFT) -- across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL. The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility. It encompasses a wide variety of real-world scenarios, and supports mainstream RL algorithms. Furthermore, we propose ScalingInter-RL, a training approach designed for exploration-exploitation balance and stable RL optimization. In early stages, it emphasizes exploitation by restricting the number of interactions, and gradually shifts towards exploration with larger horizons to encourage diverse problem-solving strategies. In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons. We perform extensive experiments to validate the stability and effectiveness of both the AgentGym-RL framework and the ScalingInter-RL approach. Our agents match or surpass commercial models on 27 tasks across diverse environments. We offer key insights and will open-source the complete AgentGym-RL framework -- including code and datasets -- to empower the research community in developing the next generation of intelligent agents.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, Wei He, Yiwen Ding, Guanyu Li, Zehui Chen, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Tao Gui, Zuxuan Wu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang</p>

            <p><strong>Title:</strong><br>
            AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.08755v1">http://arxiv.org/abs/2509.08755v1</a></p>

            <p><strong>Abstract:</strong><br>
            Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch -- without relying on supervised fine-tuning (SFT) -- across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL. The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility. It encompasses a wide variety of real-world scenarios, and supports mainstream RL algorithms. Furthermore, we propose ScalingInter-RL, a training approach designed for exploration-exploitation balance and stable RL optimization. In early stages, it emphasizes exploitation by restricting the number of interactions, and gradually shifts towards exploration with larger horizons to encourage diverse problem-solving strategies. In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons. We perform extensive experiments to validate the stability and effectiveness of both the AgentGym-RL framework and the ScalingInter-RL approach. Our agents match or surpass commercial models on 27 tasks across diverse environments. We offer key insights and will open-source the complete AgentGym-RL framework -- including code and datasets -- to empower the research community in developing the next generation of intelligent agents.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 11 Sep 2025 20:18:38 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5f3c7d45/d0aadaf9.mp3" length="23799978" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1484</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, Wei He, Yiwen Ding, Guanyu Li, Zehui Chen, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Tao Gui, Zuxuan Wu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang</p>

            <p><strong>Title:</strong><br>
            AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.08755v1">http://arxiv.org/abs/2509.08755v1</a></p>

            <p><strong>Abstract:</strong><br>
            Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch -- without relying on supervised fine-tuning (SFT) -- across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL. The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility. It encompasses a wide variety of real-world scenarios, and supports mainstream RL algorithms. Furthermore, we propose ScalingInter-RL, a training approach designed for exploration-exploitation balance and stable RL optimization. In early stages, it emphasizes exploitation by restricting the number of interactions, and gradually shifts towards exploration with larger horizons to encourage diverse problem-solving strategies. In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons. We perform extensive experiments to validate the stability and effectiveness of both the AgentGym-RL framework and the ScalingInter-RL approach. Our agents match or surpass commercial models on 27 tasks across diverse environments. We offer key insights and will open-source the complete AgentGym-RL framework -- including code and datasets -- to empower the research community in developing the next generation of intelligent agents.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Parallel-R1: Towards Parallel Thinking via Reinforcement Learning</title>
      <itunes:episode>1129</itunes:episode>
      <podcast:episode>1129</podcast:episode>
      <itunes:title>Parallel-R1: Towards Parallel Thinking via Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7f63b322-0549-4b2a-8278-d2550aec2a99</guid>
      <link>https://share.transistor.fm/s/cb288173</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Xinyu Yang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, Dong Yu</p>

            <p><strong>Title:</strong><br>
            Parallel-R1: Towards Parallel Thinking via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.07980v1">http://arxiv.org/abs/2509.07980v1</a></p>

            <p><strong>Abstract:</strong><br>
            Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model's thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a \textbf{mid-training exploration scaffold}, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Xinyu Yang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, Dong Yu</p>

            <p><strong>Title:</strong><br>
            Parallel-R1: Towards Parallel Thinking via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.07980v1">http://arxiv.org/abs/2509.07980v1</a></p>

            <p><strong>Abstract:</strong><br>
            Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model's thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a \textbf{mid-training exploration scaffold}, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 10 Sep 2025 20:37:32 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cb288173/382e8695.mp3" length="22469154" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1401</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Xinyu Yang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, Dong Yu</p>

            <p><strong>Title:</strong><br>
            Parallel-R1: Towards Parallel Thinking via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.07980v1">http://arxiv.org/abs/2509.07980v1</a></p>

            <p><strong>Abstract:</strong><br>
            Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model's thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a \textbf{mid-training exploration scaffold}, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Visual Representation Alignment for Multimodal Large Language Models</title>
      <itunes:episode>1128</itunes:episode>
      <podcast:episode>1128</podcast:episode>
      <itunes:title>Visual Representation Alignment for Multimodal Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">03b07b8a-1a45-4bfd-9595-b14c428ece60</guid>
      <link>https://share.transistor.fm/s/d1eecade</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, Chanho Eom, Sunghwan Hong, Seungryong Kim</p>

            <p><strong>Title:</strong><br>
            Visual Representation Alignment for Multimodal Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.07979v1">http://arxiv.org/abs/2509.07979v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, Chanho Eom, Sunghwan Hong, Seungryong Kim</p>

            <p><strong>Title:</strong><br>
            Visual Representation Alignment for Multimodal Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.07979v1">http://arxiv.org/abs/2509.07979v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 10 Sep 2025 20:37:03 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d1eecade/b7f80939.mp3" length="25226434" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1573</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, Chanho Eom, Sunghwan Hong, Seungryong Kim</p>

            <p><strong>Title:</strong><br>
            Visual Representation Alignment for Multimodal Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.07979v1">http://arxiv.org/abs/2509.07979v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search</title>
      <itunes:episode>1127</itunes:episode>
      <podcast:episode>1127</podcast:episode>
      <itunes:title>Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">55e7e3a3-b14e-49bc-b3e2-6c0386639e09</guid>
      <link>https://share.transistor.fm/s/65d53e59</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.07969v1">http://arxiv.org/abs/2509.07969v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning -- spanning tens of steps -- and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.07969v1">http://arxiv.org/abs/2509.07969v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning -- spanning tens of steps -- and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 10 Sep 2025 20:36:38 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/65d53e59/fd31ac3c.mp3" length="21043926" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1312</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.07969v1">http://arxiv.org/abs/2509.07969v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning -- spanning tens of steps -- and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Reconstruction Alignment Improves Unified Multimodal Models</title>
      <itunes:episode>1126</itunes:episode>
      <podcast:episode>1126</podcast:episode>
      <itunes:title>Reconstruction Alignment Improves Unified Multimodal Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">be95d60f-4e91-4add-905e-9fb654c635ca</guid>
      <link>https://share.transistor.fm/s/7fda8254</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ji Xie, Trevor Darrell, Luke Zettlemoyer, XuDong Wang</p>

            <p><strong>Title:</strong><br>
            Reconstruction Alignment Improves Unified Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.07295v1">http://arxiv.org/abs/2509.07295v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73$\rightarrow$0.90) and DPGBench (80.93$\rightarrow$88.15), while also boosting editing benchmarks (ImgEdit 3.38$\rightarrow$3.75, GEdit 6.94$\rightarrow$7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ji Xie, Trevor Darrell, Luke Zettlemoyer, XuDong Wang</p>

            <p><strong>Title:</strong><br>
            Reconstruction Alignment Improves Unified Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.07295v1">http://arxiv.org/abs/2509.07295v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73$\rightarrow$0.90) and DPGBench (80.93$\rightarrow$88.15), while also boosting editing benchmarks (ImgEdit 3.38$\rightarrow$3.75, GEdit 6.94$\rightarrow$7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 10 Sep 2025 20:36:14 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7fda8254/6f2f0cbd.mp3" length="23309664" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1453</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ji Xie, Trevor Darrell, Luke Zettlemoyer, XuDong Wang</p>

            <p><strong>Title:</strong><br>
            Reconstruction Alignment Improves Unified Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.07295v1">http://arxiv.org/abs/2509.07295v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73$\rightarrow$0.90) and DPGBench (80.93$\rightarrow$88.15), while also boosting editing benchmarks (ImgEdit 3.38$\rightarrow$3.75, GEdit 6.94$\rightarrow$7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward</title>
      <itunes:episode>1125</itunes:episode>
      <podcast:episode>1125</podcast:episode>
      <itunes:title>UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">013d3adf-166f-4f19-af24-3eb56a79f9b3</guid>
      <link>https://share.transistor.fm/s/d248e4fe</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yufeng Cheng, Wenxu Wu, Shaojin Wu, Mengqi Huang, Fei Ding, Qian He</p>

            <p><strong>Title:</strong><br>
            UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.06818v1">http://arxiv.org/abs/2509.06818v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in image customization exhibit a wide range of application prospects due to stronger customization capabilities. However, since we humans are more sensitive to faces, a significant challenge remains in preserving consistent identity while avoiding identity confusion with multi-reference images, limiting the identity scalability of customization models. To address this, we present UMO, a Unified Multi-identity Optimization framework, designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability. With "multi-to-multi matching" paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem and unleashes multi-identity consistency for existing image customization methods generally through reinforcement learning on diffusion models. To facilitate the training of UMO, we develop a scalable customization dataset with multi-reference images, consisting of both synthesised and real parts. Additionally, we propose a new metric to measure identity confusion. Extensive experiments demonstrate that UMO not only improves identity consistency significantly, but also reduces identity confusion on several image customization methods, setting a new state-of-the-art among open-source methods along the dimension of identity preserving. Code and model: https://github.com/bytedance/UMO</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yufeng Cheng, Wenxu Wu, Shaojin Wu, Mengqi Huang, Fei Ding, Qian He</p>

            <p><strong>Title:</strong><br>
            UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.06818v1">http://arxiv.org/abs/2509.06818v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in image customization exhibit a wide range of application prospects due to stronger customization capabilities. However, since we humans are more sensitive to faces, a significant challenge remains in preserving consistent identity while avoiding identity confusion with multi-reference images, limiting the identity scalability of customization models. To address this, we present UMO, a Unified Multi-identity Optimization framework, designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability. With "multi-to-multi matching" paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem and unleashes multi-identity consistency for existing image customization methods generally through reinforcement learning on diffusion models. To facilitate the training of UMO, we develop a scalable customization dataset with multi-reference images, consisting of both synthesised and real parts. Additionally, we propose a new metric to measure identity confusion. Extensive experiments demonstrate that UMO not only improves identity consistency significantly, but also reduces identity confusion on several image customization methods, setting a new state-of-the-art among open-source methods along the dimension of identity preserving. Code and model: https://github.com/bytedance/UMO</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 10 Sep 2025 20:35:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d248e4fe/075efb84.mp3" length="22170331" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1382</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yufeng Cheng, Wenxu Wu, Shaojin Wu, Mengqi Huang, Fei Ding, Qian He</p>

            <p><strong>Title:</strong><br>
            UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.06818v1">http://arxiv.org/abs/2509.06818v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in image customization exhibit a wide range of application prospects due to stronger customization capabilities. However, since we humans are more sensitive to faces, a significant challenge remains in preserving consistent identity while avoiding identity confusion with multi-reference images, limiting the identity scalability of customization models. To address this, we present UMO, a Unified Multi-identity Optimization framework, designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability. With "multi-to-multi matching" paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem and unleashes multi-identity consistency for existing image customization methods generally through reinforcement learning on diffusion models. To facilitate the training of UMO, we develop a scalable customization dataset with multi-reference images, consisting of both synthesised and real parts. Additionally, we propose a new metric to measure identity confusion. Extensive experiments demonstrate that UMO not only improves identity consistency significantly, but also reduces identity confusion on several image customization methods, setting a new state-of-the-art among open-source methods along the dimension of identity preserving. Code and model: https://github.com/bytedance/UMO</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Reverse-Engineered Reasoning for Open-Ended Generation</title>
      <itunes:episode>1124</itunes:episode>
      <podcast:episode>1124</podcast:episode>
      <itunes:title>Reverse-Engineered Reasoning for Open-Ended Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1b38e29e-6612-4e52-b4e3-e74da8041adb</guid>
      <link>https://share.transistor.fm/s/4787e86d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 107 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haozhe Wang, Haoran Que, Qixin Xu, Minghao Liu, Wangchunshu Zhou, Jiazhan Feng, Wanjun Zhong, Wei Ye, Tong Yang, Wenhao Huang, Ge Zhang, Fangzhen Lin</p>

            <p><strong>Title:</strong><br>
            Reverse-Engineered Reasoning for Open-Ended Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.06160v1">http://arxiv.org/abs/2509.06160v1</a></p>

            <p><strong>Abstract:</strong><br>
            While the ``deep reasoning'' paradigm has spurred significant advances in verifiable domains like mathematics, its application to open-ended, creative generation remains a critical challenge. The two dominant methods for instilling reasoning -- reinforcement learning (RL) and instruction distillation -- falter in this area; RL struggles with the absence of clear reward signals and high-quality reward models, while distillation is prohibitively expensive and capped by the teacher model's capabilities. To overcome these limitations, we introduce REverse-Engineered Reasoning (REER), a new paradigm that fundamentally shifts the approach. Instead of building a reasoning process ``forwards'' through trial-and-error or imitation, REER works ``backwards'' from known-good solutions to computationally discover the latent, step-by-step deep reasoning process that could have produced them. Using this scalable, gradient-free approach, we curate and open-source DeepWriting-20K, a large-scale dataset of 20,000 deep reasoning trajectories for open-ended tasks. Our model, DeepWriter-8B, trained on this data, not only surpasses strong open-source baselines but also achieves performance competitive with, and at times superior to, leading proprietary models like GPT-4o and Claude 3.5.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 107 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haozhe Wang, Haoran Que, Qixin Xu, Minghao Liu, Wangchunshu Zhou, Jiazhan Feng, Wanjun Zhong, Wei Ye, Tong Yang, Wenhao Huang, Ge Zhang, Fangzhen Lin</p>

            <p><strong>Title:</strong><br>
            Reverse-Engineered Reasoning for Open-Ended Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.06160v1">http://arxiv.org/abs/2509.06160v1</a></p>

            <p><strong>Abstract:</strong><br>
            While the ``deep reasoning'' paradigm has spurred significant advances in verifiable domains like mathematics, its application to open-ended, creative generation remains a critical challenge. The two dominant methods for instilling reasoning -- reinforcement learning (RL) and instruction distillation -- falter in this area; RL struggles with the absence of clear reward signals and high-quality reward models, while distillation is prohibitively expensive and capped by the teacher model's capabilities. To overcome these limitations, we introduce REverse-Engineered Reasoning (REER), a new paradigm that fundamentally shifts the approach. Instead of building a reasoning process ``forwards'' through trial-and-error or imitation, REER works ``backwards'' from known-good solutions to computationally discover the latent, step-by-step deep reasoning process that could have produced them. Using this scalable, gradient-free approach, we curate and open-source DeepWriting-20K, a large-scale dataset of 20,000 deep reasoning trajectories for open-ended tasks. Our model, DeepWriter-8B, trained on this data, not only surpasses strong open-source baselines but also achieves performance competitive with, and at times superior to, leading proprietary models like GPT-4o and Claude 3.5.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 09 Sep 2025 20:25:39 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4787e86d/c22d5466.mp3" length="11959977" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>744</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 107 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haozhe Wang, Haoran Que, Qixin Xu, Minghao Liu, Wangchunshu Zhou, Jiazhan Feng, Wanjun Zhong, Wei Ye, Tong Yang, Wenhao Huang, Ge Zhang, Fangzhen Lin</p>

            <p><strong>Title:</strong><br>
            Reverse-Engineered Reasoning for Open-Ended Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.06160v1">http://arxiv.org/abs/2509.06160v1</a></p>

            <p><strong>Abstract:</strong><br>
            While the ``deep reasoning'' paradigm has spurred significant advances in verifiable domains like mathematics, its application to open-ended, creative generation remains a critical challenge. The two dominant methods for instilling reasoning -- reinforcement learning (RL) and instruction distillation -- falter in this area; RL struggles with the absence of clear reward signals and high-quality reward models, while distillation is prohibitively expensive and capped by the teacher model's capabilities. To overcome these limitations, we introduce REverse-Engineered Reasoning (REER), a new paradigm that fundamentally shifts the approach. Instead of building a reasoning process ``forwards'' through trial-and-error or imitation, REER works ``backwards'' from known-good solutions to computationally discover the latent, step-by-step deep reasoning process that could have produced them. Using this scalable, gradient-free approach, we curate and open-source DeepWriting-20K, a large-scale dataset of 20,000 deep reasoning trajectories for open-ended tasks. Our model, DeepWriter-8B, trained on this data, not only surpasses strong open-source baselines but also achieves performance competitive with, and at times superior to, leading proprietary models like GPT-4o and Claude 3.5.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Does DINOv3 Set a New Medical Vision Standard?</title>
      <itunes:episode>1123</itunes:episode>
      <podcast:episode>1123</podcast:episode>
      <itunes:title>Does DINOv3 Set a New Medical Vision Standard?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e44f1ccf-27b0-406a-a084-a68dbe756d06</guid>
      <link>https://share.transistor.fm/s/fdd80404</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Che Liu, Yinda Chen, Haoyuan Shi, Jinpeng Lu, Bailiang Jian, Jiazhen Pan, Linghan Cai, Jiayi Wang, Yundi Zhang, Jun Li, Cosmin I. Bercea, Cheng Ouyang, Chen Chen, Zhiwei Xiong, Benedikt Wiestler, Christian Wachinger, Daniel Rueckert, Wenjia Bai, Rossella Arcucci</p>

            <p><strong>Title:</strong><br>
            Does DINOv3 Set a New Medical Vision Standard?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.06467v1">http://arxiv.org/abs/2509.06467v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advent of large-scale vision foundation models, pre-trained on diverse natural images, has marked a paradigm shift in computer vision. However, how the frontier vision foundation models' efficacies transfer to specialized domains remains such as medical imaging remains an open question. This report investigates whether DINOv3, a state-of-the-art self-supervised vision transformer (ViT) that features strong capability in dense prediction tasks, can directly serve as a powerful, unified encoder for medical vision tasks without domain-specific pre-training. To answer this, we benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation on a wide range of medical imaging modalities. We systematically analyze its scalability by varying model sizes and input image resolutions. Our findings reveal that DINOv3 shows impressive performance and establishes a formidable new baseline. Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks, despite being trained solely on natural images. However, we identify clear limitations: The model's features degrade in scenarios requiring deep domain specialization, such as in Whole-Slide Pathological Images (WSIs), Electron Microscopy (EM), and Positron Emission Tomography (PET). Furthermore, we observe that DINOv3 does not consistently obey scaling law in the medical domain; performance does not reliably increase with larger models or finer feature resolutions, showing diverse scaling behaviors across tasks. Ultimately, our work establishes DINOv3 as a strong baseline, whose powerful visual features can serve as a robust prior for multiple complex medical tasks. This opens promising future directions, such as leveraging its features to enforce multiview consistency in 3D reconstruction.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Che Liu, Yinda Chen, Haoyuan Shi, Jinpeng Lu, Bailiang Jian, Jiazhen Pan, Linghan Cai, Jiayi Wang, Yundi Zhang, Jun Li, Cosmin I. Bercea, Cheng Ouyang, Chen Chen, Zhiwei Xiong, Benedikt Wiestler, Christian Wachinger, Daniel Rueckert, Wenjia Bai, Rossella Arcucci</p>

            <p><strong>Title:</strong><br>
            Does DINOv3 Set a New Medical Vision Standard?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.06467v1">http://arxiv.org/abs/2509.06467v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advent of large-scale vision foundation models, pre-trained on diverse natural images, has marked a paradigm shift in computer vision. However, how the frontier vision foundation models' efficacies transfer to specialized domains remains such as medical imaging remains an open question. This report investigates whether DINOv3, a state-of-the-art self-supervised vision transformer (ViT) that features strong capability in dense prediction tasks, can directly serve as a powerful, unified encoder for medical vision tasks without domain-specific pre-training. To answer this, we benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation on a wide range of medical imaging modalities. We systematically analyze its scalability by varying model sizes and input image resolutions. Our findings reveal that DINOv3 shows impressive performance and establishes a formidable new baseline. Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks, despite being trained solely on natural images. However, we identify clear limitations: The model's features degrade in scenarios requiring deep domain specialization, such as in Whole-Slide Pathological Images (WSIs), Electron Microscopy (EM), and Positron Emission Tomography (PET). Furthermore, we observe that DINOv3 does not consistently obey scaling law in the medical domain; performance does not reliably increase with larger models or finer feature resolutions, showing diverse scaling behaviors across tasks. Ultimately, our work establishes DINOv3 as a strong baseline, whose powerful visual features can serve as a robust prior for multiple complex medical tasks. This opens promising future directions, such as leveraging its features to enforce multiview consistency in 3D reconstruction.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 09 Sep 2025 20:24:57 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fdd80404/f2d260d4.mp3" length="10387189" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>646</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Che Liu, Yinda Chen, Haoyuan Shi, Jinpeng Lu, Bailiang Jian, Jiazhen Pan, Linghan Cai, Jiayi Wang, Yundi Zhang, Jun Li, Cosmin I. Bercea, Cheng Ouyang, Chen Chen, Zhiwei Xiong, Benedikt Wiestler, Christian Wachinger, Daniel Rueckert, Wenjia Bai, Rossella Arcucci</p>

            <p><strong>Title:</strong><br>
            Does DINOv3 Set a New Medical Vision Standard?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.06467v1">http://arxiv.org/abs/2509.06467v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advent of large-scale vision foundation models, pre-trained on diverse natural images, has marked a paradigm shift in computer vision. However, how the frontier vision foundation models' efficacies transfer to specialized domains remains such as medical imaging remains an open question. This report investigates whether DINOv3, a state-of-the-art self-supervised vision transformer (ViT) that features strong capability in dense prediction tasks, can directly serve as a powerful, unified encoder for medical vision tasks without domain-specific pre-training. To answer this, we benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation on a wide range of medical imaging modalities. We systematically analyze its scalability by varying model sizes and input image resolutions. Our findings reveal that DINOv3 shows impressive performance and establishes a formidable new baseline. Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks, despite being trained solely on natural images. However, we identify clear limitations: The model's features degrade in scenarios requiring deep domain specialization, such as in Whole-Slide Pathological Images (WSIs), Electron Microscopy (EM), and Positron Emission Tomography (PET). Furthermore, we observe that DINOv3 does not consistently obey scaling law in the medical domain; performance does not reliably increase with larger models or finer feature resolutions, showing diverse scaling behaviors across tasks. Ultimately, our work establishes DINOv3 as a strong baseline, whose powerful visual features can serve as a robust prior for multiple complex medical tasks. This opens promising future directions, such as leveraging its features to enforce multiview consistency in 3D reconstruction.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Symbolic Graphics Programming with Large Language Models</title>
      <itunes:episode>1122</itunes:episode>
      <podcast:episode>1122</podcast:episode>
      <itunes:title>Symbolic Graphics Programming with Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d5a8cfc5-b3ea-47c6-a895-9ebc593dba73</guid>
      <link>https://share.transistor.fm/s/d035a725</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yamei Chen, Haoquan Zhang, Yangyi Huang, Zeju Qiu, Kaipeng Zhang, Yandong Wen, Weiyang Liu</p>

            <p><strong>Title:</strong><br>
            Symbolic Graphics Programming with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.05208v1">http://arxiv.org/abs/2509.05208v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) that render into precise visual content remains underexplored. We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description. This task also serves as a lens into how LLMs understand the visual world by prompting them to generate images rendered from SGPs. Among various SGPs, our paper sticks to scalable vector graphics (SVGs). We begin by examining the extent to which LLMs can generate SGPs. To this end, we introduce SGP-GenBench, a comprehensive benchmark covering object fidelity, scene fidelity, and compositionality (attribute binding, spatial relations, numeracy). On SGP-GenBench, we discover that frontier proprietary models substantially outperform open-source models, and performance correlates well with general coding capabilities. Motivated by this gap, we aim to improve LLMs' ability to generate SGPs. We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG, and a cross-modal reward aligns text and the rendered image via strong vision encoders (e.g., SigLIP for text-image and DINO for image-image). Applied to Qwen-2.5-7B, our method substantially improves SVG generation quality and semantics, achieving performance on par with frontier systems. We further analyze training dynamics, showing that RL induces (i) finer decomposition of objects into controllable primitives and (ii) contextual details that improve scene coherence. Our results demonstrate that symbolic graphics programming offers a precise and interpretable lens on cross-modal grounding.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yamei Chen, Haoquan Zhang, Yangyi Huang, Zeju Qiu, Kaipeng Zhang, Yandong Wen, Weiyang Liu</p>

            <p><strong>Title:</strong><br>
            Symbolic Graphics Programming with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.05208v1">http://arxiv.org/abs/2509.05208v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) that render into precise visual content remains underexplored. We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description. This task also serves as a lens into how LLMs understand the visual world by prompting them to generate images rendered from SGPs. Among various SGPs, our paper sticks to scalable vector graphics (SVGs). We begin by examining the extent to which LLMs can generate SGPs. To this end, we introduce SGP-GenBench, a comprehensive benchmark covering object fidelity, scene fidelity, and compositionality (attribute binding, spatial relations, numeracy). On SGP-GenBench, we discover that frontier proprietary models substantially outperform open-source models, and performance correlates well with general coding capabilities. Motivated by this gap, we aim to improve LLMs' ability to generate SGPs. We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG, and a cross-modal reward aligns text and the rendered image via strong vision encoders (e.g., SigLIP for text-image and DINO for image-image). Applied to Qwen-2.5-7B, our method substantially improves SVG generation quality and semantics, achieving performance on par with frontier systems. We further analyze training dynamics, showing that RL induces (i) finer decomposition of objects into controllable primitives and (ii) contextual details that improve scene coherence. Our results demonstrate that symbolic graphics programming offers a precise and interpretable lens on cross-modal grounding.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 08 Sep 2025 20:07:06 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d035a725/29d185a8.mp3" length="13446242" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>837</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yamei Chen, Haoquan Zhang, Yangyi Huang, Zeju Qiu, Kaipeng Zhang, Yandong Wen, Weiyang Liu</p>

            <p><strong>Title:</strong><br>
            Symbolic Graphics Programming with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.05208v1">http://arxiv.org/abs/2509.05208v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) that render into precise visual content remains underexplored. We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description. This task also serves as a lens into how LLMs understand the visual world by prompting them to generate images rendered from SGPs. Among various SGPs, our paper sticks to scalable vector graphics (SVGs). We begin by examining the extent to which LLMs can generate SGPs. To this end, we introduce SGP-GenBench, a comprehensive benchmark covering object fidelity, scene fidelity, and compositionality (attribute binding, spatial relations, numeracy). On SGP-GenBench, we discover that frontier proprietary models substantially outperform open-source models, and performance correlates well with general coding capabilities. Motivated by this gap, we aim to improve LLMs' ability to generate SGPs. We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG, and a cross-modal reward aligns text and the rendered image via strong vision encoders (e.g., SigLIP for text-image and DINO for image-image). Applied to Qwen-2.5-7B, our method substantially improves SVG generation quality and semantics, achieving performance on par with frontier systems. We further analyze training dynamics, showing that RL induces (i) finer decomposition of objects into controllable primitives and (ii) contextual details that improve scene coherence. Our results demonstrate that symbolic graphics programming offers a precise and interpretable lens on cross-modal grounding.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Set Block Decoding is a Language Model Inference Accelerator</title>
      <itunes:episode>1121</itunes:episode>
      <podcast:episode>1121</podcast:episode>
      <itunes:title>Set Block Decoding is a Language Model Inference Accelerator</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8374eeb4-16b5-48f0-8b62-d135eab19b2d</guid>
      <link>https://share.transistor.fm/s/c4d6b88f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Itai Gat, Heli Ben-Hamu, Marton Havasi, Daniel Haziza, Jeremy Reizenstein, Gabriel Synnaeve, David Lopez-Paz, Brian Karrer, Yaron Lipman</p>

            <p><strong>Title:</strong><br>
            Set Block Decoding is a Language Model Inference Accelerator</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.04185v1">http://arxiv.org/abs/2509.04185v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive next token prediction language models offer powerful capabilities but face significant challenges in practical deployment due to the high computational and memory costs of inference, particularly during the decoding stage. We introduce Set Block Decoding (SBD), a simple and flexible paradigm that accelerates generation by integrating standard next token prediction (NTP) and masked token prediction (MATP) within a single architecture. SBD allows the model to sample multiple, not necessarily consecutive, future tokens in parallel, a key distinction from previous acceleration methods. This flexibility allows the use of advanced solvers from the discrete diffusion literature, offering significant speedups without sacrificing accuracy. SBD requires no architectural changes or extra training hyperparameters, maintains compatibility with exact KV-caching, and can be implemented by fine-tuning existing next token prediction models. By fine-tuning Llama-3.1 8B and Qwen-3 8B, we demonstrate that SBD enables a 3-5x reduction in the number of forward passes required for generation while achieving same performance as equivalent NTP training.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Itai Gat, Heli Ben-Hamu, Marton Havasi, Daniel Haziza, Jeremy Reizenstein, Gabriel Synnaeve, David Lopez-Paz, Brian Karrer, Yaron Lipman</p>

            <p><strong>Title:</strong><br>
            Set Block Decoding is a Language Model Inference Accelerator</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.04185v1">http://arxiv.org/abs/2509.04185v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive next token prediction language models offer powerful capabilities but face significant challenges in practical deployment due to the high computational and memory costs of inference, particularly during the decoding stage. We introduce Set Block Decoding (SBD), a simple and flexible paradigm that accelerates generation by integrating standard next token prediction (NTP) and masked token prediction (MATP) within a single architecture. SBD allows the model to sample multiple, not necessarily consecutive, future tokens in parallel, a key distinction from previous acceleration methods. This flexibility allows the use of advanced solvers from the discrete diffusion literature, offering significant speedups without sacrificing accuracy. SBD requires no architectural changes or extra training hyperparameters, maintains compatibility with exact KV-caching, and can be implemented by fine-tuning existing next token prediction models. By fine-tuning Llama-3.1 8B and Qwen-3 8B, we demonstrate that SBD enables a 3-5x reduction in the number of forward passes required for generation while achieving same performance as equivalent NTP training.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 08 Sep 2025 20:06:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c4d6b88f/c30d1a13.mp3" length="15754217" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>981</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Itai Gat, Heli Ben-Hamu, Marton Havasi, Daniel Haziza, Jeremy Reizenstein, Gabriel Synnaeve, David Lopez-Paz, Brian Karrer, Yaron Lipman</p>

            <p><strong>Title:</strong><br>
            Set Block Decoding is a Language Model Inference Accelerator</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.04185v1">http://arxiv.org/abs/2509.04185v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive next token prediction language models offer powerful capabilities but face significant challenges in practical deployment due to the high computational and memory costs of inference, particularly during the decoding stage. We introduce Set Block Decoding (SBD), a simple and flexible paradigm that accelerates generation by integrating standard next token prediction (NTP) and masked token prediction (MATP) within a single architecture. SBD allows the model to sample multiple, not necessarily consecutive, future tokens in parallel, a key distinction from previous acceleration methods. This flexibility allows the use of advanced solvers from the discrete diffusion literature, offering significant speedups without sacrificing accuracy. SBD requires no architectural changes or extra training hyperparameters, maintains compatibility with exact KV-caching, and can be implemented by fine-tuning existing next token prediction models. By fine-tuning Llama-3.1 8B and Qwen-3 8B, we demonstrate that SBD enables a 3-5x reduction in the number of forward passes required for generation while achieving same performance as equivalent NTP training.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth</title>
      <itunes:episode>1120</itunes:episode>
      <podcast:episode>1120</podcast:episode>
      <itunes:title>Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">285e588e-1334-4b90-bcaa-685d97b7ac24</guid>
      <link>https://share.transistor.fm/s/7b606888</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 100 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yang Wang, Chenghao Xiao, Chia-Yi Hsiao, Zi Yan Chang, Chi-Li Chen, Tyler Loakman, Chenghua Lin</p>

            <p><strong>Title:</strong><br>
            Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.03867v1">http://arxiv.org/abs/2509.03867v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Drivelology, a unique linguistic phenomenon characterised as "nonsense with depth", utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a small but diverse benchmark dataset of over 1,200 meticulously curated examples, with select instances in English, Mandarin, Spanish, French, Japanese, and Korean. Annotation was especially challenging: each of the examples required careful expert review to verify that it truly reflected Drivelological characteristics. The process involved multiple rounds of discussion and adjudication to address disagreements, highlighting the subtle and subjective nature of the Drivelology. We evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss the implied rhetorical function altogether. These findings highlight a deeper representational gap in LLMs' pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 100 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yang Wang, Chenghao Xiao, Chia-Yi Hsiao, Zi Yan Chang, Chi-Li Chen, Tyler Loakman, Chenghua Lin</p>

            <p><strong>Title:</strong><br>
            Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.03867v1">http://arxiv.org/abs/2509.03867v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Drivelology, a unique linguistic phenomenon characterised as "nonsense with depth", utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a small but diverse benchmark dataset of over 1,200 meticulously curated examples, with select instances in English, Mandarin, Spanish, French, Japanese, and Korean. Annotation was especially challenging: each of the examples required careful expert review to verify that it truly reflected Drivelological characteristics. The process involved multiple rounds of discussion and adjudication to address disagreements, highlighting the subtle and subjective nature of the Drivelology. We evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss the implied rhetorical function altogether. These findings highlight a deeper representational gap in LLMs' pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 05 Sep 2025 20:25:02 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7b606888/f6583bc8.mp3" length="22098009" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1377</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 100 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yang Wang, Chenghao Xiao, Chia-Yi Hsiao, Zi Yan Chang, Chi-Li Chen, Tyler Loakman, Chenghua Lin</p>

            <p><strong>Title:</strong><br>
            Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.03867v1">http://arxiv.org/abs/2509.03867v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Drivelology, a unique linguistic phenomenon characterised as "nonsense with depth", utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a small but diverse benchmark dataset of over 1,200 meticulously curated examples, with select instances in English, Mandarin, Spanish, French, Japanese, and Korean. Annotation was especially challenging: each of the examples required careful expert review to verify that it truly reflected Drivelological characteristics. The process involved multiple rounds of discussion and adjudication to address disagreements, highlighting the subtle and subjective nature of the Drivelology. We evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss the implied rhetorical function altogether. These findings highlight a deeper representational gap in LLMs' pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>From Editor to Dense Geometry Estimator</title>
      <itunes:episode>1119</itunes:episode>
      <podcast:episode>1119</podcast:episode>
      <itunes:title>From Editor to Dense Geometry Estimator</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">447b7391-11ef-4cac-8f11-a5ba81638dc4</guid>
      <link>https://share.transistor.fm/s/aabcf7da</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu, Yao Zhao</p>

            <p><strong>Title:</strong><br>
            From Editor to Dense Geometry Estimator</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.04338v1">http://arxiv.org/abs/2509.04338v1</a></p>

            <p><strong>Abstract:</strong><br>
            Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning.   Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts.   Based on these findings, we introduce \textbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other.   Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$\times$ data. The project page can be accessed \href{https://amap-ml.github.io/FE2E/}{here}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu, Yao Zhao</p>

            <p><strong>Title:</strong><br>
            From Editor to Dense Geometry Estimator</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.04338v1">http://arxiv.org/abs/2509.04338v1</a></p>

            <p><strong>Abstract:</strong><br>
            Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning.   Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts.   Based on these findings, we introduce \textbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other.   Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$\times$ data. The project page can be accessed \href{https://amap-ml.github.io/FE2E/}{here}.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 05 Sep 2025 20:24:40 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/aabcf7da/0504d2f8.mp3" length="18058405" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1125</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu, Yao Zhao</p>

            <p><strong>Title:</strong><br>
            From Editor to Dense Geometry Estimator</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.04338v1">http://arxiv.org/abs/2509.04338v1</a></p>

            <p><strong>Abstract:</strong><br>
            Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning.   Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts.   Based on these findings, we introduce \textbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other.   Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$\times$ data. The project page can be accessed \href{https://amap-ml.github.io/FE2E/}{here}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Towards a Unified View of Large Language Model Post-Training</title>
      <itunes:episode>1118</itunes:episode>
      <podcast:episode>1118</podcast:episode>
      <itunes:title>Towards a Unified View of Large Language Model Post-Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8e8e0bf8-b8ec-4de4-8d06-1daf9e3c7a56</guid>
      <link>https://share.transistor.fm/s/14d53ad9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingtai Lv, Yuxin Zuo, Youbang Sun, Hongyi Liu, Yuntian Wei, Zhekai Chen, Lixuan He, Xuekai Zhu, Kaiyan Zhang, Bingning Wang, Ning Ding, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            Towards a Unified View of Large Language Model Post-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.04419v1">http://arxiv.org/abs/2509.04419v1</a></p>

            <p><strong>Abstract:</strong><br>
            Two major sources of training data exist for post-training modern language models: online (model-generated rollouts) data, and offline (human or other-model demonstrations) data. These two types of data are typically used by approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT), respectively. In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive a Unified Policy Gradient Estimator, and present the calculations of a wide spectrum of post-training approaches as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. The gradient estimator is constructed with four interchangeable parts: stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify the effectiveness of our unified theoretical framework and HPT. Across six mathematical reasoning benchmarks and two out-of-distribution suites, HPT consistently surpasses strong baselines across models of varying scales and families.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingtai Lv, Yuxin Zuo, Youbang Sun, Hongyi Liu, Yuntian Wei, Zhekai Chen, Lixuan He, Xuekai Zhu, Kaiyan Zhang, Bingning Wang, Ning Ding, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            Towards a Unified View of Large Language Model Post-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.04419v1">http://arxiv.org/abs/2509.04419v1</a></p>

            <p><strong>Abstract:</strong><br>
            Two major sources of training data exist for post-training modern language models: online (model-generated rollouts) data, and offline (human or other-model demonstrations) data. These two types of data are typically used by approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT), respectively. In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive a Unified Policy Gradient Estimator, and present the calculations of a wide spectrum of post-training approaches as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. The gradient estimator is constructed with four interchangeable parts: stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify the effectiveness of our unified theoretical framework and HPT. Across six mathematical reasoning benchmarks and two out-of-distribution suites, HPT consistently surpasses strong baselines across models of varying scales and families.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 05 Sep 2025 20:24:19 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/14d53ad9/3f54b9c0.mp3" length="22250138" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1387</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingtai Lv, Yuxin Zuo, Youbang Sun, Hongyi Liu, Yuntian Wei, Zhekai Chen, Lixuan He, Xuekai Zhu, Kaiyan Zhang, Bingning Wang, Ning Ding, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            Towards a Unified View of Large Language Model Post-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.04419v1">http://arxiv.org/abs/2509.04419v1</a></p>

            <p><strong>Abstract:</strong><br>
            Two major sources of training data exist for post-training modern language models: online (model-generated rollouts) data, and offline (human or other-model demonstrations) data. These two types of data are typically used by approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT), respectively. In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive a Unified Policy Gradient Estimator, and present the calculations of a wide spectrum of post-training approaches as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. The gradient estimator is constructed with four interchangeable parts: stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify the effectiveness of our unified theoretical framework and HPT. Across six mathematical reasoning benchmarks and two out-of-distribution suites, HPT consistently surpasses strong baselines across models of varying scales and families.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks</title>
      <itunes:episode>1117</itunes:episode>
      <podcast:episode>1117</podcast:episode>
      <itunes:title>DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">25f6887a-cdc4-4733-8022-2a709e38edf7</guid>
      <link>https://share.transistor.fm/s/afc117ff</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haiyuan Wan, Chen Yang, Junchi Yu, Meiqi Tu, Jiaxuan Lu, Di Yu, Jianbao Cao, Ben Gao, Jiaqing Xie, Aoran Wang, Wenlong Zhang, Philip Torr, Dongzhan Zhou</p>

            <p><strong>Title:</strong><br>
            DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.01396v1">http://arxiv.org/abs/2509.01396v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep research agents have attracted growing attention for their potential to orchestrate multi-stage research workflows, spanning literature synthesis, methodological design, and empirical verification. Despite these strides, evaluating their research capability faithfully is rather challenging due to the difficulty of collecting frontier research questions that genuinely capture researchers' attention and intellectual curiosity. To address this gap, we introduce DeepResearch Arena, a benchmark grounded in academic seminars that capture rich expert discourse and interaction, better reflecting real-world research environments and reducing the risk of data leakage. To automatically construct DeepResearch Arena, we propose a Multi-Agent Hierarchical Task Generation (MAHTG) system that extracts research-worthy inspirations from seminar transcripts. The MAHTG system further translates research-worthy inspirations into high-quality research tasks, ensuring the traceability of research task formulation while filtering noise. With the MAHTG system, we curate DeepResearch Arena with over 10,000 high-quality research tasks from over 200 academic seminars, spanning 12 disciplines, such as literature, history, and science. Our extensive evaluation shows that DeepResearch Arena presents substantial challenges for current state-of-the-art agents, with clear performance gaps observed across different models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haiyuan Wan, Chen Yang, Junchi Yu, Meiqi Tu, Jiaxuan Lu, Di Yu, Jianbao Cao, Ben Gao, Jiaqing Xie, Aoran Wang, Wenlong Zhang, Philip Torr, Dongzhan Zhou</p>

            <p><strong>Title:</strong><br>
            DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.01396v1">http://arxiv.org/abs/2509.01396v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep research agents have attracted growing attention for their potential to orchestrate multi-stage research workflows, spanning literature synthesis, methodological design, and empirical verification. Despite these strides, evaluating their research capability faithfully is rather challenging due to the difficulty of collecting frontier research questions that genuinely capture researchers' attention and intellectual curiosity. To address this gap, we introduce DeepResearch Arena, a benchmark grounded in academic seminars that capture rich expert discourse and interaction, better reflecting real-world research environments and reducing the risk of data leakage. To automatically construct DeepResearch Arena, we propose a Multi-Agent Hierarchical Task Generation (MAHTG) system that extracts research-worthy inspirations from seminar transcripts. The MAHTG system further translates research-worthy inspirations into high-quality research tasks, ensuring the traceability of research task formulation while filtering noise. With the MAHTG system, we curate DeepResearch Arena with over 10,000 high-quality research tasks from over 200 academic seminars, spanning 12 disciplines, such as literature, history, and science. Our extensive evaluation shows that DeepResearch Arena presents substantial challenges for current state-of-the-art agents, with clear performance gaps observed across different models.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 05 Sep 2025 20:23:57 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/afc117ff/0765a2e2.mp3" length="19437302" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1211</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haiyuan Wan, Chen Yang, Junchi Yu, Meiqi Tu, Jiaxuan Lu, Di Yu, Jianbao Cao, Ben Gao, Jiaqing Xie, Aoran Wang, Wenlong Zhang, Philip Torr, Dongzhan Zhou</p>

            <p><strong>Title:</strong><br>
            DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.01396v1">http://arxiv.org/abs/2509.01396v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep research agents have attracted growing attention for their potential to orchestrate multi-stage research workflows, spanning literature synthesis, methodological design, and empirical verification. Despite these strides, evaluating their research capability faithfully is rather challenging due to the difficulty of collecting frontier research questions that genuinely capture researchers' attention and intellectual curiosity. To address this gap, we introduce DeepResearch Arena, a benchmark grounded in academic seminars that capture rich expert discourse and interaction, better reflecting real-world research environments and reducing the risk of data leakage. To automatically construct DeepResearch Arena, we propose a Multi-Agent Hierarchical Task Generation (MAHTG) system that extracts research-worthy inspirations from seminar transcripts. The MAHTG system further translates research-worthy inspirations into high-quality research tasks, ensuring the traceability of research task formulation while filtering noise. With the MAHTG system, we curate DeepResearch Arena with over 10,000 high-quality research tasks from over 200 academic seminars, spanning 12 disciplines, such as literature, history, and science. Our extensive evaluation shows that DeepResearch Arena presents substantial challenges for current state-of-the-art agents, with clear performance gaps observed across different models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?</title>
      <itunes:episode>1116</itunes:episode>
      <podcast:episode>1116</podcast:episode>
      <itunes:title>Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2fd5304d-5e30-49a4-a97a-9c88b382357b</guid>
      <link>https://share.transistor.fm/s/0ddb18e5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qinyan Zhang, Xinping Lei, Ruijie Miao, Yu Fu, Haojie Fan, Le Chang, Jiafan Hou, Dingling Zhang, Zhongfei Hou, Ziqiang Yang, Changxin Pu, Fei Hu, Jingkai Liu, Mengyun Liu, Yang Liu, Xiang Gao, Jiaheng Liu, Tong Yang, Zaiyuan Wang, Ge Zhang, Wenhao Huang</p>

            <p><strong>Title:</strong><br>
            Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.04292v1">http://arxiv.org/abs/2509.04292v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qinyan Zhang, Xinping Lei, Ruijie Miao, Yu Fu, Haojie Fan, Le Chang, Jiafan Hou, Dingling Zhang, Zhongfei Hou, Ziqiang Yang, Changxin Pu, Fei Hu, Jingkai Liu, Mengyun Liu, Yang Liu, Xiang Gao, Jiaheng Liu, Tong Yang, Zaiyuan Wang, Ge Zhang, Wenhao Huang</p>

            <p><strong>Title:</strong><br>
            Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.04292v1">http://arxiv.org/abs/2509.04292v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 05 Sep 2025 20:23:35 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0ddb18e5/0f656b20.mp3" length="22194163" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1383</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qinyan Zhang, Xinping Lei, Ruijie Miao, Yu Fu, Haojie Fan, Le Chang, Jiafan Hou, Dingling Zhang, Zhongfei Hou, Ziqiang Yang, Changxin Pu, Fei Hu, Jingkai Liu, Mengyun Liu, Yang Liu, Xiang Gao, Jiaheng Liu, Tong Yang, Zaiyuan Wang, Ge Zhang, Wenhao Huang</p>

            <p><strong>Title:</strong><br>
            Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.04292v1">http://arxiv.org/abs/2509.04292v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Open Data Synthesis For Deep Research</title>
      <itunes:episode>1115</itunes:episode>
      <podcast:episode>1115</podcast:episode>
      <itunes:title>Open Data Synthesis For Deep Research</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e6e37298-8fd0-4e14-a712-5c31f00709da</guid>
      <link>https://share.transistor.fm/s/9aa30380</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ziyi Xia, Kun Luo, Hongjin Qian, Zheng Liu</p>

            <p><strong>Title:</strong><br>
            Open Data Synthesis For Deep Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.00375v1">http://arxiv.org/abs/2509.00375v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are increasingly expected to go beyond simple factual queries toward Deep Research-tasks that require decomposing questions into sub-problems, coordinating multi-step reasoning, and synthesizing evidence from diverse sources. We formalize Deep Research tasks with verifiable answers as Hierarchical Constraint Satisfaction Problems (HCSPs), which are fundamentally different from single-constraint, multi-hop, or flat CSP formulations. However, existing benchmarks (e.g., Natural Questions, HotpotQA) fail to capture this complexity, while recent synthetic datasets often introduce shortcut reasoning, knowledge leakage, or lack sufficient structural depth. To address this gap, we introduce InfoSeek, a scalable framework for synthesizing complex Deep Research tasks. InfoSeek uses a dual-agent system to recursively build a Research Tree from large-scale webpages, blurring intermediate nodes into valid sub-problems, and converting these trees into natural language questions that require traversing the full hierarchy. It also enables rapid scaling, yielding over 50K training examples, a curated test set, and reasoning trajectories generated via reject sampling. Experiments show that models trained on InfoSeek consistently outperform strong baselines. On a challenging benchmark BrowseComp-Plus, 3B LLMs optimized with InfoSeek surpass much larger 32B models and lightweight commercial APIs (e.g., Gemini2.5-Flash), while achieving performance comparable to stronger APIs (e.g., Gemini2.5-Pro). By preserving meta-information such as intermediate steps and retrieval labels, InfoSeek further supports advanced optimization strategies, including compound reward design and trajectory-level exploration. We provide our codes and datasets in \href{https://github.com/VectorSpaceLab/InfoSeek}{this repository}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ziyi Xia, Kun Luo, Hongjin Qian, Zheng Liu</p>

            <p><strong>Title:</strong><br>
            Open Data Synthesis For Deep Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.00375v1">http://arxiv.org/abs/2509.00375v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are increasingly expected to go beyond simple factual queries toward Deep Research-tasks that require decomposing questions into sub-problems, coordinating multi-step reasoning, and synthesizing evidence from diverse sources. We formalize Deep Research tasks with verifiable answers as Hierarchical Constraint Satisfaction Problems (HCSPs), which are fundamentally different from single-constraint, multi-hop, or flat CSP formulations. However, existing benchmarks (e.g., Natural Questions, HotpotQA) fail to capture this complexity, while recent synthetic datasets often introduce shortcut reasoning, knowledge leakage, or lack sufficient structural depth. To address this gap, we introduce InfoSeek, a scalable framework for synthesizing complex Deep Research tasks. InfoSeek uses a dual-agent system to recursively build a Research Tree from large-scale webpages, blurring intermediate nodes into valid sub-problems, and converting these trees into natural language questions that require traversing the full hierarchy. It also enables rapid scaling, yielding over 50K training examples, a curated test set, and reasoning trajectories generated via reject sampling. Experiments show that models trained on InfoSeek consistently outperform strong baselines. On a challenging benchmark BrowseComp-Plus, 3B LLMs optimized with InfoSeek surpass much larger 32B models and lightweight commercial APIs (e.g., Gemini2.5-Flash), while achieving performance comparable to stronger APIs (e.g., Gemini2.5-Pro). By preserving meta-information such as intermediate steps and retrieval labels, InfoSeek further supports advanced optimization strategies, including compound reward design and trajectory-level exploration. We provide our codes and datasets in \href{https://github.com/VectorSpaceLab/InfoSeek}{this repository}.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 04 Sep 2025 19:57:38 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9aa30380/21fa923a.mp3" length="22181570" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1383</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ziyi Xia, Kun Luo, Hongjin Qian, Zheng Liu</p>

            <p><strong>Title:</strong><br>
            Open Data Synthesis For Deep Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.00375v1">http://arxiv.org/abs/2509.00375v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are increasingly expected to go beyond simple factual queries toward Deep Research-tasks that require decomposing questions into sub-problems, coordinating multi-step reasoning, and synthesizing evidence from diverse sources. We formalize Deep Research tasks with verifiable answers as Hierarchical Constraint Satisfaction Problems (HCSPs), which are fundamentally different from single-constraint, multi-hop, or flat CSP formulations. However, existing benchmarks (e.g., Natural Questions, HotpotQA) fail to capture this complexity, while recent synthetic datasets often introduce shortcut reasoning, knowledge leakage, or lack sufficient structural depth. To address this gap, we introduce InfoSeek, a scalable framework for synthesizing complex Deep Research tasks. InfoSeek uses a dual-agent system to recursively build a Research Tree from large-scale webpages, blurring intermediate nodes into valid sub-problems, and converting these trees into natural language questions that require traversing the full hierarchy. It also enables rapid scaling, yielding over 50K training examples, a curated test set, and reasoning trajectories generated via reject sampling. Experiments show that models trained on InfoSeek consistently outperform strong baselines. On a challenging benchmark BrowseComp-Plus, 3B LLMs optimized with InfoSeek surpass much larger 32B models and lightweight commercial APIs (e.g., Gemini2.5-Flash), while achieving performance comparable to stronger APIs (e.g., Gemini2.5-Pro). By preserving meta-information such as intermediate steps and retrieval labels, InfoSeek further supports advanced optimization strategies, including compound reward design and trajectory-level exploration. We provide our codes and datasets in \href{https://github.com/VectorSpaceLab/InfoSeek}{this repository}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Robix: A Unified Model for Robot Interaction, Reasoning and Planning</title>
      <itunes:episode>1114</itunes:episode>
      <podcast:episode>1114</podcast:episode>
      <itunes:title>Robix: A Unified Model for Robot Interaction, Reasoning and Planning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7ce2e02f-b4a8-49f8-8da7-ee97bee733c6</guid>
      <link>https://share.transistor.fm/s/ca6ca049</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.AI, cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, Hang Li</p>

            <p><strong>Title:</strong><br>
            Robix: A Unified Model for Robot Interaction, Reasoning and Planning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.01106v1">http://arxiv.org/abs/2509.01106v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.AI, cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, Hang Li</p>

            <p><strong>Title:</strong><br>
            Robix: A Unified Model for Robot Interaction, Reasoning and Planning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.01106v1">http://arxiv.org/abs/2509.01106v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 04 Sep 2025 19:57:16 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ca6ca049/7edd81c4.mp3" length="21138793" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1317</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.AI, cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, Hang Li</p>

            <p><strong>Title:</strong><br>
            Robix: A Unified Model for Robot Interaction, Reasoning and Planning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.01106v1">http://arxiv.org/abs/2509.01106v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning</title>
      <itunes:episode>1113</itunes:episode>
      <podcast:episode>1113</podcast:episode>
      <itunes:title>UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6276475b-74e7-4ec6-b6b1-2630c8855253</guid>
      <link>https://share.transistor.fm/s/db124b4d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Bo Li, Chen Dun, Chong Liu, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu Shi, Lin Yan, Peiyao Zhao, Pengfei Liu, Qinghao Ye, Renjie Zheng, Wayne Xin Zhao, Wen Heng, Wenhao Huang, Wenqian Wang, Xiaobo Qin, Yi Lin, Youbin Wu, Zehui Chen, Zihao Wang, Baoquan Zhong, Xinchun Zhang, Xujing Li, Yuanfan Li, Zhongkai Zhao, Chengquan Jiang, Faming Wu, Haotian Zhou, Jinlin Pang, Li Han, Qianli Ma, Siyao Liu, Songhua Cai, Wenqi Fu, Xin Liu, Zhi Zhang, Bo Zhou, Guoliang Li, Jiajun Shi, Jiale Yang, Jie Tang, Li Li, Taoran Lu, Woyu Lin, Xiaokang Tong, Xinyao Li, Yichi Zhang, Yu Miao, Zhengxuan Jiang, Zili Li, Ziyuan Zhao, Chenxin Li, Dehua Ma, Feng Lin, Ge Zhang, Haihua Yang, Hangyu Guo, Hongda Zhu, Jiaheng Liu, Junda Du, Kai Cai, Kuanye Li, Lichen Yuan, Meilan Han, Minchao Wang, Shuyue Guo, Tianhao Cheng, Xiaobo Ma, Xiaojun Xiao, Xiaolong Huang, Xinjie Chen, Yidi Du, Yilin Chen, Yiwen Wang, Zhaojian Li, Zhenzhu Yang, Zhiyuan Zeng, Chaolin Jin, Chen Li, Hao Chen, Haoli Chen, Jian Chen, Qinghao Zhao, Guang Shi</p>

            <p><strong>Title:</strong><br>
            UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.02544v1">http://arxiv.org/abs/2509.02544v1</a></p>

            <p><strong>Abstract:</strong><br>
            The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Bo Li, Chen Dun, Chong Liu, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu Shi, Lin Yan, Peiyao Zhao, Pengfei Liu, Qinghao Ye, Renjie Zheng, Wayne Xin Zhao, Wen Heng, Wenhao Huang, Wenqian Wang, Xiaobo Qin, Yi Lin, Youbin Wu, Zehui Chen, Zihao Wang, Baoquan Zhong, Xinchun Zhang, Xujing Li, Yuanfan Li, Zhongkai Zhao, Chengquan Jiang, Faming Wu, Haotian Zhou, Jinlin Pang, Li Han, Qianli Ma, Siyao Liu, Songhua Cai, Wenqi Fu, Xin Liu, Zhi Zhang, Bo Zhou, Guoliang Li, Jiajun Shi, Jiale Yang, Jie Tang, Li Li, Taoran Lu, Woyu Lin, Xiaokang Tong, Xinyao Li, Yichi Zhang, Yu Miao, Zhengxuan Jiang, Zili Li, Ziyuan Zhao, Chenxin Li, Dehua Ma, Feng Lin, Ge Zhang, Haihua Yang, Hangyu Guo, Hongda Zhu, Jiaheng Liu, Junda Du, Kai Cai, Kuanye Li, Lichen Yuan, Meilan Han, Minchao Wang, Shuyue Guo, Tianhao Cheng, Xiaobo Ma, Xiaojun Xiao, Xiaolong Huang, Xinjie Chen, Yidi Du, Yilin Chen, Yiwen Wang, Zhaojian Li, Zhenzhu Yang, Zhiyuan Zeng, Chaolin Jin, Chen Li, Hao Chen, Haoli Chen, Jian Chen, Qinghao Zhao, Guang Shi</p>

            <p><strong>Title:</strong><br>
            UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.02544v1">http://arxiv.org/abs/2509.02544v1</a></p>

            <p><strong>Abstract:</strong><br>
            The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 03 Sep 2025 21:21:11 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/db124b4d/d6a3ce34.mp3" length="23348561" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1456</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Bo Li, Chen Dun, Chong Liu, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu Shi, Lin Yan, Peiyao Zhao, Pengfei Liu, Qinghao Ye, Renjie Zheng, Wayne Xin Zhao, Wen Heng, Wenhao Huang, Wenqian Wang, Xiaobo Qin, Yi Lin, Youbin Wu, Zehui Chen, Zihao Wang, Baoquan Zhong, Xinchun Zhang, Xujing Li, Yuanfan Li, Zhongkai Zhao, Chengquan Jiang, Faming Wu, Haotian Zhou, Jinlin Pang, Li Han, Qianli Ma, Siyao Liu, Songhua Cai, Wenqi Fu, Xin Liu, Zhi Zhang, Bo Zhou, Guoliang Li, Jiajun Shi, Jiale Yang, Jie Tang, Li Li, Taoran Lu, Woyu Lin, Xiaokang Tong, Xinyao Li, Yichi Zhang, Yu Miao, Zhengxuan Jiang, Zili Li, Ziyuan Zhao, Chenxin Li, Dehua Ma, Feng Lin, Ge Zhang, Haihua Yang, Hangyu Guo, Hongda Zhu, Jiaheng Liu, Junda Du, Kai Cai, Kuanye Li, Lichen Yuan, Meilan Han, Minchao Wang, Shuyue Guo, Tianhao Cheng, Xiaobo Ma, Xiaojun Xiao, Xiaolong Huang, Xinjie Chen, Yidi Du, Yilin Chen, Yiwen Wang, Zhaojian Li, Zhenzhu Yang, Zhiyuan Zeng, Chaolin Jin, Chen Li, Hao Chen, Haoli Chen, Jian Chen, Qinghao Zhao, Guang Shi</p>

            <p><strong>Title:</strong><br>
            UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.02544v1">http://arxiv.org/abs/2509.02544v1</a></p>

            <p><strong>Abstract:</strong><br>
            The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model</title>
      <itunes:episode>1112</itunes:episode>
      <podcast:episode>1112</podcast:episode>
      <itunes:title>LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0f9506b6-4cf4-45b0-973c-edc591419990</guid>
      <link>https://share.transistor.fm/s/0d4e8c60</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiyao Wang, Chunyuan Li, Jianwei Yang, Kai Zhang, Bo Liu, Tianyi Xiong, Furong Huang</p>

            <p><strong>Title:</strong><br>
            LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.00676v1">http://arxiv.org/abs/2509.00676v1</a></p>

            <p><strong>Abstract:</strong><br>
            In vision-language modeling, critic models are typically trained to evaluate outputs -- assigning scalar scores or pairwise preferences -- rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use. In this work, we challenge this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing LLaVA-Critic-R1, a multimodal critic trained to optimize preference judgments while retaining full generation ability. Surprisingly, LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model -- matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B). Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a SoTA performance of 71.9 on MMMU at the 7B scale. Finally, we show that the enhanced critic ability benefits inference: applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiyao Wang, Chunyuan Li, Jianwei Yang, Kai Zhang, Bo Liu, Tianyi Xiong, Furong Huang</p>

            <p><strong>Title:</strong><br>
            LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.00676v1">http://arxiv.org/abs/2509.00676v1</a></p>

            <p><strong>Abstract:</strong><br>
            In vision-language modeling, critic models are typically trained to evaluate outputs -- assigning scalar scores or pairwise preferences -- rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use. In this work, we challenge this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing LLaVA-Critic-R1, a multimodal critic trained to optimize preference judgments while retaining full generation ability. Surprisingly, LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model -- matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B). Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a SoTA performance of 71.9 on MMMU at the 7B scale. Finally, we show that the enhanced critic ability benefits inference: applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 03 Sep 2025 21:20:36 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0d4e8c60/01bdd7a6.mp3" length="22905506" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1428</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiyao Wang, Chunyuan Li, Jianwei Yang, Kai Zhang, Bo Liu, Tianyi Xiong, Furong Huang</p>

            <p><strong>Title:</strong><br>
            LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.00676v1">http://arxiv.org/abs/2509.00676v1</a></p>

            <p><strong>Abstract:</strong><br>
            In vision-language modeling, critic models are typically trained to evaluate outputs -- assigning scalar scores or pairwise preferences -- rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use. In this work, we challenge this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing LLaVA-Critic-R1, a multimodal critic trained to optimize preference judgments while retaining full generation ability. Surprisingly, LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model -- matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B). Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a SoTA performance of 71.9 on MMMU at the 7B scale. Finally, we show that the enhanced critic ability benefits inference: applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding</title>
      <itunes:episode>1111</itunes:episode>
      <podcast:episode>1111</podcast:episode>
      <itunes:title>ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">33eb5016-0bab-4c39-b120-8bb8658f5eb6</guid>
      <link>https://share.transistor.fm/s/345384de</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, Lewei Lu</p>

            <p><strong>Title:</strong><br>
            ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.21496v2">http://arxiv.org/abs/2508.21496v2</a></p>

            <p><strong>Abstract:</strong><br>
            Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model's ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, Lewei Lu</p>

            <p><strong>Title:</strong><br>
            ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.21496v2">http://arxiv.org/abs/2508.21496v2</a></p>

            <p><strong>Abstract:</strong><br>
            Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model's ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 03 Sep 2025 21:20:11 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/345384de/bb17bbca.mp3" length="21683414" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1352</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, Lewei Lu</p>

            <p><strong>Title:</strong><br>
            ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.21496v2">http://arxiv.org/abs/2508.21496v2</a></p>

            <p><strong>Abstract:</strong><br>
            Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model's ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion</title>
      <itunes:episode>1110</itunes:episode>
      <podcast:episode>1110</podcast:episode>
      <itunes:title>POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">53e44a86-7757-465e-be85-9c04a2f63702</guid>
      <link>https://share.transistor.fm/s/313ccf73</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, Jie Zhou</p>

            <p><strong>Title:</strong><br>
            POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.01215v1">http://arxiv.org/abs/2509.01215v1</a></p>

            <p><strong>Abstract:</strong><br>
            High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model's conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model is available at https://github.com/Tencent/POINTS-Reader.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, Jie Zhou</p>

            <p><strong>Title:</strong><br>
            POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.01215v1">http://arxiv.org/abs/2509.01215v1</a></p>

            <p><strong>Abstract:</strong><br>
            High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model's conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model is available at https://github.com/Tencent/POINTS-Reader.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 03 Sep 2025 21:19:37 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/313ccf73/f0bd6156.mp3" length="19513792" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1216</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, Jie Zhou</p>

            <p><strong>Title:</strong><br>
            POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.01215v1">http://arxiv.org/abs/2509.01215v1</a></p>

            <p><strong>Abstract:</strong><br>
            High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model's conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model is available at https://github.com/Tencent/POINTS-Reader.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Baichuan-M2: Scaling Medical Capability with Large Verifier System</title>
      <itunes:episode>1109</itunes:episode>
      <podcast:episode>1109</podcast:episode>
      <itunes:title>Baichuan-M2: Scaling Medical Capability with Large Verifier System</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1b972aa0-9822-47e5-a74a-c47724c849f5</guid>
      <link>https://share.transistor.fm/s/bd73cf66</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Baichuan-M2 Team, :, Chengfeng Dou, Chong Liu, Fan Yang, Fei Li, Jiyuan Jia, Mingyang Chen, Qiang Ju, Shuai Wang, Shunya Dang, Tianpeng Li, Xiangrong Zeng, Yijie Zhou, Chenzheng Zhu, Da Pan, Fei Deng, Guangwei Ai, Guosheng Dong, Hongda Zhang, Jinyang Tai, Jixiang Hong, Kai Lu, Linzhuang Sun, Peidong Guo, Qian Ma, Rihui Xin, Shihui Yang, Shusen Zhang, Yichuan Mo, Zheng Liang, Zhishou Zhang, Hengfu Cui, Zuyi Zhu, Xiaochuan Wang</p>

            <p><strong>Title:</strong><br>
            Baichuan-M2: Scaling Medical Capability with Large Verifier System</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.02208v1">http://arxiv.org/abs/2509.02208v1</a></p>

            <p><strong>Abstract:</strong><br>
            As large language models (LLMs) advance in conversational and reasoning capabilities, their practical application in healthcare has become a critical research focus. However, there is a notable gap between the performance of medical LLMs on static benchmarks such as USMLE and their utility in real-world clinical decision-making. This discrepancy arises because traditional exams fail to capture the dynamic, interactive nature of medical consultations. To address this challenge, we introduce a novel dynamic verification framework that moves beyond static answer verifier, establishing a large-scale, high-fidelity interactive reinforcement learning system. Our framework comprises two key components: a Patient Simulator that creates realistic clinical environments using de-identified medical records, and a Clinical Rubrics Generator that dynamically produces multi-dimensional evaluation metrics. Building on this foundation, we develop Baichuan-M2, a 32B-parameter medical augmented reasoning model trained through a multi-stage reinforcement learning strategy with an improved Group Relative Policy Optimization (GRPO) algorithm. Evaluated on HealthBench, Baichuan-M2 outperforms all other open-source models and most advanced closed-source counterparts, achieving a score above 32 on the challenging HealthBench Hard benchmark-previously exceeded only by GPT-5. Our work demonstrates that robust dynamic verifier system is essential for aligning LLM capabilities with practical clinical applications, establishing a new Pareto front in the performance-parameter trade-off for medical AI deployment.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Baichuan-M2 Team, :, Chengfeng Dou, Chong Liu, Fan Yang, Fei Li, Jiyuan Jia, Mingyang Chen, Qiang Ju, Shuai Wang, Shunya Dang, Tianpeng Li, Xiangrong Zeng, Yijie Zhou, Chenzheng Zhu, Da Pan, Fei Deng, Guangwei Ai, Guosheng Dong, Hongda Zhang, Jinyang Tai, Jixiang Hong, Kai Lu, Linzhuang Sun, Peidong Guo, Qian Ma, Rihui Xin, Shihui Yang, Shusen Zhang, Yichuan Mo, Zheng Liang, Zhishou Zhang, Hengfu Cui, Zuyi Zhu, Xiaochuan Wang</p>

            <p><strong>Title:</strong><br>
            Baichuan-M2: Scaling Medical Capability with Large Verifier System</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.02208v1">http://arxiv.org/abs/2509.02208v1</a></p>

            <p><strong>Abstract:</strong><br>
            As large language models (LLMs) advance in conversational and reasoning capabilities, their practical application in healthcare has become a critical research focus. However, there is a notable gap between the performance of medical LLMs on static benchmarks such as USMLE and their utility in real-world clinical decision-making. This discrepancy arises because traditional exams fail to capture the dynamic, interactive nature of medical consultations. To address this challenge, we introduce a novel dynamic verification framework that moves beyond static answer verifier, establishing a large-scale, high-fidelity interactive reinforcement learning system. Our framework comprises two key components: a Patient Simulator that creates realistic clinical environments using de-identified medical records, and a Clinical Rubrics Generator that dynamically produces multi-dimensional evaluation metrics. Building on this foundation, we develop Baichuan-M2, a 32B-parameter medical augmented reasoning model trained through a multi-stage reinforcement learning strategy with an improved Group Relative Policy Optimization (GRPO) algorithm. Evaluated on HealthBench, Baichuan-M2 outperforms all other open-source models and most advanced closed-source counterparts, achieving a score above 32 on the challenging HealthBench Hard benchmark-previously exceeded only by GPT-5. Our work demonstrates that robust dynamic verifier system is essential for aligning LLM capabilities with practical clinical applications, establishing a new Pareto front in the performance-parameter trade-off for medical AI deployment.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 03 Sep 2025 21:19:13 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bd73cf66/cf07cd2a.mp3" length="22682314" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1414</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Baichuan-M2 Team, :, Chengfeng Dou, Chong Liu, Fan Yang, Fei Li, Jiyuan Jia, Mingyang Chen, Qiang Ju, Shuai Wang, Shunya Dang, Tianpeng Li, Xiangrong Zeng, Yijie Zhou, Chenzheng Zhu, Da Pan, Fei Deng, Guangwei Ai, Guosheng Dong, Hongda Zhang, Jinyang Tai, Jixiang Hong, Kai Lu, Linzhuang Sun, Peidong Guo, Qian Ma, Rihui Xin, Shihui Yang, Shusen Zhang, Yichuan Mo, Zheng Liang, Zhishou Zhang, Hengfu Cui, Zuyi Zhu, Xiaochuan Wang</p>

            <p><strong>Title:</strong><br>
            Baichuan-M2: Scaling Medical Capability with Large Verifier System</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.02208v1">http://arxiv.org/abs/2509.02208v1</a></p>

            <p><strong>Abstract:</strong><br>
            As large language models (LLMs) advance in conversational and reasoning capabilities, their practical application in healthcare has become a critical research focus. However, there is a notable gap between the performance of medical LLMs on static benchmarks such as USMLE and their utility in real-world clinical decision-making. This discrepancy arises because traditional exams fail to capture the dynamic, interactive nature of medical consultations. To address this challenge, we introduce a novel dynamic verification framework that moves beyond static answer verifier, establishing a large-scale, high-fidelity interactive reinforcement learning system. Our framework comprises two key components: a Patient Simulator that creates realistic clinical environments using de-identified medical records, and a Clinical Rubrics Generator that dynamically produces multi-dimensional evaluation metrics. Building on this foundation, we develop Baichuan-M2, a 32B-parameter medical augmented reasoning model trained through a multi-stage reinforcement learning strategy with an improved Group Relative Policy Optimization (GRPO) algorithm. Evaluated on HealthBench, Baichuan-M2 outperforms all other open-source models and most advanced closed-source counterparts, achieving a score above 32 on the challenging HealthBench Hard benchmark-previously exceeded only by GPT-5. Our work demonstrates that robust dynamic verifier system is essential for aligning LLM capabilities with practical clinical applications, establishing a new Pareto front in the performance-parameter trade-off for medical AI deployment.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Kwai Keye-VL 1.5 Technical Report</title>
      <itunes:episode>1108</itunes:episode>
      <podcast:episode>1108</podcast:episode>
      <itunes:title>Kwai Keye-VL 1.5 Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7265aa86-4131-40fe-a1cd-57340e1cff25</guid>
      <link>https://share.transistor.fm/s/88caa2a6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Guowang Zhang, Han Shen, Hao Peng, Haojie Ding, Hao Wang, Hengrui Ju, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Muhao Wei, Qiang Wang, Ruitao Wang, Sen Na, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zeyi Lu, Zhenhua Wu, Zhixin Ling, Zhuoran Yang, Ziming Li, Di Xu, Haixuan Gao, Hang Li, Jing Wang, Lejian Ren, Qigen Hu, Qianqian Wang, Shiyao Wang, Xinchen Luo, Yan Li, Yuhang Hu, Zixing Zhang</p>

            <p><strong>Title:</strong><br>
            Kwai Keye-VL 1.5 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.01563v1">http://arxiv.org/abs/2509.01563v1</a></p>

            <p><strong>Abstract:</strong><br>
            In recent years, the development of Large Language Models (LLMs) has significantly advanced, extending their capabilities to multimodal tasks through Multimodal Large Language Models (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model's context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Guowang Zhang, Han Shen, Hao Peng, Haojie Ding, Hao Wang, Hengrui Ju, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Muhao Wei, Qiang Wang, Ruitao Wang, Sen Na, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zeyi Lu, Zhenhua Wu, Zhixin Ling, Zhuoran Yang, Ziming Li, Di Xu, Haixuan Gao, Hang Li, Jing Wang, Lejian Ren, Qigen Hu, Qianqian Wang, Shiyao Wang, Xinchen Luo, Yan Li, Yuhang Hu, Zixing Zhang</p>

            <p><strong>Title:</strong><br>
            Kwai Keye-VL 1.5 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.01563v1">http://arxiv.org/abs/2509.01563v1</a></p>

            <p><strong>Abstract:</strong><br>
            In recent years, the development of Large Language Models (LLMs) has significantly advanced, extending their capabilities to multimodal tasks through Multimodal Large Language Models (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model's context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 03 Sep 2025 21:18:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/88caa2a6/27d4faa4.mp3" length="17636678" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1099</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Guowang Zhang, Han Shen, Hao Peng, Haojie Ding, Hao Wang, Hengrui Ju, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Muhao Wei, Qiang Wang, Ruitao Wang, Sen Na, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zeyi Lu, Zhenhua Wu, Zhixin Ling, Zhuoran Yang, Ziming Li, Di Xu, Haixuan Gao, Hang Li, Jing Wang, Lejian Ren, Qigen Hu, Qianqian Wang, Shiyao Wang, Xinchen Luo, Yan Li, Yuhang Hu, Zixing Zhang</p>

            <p><strong>Title:</strong><br>
            Kwai Keye-VL 1.5 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.01563v1">http://arxiv.org/abs/2509.01563v1</a></p>

            <p><strong>Abstract:</strong><br>
            In recent years, the development of Large Language Models (LLMs) has significantly advanced, extending their capabilities to multimodal tasks through Multimodal Large Language Models (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model's context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic</title>
      <itunes:episode>1107</itunes:episode>
      <podcast:episode>1107</podcast:episode>
      <itunes:title>Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2ccbf3f8-7c91-455c-bbf7-e305c436bfb6</guid>
      <link>https://share.transistor.fm/s/d3c91a79</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mohammad Zbeeb, Hasan Abed Al Kader Hammoud, Bernard Ghanem</p>

            <p><strong>Title:</strong><br>
            Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.01363v1">http://arxiv.org/abs/2509.01363v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models often require costly optimization, such as reinforcement learning, to master complex reasoning tasks. This work demonstrates that reasoning ability, once learned, can be extracted and transferred between models as a compact task vector. We source two publicly available, identically initialized Qwen2.5 models, one fine-tuned with supervised fine-tuning (SFT) and the other with group relative policy optimization (GRPO) on the same dataset. From these, we extract a reasoning vector: $v_{\text{reason}} = \theta_{\text{GRPO}} - \theta_{\text{SFT}}$. We hypothesize that this vector captures the reasoning capability instilled by reinforcement learning while factoring out shared knowledge from the SFT process. When added to compatible instruction-tuned models through simple arithmetic, this vector consistently improves performance across diverse reasoning benchmarks: GSM8K (+4.9%), HumanEval (+4.3%), SciQ (+1.7%), and BigBenchHard (+12.3% for the 1.5B model). The performance improvements persist under adversarial conditions. Conversely, subtracting the vector causes significant performance degradation (-11.8% on GSM8K), demonstrating the vector's strong contribution to the model's reasoning abilities. This work shows how reasoning capabilities, typically developed through expensive training, can be extracted from existing open-source models and reused through simple tensor arithmetic, offering a practical way to enhance models by recycling prior computational investments.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mohammad Zbeeb, Hasan Abed Al Kader Hammoud, Bernard Ghanem</p>

            <p><strong>Title:</strong><br>
            Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.01363v1">http://arxiv.org/abs/2509.01363v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models often require costly optimization, such as reinforcement learning, to master complex reasoning tasks. This work demonstrates that reasoning ability, once learned, can be extracted and transferred between models as a compact task vector. We source two publicly available, identically initialized Qwen2.5 models, one fine-tuned with supervised fine-tuning (SFT) and the other with group relative policy optimization (GRPO) on the same dataset. From these, we extract a reasoning vector: $v_{\text{reason}} = \theta_{\text{GRPO}} - \theta_{\text{SFT}}$. We hypothesize that this vector captures the reasoning capability instilled by reinforcement learning while factoring out shared knowledge from the SFT process. When added to compatible instruction-tuned models through simple arithmetic, this vector consistently improves performance across diverse reasoning benchmarks: GSM8K (+4.9%), HumanEval (+4.3%), SciQ (+1.7%), and BigBenchHard (+12.3% for the 1.5B model). The performance improvements persist under adversarial conditions. Conversely, subtracting the vector causes significant performance degradation (-11.8% on GSM8K), demonstrating the vector's strong contribution to the model's reasoning abilities. This work shows how reasoning capabilities, typically developed through expensive training, can be extracted from existing open-source models and reused through simple tensor arithmetic, offering a practical way to enhance models by recycling prior computational investments.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 03 Sep 2025 21:18:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d3c91a79/f4d4bd92.mp3" length="23390352" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1458</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mohammad Zbeeb, Hasan Abed Al Kader Hammoud, Bernard Ghanem</p>

            <p><strong>Title:</strong><br>
            Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2509.01363v1">http://arxiv.org/abs/2509.01363v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models often require costly optimization, such as reinforcement learning, to master complex reasoning tasks. This work demonstrates that reasoning ability, once learned, can be extracted and transferred between models as a compact task vector. We source two publicly available, identically initialized Qwen2.5 models, one fine-tuned with supervised fine-tuning (SFT) and the other with group relative policy optimization (GRPO) on the same dataset. From these, we extract a reasoning vector: $v_{\text{reason}} = \theta_{\text{GRPO}} - \theta_{\text{SFT}}$. We hypothesize that this vector captures the reasoning capability instilled by reinforcement learning while factoring out shared knowledge from the SFT process. When added to compatible instruction-tuned models through simple arithmetic, this vector consistently improves performance across diverse reasoning benchmarks: GSM8K (+4.9%), HumanEval (+4.3%), SciQ (+1.7%), and BigBenchHard (+12.3% for the 1.5B model). The performance improvements persist under adversarial conditions. Conversely, subtracting the vector causes significant performance degradation (-11.8% on GSM8K), demonstrating the vector's strong contribution to the model's reasoning abilities. This work shows how reasoning capabilities, typically developed through expensive training, can be extracted from existing open-source models and reused through simple tensor arithmetic, offering a practical way to enhance models by recycling prior computational investments.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning</title>
      <itunes:episode>1106</itunes:episode>
      <podcast:episode>1106</podcast:episode>
      <itunes:title>PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a32ba578-f9fe-489d-9247-00499ab5d50d</guid>
      <link>https://share.transistor.fm/s/ae4480e8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Wenfeng Feng, Penghong Zhao, Guochao Jiang, Chuzhan Hao, Yuewei Zhang, Hao Wang</p>

            <p><strong>Title:</strong><br>
            PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.21104v1">http://arxiv.org/abs/2508.21104v1</a></p>

            <p><strong>Abstract:</strong><br>
            Critic-free reinforcement learning methods, particularly group policies, have attracted considerable attention for their efficiency in complex tasks. However, these methods rely heavily on multiple sampling and comparisons within the policy to estimate advantage, which may cause the policy to fall into local optimum and increase computational cost. To address these issues, we propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling. Specifically, we use the reference model to rollout in advance and employ the calculated reward score as a reference anchor. Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts. Meanwhile, the reference model can assess sample difficulty during data pre-sampling, enabling effective selection of high-gain data to improve training efficiency. Experiments conducted on nine datasets across two domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Wenfeng Feng, Penghong Zhao, Guochao Jiang, Chuzhan Hao, Yuewei Zhang, Hao Wang</p>

            <p><strong>Title:</strong><br>
            PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.21104v1">http://arxiv.org/abs/2508.21104v1</a></p>

            <p><strong>Abstract:</strong><br>
            Critic-free reinforcement learning methods, particularly group policies, have attracted considerable attention for their efficiency in complex tasks. However, these methods rely heavily on multiple sampling and comparisons within the policy to estimate advantage, which may cause the policy to fall into local optimum and increase computational cost. To address these issues, we propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling. Specifically, we use the reference model to rollout in advance and employ the calculated reward score as a reference anchor. Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts. Meanwhile, the reference model can assess sample difficulty during data pre-sampling, enabling effective selection of high-gain data to improve training efficiency. Experiments conducted on nine datasets across two domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 02 Sep 2025 19:45:46 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ae4480e8/9edf5b00.mp3" length="21160950" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1319</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Wenfeng Feng, Penghong Zhao, Guochao Jiang, Chuzhan Hao, Yuewei Zhang, Hao Wang</p>

            <p><strong>Title:</strong><br>
            PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.21104v1">http://arxiv.org/abs/2508.21104v1</a></p>

            <p><strong>Abstract:</strong><br>
            Critic-free reinforcement learning methods, particularly group policies, have attracted considerable attention for their efficiency in complex tasks. However, these methods rely heavily on multiple sampling and comparisons within the policy to estimate advantage, which may cause the policy to fall into local optimum and increase computational cost. To address these issues, we propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling. Specifically, we use the reference model to rollout in advance and employ the calculated reward score as a reference anchor. Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts. Meanwhile, the reference model can assess sample difficulty during data pre-sampling, enabling effective selection of high-gain data to improve training efficiency. Experiments conducted on nine datasets across two domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning</title>
      <itunes:episode>1105</itunes:episode>
      <podcast:episode>1105</podcast:episode>
      <itunes:title>R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">71d56edc-a2ee-48ff-a3ca-6070b52c4f8a</guid>
      <link>https://share.transistor.fm/s/c756be05</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 84 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jie Jiang, Qi Yang, Bolin Ni, Shiming Xiang, Han Hu, Houwen Peng</p>

            <p><strong>Title:</strong><br>
            R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.21113v1">http://arxiv.org/abs/2508.21113v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model's accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 84 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jie Jiang, Qi Yang, Bolin Ni, Shiming Xiang, Han Hu, Houwen Peng</p>

            <p><strong>Title:</strong><br>
            R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.21113v1">http://arxiv.org/abs/2508.21113v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model's accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 01 Sep 2025 20:25:59 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c756be05/776ab7c0.mp3" length="19220406" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1198</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 84 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jie Jiang, Qi Yang, Bolin Ni, Shiming Xiang, Han Hu, Houwen Peng</p>

            <p><strong>Title:</strong><br>
            R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.21113v1">http://arxiv.org/abs/2508.21113v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model's accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers</title>
      <itunes:episode>1104</itunes:episode>
      <podcast:episode>1104</podcast:episode>
      <itunes:title>A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">279d35e2-12e2-4f6c-8b0b-3234cae92bbe</guid>
      <link>https://share.transistor.fm/s/8b0829ef</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao Wu, Lulu Chen, Fengxiang Wang, Yuanyuan Zhang, Xiangyu Zhao, Feilong Tang, Encheng Su, Junzhi Ning, Xinyao Liu, Ye Du, Changkai Ji, Cheng Tang, Huihui Xu, Ziyang Chen, Ziyan Huang, Jiyao Liu, Pengfei Jiang, Yizhou Wang, Chen Tang, Jianyu Wu, Yuchen Ren, Siyuan Yan, Zhonghua Wang, Zhongxing Xu, Shiyan Su, Shangquan Sun, Runkai Zhao, Zhisheng Zhang, Yu Liu, Fudi Wang, Yuanfeng Ji, Yanzhou Su, Hongming Shan, Chunmei Feng, Jiahao Xu, Jiangtao Yan, Wenhao Tang, Diping Song, Lihao Liu, Yanyan Huang, Lequan Yu, Bin Fu, Shujun Wang, Xiaomeng Li, Xiaowei Hu, Yun Gu, Ben Fei, Zhongying Deng, Benyou Wang, Yuewen Cao, Minjie Shen, Haodong Duan, Jie Xu, Yirong Chen, Fang Yan, Hongxia Hao, Jielan Li, Jiajun Du, Yanbo Wang, Imran Razzak, Chi Zhang, Lijun Wu, Conghui He, Zhaohui Lu, Jinhai Huang, Yihao Liu, Fenghua Ling, Yuqiang Li, Aoran Wang, Qihao Zheng, Nanqing Dong, Tianfan Fu, Dongzhan Zhou, Yan Lu, Wenlong Zhang, Jin Ye, Jianfei Cai, Wanli Ouyang, Yu Qiao, Zongyuan Ge, Shixiang Tang, Junjun He, Chunfeng Song, Lei Bai, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.21148v1">http://arxiv.org/abs/2508.21148v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao Wu, Lulu Chen, Fengxiang Wang, Yuanyuan Zhang, Xiangyu Zhao, Feilong Tang, Encheng Su, Junzhi Ning, Xinyao Liu, Ye Du, Changkai Ji, Cheng Tang, Huihui Xu, Ziyang Chen, Ziyan Huang, Jiyao Liu, Pengfei Jiang, Yizhou Wang, Chen Tang, Jianyu Wu, Yuchen Ren, Siyuan Yan, Zhonghua Wang, Zhongxing Xu, Shiyan Su, Shangquan Sun, Runkai Zhao, Zhisheng Zhang, Yu Liu, Fudi Wang, Yuanfeng Ji, Yanzhou Su, Hongming Shan, Chunmei Feng, Jiahao Xu, Jiangtao Yan, Wenhao Tang, Diping Song, Lihao Liu, Yanyan Huang, Lequan Yu, Bin Fu, Shujun Wang, Xiaomeng Li, Xiaowei Hu, Yun Gu, Ben Fei, Zhongying Deng, Benyou Wang, Yuewen Cao, Minjie Shen, Haodong Duan, Jie Xu, Yirong Chen, Fang Yan, Hongxia Hao, Jielan Li, Jiajun Du, Yanbo Wang, Imran Razzak, Chi Zhang, Lijun Wu, Conghui He, Zhaohui Lu, Jinhai Huang, Yihao Liu, Fenghua Ling, Yuqiang Li, Aoran Wang, Qihao Zheng, Nanqing Dong, Tianfan Fu, Dongzhan Zhou, Yan Lu, Wenlong Zhang, Jin Ye, Jianfei Cai, Wanli Ouyang, Yu Qiao, Zongyuan Ge, Shixiang Tang, Junjun He, Chunfeng Song, Lei Bai, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.21148v1">http://arxiv.org/abs/2508.21148v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 01 Sep 2025 20:25:17 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8b0829ef/1a80301d.mp3" length="22365939" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1394</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, Ying Chen, Chaoyang Zhang, Cheng Tan, Jie Ying, Guocheng Wu, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao Wu, Lulu Chen, Fengxiang Wang, Yuanyuan Zhang, Xiangyu Zhao, Feilong Tang, Encheng Su, Junzhi Ning, Xinyao Liu, Ye Du, Changkai Ji, Cheng Tang, Huihui Xu, Ziyang Chen, Ziyan Huang, Jiyao Liu, Pengfei Jiang, Yizhou Wang, Chen Tang, Jianyu Wu, Yuchen Ren, Siyuan Yan, Zhonghua Wang, Zhongxing Xu, Shiyan Su, Shangquan Sun, Runkai Zhao, Zhisheng Zhang, Yu Liu, Fudi Wang, Yuanfeng Ji, Yanzhou Su, Hongming Shan, Chunmei Feng, Jiahao Xu, Jiangtao Yan, Wenhao Tang, Diping Song, Lihao Liu, Yanyan Huang, Lequan Yu, Bin Fu, Shujun Wang, Xiaomeng Li, Xiaowei Hu, Yun Gu, Ben Fei, Zhongying Deng, Benyou Wang, Yuewen Cao, Minjie Shen, Haodong Duan, Jie Xu, Yirong Chen, Fang Yan, Hongxia Hao, Jielan Li, Jiajun Du, Yanbo Wang, Imran Razzak, Chi Zhang, Lijun Wu, Conghui He, Zhaohui Lu, Jinhai Huang, Yihao Liu, Fenghua Ling, Yuqiang Li, Aoran Wang, Qihao Zheng, Nanqing Dong, Tianfan Fu, Dongzhan Zhou, Yan Lu, Wenlong Zhang, Jin Ye, Jianfei Cai, Wanli Ouyang, Yu Qiao, Zongyuan Ge, Shixiang Tang, Junjun He, Chunfeng Song, Lei Bai, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.21148v1">http://arxiv.org/abs/2508.21148v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands -- heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling</title>
      <itunes:episode>1103</itunes:episode>
      <podcast:episode>1103</podcast:episode>
      <itunes:title>TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">023cdf16-7f5b-4269-af8d-650fc5f826ce</guid>
      <link>https://share.transistor.fm/s/73415e4d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yizhi Li, Qingshui Gu, Zhoufutu Wen, Ziniu Li, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xin Zhou, Xingwei Qu, Wangchunshu Zhou, Zheng Zhang, Wei Shen, Qian Liu, Chenghua Lin, Jian Yang, Ge Zhang, Wenhao Huang</p>

            <p><strong>Title:</strong><br>
            TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.17445v1">http://arxiv.org/abs/2508.17445v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with an early-stop mechanism; (2) a tree-based segment-level advantage estimation that considers both global and local proximal policy optimization. and (3) analysis on the effectiveness of probability and quality-driven dynamic divergence and fallback strategy. We empirically validate the performance gain of TreePO on a set reasoning benchmarks and the efficiency saving of GPU hours from 22\% up to 43\% of the sampling design for the trained models, meanwhile showing up to 40\% reduction at trajectory-level and 35\% at token-level sampling compute for the existing models. While offering a free lunch of inference efficiency, TreePO reveals a practical path toward scaling RL-based post-training with fewer samples and less compute. Home page locates at https://m-a-p.ai/TreePO.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yizhi Li, Qingshui Gu, Zhoufutu Wen, Ziniu Li, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xin Zhou, Xingwei Qu, Wangchunshu Zhou, Zheng Zhang, Wei Shen, Qian Liu, Chenghua Lin, Jian Yang, Ge Zhang, Wenhao Huang</p>

            <p><strong>Title:</strong><br>
            TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.17445v1">http://arxiv.org/abs/2508.17445v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with an early-stop mechanism; (2) a tree-based segment-level advantage estimation that considers both global and local proximal policy optimization. and (3) analysis on the effectiveness of probability and quality-driven dynamic divergence and fallback strategy. We empirically validate the performance gain of TreePO on a set reasoning benchmarks and the efficiency saving of GPU hours from 22\% up to 43\% of the sampling design for the trained models, meanwhile showing up to 40\% reduction at trajectory-level and 35\% at token-level sampling compute for the existing models. While offering a free lunch of inference efficiency, TreePO reveals a practical path toward scaling RL-based post-training with fewer samples and less compute. Home page locates at https://m-a-p.ai/TreePO.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 27 Aug 2025 20:41:32 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/73415e4d/08c33fd4.mp3" length="20890577" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1302</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yizhi Li, Qingshui Gu, Zhoufutu Wen, Ziniu Li, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xin Zhou, Xingwei Qu, Wangchunshu Zhou, Zheng Zhang, Wei Shen, Qian Liu, Chenghua Lin, Jian Yang, Ge Zhang, Wenhao Huang</p>

            <p><strong>Title:</strong><br>
            TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.17445v1">http://arxiv.org/abs/2508.17445v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with an early-stop mechanism; (2) a tree-based segment-level advantage estimation that considers both global and local proximal policy optimization. and (3) analysis on the effectiveness of probability and quality-driven dynamic divergence and fallback strategy. We empirically validate the performance gain of TreePO on a set reasoning benchmarks and the efficiency saving of GPU hours from 22\% up to 43\% of the sampling design for the trained models, meanwhile showing up to 40\% reduction at trajectory-level and 35\% at token-level sampling compute for the existing models. While offering a free lunch of inference efficiency, TreePO reveals a practical path toward scaling RL-based post-training with fewer samples and less compute. Home page locates at https://m-a-p.ai/TreePO.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VibeVoice Technical Report</title>
      <itunes:episode>1102</itunes:episode>
      <podcast:episode>1102</podcast:episode>
      <itunes:title>VibeVoice Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fa1f0a22-e4c1-4ee1-9760-24868cc78941</guid>
      <link>https://share.transistor.fm/s/8e2fda53</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL, cs.AI, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, Furu Wei</p>

            <p><strong>Title:</strong><br>
            VibeVoice Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.19205v1">http://arxiv.org/abs/2508.19205v1</a></p>

            <p><strong>Abstract:</strong><br>
            This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL, cs.AI, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, Furu Wei</p>

            <p><strong>Title:</strong><br>
            VibeVoice Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.19205v1">http://arxiv.org/abs/2508.19205v1</a></p>

            <p><strong>Abstract:</strong><br>
            This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 27 Aug 2025 20:41:11 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8e2fda53/b6cb81d8.mp3" length="20516827" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1279</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL, cs.AI, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, Furu Wei</p>

            <p><strong>Title:</strong><br>
            VibeVoice Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.19205v1">http://arxiv.org/abs/2508.19205v1</a></p>

            <p><strong>Abstract:</strong><br>
            This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics</title>
      <itunes:episode>1101</itunes:episode>
      <podcast:episode>1101</podcast:episode>
      <itunes:title>CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5923a87f-5e88-4f3b-8c77-247d267899d9</guid>
      <link>https://share.transistor.fm/s/3f25479b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng, Zeke Xie, Mao Su, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang, Yunqi Cai, Xi Dai, Shufei Zhang, Lei Bai, Jinguang Cheng, Zhong Fang, Hongming Weng</p>

            <p><strong>Title:</strong><br>
            CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.18124v2">http://arxiv.org/abs/2508.18124v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng, Zeke Xie, Mao Su, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang, Yunqi Cai, Xi Dai, Shufei Zhang, Lei Bai, Jinguang Cheng, Zhong Fang, Hongming Weng</p>

            <p><strong>Title:</strong><br>
            CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.18124v2">http://arxiv.org/abs/2508.18124v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 27 Aug 2025 20:40:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3f25479b/ad3de283.mp3" length="19301465" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1203</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He, Jin Zeng, Zeke Xie, Mao Su, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang, Yunqi Cai, Xi Dai, Shufei Zhang, Lei Bai, Jinguang Cheng, Zhong Fang, Hongming Weng</p>

            <p><strong>Title:</strong><br>
            CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.18124v2">http://arxiv.org/abs/2508.18124v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space</title>
      <itunes:episode>1100</itunes:episode>
      <podcast:episode>1100</podcast:episode>
      <itunes:title>VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">40c54392-a959-4f93-af1e-d503d4759b2b</guid>
      <link>https://share.transistor.fm/s/a58b39a8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Lin Li, Zehuan Huang, Haoran Feng, Gengxiong Zhuang, Rui Chen, Chunchao Guo, Lu Sheng</p>

            <p><strong>Title:</strong><br>
            VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.19247v1">http://arxiv.org/abs/2508.19247v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To evaluate the consistency of preserved regions, we constructed Edit3D-Bench, a human-annotated dataset comprising hundreds of samples, each with carefully labeled 3D editing regions. Experiments demonstrate that VoxHammer significantly outperforms existing methods in terms of both 3D consistency of preserved regions and overall quality. Our method holds promise for synthesizing high-quality edited paired data, thereby laying the data foundation for in-context 3D generation. See our project page at https://huanngzh.github.io/VoxHammer-Page/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Lin Li, Zehuan Huang, Haoran Feng, Gengxiong Zhuang, Rui Chen, Chunchao Guo, Lu Sheng</p>

            <p><strong>Title:</strong><br>
            VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.19247v1">http://arxiv.org/abs/2508.19247v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To evaluate the consistency of preserved regions, we constructed Edit3D-Bench, a human-annotated dataset comprising hundreds of samples, each with carefully labeled 3D editing regions. Experiments demonstrate that VoxHammer significantly outperforms existing methods in terms of both 3D consistency of preserved regions and overall quality. Our method holds promise for synthesizing high-quality edited paired data, thereby laying the data foundation for in-context 3D generation. See our project page at https://huanngzh.github.io/VoxHammer-Page/.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 27 Aug 2025 20:40:28 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a58b39a8/29cb5f23.mp3" length="20056285" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1250</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Lin Li, Zehuan Huang, Haoran Feng, Gengxiong Zhuang, Rui Chen, Chunchao Guo, Lu Sheng</p>

            <p><strong>Title:</strong><br>
            VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.19247v1">http://arxiv.org/abs/2508.19247v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To evaluate the consistency of preserved regions, we constructed Edit3D-Bench, a human-annotated dataset comprising hundreds of samples, each with carefully labeled 3D editing regions. Experiments demonstrate that VoxHammer significantly outperforms existing methods in terms of both 3D consistency of preserved regions and overall quality. Our method holds promise for synthesizing high-quality edited paired data, thereby laying the data foundation for in-context 3D generation. See our project page at https://huanngzh.github.io/VoxHammer-Page/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation</title>
      <itunes:episode>1099</itunes:episode>
      <podcast:episode>1099</podcast:episode>
      <itunes:title>OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">72ed007f-3aa2-4877-b3f2-555186e969d9</guid>
      <link>https://share.transistor.fm/s/9c74ee72</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, Mingyuan Gao</p>

            <p><strong>Title:</strong><br>
            OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.19209v1">http://arxiv.org/abs/2508.19209v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character's authentic essence. Their motions typically synchronize with low-level cues like audio rhythm, lacking a deeper semantic understanding of emotion, intent, or context. To bridge this gap, \textbf{we propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive.} Our model, \textbf{OmniHuman-1.5}, is built upon two key technical contributions. First, we leverage Multimodal Large Language Models to synthesize a structured textual representation of conditions that provides high-level semantic guidance. This guidance steers our motion generator beyond simplistic rhythmic synchronization, enabling the production of actions that are contextually and emotionally resonant. Second, to ensure the effective fusion of these multimodal inputs and mitigate inter-modality conflicts, we introduce a specialized Multimodal DiT architecture with a novel Pseudo Last Frame design. The synergy of these components allows our model to accurately interpret the joint semantics of audio, images, and text, thereby generating motions that are deeply coherent with the character, scene, and linguistic content. Extensive experiments demonstrate that our model achieves leading performance across a comprehensive set of metrics, including lip-sync accuracy, video quality, motion naturalness and semantic consistency with textual prompts. Furthermore, our approach shows remarkable extensibility to complex scenarios, such as those involving multi-person and non-human subjects. Homepage: \href{https://omnihuman-lab.github.io/v1_5/}</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, Mingyuan Gao</p>

            <p><strong>Title:</strong><br>
            OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.19209v1">http://arxiv.org/abs/2508.19209v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character's authentic essence. Their motions typically synchronize with low-level cues like audio rhythm, lacking a deeper semantic understanding of emotion, intent, or context. To bridge this gap, \textbf{we propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive.} Our model, \textbf{OmniHuman-1.5}, is built upon two key technical contributions. First, we leverage Multimodal Large Language Models to synthesize a structured textual representation of conditions that provides high-level semantic guidance. This guidance steers our motion generator beyond simplistic rhythmic synchronization, enabling the production of actions that are contextually and emotionally resonant. Second, to ensure the effective fusion of these multimodal inputs and mitigate inter-modality conflicts, we introduce a specialized Multimodal DiT architecture with a novel Pseudo Last Frame design. The synergy of these components allows our model to accurately interpret the joint semantics of audio, images, and text, thereby generating motions that are deeply coherent with the character, scene, and linguistic content. Extensive experiments demonstrate that our model achieves leading performance across a comprehensive set of metrics, including lip-sync accuracy, video quality, motion naturalness and semantic consistency with textual prompts. Furthermore, our approach shows remarkable extensibility to complex scenarios, such as those involving multi-person and non-human subjects. Homepage: \href{https://omnihuman-lab.github.io/v1_5/}</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 27 Aug 2025 20:40:07 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9c74ee72/d11b8b80.mp3" length="21783712" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1358</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianwen Jiang, Weihong Zeng, Zerong Zheng, Jiaqi Yang, Chao Liang, Wang Liao, Han Liang, Yuan Zhang, Mingyuan Gao</p>

            <p><strong>Title:</strong><br>
            OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.19209v1">http://arxiv.org/abs/2508.19209v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character's authentic essence. Their motions typically synchronize with low-level cues like audio rhythm, lacking a deeper semantic understanding of emotion, intent, or context. To bridge this gap, \textbf{we propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive.} Our model, \textbf{OmniHuman-1.5}, is built upon two key technical contributions. First, we leverage Multimodal Large Language Models to synthesize a structured textual representation of conditions that provides high-level semantic guidance. This guidance steers our motion generator beyond simplistic rhythmic synchronization, enabling the production of actions that are contextually and emotionally resonant. Second, to ensure the effective fusion of these multimodal inputs and mitigate inter-modality conflicts, we introduce a specialized Multimodal DiT architecture with a novel Pseudo Last Frame design. The synergy of these components allows our model to accurately interpret the joint semantics of audio, images, and text, thereby generating motions that are deeply coherent with the character, scene, and linguistic content. Extensive experiments demonstrate that our model achieves leading performance across a comprehensive set of metrics, including lip-sync accuracy, video quality, motion naturalness and semantic consistency with textual prompts. Furthermore, our approach shows remarkable extensibility to complex scenarios, such as those involving multi-person and non-human subjects. Homepage: \href{https://omnihuman-lab.github.io/v1_5/}</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Spacer: Towards Engineered Scientific Inspiration</title>
      <itunes:episode>1098</itunes:episode>
      <podcast:episode>1098</podcast:episode>
      <itunes:title>Spacer: Towards Engineered Scientific Inspiration</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4d1d4395-b645-4bbf-b167-c0254ead2637</guid>
      <link>https://share.transistor.fm/s/6fae25d8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.AI, cs.LG, cs.NE</p>

            <p><strong>Authors:</strong><br>
            Minhyeong Lee, Suyoung Hwang, Seunghyun Moon, Geonho Nah, Donghyun Koh, Youngjun Cho, Johyun Park, Hojin Yoo, Jiho Park, Haneul Choi, Sungbin Moon, Taehoon Hwang, Seungwon Kim, Jaeyeong Kim, Seongjun Kim, Juneau Jung</p>

            <p><strong>Title:</strong><br>
            Spacer: Towards Engineered Scientific Inspiration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.17661v1">http://arxiv.org/abs/2508.17661v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in LLMs have made automated scientific research the next frontline in the path to artificial superintelligence. However, these systems are bound either to tasks of narrow scope or the limited creative capabilities of LLMs. We propose Spacer, a scientific discovery system that develops creative and factually grounded concepts without external intervention. Spacer attempts to achieve this via 'deliberate decontextualization,' an approach that disassembles information into atomic units - keywords - and draws creativity from unexplored connections between them. Spacer consists of (i) Nuri, an inspiration engine that builds keyword sets, and (ii) the Manifesting Pipeline that refines these sets into elaborate scientific statements. Nuri extracts novel, high-potential keyword sets from a keyword graph built with 180,000 academic publications in biological fields. The Manifesting Pipeline finds links between keywords, analyzes their logical structure, validates their plausibility, and ultimately drafts original scientific concepts. According to our experiments, the evaluation metric of Nuri accurately classifies high-impact publications with an AUROC score of 0.737. Our Manifesting Pipeline also successfully reconstructs core concepts from the latest top-journal articles solely from their keyword sets. An LLM-based scoring system estimates that this reconstruction was sound for over 85% of the cases. Finally, our embedding space analysis shows that outputs from Spacer are significantly more similar to leading publications compared with those from SOTA LLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.AI, cs.LG, cs.NE</p>

            <p><strong>Authors:</strong><br>
            Minhyeong Lee, Suyoung Hwang, Seunghyun Moon, Geonho Nah, Donghyun Koh, Youngjun Cho, Johyun Park, Hojin Yoo, Jiho Park, Haneul Choi, Sungbin Moon, Taehoon Hwang, Seungwon Kim, Jaeyeong Kim, Seongjun Kim, Juneau Jung</p>

            <p><strong>Title:</strong><br>
            Spacer: Towards Engineered Scientific Inspiration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.17661v1">http://arxiv.org/abs/2508.17661v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in LLMs have made automated scientific research the next frontline in the path to artificial superintelligence. However, these systems are bound either to tasks of narrow scope or the limited creative capabilities of LLMs. We propose Spacer, a scientific discovery system that develops creative and factually grounded concepts without external intervention. Spacer attempts to achieve this via 'deliberate decontextualization,' an approach that disassembles information into atomic units - keywords - and draws creativity from unexplored connections between them. Spacer consists of (i) Nuri, an inspiration engine that builds keyword sets, and (ii) the Manifesting Pipeline that refines these sets into elaborate scientific statements. Nuri extracts novel, high-potential keyword sets from a keyword graph built with 180,000 academic publications in biological fields. The Manifesting Pipeline finds links between keywords, analyzes their logical structure, validates their plausibility, and ultimately drafts original scientific concepts. According to our experiments, the evaluation metric of Nuri accurately classifies high-impact publications with an AUROC score of 0.737. Our Manifesting Pipeline also successfully reconstructs core concepts from the latest top-journal articles solely from their keyword sets. An LLM-based scoring system estimates that this reconstruction was sound for over 85% of the cases. Finally, our embedding space analysis shows that outputs from Spacer are significantly more similar to leading publications compared with those from SOTA LLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 27 Aug 2025 20:39:46 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6fae25d8/424ae383.mp3" length="21617337" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1347</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.AI, cs.LG, cs.NE</p>

            <p><strong>Authors:</strong><br>
            Minhyeong Lee, Suyoung Hwang, Seunghyun Moon, Geonho Nah, Donghyun Koh, Youngjun Cho, Johyun Park, Hojin Yoo, Jiho Park, Haneul Choi, Sungbin Moon, Taehoon Hwang, Seungwon Kim, Jaeyeong Kim, Seongjun Kim, Juneau Jung</p>

            <p><strong>Title:</strong><br>
            Spacer: Towards Engineered Scientific Inspiration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.17661v1">http://arxiv.org/abs/2508.17661v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in LLMs have made automated scientific research the next frontline in the path to artificial superintelligence. However, these systems are bound either to tasks of narrow scope or the limited creative capabilities of LLMs. We propose Spacer, a scientific discovery system that develops creative and factually grounded concepts without external intervention. Spacer attempts to achieve this via 'deliberate decontextualization,' an approach that disassembles information into atomic units - keywords - and draws creativity from unexplored connections between them. Spacer consists of (i) Nuri, an inspiration engine that builds keyword sets, and (ii) the Manifesting Pipeline that refines these sets into elaborate scientific statements. Nuri extracts novel, high-potential keyword sets from a keyword graph built with 180,000 academic publications in biological fields. The Manifesting Pipeline finds links between keywords, analyzes their logical structure, validates their plausibility, and ultimately drafts original scientific concepts. According to our experiments, the evaluation metric of Nuri accurately classifies high-impact publications with an AUROC score of 0.737. Our Manifesting Pipeline also successfully reconstructs core concepts from the latest top-journal articles solely from their keyword sets. An LLM-based scoring system estimates that this reconstruction was sound for over 85% of the cases. Finally, our embedding space analysis shows that outputs from Spacer are significantly more similar to leading publications compared with those from SOTA LLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning</title>
      <itunes:episode>1097</itunes:episode>
      <podcast:episode>1097</podcast:episode>
      <itunes:title>UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">89eea594-03c6-42f9-b4c4-4f24d0b82e87</guid>
      <link>https://share.transistor.fm/s/44258f66</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, Siyuan Qiao</p>

            <p><strong>Title:</strong><br>
            UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.18756v1">http://arxiv.org/abs/2508.18756v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, Siyuan Qiao</p>

            <p><strong>Title:</strong><br>
            UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.18756v1">http://arxiv.org/abs/2508.18756v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 27 Aug 2025 20:39:25 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/44258f66/2d3a6745.mp3" length="18927393" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1179</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, Siyuan Qiao</p>

            <p><strong>Title:</strong><br>
            UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.18756v1">http://arxiv.org/abs/2508.18756v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency</title>
      <itunes:episode>1096</itunes:episode>
      <podcast:episode>1096</podcast:episode>
      <itunes:title>InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">03cb7a88-5cf9-4d44-ad80-5de000a91d07</guid>
      <link>https://share.transistor.fm/s/139925c6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 120 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Songze Li, Xiangyu Zhao, Haodong Duan, Nianchen Deng, Bin Fu, Yinan He, Yi Wang, Conghui He, Botian Shi, Junjun He, Yingtong Xiong, Han Lv, Lijun Wu, Wenqi Shao, Kaipeng Zhang, Huipeng Deng, Biqing Qi, Jiaye Ge, Qipeng Guo, Wenwei Zhang, Wanli Ouyang, Limin Wang, Min Dou, Xizhou Zhu, Tong Lu, Dahua Lin, Jifeng Dai, Bowen Zhou, Weijie Su, Kai Chen, Yu Qiao, Wenhai Wang, Gen Luo</p>

            <p><strong>Title:</strong><br>
            InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.18265v1">http://arxiv.org/abs/2508.18265v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05$\times$ inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks -- narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 120 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Songze Li, Xiangyu Zhao, Haodong Duan, Nianchen Deng, Bin Fu, Yinan He, Yi Wang, Conghui He, Botian Shi, Junjun He, Yingtong Xiong, Han Lv, Lijun Wu, Wenqi Shao, Kaipeng Zhang, Huipeng Deng, Biqing Qi, Jiaye Ge, Qipeng Guo, Wenwei Zhang, Wanli Ouyang, Limin Wang, Min Dou, Xizhou Zhu, Tong Lu, Dahua Lin, Jifeng Dai, Bowen Zhou, Weijie Su, Kai Chen, Yu Qiao, Wenhai Wang, Gen Luo</p>

            <p><strong>Title:</strong><br>
            InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.18265v1">http://arxiv.org/abs/2508.18265v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05$\times$ inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks -- narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 26 Aug 2025 20:05:05 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/139925c6/97644159.mp3" length="22365111" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1394</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 120 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Songze Li, Xiangyu Zhao, Haodong Duan, Nianchen Deng, Bin Fu, Yinan He, Yi Wang, Conghui He, Botian Shi, Junjun He, Yingtong Xiong, Han Lv, Lijun Wu, Wenqi Shao, Kaipeng Zhang, Huipeng Deng, Biqing Qi, Jiaye Ge, Qipeng Guo, Wenwei Zhang, Wanli Ouyang, Limin Wang, Min Dou, Xizhou Zhu, Tong Lu, Dahua Lin, Jifeng Dai, Bowen Zhou, Weijie Su, Kai Chen, Yu Qiao, Wenhai Wang, Gen Luo</p>

            <p><strong>Title:</strong><br>
            InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.18265v1">http://arxiv.org/abs/2508.18265v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05$\times$ inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks -- narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation</title>
      <itunes:episode>1095</itunes:episode>
      <podcast:episode>1095</podcast:episode>
      <itunes:title>Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8ccc0ca1-0092-4329-b28f-24cb389efd24</guid>
      <link>https://share.transistor.fm/s/bdad6873</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yaqi Li, Peng Chen, Mingyang Han, Pi Bu, Haoxiang Shi, Runzhou Zhao, Yang Yao, Xuan Zhang, Jun Song, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.18032v2">http://arxiv.org/abs/2508.18032v2</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the promising progress of recent autoregressive models in text-to-image (T2I) generation, their ability to handle multi-attribute and ambiguous prompts remains limited. To address these limitations, existing works have applied chain-of-thought (CoT) to enable stage-aware visual synthesis and employed reinforcement learning (RL) to improve reasoning capabilities. However, most models provide reward signals only at the end of the generation stage. This monolithic final-only guidance makes it difficult to identify which stages contribute positively to the final outcome and may lead to suboptimal policies. To tackle this issue, we propose a Visual-Chain of Guidance (Visual-CoG) paradigm consisting of three stages: semantic reasoning, process refining, and outcome evaluation, with stage-aware rewards providing immediate guidance throughout the image generation pipeline. We further construct a visual cognition benchmark, VisCog-Bench, which comprises four subtasks to evaluate the effectiveness of semantic reasoning. Comprehensive evaluations on GenEval, T2I-CompBench, and the proposed VisCog-Bench show improvements of 15%, 5%, and 19%, respectively, demonstrating the superior performance of the proposed Visual-CoG. We will release all the resources soon.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yaqi Li, Peng Chen, Mingyang Han, Pi Bu, Haoxiang Shi, Runzhou Zhao, Yang Yao, Xuan Zhang, Jun Song, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.18032v2">http://arxiv.org/abs/2508.18032v2</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the promising progress of recent autoregressive models in text-to-image (T2I) generation, their ability to handle multi-attribute and ambiguous prompts remains limited. To address these limitations, existing works have applied chain-of-thought (CoT) to enable stage-aware visual synthesis and employed reinforcement learning (RL) to improve reasoning capabilities. However, most models provide reward signals only at the end of the generation stage. This monolithic final-only guidance makes it difficult to identify which stages contribute positively to the final outcome and may lead to suboptimal policies. To tackle this issue, we propose a Visual-Chain of Guidance (Visual-CoG) paradigm consisting of three stages: semantic reasoning, process refining, and outcome evaluation, with stage-aware rewards providing immediate guidance throughout the image generation pipeline. We further construct a visual cognition benchmark, VisCog-Bench, which comprises four subtasks to evaluate the effectiveness of semantic reasoning. Comprehensive evaluations on GenEval, T2I-CompBench, and the proposed VisCog-Bench show improvements of 15%, 5%, and 19%, respectively, demonstrating the superior performance of the proposed Visual-CoG. We will release all the resources soon.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 26 Aug 2025 20:04:41 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bdad6873/fcdab63d.mp3" length="18285833" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1139</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yaqi Li, Peng Chen, Mingyang Han, Pi Bu, Haoxiang Shi, Runzhou Zhao, Yang Yao, Xuan Zhang, Jun Song, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.18032v2">http://arxiv.org/abs/2508.18032v2</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the promising progress of recent autoregressive models in text-to-image (T2I) generation, their ability to handle multi-attribute and ambiguous prompts remains limited. To address these limitations, existing works have applied chain-of-thought (CoT) to enable stage-aware visual synthesis and employed reinforcement learning (RL) to improve reasoning capabilities. However, most models provide reward signals only at the end of the generation stage. This monolithic final-only guidance makes it difficult to identify which stages contribute positively to the final outcome and may lead to suboptimal policies. To tackle this issue, we propose a Visual-Chain of Guidance (Visual-CoG) paradigm consisting of three stages: semantic reasoning, process refining, and outcome evaluation, with stage-aware rewards providing immediate guidance throughout the image generation pipeline. We further construct a visual cognition benchmark, VisCog-Bench, which comprises four subtasks to evaluate the effectiveness of semantic reasoning. Comprehensive evaluations on GenEval, T2I-CompBench, and the proposed VisCog-Bench show improvements of 15%, 5%, and 19%, respectively, demonstrating the superior performance of the proposed Visual-CoG. We will release all the resources soon.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MV-RAG: Retrieval Augmented Multiview Diffusion</title>
      <itunes:episode>1094</itunes:episode>
      <podcast:episode>1094</podcast:episode>
      <itunes:title>MV-RAG: Retrieval Augmented Multiview Diffusion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fecf93ff-c948-4f2c-a499-4c04c232975a</guid>
      <link>https://share.transistor.fm/s/17d02cc6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yosef Dayani, Omer Benishu, Sagie Benaim</p>

            <p><strong>Title:</strong><br>
            MV-RAG: Retrieval Augmented Multiview Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.16577v1">http://arxiv.org/abs/2508.16577v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yosef Dayani, Omer Benishu, Sagie Benaim</p>

            <p><strong>Title:</strong><br>
            MV-RAG: Retrieval Augmented Multiview Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.16577v1">http://arxiv.org/abs/2508.16577v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 26 Aug 2025 20:04:17 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/17d02cc6/df66ff2c.mp3" length="19765358" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1232</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yosef Dayani, Omer Benishu, Sagie Benaim</p>

            <p><strong>Title:</strong><br>
            MV-RAG: Retrieval Augmented Multiview Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.16577v1">http://arxiv.org/abs/2508.16577v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Memento: Fine-tuning LLM Agents without Fine-tuning LLMs</title>
      <itunes:episode>1093</itunes:episode>
      <podcast:episode>1093</podcast:episode>
      <itunes:title>Memento: Fine-tuning LLM Agents without Fine-tuning LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cb3f5f6a-69eb-4f6f-ae0f-22d3552c8452</guid>
      <link>https://share.transistor.fm/s/119d78d3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, Jun Wang</p>

            <p><strong>Title:</strong><br>
            Memento: Fine-tuning LLM Agents without Fine-tuning LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.16153v2">http://arxiv.org/abs/2508.16153v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce a novel learning paradigm for Adaptive Large Language Model (LLM) agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either rigid, relying on static, handcrafted reflection workflows, or computationally intensive, requiring gradient updates of LLM model parameters. In contrast, our method enables low-cost continual adaptation via memory-based online reinforcement learning. We formalise this as a Memory-augmented Markov Decision Process (M-MDP), equipped with a neural case-selection policy to guide action decisions. Past experiences are stored in an episodic memory, either differentiable or non-parametric. The policy is continually updated based on environmental feedback through a memory rewriting mechanism, whereas policy improvement is achieved through efficient memory reading (retrieval). We instantiate our agent model in the deep research setting, namely \emph{Memento}, which attains top-1 on GAIA validation ($87.88\%$ Pass@$3$) and $79.40\%$ on the test set. It reaches $66.6\%$ F1 and $80.4\%$ PM on the DeepResearcher dataset, outperforming the state-of-the-art training-based method, while case-based memory adds $4.7\%$ to $9.6\%$ absolute points on out-of-distribution tasks. Our approach offers a scalable and efficient pathway for developing generalist LLM agents capable of continuous, real-time learning without gradient updates, advancing machine learning towards open-ended skill acquisition and deep research scenarios. The code is available at https://github.com/Agent-on-the-Fly/Memento.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, Jun Wang</p>

            <p><strong>Title:</strong><br>
            Memento: Fine-tuning LLM Agents without Fine-tuning LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.16153v2">http://arxiv.org/abs/2508.16153v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce a novel learning paradigm for Adaptive Large Language Model (LLM) agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either rigid, relying on static, handcrafted reflection workflows, or computationally intensive, requiring gradient updates of LLM model parameters. In contrast, our method enables low-cost continual adaptation via memory-based online reinforcement learning. We formalise this as a Memory-augmented Markov Decision Process (M-MDP), equipped with a neural case-selection policy to guide action decisions. Past experiences are stored in an episodic memory, either differentiable or non-parametric. The policy is continually updated based on environmental feedback through a memory rewriting mechanism, whereas policy improvement is achieved through efficient memory reading (retrieval). We instantiate our agent model in the deep research setting, namely \emph{Memento}, which attains top-1 on GAIA validation ($87.88\%$ Pass@$3$) and $79.40\%$ on the test set. It reaches $66.6\%$ F1 and $80.4\%$ PM on the DeepResearcher dataset, outperforming the state-of-the-art training-based method, while case-based memory adds $4.7\%$ to $9.6\%$ absolute points on out-of-distribution tasks. Our approach offers a scalable and efficient pathway for developing generalist LLM agents capable of continuous, real-time learning without gradient updates, advancing machine learning towards open-ended skill acquisition and deep research scenarios. The code is available at https://github.com/Agent-on-the-Fly/Memento.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 25 Aug 2025 20:08:16 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/119d78d3/fda4e84d.mp3" length="21711803" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1353</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, Jun Wang</p>

            <p><strong>Title:</strong><br>
            Memento: Fine-tuning LLM Agents without Fine-tuning LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.16153v2">http://arxiv.org/abs/2508.16153v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce a novel learning paradigm for Adaptive Large Language Model (LLM) agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either rigid, relying on static, handcrafted reflection workflows, or computationally intensive, requiring gradient updates of LLM model parameters. In contrast, our method enables low-cost continual adaptation via memory-based online reinforcement learning. We formalise this as a Memory-augmented Markov Decision Process (M-MDP), equipped with a neural case-selection policy to guide action decisions. Past experiences are stored in an episodic memory, either differentiable or non-parametric. The policy is continually updated based on environmental feedback through a memory rewriting mechanism, whereas policy improvement is achieved through efficient memory reading (retrieval). We instantiate our agent model in the deep research setting, namely \emph{Memento}, which attains top-1 on GAIA validation ($87.88\%$ Pass@$3$) and $79.40\%$ on the test set. It reaches $66.6\%$ F1 and $80.4\%$ PM on the DeepResearcher dataset, outperforming the state-of-the-art training-based method, while case-based memory adds $4.7\%$ to $9.6\%$ absolute points on out-of-distribution tasks. Our approach offers a scalable and efficient pathway for developing generalist LLM agents capable of continuous, real-time learning without gradient updates, advancing machine learning towards open-ended skill acquisition and deep research scenarios. The code is available at https://github.com/Agent-on-the-Fly/Memento.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR</title>
      <itunes:episode>1092</itunes:episode>
      <podcast:episode>1092</podcast:episode>
      <itunes:title>Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">de597ec1-648e-43c3-9a4e-9921863b03c1</guid>
      <link>https://share.transistor.fm/s/a7d88fc5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, Weizhu Chen</p>

            <p><strong>Title:</strong><br>
            Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.14029v2">http://arxiv.org/abs/2508.14029v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, Weizhu Chen</p>

            <p><strong>Title:</strong><br>
            Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.14029v2">http://arxiv.org/abs/2508.14029v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 25 Aug 2025 20:07:54 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a7d88fc5/f804dfe8.mp3" length="20816551" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1297</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, Weizhu Chen</p>

            <p><strong>Title:</strong><br>
            Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.14029v2">http://arxiv.org/abs/2508.14029v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks</title>
      <itunes:episode>1091</itunes:episode>
      <podcast:episode>1091</podcast:episode>
      <itunes:title>ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8eefc96d-9190-4a27-88a3-1a3acecd2da1</guid>
      <link>https://share.transistor.fm/s/ddf24dea</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kaijun Wang, Liqin Lu, Mingyu Liu, Jianuo Jiang, Zeju Li, Bolin Zhang, Wancai Zheng, Xinyi Yu, Hao Chen, Chunhua Shen</p>

            <p><strong>Title:</strong><br>
            ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.08240v1">http://arxiv.org/abs/2508.08240v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language-guided long-horizon mobile manipulation has long been a grand challenge in embodied semantic reasoning, generalizable manipulation, and adaptive locomotion. Three fundamental limitations hinder progress: First, although large language models have improved spatial reasoning and task planning through semantic priors, existing implementations remain confined to tabletop scenarios, failing to address the constrained perception and limited actuation ranges of mobile platforms. Second, current manipulation strategies exhibit insufficient generalization when confronted with the diverse object configurations encountered in open-world environments. Third, while crucial for practical deployment, the dual requirement of maintaining high platform maneuverability alongside precise end-effector control in unstructured settings remains understudied.   In this work, we present ODYSSEY, a unified mobile manipulation framework for agile quadruped robots equipped with manipulators, which seamlessly integrates high-level task planning with low-level whole-body control. To address the challenge of egocentric perception in language-conditioned tasks, we introduce a hierarchical planner powered by a vision-language model, enabling long-horizon instruction decomposition and precise action execution. At the control level, our novel whole-body policy achieves robust coordination across challenging terrains. We further present the first benchmark for long-horizon mobile manipulation, evaluating diverse indoor and outdoor scenarios. Through successful sim-to-real transfer, we demonstrate the system's generalization and robustness in real-world deployments, underscoring the practicality of legged manipulators in unstructured environments. Our work advances the feasibility of generalized robotic assistants capable of complex, dynamic tasks. Our project page: https://kaijwang.github.io/odyssey.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kaijun Wang, Liqin Lu, Mingyu Liu, Jianuo Jiang, Zeju Li, Bolin Zhang, Wancai Zheng, Xinyi Yu, Hao Chen, Chunhua Shen</p>

            <p><strong>Title:</strong><br>
            ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.08240v1">http://arxiv.org/abs/2508.08240v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language-guided long-horizon mobile manipulation has long been a grand challenge in embodied semantic reasoning, generalizable manipulation, and adaptive locomotion. Three fundamental limitations hinder progress: First, although large language models have improved spatial reasoning and task planning through semantic priors, existing implementations remain confined to tabletop scenarios, failing to address the constrained perception and limited actuation ranges of mobile platforms. Second, current manipulation strategies exhibit insufficient generalization when confronted with the diverse object configurations encountered in open-world environments. Third, while crucial for practical deployment, the dual requirement of maintaining high platform maneuverability alongside precise end-effector control in unstructured settings remains understudied.   In this work, we present ODYSSEY, a unified mobile manipulation framework for agile quadruped robots equipped with manipulators, which seamlessly integrates high-level task planning with low-level whole-body control. To address the challenge of egocentric perception in language-conditioned tasks, we introduce a hierarchical planner powered by a vision-language model, enabling long-horizon instruction decomposition and precise action execution. At the control level, our novel whole-body policy achieves robust coordination across challenging terrains. We further present the first benchmark for long-horizon mobile manipulation, evaluating diverse indoor and outdoor scenarios. Through successful sim-to-real transfer, we demonstrate the system's generalization and robustness in real-world deployments, underscoring the practicality of legged manipulators in unstructured environments. Our work advances the feasibility of generalized robotic assistants capable of complex, dynamic tasks. Our project page: https://kaijwang.github.io/odyssey.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 25 Aug 2025 20:07:32 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ddf24dea/f79a6435.mp3" length="20645615" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1287</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kaijun Wang, Liqin Lu, Mingyu Liu, Jianuo Jiang, Zeju Li, Bolin Zhang, Wancai Zheng, Xinyi Yu, Hao Chen, Chunhua Shen</p>

            <p><strong>Title:</strong><br>
            ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.08240v1">http://arxiv.org/abs/2508.08240v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language-guided long-horizon mobile manipulation has long been a grand challenge in embodied semantic reasoning, generalizable manipulation, and adaptive locomotion. Three fundamental limitations hinder progress: First, although large language models have improved spatial reasoning and task planning through semantic priors, existing implementations remain confined to tabletop scenarios, failing to address the constrained perception and limited actuation ranges of mobile platforms. Second, current manipulation strategies exhibit insufficient generalization when confronted with the diverse object configurations encountered in open-world environments. Third, while crucial for practical deployment, the dual requirement of maintaining high platform maneuverability alongside precise end-effector control in unstructured settings remains understudied.   In this work, we present ODYSSEY, a unified mobile manipulation framework for agile quadruped robots equipped with manipulators, which seamlessly integrates high-level task planning with low-level whole-body control. To address the challenge of egocentric perception in language-conditioned tasks, we introduce a hierarchical planner powered by a vision-language model, enabling long-horizon instruction decomposition and precise action execution. At the control level, our novel whole-body policy achieves robust coordination across challenging terrains. We further present the first benchmark for long-horizon mobile manipulation, evaluating diverse indoor and outdoor scenarios. Through successful sim-to-real transfer, we demonstrate the system's generalization and robustness in real-world deployments, underscoring the practicality of legged manipulators in unstructured environments. Our work advances the feasibility of generalized robotic assistants capable of complex, dynamic tasks. Our project page: https://kaijwang.github.io/odyssey.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Intern-S1: A Scientific Multimodal Foundation Model</title>
      <itunes:episode>1090</itunes:episode>
      <podcast:episode>1090</podcast:episode>
      <itunes:title>Intern-S1: A Scientific Multimodal Foundation Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">27bf7425-2f96-470b-af3b-25919958fe45</guid>
      <link>https://share.transistor.fm/s/bc718bf3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 166 | cs.LG, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Lei Bai, Zhongrui Cai, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, Yu Cheng, Yu Cheng, Pei Chu, Tao Chu, Erfei Cui, Ganqu Cui, Long Cui, Ziyun Cui, Nianchen Deng, Ning Ding, Nanqin Dong, Peijie Dong, Shihan Dou, Sinan Du, Haodong Duan, Caihua Fan, Ben Gao, Changjiang Gao, Jianfei Gao, Songyang Gao, Yang Gao, Zhangwei Gao, Jiaye Ge, Qiming Ge, Lixin Gu, Yuzhe Gu, Aijia Guo, Qipeng Guo, Xu Guo, Conghui He, Junjun He, Yili Hong, Siyuan Hou, Caiyu Hu, Hanglei Hu, Jucheng Hu, Ming Hu, Zhouqi Hua, Haian Huang, Junhao Huang, Xu Huang, Zixian Huang, Zhe Jiang, Lingkai Kong, Linyang Li, Peiji Li, Pengze Li, Shuaibin Li, Tianbin Li, Wei Li, Yuqiang Li, Dahua Lin, Junyao Lin, Tianyi Lin, Zhishan Lin, Hongwei Liu, Jiangning Liu, Jiyao Liu, Junnan Liu, Kai Liu, Kaiwen Liu, Kuikun Liu, Shichun Liu, Shudong Liu, Wei Liu, Xinyao Liu, Yuhong Liu, Zhan Liu, Yinquan Lu, Haijun Lv, Hongxia Lv, Huijie Lv, Qidang Lv, Ying Lv, Chengqi Lyu, Chenglong Ma, Jianpeng Ma, Ren Ma, Runmin Ma, Runyuan Ma, Xinzhu Ma, Yichuan Ma, Zihan Ma, Sixuan Mi, Junzhi Ning, Wenchang Ning, Xinle Pang, Jiahui Peng, Runyu Peng, Yu Qiao, Jiantao Qiu, Xiaoye Qu, Yuan Qu, Yuchen Ren, Fukai Shang, Wenqi Shao, Junhao Shen, Shuaike Shen, Chunfeng Song, Demin Song, Diping Song, Chenlin Su, Weijie Su, Weigao Sun, Yu Sun, Qian Tan, Cheng Tang, Huanze Tang, Kexian Tang, Shixiang Tang, Jian Tong, Aoran Wang, Bin Wang, Dong Wang, Lintao Wang, Rui Wang, Weiyun Wang, Wenhai Wang, Yi Wang, Ziyi Wang, Ling-I Wu, Wen Wu, Yue Wu, Zijian Wu, Linchen Xiao, Shuhao Xing, Chao Xu, Huihui Xu, Jun Xu, Ruiliang Xu, Wanghan Xu, GanLin Yang, Yuming Yang, Haochen Ye, Jin Ye, Shenglong Ye, Jia Yu, Jiashuo Yu, Jing Yu, Fei Yuan, Bo Zhang, Chao Zhang, Chen Zhang, Hongjie Zhang, Jin Zhang, Qiaosheng Zhang, Qiuyinzhe Zhang, Songyang Zhang, Taolin Zhang, Wenlong Zhang, Wenwei Zhang, Yechen Zhang, Ziyang Zhang, Haiteng Zhao, Qian Zhao, Xiangyu Zhao, Xiangyu Zhao, Bowen Zhou, Dongzhan Zhou, Peiheng Zhou, Yuhao Zhou, Yunhua Zhou, Dongsheng Zhu, Lin Zhu, Yicheng Zou</p>

            <p><strong>Title:</strong><br>
            Intern-S1: A Scientific Multimodal Foundation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.15763v1">http://arxiv.org/abs/2508.15763v1</a></p>

            <p><strong>Abstract:</strong><br>
            In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training.On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at https://huggingface.co/internlm/Intern-S1.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 166 | cs.LG, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Lei Bai, Zhongrui Cai, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, Yu Cheng, Yu Cheng, Pei Chu, Tao Chu, Erfei Cui, Ganqu Cui, Long Cui, Ziyun Cui, Nianchen Deng, Ning Ding, Nanqin Dong, Peijie Dong, Shihan Dou, Sinan Du, Haodong Duan, Caihua Fan, Ben Gao, Changjiang Gao, Jianfei Gao, Songyang Gao, Yang Gao, Zhangwei Gao, Jiaye Ge, Qiming Ge, Lixin Gu, Yuzhe Gu, Aijia Guo, Qipeng Guo, Xu Guo, Conghui He, Junjun He, Yili Hong, Siyuan Hou, Caiyu Hu, Hanglei Hu, Jucheng Hu, Ming Hu, Zhouqi Hua, Haian Huang, Junhao Huang, Xu Huang, Zixian Huang, Zhe Jiang, Lingkai Kong, Linyang Li, Peiji Li, Pengze Li, Shuaibin Li, Tianbin Li, Wei Li, Yuqiang Li, Dahua Lin, Junyao Lin, Tianyi Lin, Zhishan Lin, Hongwei Liu, Jiangning Liu, Jiyao Liu, Junnan Liu, Kai Liu, Kaiwen Liu, Kuikun Liu, Shichun Liu, Shudong Liu, Wei Liu, Xinyao Liu, Yuhong Liu, Zhan Liu, Yinquan Lu, Haijun Lv, Hongxia Lv, Huijie Lv, Qidang Lv, Ying Lv, Chengqi Lyu, Chenglong Ma, Jianpeng Ma, Ren Ma, Runmin Ma, Runyuan Ma, Xinzhu Ma, Yichuan Ma, Zihan Ma, Sixuan Mi, Junzhi Ning, Wenchang Ning, Xinle Pang, Jiahui Peng, Runyu Peng, Yu Qiao, Jiantao Qiu, Xiaoye Qu, Yuan Qu, Yuchen Ren, Fukai Shang, Wenqi Shao, Junhao Shen, Shuaike Shen, Chunfeng Song, Demin Song, Diping Song, Chenlin Su, Weijie Su, Weigao Sun, Yu Sun, Qian Tan, Cheng Tang, Huanze Tang, Kexian Tang, Shixiang Tang, Jian Tong, Aoran Wang, Bin Wang, Dong Wang, Lintao Wang, Rui Wang, Weiyun Wang, Wenhai Wang, Yi Wang, Ziyi Wang, Ling-I Wu, Wen Wu, Yue Wu, Zijian Wu, Linchen Xiao, Shuhao Xing, Chao Xu, Huihui Xu, Jun Xu, Ruiliang Xu, Wanghan Xu, GanLin Yang, Yuming Yang, Haochen Ye, Jin Ye, Shenglong Ye, Jia Yu, Jiashuo Yu, Jing Yu, Fei Yuan, Bo Zhang, Chao Zhang, Chen Zhang, Hongjie Zhang, Jin Zhang, Qiaosheng Zhang, Qiuyinzhe Zhang, Songyang Zhang, Taolin Zhang, Wenlong Zhang, Wenwei Zhang, Yechen Zhang, Ziyang Zhang, Haiteng Zhao, Qian Zhao, Xiangyu Zhao, Xiangyu Zhao, Bowen Zhou, Dongzhan Zhou, Peiheng Zhou, Yuhao Zhou, Yunhua Zhou, Dongsheng Zhu, Lin Zhu, Yicheng Zou</p>

            <p><strong>Title:</strong><br>
            Intern-S1: A Scientific Multimodal Foundation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.15763v1">http://arxiv.org/abs/2508.15763v1</a></p>

            <p><strong>Abstract:</strong><br>
            In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training.On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at https://huggingface.co/internlm/Intern-S1.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 22 Aug 2025 20:16:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bc718bf3/0138c96e.mp3" length="18714195" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1166</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 166 | cs.LG, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Lei Bai, Zhongrui Cai, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, Yu Cheng, Yu Cheng, Pei Chu, Tao Chu, Erfei Cui, Ganqu Cui, Long Cui, Ziyun Cui, Nianchen Deng, Ning Ding, Nanqin Dong, Peijie Dong, Shihan Dou, Sinan Du, Haodong Duan, Caihua Fan, Ben Gao, Changjiang Gao, Jianfei Gao, Songyang Gao, Yang Gao, Zhangwei Gao, Jiaye Ge, Qiming Ge, Lixin Gu, Yuzhe Gu, Aijia Guo, Qipeng Guo, Xu Guo, Conghui He, Junjun He, Yili Hong, Siyuan Hou, Caiyu Hu, Hanglei Hu, Jucheng Hu, Ming Hu, Zhouqi Hua, Haian Huang, Junhao Huang, Xu Huang, Zixian Huang, Zhe Jiang, Lingkai Kong, Linyang Li, Peiji Li, Pengze Li, Shuaibin Li, Tianbin Li, Wei Li, Yuqiang Li, Dahua Lin, Junyao Lin, Tianyi Lin, Zhishan Lin, Hongwei Liu, Jiangning Liu, Jiyao Liu, Junnan Liu, Kai Liu, Kaiwen Liu, Kuikun Liu, Shichun Liu, Shudong Liu, Wei Liu, Xinyao Liu, Yuhong Liu, Zhan Liu, Yinquan Lu, Haijun Lv, Hongxia Lv, Huijie Lv, Qidang Lv, Ying Lv, Chengqi Lyu, Chenglong Ma, Jianpeng Ma, Ren Ma, Runmin Ma, Runyuan Ma, Xinzhu Ma, Yichuan Ma, Zihan Ma, Sixuan Mi, Junzhi Ning, Wenchang Ning, Xinle Pang, Jiahui Peng, Runyu Peng, Yu Qiao, Jiantao Qiu, Xiaoye Qu, Yuan Qu, Yuchen Ren, Fukai Shang, Wenqi Shao, Junhao Shen, Shuaike Shen, Chunfeng Song, Demin Song, Diping Song, Chenlin Su, Weijie Su, Weigao Sun, Yu Sun, Qian Tan, Cheng Tang, Huanze Tang, Kexian Tang, Shixiang Tang, Jian Tong, Aoran Wang, Bin Wang, Dong Wang, Lintao Wang, Rui Wang, Weiyun Wang, Wenhai Wang, Yi Wang, Ziyi Wang, Ling-I Wu, Wen Wu, Yue Wu, Zijian Wu, Linchen Xiao, Shuhao Xing, Chao Xu, Huihui Xu, Jun Xu, Ruiliang Xu, Wanghan Xu, GanLin Yang, Yuming Yang, Haochen Ye, Jin Ye, Shenglong Ye, Jia Yu, Jiashuo Yu, Jing Yu, Fei Yuan, Bo Zhang, Chao Zhang, Chen Zhang, Hongjie Zhang, Jin Zhang, Qiaosheng Zhang, Qiuyinzhe Zhang, Songyang Zhang, Taolin Zhang, Wenlong Zhang, Wenwei Zhang, Yechen Zhang, Ziyang Zhang, Haiteng Zhao, Qian Zhao, Xiangyu Zhao, Xiangyu Zhao, Bowen Zhou, Dongzhan Zhou, Peiheng Zhou, Yuhao Zhou, Yunhua Zhou, Dongsheng Zhu, Lin Zhu, Yicheng Zou</p>

            <p><strong>Title:</strong><br>
            Intern-S1: A Scientific Multimodal Foundation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.15763v1">http://arxiv.org/abs/2508.15763v1</a></p>

            <p><strong>Abstract:</strong><br>
            In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training.On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at https://huggingface.co/internlm/Intern-S1.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Mobile-Agent-v3: Foundamental Agents for GUI Automation</title>
      <itunes:episode>1089</itunes:episode>
      <podcast:episode>1089</podcast:episode>
      <itunes:title>Mobile-Agent-v3: Foundamental Agents for GUI Automation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3209231c-bede-42a0-9c61-ca154c0ac9f7</guid>
      <link>https://share.transistor.fm/s/5cef12cf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, Ming Yan</p>

            <p><strong>Title:</strong><br>
            Mobile-Agent-v3: Foundamental Agents for GUI Automation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.15144v1">http://arxiv.org/abs/2508.15144v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, Ming Yan</p>

            <p><strong>Title:</strong><br>
            Mobile-Agent-v3: Foundamental Agents for GUI Automation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.15144v1">http://arxiv.org/abs/2508.15144v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 22 Aug 2025 20:16:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5cef12cf/3f350956.mp3" length="24089154" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1502</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, Ming Yan</p>

            <p><strong>Title:</strong><br>
            Mobile-Agent-v3: Foundamental Agents for GUI Automation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.15144v1">http://arxiv.org/abs/2508.15144v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Deep Think with Confidence</title>
      <itunes:episode>1088</itunes:episode>
      <podcast:episode>1088</podcast:episode>
      <itunes:title>Deep Think with Confidence</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e9858d9c-8706-4452-ba0a-d6e78276279c</guid>
      <link>https://share.transistor.fm/s/15f33003</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yichao Fu, Xuewei Wang, Yuandong Tian, Jiawei Zhao</p>

            <p><strong>Title:</strong><br>
            Deep Think with Confidence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.15260v1">http://arxiv.org/abs/2508.15260v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have shown great potential in reasoning tasks through test-time scaling methods like self-consistency with majority voting. However, this approach often leads to diminishing returns in accuracy and high computational overhead. To address these challenges, we introduce Deep Think with Confidence (DeepConf), a simple yet powerful method that enhances both reasoning efficiency and performance at test time. DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks. We evaluate DeepConf across a variety of reasoning tasks and the latest open-source models, including Qwen 3 and GPT-OSS series. Notably, on challenging benchmarks such as AIME 2025, DeepConf@512 achieves up to 99.9% accuracy and reduces generated tokens by up to 84.7% compared to full parallel thinking.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yichao Fu, Xuewei Wang, Yuandong Tian, Jiawei Zhao</p>

            <p><strong>Title:</strong><br>
            Deep Think with Confidence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.15260v1">http://arxiv.org/abs/2508.15260v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have shown great potential in reasoning tasks through test-time scaling methods like self-consistency with majority voting. However, this approach often leads to diminishing returns in accuracy and high computational overhead. To address these challenges, we introduce Deep Think with Confidence (DeepConf), a simple yet powerful method that enhances both reasoning efficiency and performance at test time. DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks. We evaluate DeepConf across a variety of reasoning tasks and the latest open-source models, including Qwen 3 and GPT-OSS series. Notably, on challenging benchmarks such as AIME 2025, DeepConf@512 achieves up to 99.9% accuracy and reduces generated tokens by up to 84.7% compared to full parallel thinking.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 22 Aug 2025 20:16:02 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/15f33003/b558f19c.mp3" length="19901592" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1240</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yichao Fu, Xuewei Wang, Yuandong Tian, Jiawei Zhao</p>

            <p><strong>Title:</strong><br>
            Deep Think with Confidence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.15260v1">http://arxiv.org/abs/2508.15260v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have shown great potential in reasoning tasks through test-time scaling methods like self-consistency with majority voting. However, this approach often leads to diminishing returns in accuracy and high computational overhead. To address these challenges, we introduce Deep Think with Confidence (DeepConf), a simple yet powerful method that enhances both reasoning efficiency and performance at test time. DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks. We evaluate DeepConf across a variety of reasoning tasks and the latest open-source models, including Qwen 3 and GPT-OSS series. Notably, on challenging benchmarks such as AIME 2025, DeepConf@512 achieves up to 99.9% accuracy and reduces generated tokens by up to 84.7% compared to full parallel thinking.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries</title>
      <itunes:episode>1087</itunes:episode>
      <podcast:episode>1087</podcast:episode>
      <itunes:title>LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5127273b-fae0-4af2-9dc8-f1052e07d30c</guid>
      <link>https://share.transistor.fm/s/a9716dd3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ming Yin, Dinghan Shen, Silei Xu, Jianbing Han, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song</p>

            <p><strong>Title:</strong><br>
            LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.15760v1">http://arxiv.org/abs/2508.15760v1</a></p>

            <p><strong>Abstract:</strong><br>
            Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ming Yin, Dinghan Shen, Silei Xu, Jianbing Han, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song</p>

            <p><strong>Title:</strong><br>
            LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.15760v1">http://arxiv.org/abs/2508.15760v1</a></p>

            <p><strong>Abstract:</strong><br>
            Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 22 Aug 2025 20:15:39 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a9716dd3/4c9af6f5.mp3" length="22908030" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1428</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ming Yin, Dinghan Shen, Silei Xu, Jianbing Han, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song</p>

            <p><strong>Title:</strong><br>
            LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.15760v1">http://arxiv.org/abs/2508.15760v1</a></p>

            <p><strong>Abstract:</strong><br>
            Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization</title>
      <itunes:episode>1086</itunes:episode>
      <podcast:episode>1086</podcast:episode>
      <itunes:title>DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c7d127ae-8a0d-47a6-bb52-79dbf57562bf</guid>
      <link>https://share.transistor.fm/s/c230f99c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shuaijie She, Yu Bao, Yu Lu, Lu Xu, Tao Li, Wenhao Zhu, Shujian Huang, Shanbo Cheng, Lu Lu, Yuxuan Wang</p>

            <p><strong>Title:</strong><br>
            DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.14460v1">http://arxiv.org/abs/2508.14460v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning's restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task's input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs' ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.13 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shuaijie She, Yu Bao, Yu Lu, Lu Xu, Tao Li, Wenhao Zhu, Shujian Huang, Shanbo Cheng, Lu Lu, Yuxuan Wang</p>

            <p><strong>Title:</strong><br>
            DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.14460v1">http://arxiv.org/abs/2508.14460v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning's restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task's input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs' ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.13 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 21 Aug 2025 20:32:05 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c230f99c/a4e28a67.mp3" length="22118499" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1379</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shuaijie She, Yu Bao, Yu Lu, Lu Xu, Tao Li, Wenhao Zhu, Shujian Huang, Shanbo Cheng, Lu Lu, Yuxuan Wang</p>

            <p><strong>Title:</strong><br>
            DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.14460v1">http://arxiv.org/abs/2508.14460v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning's restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task's input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs' ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.13 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models</title>
      <itunes:episode>1085</itunes:episode>
      <podcast:episode>1085</podcast:episode>
      <itunes:title>From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e6f3d1ae-3592-45bb-978e-b7e4b5dd706c</guid>
      <link>https://share.transistor.fm/s/a920d8b9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CE</p>

            <p><strong>Authors:</strong><br>
            Ziyan Kuang, Feiyu Zhu, Maowei Jiang, Yanzhao Lai, Zelin Wang, Zhitong Wang, Meikang Qiu, Jiajia Huang, Min Peng, Qianqian Xie, Sophia Ananiadou</p>

            <p><strong>Title:</strong><br>
            From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.13491v1">http://arxiv.org/abs/2508.13491v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have shown promise for financial applications, yet their suitability for this high-stakes domain remains largely unproven due to inadequacies in existing benchmarks. Existing benchmarks solely rely on score-level evaluation, summarizing performance with a single score that obscures the nuanced understanding of what models truly know and their precise limitations. They also rely on datasets that cover only a narrow subset of financial concepts, while overlooking other essentials for real-world applications. To address these gaps, we introduce FinCDM, the first cognitive diagnosis evaluation framework tailored for financial LLMs, enabling the evaluation of LLMs at the knowledge-skill level, identifying what financial skills and knowledge they have or lack based on their response patterns across skill-tagged tasks, rather than a single aggregated number. We construct CPA-QKA, the first cognitively informed financial evaluation dataset derived from the Certified Public Accountant (CPA) examination, with comprehensive coverage of real-world accounting and financial skills. It is rigorously annotated by domain experts, who author, validate, and annotate questions with high inter-annotator agreement and fine-grained knowledge labels. Our extensive experiments on 30 proprietary, open-source, and domain-specific LLMs show that FinCDM reveals hidden knowledge gaps, identifies under-tested areas such as tax and regulatory reasoning overlooked by traditional benchmarks, and uncovers behavioral clusters among models. FinCDM introduces a new paradigm for financial LLM evaluation by enabling interpretable, skill-aware diagnosis that supports more trustworthy and targeted model development, and all datasets and evaluation scripts will be publicly released to support further research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CE</p>

            <p><strong>Authors:</strong><br>
            Ziyan Kuang, Feiyu Zhu, Maowei Jiang, Yanzhao Lai, Zelin Wang, Zhitong Wang, Meikang Qiu, Jiajia Huang, Min Peng, Qianqian Xie, Sophia Ananiadou</p>

            <p><strong>Title:</strong><br>
            From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.13491v1">http://arxiv.org/abs/2508.13491v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have shown promise for financial applications, yet their suitability for this high-stakes domain remains largely unproven due to inadequacies in existing benchmarks. Existing benchmarks solely rely on score-level evaluation, summarizing performance with a single score that obscures the nuanced understanding of what models truly know and their precise limitations. They also rely on datasets that cover only a narrow subset of financial concepts, while overlooking other essentials for real-world applications. To address these gaps, we introduce FinCDM, the first cognitive diagnosis evaluation framework tailored for financial LLMs, enabling the evaluation of LLMs at the knowledge-skill level, identifying what financial skills and knowledge they have or lack based on their response patterns across skill-tagged tasks, rather than a single aggregated number. We construct CPA-QKA, the first cognitively informed financial evaluation dataset derived from the Certified Public Accountant (CPA) examination, with comprehensive coverage of real-world accounting and financial skills. It is rigorously annotated by domain experts, who author, validate, and annotate questions with high inter-annotator agreement and fine-grained knowledge labels. Our extensive experiments on 30 proprietary, open-source, and domain-specific LLMs show that FinCDM reveals hidden knowledge gaps, identifies under-tested areas such as tax and regulatory reasoning overlooked by traditional benchmarks, and uncovers behavioral clusters among models. FinCDM introduces a new paradigm for financial LLM evaluation by enabling interpretable, skill-aware diagnosis that supports more trustworthy and targeted model development, and all datasets and evaluation scripts will be publicly released to support further research.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 21 Aug 2025 20:31:41 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a920d8b9/4a60b5a7.mp3" length="22380582" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1395</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CE</p>

            <p><strong>Authors:</strong><br>
            Ziyan Kuang, Feiyu Zhu, Maowei Jiang, Yanzhao Lai, Zelin Wang, Zhitong Wang, Meikang Qiu, Jiajia Huang, Min Peng, Qianqian Xie, Sophia Ananiadou</p>

            <p><strong>Title:</strong><br>
            From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.13491v1">http://arxiv.org/abs/2508.13491v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have shown promise for financial applications, yet their suitability for this high-stakes domain remains largely unproven due to inadequacies in existing benchmarks. Existing benchmarks solely rely on score-level evaluation, summarizing performance with a single score that obscures the nuanced understanding of what models truly know and their precise limitations. They also rely on datasets that cover only a narrow subset of financial concepts, while overlooking other essentials for real-world applications. To address these gaps, we introduce FinCDM, the first cognitive diagnosis evaluation framework tailored for financial LLMs, enabling the evaluation of LLMs at the knowledge-skill level, identifying what financial skills and knowledge they have or lack based on their response patterns across skill-tagged tasks, rather than a single aggregated number. We construct CPA-QKA, the first cognitively informed financial evaluation dataset derived from the Certified Public Accountant (CPA) examination, with comprehensive coverage of real-world accounting and financial skills. It is rigorously annotated by domain experts, who author, validate, and annotate questions with high inter-annotator agreement and fine-grained knowledge labels. Our extensive experiments on 30 proprietary, open-source, and domain-specific LLMs show that FinCDM reveals hidden knowledge gaps, identifies under-tested areas such as tax and regulatory reasoning overlooked by traditional benchmarks, and uncovers behavioral clusters among models. FinCDM introduces a new paradigm for financial LLM evaluation by enabling interpretable, skill-aware diagnosis that supports more trustworthy and targeted model development, and all datasets and evaluation scripts will be publicly released to support further research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction</title>
      <itunes:episode>1084</itunes:episode>
      <podcast:episode>1084</podcast:episode>
      <itunes:title>FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9ca72a67-a8ab-4577-80a1-e2e9e2d43ea2</guid>
      <link>https://share.transistor.fm/s/eaf82792</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, Mingren Yin, Zhenwei Zhu, Tianle Cai, Zehui Chen, Jiecao Chen, Yantao Du, Xiang Gao, Jiacheng Guo, Liang Hu, Jianpeng Jiao, Xiangsheng Li, Jingkai Liu, Shuang Ni, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xin Zhou, Jose Blanchet, Xipeng Qiu, Mengdi Wang, Wenhao Huang</p>

            <p><strong>Title:</strong><br>
            FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.11987v2">http://arxiv.org/abs/2508.11987v2</a></p>

            <p><strong>Abstract:</strong><br>
            Future prediction is a complex task for LLM agents, requiring a high level of analytical thinking, information gathering, contextual understanding, and decision-making under uncertainty. Agents must not only gather and interpret vast amounts of dynamic information but also integrate diverse data sources, weigh uncertainties, and adapt predictions based on emerging trends, just as human experts do in fields like politics, economics, and finance. Despite its importance, no large-scale benchmark exists for evaluating agents on future prediction, largely due to challenges in handling real-time updates and retrieving timely, accurate answers. To address this, we introduce $\textbf{FutureX}$, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is the largest and most diverse live benchmark for future prediction, supporting real-time daily updates and eliminating data contamination through an automated pipeline for question gathering and answer collection. We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools such as the open-source Deep Research Agent and closed-source Deep Research models. This comprehensive evaluation assesses agents' adaptive reasoning and performance in dynamic environments. Additionally, we provide in-depth analyses of agents' failure modes and performance pitfalls in future-oriented tasks, including the vulnerability to fake web pages and the temporal validity. Our goal is to establish a dynamic, contamination-free evaluation standard that drives the development of LLM agents capable of performing at the level of professional human analysts in complex reasoning and predictive thinking.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, Mingren Yin, Zhenwei Zhu, Tianle Cai, Zehui Chen, Jiecao Chen, Yantao Du, Xiang Gao, Jiacheng Guo, Liang Hu, Jianpeng Jiao, Xiangsheng Li, Jingkai Liu, Shuang Ni, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xin Zhou, Jose Blanchet, Xipeng Qiu, Mengdi Wang, Wenhao Huang</p>

            <p><strong>Title:</strong><br>
            FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.11987v2">http://arxiv.org/abs/2508.11987v2</a></p>

            <p><strong>Abstract:</strong><br>
            Future prediction is a complex task for LLM agents, requiring a high level of analytical thinking, information gathering, contextual understanding, and decision-making under uncertainty. Agents must not only gather and interpret vast amounts of dynamic information but also integrate diverse data sources, weigh uncertainties, and adapt predictions based on emerging trends, just as human experts do in fields like politics, economics, and finance. Despite its importance, no large-scale benchmark exists for evaluating agents on future prediction, largely due to challenges in handling real-time updates and retrieving timely, accurate answers. To address this, we introduce $\textbf{FutureX}$, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is the largest and most diverse live benchmark for future prediction, supporting real-time daily updates and eliminating data contamination through an automated pipeline for question gathering and answer collection. We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools such as the open-source Deep Research Agent and closed-source Deep Research models. This comprehensive evaluation assesses agents' adaptive reasoning and performance in dynamic environments. Additionally, we provide in-depth analyses of agents' failure modes and performance pitfalls in future-oriented tasks, including the vulnerability to fake web pages and the temporal validity. Our goal is to establish a dynamic, contamination-free evaluation standard that drives the development of LLM agents capable of performing at the level of professional human analysts in complex reasoning and predictive thinking.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 21 Aug 2025 20:31:17 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/eaf82792/9d840c22.mp3" length="21197728" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1321</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, Mingren Yin, Zhenwei Zhu, Tianle Cai, Zehui Chen, Jiecao Chen, Yantao Du, Xiang Gao, Jiacheng Guo, Liang Hu, Jianpeng Jiao, Xiangsheng Li, Jingkai Liu, Shuang Ni, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xin Zhou, Jose Blanchet, Xipeng Qiu, Mengdi Wang, Wenhao Huang</p>

            <p><strong>Title:</strong><br>
            FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.11987v2">http://arxiv.org/abs/2508.11987v2</a></p>

            <p><strong>Abstract:</strong><br>
            Future prediction is a complex task for LLM agents, requiring a high level of analytical thinking, information gathering, contextual understanding, and decision-making under uncertainty. Agents must not only gather and interpret vast amounts of dynamic information but also integrate diverse data sources, weigh uncertainties, and adapt predictions based on emerging trends, just as human experts do in fields like politics, economics, and finance. Despite its importance, no large-scale benchmark exists for evaluating agents on future prediction, largely due to challenges in handling real-time updates and retrieving timely, accurate answers. To address this, we introduce $\textbf{FutureX}$, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is the largest and most diverse live benchmark for future prediction, supporting real-time daily updates and eliminating data contamination through an automated pipeline for question gathering and answer collection. We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools such as the open-source Deep Research Agent and closed-source Deep Research models. This comprehensive evaluation assesses agents' adaptive reasoning and performance in dynamic environments. Additionally, we provide in-depth analyses of agents' failure modes and performance pitfalls in future-oriented tasks, including the vulnerability to fake web pages and the temporal validity. Our goal is to establish a dynamic, contamination-free evaluation standard that drives the development of LLM agents capable of performing at the level of professional human analysts in complex reasoning and predictive thinking.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds</title>
      <itunes:episode>1083</itunes:episode>
      <podcast:episode>1083</podcast:episode>
      <itunes:title>MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2e05468d-dc98-4cb1-a406-9e1c70aedd10</guid>
      <link>https://share.transistor.fm/s/004b16da</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.GR, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Bingquan Dai, Li Ray Luo, Qihong Tang, Jie Wang, Xinyu Lian, Hao Xu, Minghan Qin, Xudong Xu, Bo Dai, Haoqian Wang, Zhaoyang Lyu, Jiangmiao Pang</p>

            <p><strong>Title:</strong><br>
            MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.14879v1">http://arxiv.org/abs/2508.14879v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reconstructing 3D objects into editable programs is pivotal for applications like reverse engineering and shape editing. However, existing methods often rely on limited domain-specific languages (DSLs) and small-scale datasets, restricting their ability to model complex geometries and structures. To address these challenges, we introduce MeshCoder, a novel framework that reconstructs complex 3D objects from point clouds into editable Blender Python scripts. We develop a comprehensive set of expressive Blender Python APIs capable of synthesizing intricate geometries. Leveraging these APIs, we construct a large-scale paired object-code dataset, where the code for each object is decomposed into distinct semantic parts. Subsequently, we train a multimodal large language model (LLM) that translates 3D point cloud into executable Blender Python scripts. Our approach not only achieves superior performance in shape-to-code reconstruction tasks but also facilitates intuitive geometric and topological editing through convenient code modifications. Furthermore, our code-based representation enhances the reasoning capabilities of LLMs in 3D shape understanding tasks. Together, these contributions establish MeshCoder as a powerful and flexible solution for programmatic 3D shape reconstruction and understanding.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.GR, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Bingquan Dai, Li Ray Luo, Qihong Tang, Jie Wang, Xinyu Lian, Hao Xu, Minghan Qin, Xudong Xu, Bo Dai, Haoqian Wang, Zhaoyang Lyu, Jiangmiao Pang</p>

            <p><strong>Title:</strong><br>
            MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.14879v1">http://arxiv.org/abs/2508.14879v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reconstructing 3D objects into editable programs is pivotal for applications like reverse engineering and shape editing. However, existing methods often rely on limited domain-specific languages (DSLs) and small-scale datasets, restricting their ability to model complex geometries and structures. To address these challenges, we introduce MeshCoder, a novel framework that reconstructs complex 3D objects from point clouds into editable Blender Python scripts. We develop a comprehensive set of expressive Blender Python APIs capable of synthesizing intricate geometries. Leveraging these APIs, we construct a large-scale paired object-code dataset, where the code for each object is decomposed into distinct semantic parts. Subsequently, we train a multimodal large language model (LLM) that translates 3D point cloud into executable Blender Python scripts. Our approach not only achieves superior performance in shape-to-code reconstruction tasks but also facilitates intuitive geometric and topological editing through convenient code modifications. Furthermore, our code-based representation enhances the reasoning capabilities of LLMs in 3D shape understanding tasks. Together, these contributions establish MeshCoder as a powerful and flexible solution for programmatic 3D shape reconstruction and understanding.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 21 Aug 2025 20:30:54 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/004b16da/dfc823ba.mp3" length="21278813" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1326</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.GR, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Bingquan Dai, Li Ray Luo, Qihong Tang, Jie Wang, Xinyu Lian, Hao Xu, Minghan Qin, Xudong Xu, Bo Dai, Haoqian Wang, Zhaoyang Lyu, Jiangmiao Pang</p>

            <p><strong>Title:</strong><br>
            MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.14879v1">http://arxiv.org/abs/2508.14879v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reconstructing 3D objects into editable programs is pivotal for applications like reverse engineering and shape editing. However, existing methods often rely on limited domain-specific languages (DSLs) and small-scale datasets, restricting their ability to model complex geometries and structures. To address these challenges, we introduce MeshCoder, a novel framework that reconstructs complex 3D objects from point clouds into editable Blender Python scripts. We develop a comprehensive set of expressive Blender Python APIs capable of synthesizing intricate geometries. Leveraging these APIs, we construct a large-scale paired object-code dataset, where the code for each object is decomposed into distinct semantic parts. Subsequently, we train a multimodal large language model (LLM) that translates 3D point cloud into executable Blender Python scripts. Our approach not only achieves superior performance in shape-to-code reconstruction tasks but also facilitates intuitive geometric and topological editing through convenient code modifications. Furthermore, our code-based representation enhances the reasoning capabilities of LLMs in 3D shape understanding tasks. Together, these contributions establish MeshCoder as a powerful and flexible solution for programmatic 3D shape reconstruction and understanding.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization</title>
      <itunes:episode>1082</itunes:episode>
      <podcast:episode>1082</podcast:episode>
      <itunes:title>Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">437a7984-50a0-42a6-8aee-2e8039d37dcf</guid>
      <link>https://share.transistor.fm/s/c27371e9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Canyu Zhao, Xiaoman Li, Tianjian Feng, Zhiyue Zhao, Hao Chen, Chunhua Shen</p>

            <p><strong>Title:</strong><br>
            Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.14811v1">http://arxiv.org/abs/2508.14811v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video synthesizer: Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, Tinker significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks. We believe that Tinker represents a key step towards truly scalable, zero-shot 3D editing. Project webpage: https://aim-uofa.github.io/Tinker</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Canyu Zhao, Xiaoman Li, Tianjian Feng, Zhiyue Zhao, Hao Chen, Chunhua Shen</p>

            <p><strong>Title:</strong><br>
            Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.14811v1">http://arxiv.org/abs/2508.14811v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video synthesizer: Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, Tinker significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks. We believe that Tinker represents a key step towards truly scalable, zero-shot 3D editing. Project webpage: https://aim-uofa.github.io/Tinker</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 21 Aug 2025 20:30:30 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c27371e9/9de8c9c9.mp3" length="20100625" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1253</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Canyu Zhao, Xiaoman Li, Tianjian Feng, Zhiyue Zhao, Hao Chen, Chunhua Shen</p>

            <p><strong>Title:</strong><br>
            Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.14811v1">http://arxiv.org/abs/2508.14811v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video synthesizer: Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, Tinker significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks. We believe that Tinker represents a key step towards truly scalable, zero-shot 3D editing. Project webpage: https://aim-uofa.github.io/Tinker</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL</title>
      <itunes:episode>1081</itunes:episode>
      <podcast:episode>1081</podcast:episode>
      <itunes:title>Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">dcbeb96e-edbc-404c-a747-ed96efec42cb</guid>
      <link>https://share.transistor.fm/s/e0b3271f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, Shuying Fan, Xiaowan Li, Tiannan Wang, Pai Liu, King Zhu, He Zhu, Dingfeng Shi, Piaohong Wang, Yeyi Guan, Xiangru Tang, Minghao Liu, Yuchen Eleanor Jiang, Jian Yang, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou</p>

            <p><strong>Title:</strong><br>
            Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.13167v1">http://arxiv.org/abs/2508.13167v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large language models (LLMs) and multi-agent systems have demonstrated remarkable capabilities in complex problem-solving tasks such as deep research, vibe coding, and mathematical reasoning. However, most existing multi-agent systems are built upon manual prompt/workflow engineering with sophisticated agent frameworks, making them computationally inefficient, less capable, and can not benefit from data-centric learning. In this work, we introduce Chain-of-Agents (CoA), a novel paradigm of LLM reasoning that enables native end-to-end complex problem-solving in the same way as a multi-agent system (i.e., multi-turn problem solving with multiple tools and multiple agents) within one model. In chain-of-agents problem-solving, the model dynamically activates different tool agents and role-playing agents to simulate multi-agent collaboration in an end-to-end fashion. To elicit end-to-end chain-of-agents problem-solving abilities in LLMs, we introduce a multi-agent distillation framework to distill state-of-the-art multi-agent systems into chain-of-agents trajectories for agentic supervised fine-tuning. We then use agentic reinforcement learning on verifiable agentic tasks to further improve the models' capabilities on chain-of-agents problem solving. We call the resulting models Agent Foundation Models (AFMs). Our empirical studies demonstrate that AFM establishes new state-of-the-art performance across diverse benchmarks in both web agent and code agent settings. We make the entire research, including the model weights, code for training and evaluation, and the training data, fully open-sourced, which offers a solid starting point for future research on agent models and agentic RL.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, Shuying Fan, Xiaowan Li, Tiannan Wang, Pai Liu, King Zhu, He Zhu, Dingfeng Shi, Piaohong Wang, Yeyi Guan, Xiangru Tang, Minghao Liu, Yuchen Eleanor Jiang, Jian Yang, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou</p>

            <p><strong>Title:</strong><br>
            Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.13167v1">http://arxiv.org/abs/2508.13167v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large language models (LLMs) and multi-agent systems have demonstrated remarkable capabilities in complex problem-solving tasks such as deep research, vibe coding, and mathematical reasoning. However, most existing multi-agent systems are built upon manual prompt/workflow engineering with sophisticated agent frameworks, making them computationally inefficient, less capable, and can not benefit from data-centric learning. In this work, we introduce Chain-of-Agents (CoA), a novel paradigm of LLM reasoning that enables native end-to-end complex problem-solving in the same way as a multi-agent system (i.e., multi-turn problem solving with multiple tools and multiple agents) within one model. In chain-of-agents problem-solving, the model dynamically activates different tool agents and role-playing agents to simulate multi-agent collaboration in an end-to-end fashion. To elicit end-to-end chain-of-agents problem-solving abilities in LLMs, we introduce a multi-agent distillation framework to distill state-of-the-art multi-agent systems into chain-of-agents trajectories for agentic supervised fine-tuning. We then use agentic reinforcement learning on verifiable agentic tasks to further improve the models' capabilities on chain-of-agents problem solving. We call the resulting models Agent Foundation Models (AFMs). Our empirical studies demonstrate that AFM establishes new state-of-the-art performance across diverse benchmarks in both web agent and code agent settings. We make the entire research, including the model weights, code for training and evaluation, and the training data, fully open-sourced, which offers a solid starting point for future research on agent models and agentic RL.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 20 Aug 2025 20:10:47 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e0b3271f/3b9ce8a2.mp3" length="21308511" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1328</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, Shuying Fan, Xiaowan Li, Tiannan Wang, Pai Liu, King Zhu, He Zhu, Dingfeng Shi, Piaohong Wang, Yeyi Guan, Xiangru Tang, Minghao Liu, Yuchen Eleanor Jiang, Jian Yang, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou</p>

            <p><strong>Title:</strong><br>
            Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.13167v1">http://arxiv.org/abs/2508.13167v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large language models (LLMs) and multi-agent systems have demonstrated remarkable capabilities in complex problem-solving tasks such as deep research, vibe coding, and mathematical reasoning. However, most existing multi-agent systems are built upon manual prompt/workflow engineering with sophisticated agent frameworks, making them computationally inefficient, less capable, and can not benefit from data-centric learning. In this work, we introduce Chain-of-Agents (CoA), a novel paradigm of LLM reasoning that enables native end-to-end complex problem-solving in the same way as a multi-agent system (i.e., multi-turn problem solving with multiple tools and multiple agents) within one model. In chain-of-agents problem-solving, the model dynamically activates different tool agents and role-playing agents to simulate multi-agent collaboration in an end-to-end fashion. To elicit end-to-end chain-of-agents problem-solving abilities in LLMs, we introduce a multi-agent distillation framework to distill state-of-the-art multi-agent systems into chain-of-agents trajectories for agentic supervised fine-tuning. We then use agentic reinforcement learning on verifiable agentic tasks to further improve the models' capabilities on chain-of-agents problem solving. We call the resulting models Agent Foundation Models (AFMs). Our empirical studies demonstrate that AFM establishes new state-of-the-art performance across diverse benchmarks in both web agent and code agent settings. We make the entire research, including the model weights, code for training and evaluation, and the training data, fully open-sourced, which offers a solid starting point for future research on agent models and agentic RL.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos</title>
      <itunes:episode>1080</itunes:episode>
      <podcast:episode>1080</podcast:episode>
      <itunes:title>LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">dd60e54c-c3ad-495d-ba77-b900d4c38785</guid>
      <link>https://share.transistor.fm/s/2cddaf90</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chin-Yang Lin, Cheng Sun, Fu-En Yang, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.14041v1">http://arxiv.org/abs/2508.14041v1</a></p>

            <p><strong>Abstract:</strong><br>
            LongSplat addresses critical challenges in novel view synthesis (NVS) from casually captured long videos characterized by irregular camera motion, unknown camera poses, and expansive scenes. Current methods often suffer from pose drift, inaccurate geometry initialization, and severe memory limitations. To address these issues, we introduce LongSplat, a robust unposed 3D Gaussian Splatting framework featuring: (1) Incremental Joint Optimization that concurrently optimizes camera poses and 3D Gaussians to avoid local minima and ensure global consistency; (2) a robust Pose Estimation Module leveraging learned 3D priors; and (3) an efficient Octree Anchor Formation mechanism that converts dense point clouds into anchors based on spatial density. Extensive experiments on challenging benchmarks demonstrate that LongSplat achieves state-of-the-art results, substantially improving rendering quality, pose accuracy, and computational efficiency compared to prior approaches. Project page: https://linjohnss.github.io/longsplat/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chin-Yang Lin, Cheng Sun, Fu-En Yang, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.14041v1">http://arxiv.org/abs/2508.14041v1</a></p>

            <p><strong>Abstract:</strong><br>
            LongSplat addresses critical challenges in novel view synthesis (NVS) from casually captured long videos characterized by irregular camera motion, unknown camera poses, and expansive scenes. Current methods often suffer from pose drift, inaccurate geometry initialization, and severe memory limitations. To address these issues, we introduce LongSplat, a robust unposed 3D Gaussian Splatting framework featuring: (1) Incremental Joint Optimization that concurrently optimizes camera poses and 3D Gaussians to avoid local minima and ensure global consistency; (2) a robust Pose Estimation Module leveraging learned 3D priors; and (3) an efficient Octree Anchor Formation mechanism that converts dense point clouds into anchors based on spatial density. Extensive experiments on challenging benchmarks demonstrate that LongSplat achieves state-of-the-art results, substantially improving rendering quality, pose accuracy, and computational efficiency compared to prior approaches. Project page: https://linjohnss.github.io/longsplat/</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 20 Aug 2025 20:10:25 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2cddaf90/4fef903e.mp3" length="20543203" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1280</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chin-Yang Lin, Cheng Sun, Fu-En Yang, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.14041v1">http://arxiv.org/abs/2508.14041v1</a></p>

            <p><strong>Abstract:</strong><br>
            LongSplat addresses critical challenges in novel view synthesis (NVS) from casually captured long videos characterized by irregular camera motion, unknown camera poses, and expansive scenes. Current methods often suffer from pose drift, inaccurate geometry initialization, and severe memory limitations. To address these issues, we introduce LongSplat, a robust unposed 3D Gaussian Splatting framework featuring: (1) Incremental Joint Optimization that concurrently optimizes camera poses and 3D Gaussians to avoid local minima and ensure global consistency; (2) a robust Pose Estimation Module leveraging learned 3D priors; and (3) an efficient Octree Anchor Formation mechanism that converts dense point clouds into anchors based on spatial density. Extensive experiments on challenging benchmarks demonstrate that LongSplat achieves state-of-the-art results, substantially improving rendering quality, pose accuracy, and computational efficiency compared to prior approaches. Project page: https://linjohnss.github.io/longsplat/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Prompt Orchestration Markup Language</title>
      <itunes:episode>1079</itunes:episode>
      <podcast:episode>1079</podcast:episode>
      <itunes:title>Prompt Orchestration Markup Language</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">568ad294-2eb0-47a9-ab97-03ab7bc74556</guid>
      <link>https://share.transistor.fm/s/bb9f09b6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.HC, cs.AI, cs.CL, cs.PL</p>

            <p><strong>Authors:</strong><br>
            Yuge Zhang, Nan Chen, Jiahang Xu, Yuqing Yang</p>

            <p><strong>Title:</strong><br>
            Prompt Orchestration Markup Language</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.13948v1">http://arxiv.org/abs/2508.13948v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) require sophisticated prompting, yet current practices face challenges in structure, data integration, format sensitivity, and tooling. Existing methods lack comprehensive solutions for organizing complex prompts involving diverse data types (documents, tables, images) or managing presentation variations systematically. To address these gaps, we introduce POML (Prompt Orchestration Markup Language). POML employs component-based markup for logical structure (roles, tasks, examples), specialized tags for seamless data integration, and a CSS-like styling system to decouple content from presentation, reducing formatting sensitivity. It includes templating for dynamic prompts and a comprehensive developer toolkit (IDE support, SDKs) to improve version control and collaboration. We validate POML through two case studies demonstrating its impact on complex application integration (PomLink) and accuracy performance (TableQA), as well as a user study assessing its effectiveness in real-world development scenarios.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.HC, cs.AI, cs.CL, cs.PL</p>

            <p><strong>Authors:</strong><br>
            Yuge Zhang, Nan Chen, Jiahang Xu, Yuqing Yang</p>

            <p><strong>Title:</strong><br>
            Prompt Orchestration Markup Language</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.13948v1">http://arxiv.org/abs/2508.13948v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) require sophisticated prompting, yet current practices face challenges in structure, data integration, format sensitivity, and tooling. Existing methods lack comprehensive solutions for organizing complex prompts involving diverse data types (documents, tables, images) or managing presentation variations systematically. To address these gaps, we introduce POML (Prompt Orchestration Markup Language). POML employs component-based markup for logical structure (roles, tasks, examples), specialized tags for seamless data integration, and a CSS-like styling system to decouple content from presentation, reducing formatting sensitivity. It includes templating for dynamic prompts and a comprehensive developer toolkit (IDE support, SDKs) to improve version control and collaboration. We validate POML through two case studies demonstrating its impact on complex application integration (PomLink) and accuracy performance (TableQA), as well as a user study assessing its effectiveness in real-world development scenarios.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 20 Aug 2025 20:10:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bb9f09b6/0c126b27.mp3" length="22307793" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1391</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.HC, cs.AI, cs.CL, cs.PL</p>

            <p><strong>Authors:</strong><br>
            Yuge Zhang, Nan Chen, Jiahang Xu, Yuqing Yang</p>

            <p><strong>Title:</strong><br>
            Prompt Orchestration Markup Language</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.13948v1">http://arxiv.org/abs/2508.13948v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) require sophisticated prompting, yet current practices face challenges in structure, data integration, format sensitivity, and tooling. Existing methods lack comprehensive solutions for organizing complex prompts involving diverse data types (documents, tables, images) or managing presentation variations systematically. To address these gaps, we introduce POML (Prompt Orchestration Markup Language). POML employs component-based markup for logical structure (roles, tasks, examples), specialized tags for seamless data integration, and a CSS-like styling system to decouple content from presentation, reducing formatting sensitivity. It includes templating for dynamic prompts and a comprehensive developer toolkit (IDE support, SDKs) to improve version control and collaboration. We validate POML through two case studies demonstrating its impact on complex application integration (PomLink) and accuracy performance (TableQA), as well as a user study assessing its effectiveness in real-world development scenarios.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Ovis2.5 Technical Report</title>
      <itunes:episode>1078</itunes:episode>
      <podcast:episode>1078</podcast:episode>
      <itunes:title>Ovis2.5 Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">66c80632-2a63-4f7d-b78c-69eb13a379ff</guid>
      <link>https://share.transistor.fm/s/4bdf2bd5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 79 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, Yuxuan Han, Haijun Li, Wanying Chen, Junke Tang, Chengkun Hou, Zhixing Du, Tianli Zhou, Wenjie Zhang, Huping Ding, Jiahe Li, Wen Li, Gui Hu, Yiliang Gu, Siran Yang, Jiamang Wang, Hailong Sun, Yibo Wang, Hui Sun, Jinlong Huang, Yuping He, Shengze Shi, Weihong Zhang, Guodong Zheng, Junpeng Jiang, Sensen Gao, Yi-Feng Wu, Sijia Chen, Yuhui Chen, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang</p>

            <p><strong>Title:</strong><br>
            Ovis2.5 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.11737v1">http://arxiv.org/abs/2508.11737v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Ovis2.5, a successor to Ovis2 designed for native-resolution visual perception and strong multimodal reasoning. Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, avoiding the degradation from fixed-resolution tiling and preserving both fine detail and global layout -- crucial for visually dense content like complex charts. To strengthen reasoning, we train the model to move beyond linear chain-of-thought and perform reflection -- including self-checking and revision. This advanced capability is exposed as an optional "thinking mode" at inference time, allowing users to trade latency for enhanced accuracy on difficult inputs. The model is trained via a comprehensive five-phase curriculum that progressively builds its skills. The process begins with foundational visual and multimodal pretraining, advances through large-scale instruction tuning, and culminates in alignment and reasoning enhancement using DPO and GRPO. To scale these upgrades efficiently, we employ multimodal data packing and hybrid parallelism, yielding a significant end-to-end speedup. We release two open-source models: Ovis2.5-9B and Ovis2.5-2B. The latter continues the "small model, big performance" philosophy of Ovis2, making it ideal for resource-constrained, on-device scenarios. On the OpenCompass multimodal leaderboard, Ovis2.5-9B averages 78.3, marking a substantial improvement over its predecessor, Ovis2-8B, and achieving state-of-the-art results among open-source MLLMs in the sub-40B parameter range; Ovis2.5-2B scores 73.9, establishing SOTA for its size. Beyond aggregate scores, Ovis2.5 achieves leading results on STEM benchmarks, exhibits strong capabilities on grounding and video tasks, and achieves open-source SOTA at its scale for complex chart analysis.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 79 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, Yuxuan Han, Haijun Li, Wanying Chen, Junke Tang, Chengkun Hou, Zhixing Du, Tianli Zhou, Wenjie Zhang, Huping Ding, Jiahe Li, Wen Li, Gui Hu, Yiliang Gu, Siran Yang, Jiamang Wang, Hailong Sun, Yibo Wang, Hui Sun, Jinlong Huang, Yuping He, Shengze Shi, Weihong Zhang, Guodong Zheng, Junpeng Jiang, Sensen Gao, Yi-Feng Wu, Sijia Chen, Yuhui Chen, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang</p>

            <p><strong>Title:</strong><br>
            Ovis2.5 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.11737v1">http://arxiv.org/abs/2508.11737v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Ovis2.5, a successor to Ovis2 designed for native-resolution visual perception and strong multimodal reasoning. Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, avoiding the degradation from fixed-resolution tiling and preserving both fine detail and global layout -- crucial for visually dense content like complex charts. To strengthen reasoning, we train the model to move beyond linear chain-of-thought and perform reflection -- including self-checking and revision. This advanced capability is exposed as an optional "thinking mode" at inference time, allowing users to trade latency for enhanced accuracy on difficult inputs. The model is trained via a comprehensive five-phase curriculum that progressively builds its skills. The process begins with foundational visual and multimodal pretraining, advances through large-scale instruction tuning, and culminates in alignment and reasoning enhancement using DPO and GRPO. To scale these upgrades efficiently, we employ multimodal data packing and hybrid parallelism, yielding a significant end-to-end speedup. We release two open-source models: Ovis2.5-9B and Ovis2.5-2B. The latter continues the "small model, big performance" philosophy of Ovis2, making it ideal for resource-constrained, on-device scenarios. On the OpenCompass multimodal leaderboard, Ovis2.5-9B averages 78.3, marking a substantial improvement over its predecessor, Ovis2-8B, and achieving state-of-the-art results among open-source MLLMs in the sub-40B parameter range; Ovis2.5-2B scores 73.9, establishing SOTA for its size. Beyond aggregate scores, Ovis2.5 achieves leading results on STEM benchmarks, exhibits strong capabilities on grounding and video tasks, and achieves open-source SOTA at its scale for complex chart analysis.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 19 Aug 2025 20:58:57 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4bdf2bd5/6018874c.mp3" length="22289390" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1389</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 79 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, Yuxuan Han, Haijun Li, Wanying Chen, Junke Tang, Chengkun Hou, Zhixing Du, Tianli Zhou, Wenjie Zhang, Huping Ding, Jiahe Li, Wen Li, Gui Hu, Yiliang Gu, Siran Yang, Jiamang Wang, Hailong Sun, Yibo Wang, Hui Sun, Jinlong Huang, Yuping He, Shengze Shi, Weihong Zhang, Guodong Zheng, Junpeng Jiang, Sensen Gao, Yi-Feng Wu, Sijia Chen, Yuhui Chen, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang</p>

            <p><strong>Title:</strong><br>
            Ovis2.5 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.11737v1">http://arxiv.org/abs/2508.11737v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Ovis2.5, a successor to Ovis2 designed for native-resolution visual perception and strong multimodal reasoning. Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, avoiding the degradation from fixed-resolution tiling and preserving both fine detail and global layout -- crucial for visually dense content like complex charts. To strengthen reasoning, we train the model to move beyond linear chain-of-thought and perform reflection -- including self-checking and revision. This advanced capability is exposed as an optional "thinking mode" at inference time, allowing users to trade latency for enhanced accuracy on difficult inputs. The model is trained via a comprehensive five-phase curriculum that progressively builds its skills. The process begins with foundational visual and multimodal pretraining, advances through large-scale instruction tuning, and culminates in alignment and reasoning enhancement using DPO and GRPO. To scale these upgrades efficiently, we employ multimodal data packing and hybrid parallelism, yielding a significant end-to-end speedup. We release two open-source models: Ovis2.5-9B and Ovis2.5-2B. The latter continues the "small model, big performance" philosophy of Ovis2, making it ideal for resource-constrained, on-device scenarios. On the OpenCompass multimodal leaderboard, Ovis2.5-9B averages 78.3, marking a substantial improvement over its predecessor, Ovis2-8B, and achieving state-of-the-art results among open-source MLLMs in the sub-40B parameter range; Ovis2.5-2B scores 73.9, establishing SOTA for its size. Beyond aggregate scores, Ovis2.5 achieves leading results on STEM benchmarks, exhibits strong capabilities on grounding and video tasks, and achieves open-source SOTA at its scale for complex chart analysis.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning</title>
      <itunes:episode>1077</itunes:episode>
      <podcast:episode>1077</podcast:episode>
      <itunes:title>ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">25fc5fd3-cb11-4f4d-9371-d77002d6ac97</guid>
      <link>https://share.transistor.fm/s/64c543e0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Juyuan Wang, Rongchen Zhao, Wei Wei, Yufeng Wang, Mo Yu, Jie Zhou, Jin Xu, Liyan Xu</p>

            <p><strong>Title:</strong><br>
            ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10419v1">http://arxiv.org/abs/2508.10419v1</a></p>

            <p><strong>Abstract:</strong><br>
            Narrative comprehension on long stories and novels has been a challenging domain attributed to their intricate plotlines and entangled, often evolving relations among characters and entities. Given the LLM's diminished reasoning over extended context and high computational cost, retrieval-based approaches remain a pivotal role in practice. However, traditional RAG methods can fall short due to their stateless, single-step retrieval process, which often overlooks the dynamic nature of capturing interconnected relations within long-range context. In this work, we propose ComoRAG, holding the principle that narrative reasoning is not a one-shot process, but a dynamic, evolving interplay between new evidence acquisition and past knowledge consolidation, analogous to human cognition when reasoning with memory-related signals in the brain. Specifically, when encountering a reasoning impasse, ComoRAG undergoes iterative reasoning cycles while interacting with a dynamic memory workspace. In each cycle, it generates probing queries to devise new exploratory paths, then integrates the retrieved evidence of new aspects into a global memory pool, thereby supporting the emergence of a coherent context for the query resolution. Across four challenging long-context narrative benchmarks (200K+ tokens), ComoRAG outperforms strong RAG baselines with consistent relative gains up to 11% compared to the strongest baseline. Further analysis reveals that ComoRAG is particularly advantageous for complex queries requiring global comprehension, offering a principled, cognitively motivated paradigm for retrieval-based long context comprehension towards stateful reasoning. Our code is publicly released at https://github.com/EternityJune25/ComoRAG</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Juyuan Wang, Rongchen Zhao, Wei Wei, Yufeng Wang, Mo Yu, Jie Zhou, Jin Xu, Liyan Xu</p>

            <p><strong>Title:</strong><br>
            ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10419v1">http://arxiv.org/abs/2508.10419v1</a></p>

            <p><strong>Abstract:</strong><br>
            Narrative comprehension on long stories and novels has been a challenging domain attributed to their intricate plotlines and entangled, often evolving relations among characters and entities. Given the LLM's diminished reasoning over extended context and high computational cost, retrieval-based approaches remain a pivotal role in practice. However, traditional RAG methods can fall short due to their stateless, single-step retrieval process, which often overlooks the dynamic nature of capturing interconnected relations within long-range context. In this work, we propose ComoRAG, holding the principle that narrative reasoning is not a one-shot process, but a dynamic, evolving interplay between new evidence acquisition and past knowledge consolidation, analogous to human cognition when reasoning with memory-related signals in the brain. Specifically, when encountering a reasoning impasse, ComoRAG undergoes iterative reasoning cycles while interacting with a dynamic memory workspace. In each cycle, it generates probing queries to devise new exploratory paths, then integrates the retrieved evidence of new aspects into a global memory pool, thereby supporting the emergence of a coherent context for the query resolution. Across four challenging long-context narrative benchmarks (200K+ tokens), ComoRAG outperforms strong RAG baselines with consistent relative gains up to 11% compared to the strongest baseline. Further analysis reveals that ComoRAG is particularly advantageous for complex queries requiring global comprehension, offering a principled, cognitively motivated paradigm for retrieval-based long context comprehension towards stateful reasoning. Our code is publicly released at https://github.com/EternityJune25/ComoRAG</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 19 Aug 2025 20:58:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/64c543e0/d59d4e8a.mp3" length="20829523" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1298</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Juyuan Wang, Rongchen Zhao, Wei Wei, Yufeng Wang, Mo Yu, Jie Zhou, Jin Xu, Liyan Xu</p>

            <p><strong>Title:</strong><br>
            ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10419v1">http://arxiv.org/abs/2508.10419v1</a></p>

            <p><strong>Abstract:</strong><br>
            Narrative comprehension on long stories and novels has been a challenging domain attributed to their intricate plotlines and entangled, often evolving relations among characters and entities. Given the LLM's diminished reasoning over extended context and high computational cost, retrieval-based approaches remain a pivotal role in practice. However, traditional RAG methods can fall short due to their stateless, single-step retrieval process, which often overlooks the dynamic nature of capturing interconnected relations within long-range context. In this work, we propose ComoRAG, holding the principle that narrative reasoning is not a one-shot process, but a dynamic, evolving interplay between new evidence acquisition and past knowledge consolidation, analogous to human cognition when reasoning with memory-related signals in the brain. Specifically, when encountering a reasoning impasse, ComoRAG undergoes iterative reasoning cycles while interacting with a dynamic memory workspace. In each cycle, it generates probing queries to devise new exploratory paths, then integrates the retrieved evidence of new aspects into a global memory pool, thereby supporting the emergence of a coherent context for the query resolution. Across four challenging long-context narrative benchmarks (200K+ tokens), ComoRAG outperforms strong RAG baselines with consistent relative gains up to 11% compared to the strongest baseline. Further analysis reveals that ComoRAG is particularly advantageous for complex queries requiring global comprehension, offering a principled, cognitively motivated paradigm for retrieval-based long context comprehension towards stateful reasoning. Our code is publicly released at https://github.com/EternityJune25/ComoRAG</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>4DNeX: Feed-Forward 4D Generative Modeling Made Easy</title>
      <itunes:episode>1076</itunes:episode>
      <podcast:episode>1076</podcast:episode>
      <itunes:title>4DNeX: Feed-Forward 4D Generative Modeling Made Easy</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c351ec35-6876-49c2-82b2-8d29feb7e8da</guid>
      <link>https://share.transistor.fm/s/b9813f7b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            4DNeX: Feed-Forward 4D Generative Modeling Made Easy</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.13154v1">http://arxiv.org/abs/2508.13154v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation by fine-tuning a pretrained video diffusion model. Specifically, 1) to alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale dataset with high-quality 4D annotations generated using advanced reconstruction approaches. 2) we introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) we propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for 4D modeling. 4DNeX produces high-quality dynamic point clouds that enable novel-view video synthesis. Extensive experiments demonstrate that 4DNeX outperforms existing 4D generation methods in efficiency and generalizability, offering a scalable solution for image-to-4D modeling and laying the foundation for generative 4D world models that simulate dynamic scene evolution.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            4DNeX: Feed-Forward 4D Generative Modeling Made Easy</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.13154v1">http://arxiv.org/abs/2508.13154v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation by fine-tuning a pretrained video diffusion model. Specifically, 1) to alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale dataset with high-quality 4D annotations generated using advanced reconstruction approaches. 2) we introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) we propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for 4D modeling. 4DNeX produces high-quality dynamic point clouds that enable novel-view video synthesis. Extensive experiments demonstrate that 4DNeX outperforms existing 4D generation methods in efficiency and generalizability, offering a scalable solution for image-to-4D modeling and laying the foundation for generative 4D world models that simulate dynamic scene evolution.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 19 Aug 2025 20:58:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b9813f7b/1a6facfe.mp3" length="22727858" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1417</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            4DNeX: Feed-Forward 4D Generative Modeling Made Easy</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.13154v1">http://arxiv.org/abs/2508.13154v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation by fine-tuning a pretrained video diffusion model. Specifically, 1) to alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale dataset with high-quality 4D annotations generated using advanced reconstruction approaches. 2) we introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) we propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for 4D modeling. 4DNeX produces high-quality dynamic point clouds that enable novel-view video synthesis. Extensive experiments demonstrate that 4DNeX outperforms existing 4D generation methods in efficiency and generalizability, offering a scalable solution for image-to-4D modeling and laying the foundation for generative 4D world models that simulate dynamic scene evolution.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Next Visual Granularity Generation</title>
      <itunes:episode>1075</itunes:episode>
      <podcast:episode>1075</podcast:episode>
      <itunes:title>Next Visual Granularity Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">56e2ab17-df2e-4a26-8e41-bc30420c111a</guid>
      <link>https://share.transistor.fm/s/c7bc592d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yikai Wang, Zhouxia Wang, Zhonghua Wu, Qingyi Tao, Kang Liao, Chen Change Loy</p>

            <p><strong>Title:</strong><br>
            Next Visual Granularity Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.12811v1">http://arxiv.org/abs/2508.12811v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity. Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels. We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 -&gt; 3.03, 2.57 -&gt;2.44, 2.09 -&gt; 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models will be released.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yikai Wang, Zhouxia Wang, Zhonghua Wu, Qingyi Tao, Kang Liao, Chen Change Loy</p>

            <p><strong>Title:</strong><br>
            Next Visual Granularity Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.12811v1">http://arxiv.org/abs/2508.12811v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity. Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels. We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 -&gt; 3.03, 2.57 -&gt;2.44, 2.09 -&gt; 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models will be released.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 19 Aug 2025 20:57:47 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c7bc592d/f9578ff6.mp3" length="21894011" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1365</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yikai Wang, Zhouxia Wang, Zhonghua Wu, Qingyi Tao, Kang Liao, Chen Change Loy</p>

            <p><strong>Title:</strong><br>
            Next Visual Granularity Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.12811v1">http://arxiv.org/abs/2508.12811v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity. Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels. We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 -&gt; 3.03, 2.57 -&gt;2.44, 2.09 -&gt; 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models will be released.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Speed Always Wins: A Survey on Efficient Architectures for Large Language Models</title>
      <itunes:episode>1074</itunes:episode>
      <podcast:episode>1074</podcast:episode>
      <itunes:title>Speed Always Wins: A Survey on Efficient Architectures for Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b11331c5-7564-4a5f-86c5-baaf1bef9a95</guid>
      <link>https://share.transistor.fm/s/0532eed4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            Speed Always Wins: A Survey on Efficient Architectures for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09834v1">http://arxiv.org/abs/2508.09834v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that address the inherent limitations of transformers and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            Speed Always Wins: A Survey on Efficient Architectures for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09834v1">http://arxiv.org/abs/2508.09834v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that address the inherent limitations of transformers and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 19 Aug 2025 20:57:23 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0532eed4/35637a6d.mp3" length="20133195" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1255</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            Speed Always Wins: A Survey on Efficient Architectures for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09834v1">http://arxiv.org/abs/2508.09834v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that address the inherent limitations of transformers and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs</title>
      <itunes:episode>1073</itunes:episode>
      <podcast:episode>1073</podcast:episode>
      <itunes:title>When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">102e4e84-88d4-4589-bf99-8fa932088299</guid>
      <link>https://share.transistor.fm/s/fff89580</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mikhail Seleznyov, Mikhail Chaichuk, Gleb Ershov, Alexander Panchenko, Elena Tutubalina, Oleg Somov</p>

            <p><strong>Title:</strong><br>
            When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.11383v1">http://arxiv.org/abs/2508.11383v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) are highly sensitive to subtle, non-semantic variations in prompt phrasing and formatting. In this work, we present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework. We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset. Our evaluation covers robustness methods from both fine-tuned and in-context learning paradigms, and tests their generalization against multiple types of distribution shifts. Finally, we extend our analysis to GPT-4.1 and DeepSeek V3 to assess frontier models' current robustness to format perturbations. Our findings offer actionable insights into the relative effectiveness of these robustness methods, enabling practitioners to make informed decisions when aiming for stable and reliable LLM performance in real-world applications. Code: https://github.com/AIRI-Institute/when-punctuation-matters.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mikhail Seleznyov, Mikhail Chaichuk, Gleb Ershov, Alexander Panchenko, Elena Tutubalina, Oleg Somov</p>

            <p><strong>Title:</strong><br>
            When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.11383v1">http://arxiv.org/abs/2508.11383v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) are highly sensitive to subtle, non-semantic variations in prompt phrasing and formatting. In this work, we present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework. We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset. Our evaluation covers robustness methods from both fine-tuned and in-context learning paradigms, and tests their generalization against multiple types of distribution shifts. Finally, we extend our analysis to GPT-4.1 and DeepSeek V3 to assess frontier models' current robustness to format perturbations. Our findings offer actionable insights into the relative effectiveness of these robustness methods, enabling practitioners to make informed decisions when aiming for stable and reliable LLM performance in real-world applications. Code: https://github.com/AIRI-Institute/when-punctuation-matters.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 19 Aug 2025 20:57:00 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fff89580/974e77c1.mp3" length="21714343" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1353</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mikhail Seleznyov, Mikhail Chaichuk, Gleb Ershov, Alexander Panchenko, Elena Tutubalina, Oleg Somov</p>

            <p><strong>Title:</strong><br>
            When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.11383v1">http://arxiv.org/abs/2508.11383v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) are highly sensitive to subtle, non-semantic variations in prompt phrasing and formatting. In this work, we present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework. We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset. Our evaluation covers robustness methods from both fine-tuned and in-context learning paradigms, and tests their generalization against multiple types of distribution shifts. Finally, we extend our analysis to GPT-4.1 and DeepSeek V3 to assess frontier models' current robustness to format perturbations. Our findings offer actionable insights into the relative effectiveness of these robustness methods, enabling practitioners to make informed decisions when aiming for stable and reliable LLM performance in real-world applications. Code: https://github.com/AIRI-Institute/when-punctuation-matters.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Has GPT-5 Achieved Spatial Intelligence? An Empirical Study</title>
      <itunes:episode>1072</itunes:episode>
      <podcast:episode>1072</podcast:episode>
      <itunes:title>Has GPT-5 Achieved Spatial Intelligence? An Empirical Study</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9dbc0213-77fb-473c-988f-ed39a93eca6f</guid>
      <link>https://share.transistor.fm/s/3ffba44b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV, cs.CL, cs.LG, cs.MM, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang</p>

            <p><strong>Title:</strong><br>
            Has GPT-5 Achieved Spatial Intelligence? An Empirical Study</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.13142v1">http://arxiv.org/abs/2508.13142v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-modal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, which are fundamental capabilities to achieving artificial general intelligence. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models stand on the path toward spatial intelligence. First, we propose a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and discuss the challenges in ensuring fair evaluation. We then evaluate state-of-the-art proprietary and open-source models on eight key benchmarks, at a cost exceeding one billion total tokens. Our empirical study reveals that (1) GPT-5 demonstrates unprecedented strength in spatial intelligence, yet (2) still falls short of human performance across a broad spectrum of tasks. Moreover, we (3) identify the more challenging spatial intelligence problems for multi-modal models, and (4) proprietary models do not exhibit a decisive advantage when facing the most difficult problems. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans yet fail even the most advanced multi-modal models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV, cs.CL, cs.LG, cs.MM, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang</p>

            <p><strong>Title:</strong><br>
            Has GPT-5 Achieved Spatial Intelligence? An Empirical Study</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.13142v1">http://arxiv.org/abs/2508.13142v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-modal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, which are fundamental capabilities to achieving artificial general intelligence. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models stand on the path toward spatial intelligence. First, we propose a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and discuss the challenges in ensuring fair evaluation. We then evaluate state-of-the-art proprietary and open-source models on eight key benchmarks, at a cost exceeding one billion total tokens. Our empirical study reveals that (1) GPT-5 demonstrates unprecedented strength in spatial intelligence, yet (2) still falls short of human performance across a broad spectrum of tasks. Moreover, we (3) identify the more challenging spatial intelligence problems for multi-modal models, and (4) proprietary models do not exhibit a decisive advantage when facing the most difficult problems. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans yet fail even the most advanced multi-modal models.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 19 Aug 2025 20:56:37 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3ffba44b/8d069d23.mp3" length="19211574" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1197</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV, cs.CL, cs.LG, cs.MM, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang</p>

            <p><strong>Title:</strong><br>
            Has GPT-5 Achieved Spatial Intelligence? An Empirical Study</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.13142v1">http://arxiv.org/abs/2508.13142v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-modal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, which are fundamental capabilities to achieving artificial general intelligence. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models stand on the path toward spatial intelligence. First, we propose a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and discuss the challenges in ensuring fair evaluation. We then evaluate state-of-the-art proprietary and open-source models on eight key benchmarks, at a cost exceeding one billion total tokens. Our empirical study reveals that (1) GPT-5 demonstrates unprecedented strength in spatial intelligence, yet (2) still falls short of human performance across a broad spectrum of tasks. Moreover, we (3) identify the more challenging spatial intelligence problems for multi-modal models, and (4) proprietary models do not exhibit a decisive advantage when facing the most difficult problems. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans yet fail even the most advanced multi-modal models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds</title>
      <itunes:episode>1071</itunes:episode>
      <podcast:episode>1071</podcast:episode>
      <itunes:title>HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d54c5f3f-c2e0-4a68-a4d0-357b646e8fc3</guid>
      <link>https://share.transistor.fm/s/8901d6d3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Petr Anokhin, Roman Khalikov, Stefan Rebrikov, Viktor Volkov, Artyom Sorokin, Vincent Bissonnette</p>

            <p><strong>Title:</strong><br>
            HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.12782v1">http://arxiv.org/abs/2508.12782v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have shown remarkable capabilities in isolated step-by-step reasoning tasks such as mathematics and programming, but their proficiency in long-horizon planning, where solutions require extended, structured sequences of interdependent actions, remains underexplored. Existing benchmarks typically assess LLMs through abstract or low-dimensional algorithmic tasks, failing to capture the complexity of realistic planning environments. We introduce HeroBench, a novel benchmark designed specifically to evaluate long-horizon planning and structured reasoning within complex RPG-inspired virtual worlds. HeroBench provides a rigorously constructed dataset of tasks covering a wide range of difficulties, a simulated environment to execute and validate agent plans, and detailed analytical tools for evaluating model performance. Tasks challenge models to formulate strategic plans, efficiently gather resources, master necessary skills, craft equipment, and defeat adversaries, reflecting practical scenarios' layered dependencies and constraints. Our extensive evaluation of 25 state-of-the-art LLMs, spanning both open-source and proprietary models, including the GPT-5 family, reveals substantial performance disparities rarely observed in conventional reasoning benchmarks. Detailed error analysis further uncovers specific weaknesses in current models' abilities to generate robust high-level plans and reliably execute structured actions. HeroBench thus not only significantly advances the evaluation of LLM reasoning but also provides a flexible, scalable foundation for future research into advanced, autonomous planning in virtual environments.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Petr Anokhin, Roman Khalikov, Stefan Rebrikov, Viktor Volkov, Artyom Sorokin, Vincent Bissonnette</p>

            <p><strong>Title:</strong><br>
            HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.12782v1">http://arxiv.org/abs/2508.12782v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have shown remarkable capabilities in isolated step-by-step reasoning tasks such as mathematics and programming, but their proficiency in long-horizon planning, where solutions require extended, structured sequences of interdependent actions, remains underexplored. Existing benchmarks typically assess LLMs through abstract or low-dimensional algorithmic tasks, failing to capture the complexity of realistic planning environments. We introduce HeroBench, a novel benchmark designed specifically to evaluate long-horizon planning and structured reasoning within complex RPG-inspired virtual worlds. HeroBench provides a rigorously constructed dataset of tasks covering a wide range of difficulties, a simulated environment to execute and validate agent plans, and detailed analytical tools for evaluating model performance. Tasks challenge models to formulate strategic plans, efficiently gather resources, master necessary skills, craft equipment, and defeat adversaries, reflecting practical scenarios' layered dependencies and constraints. Our extensive evaluation of 25 state-of-the-art LLMs, spanning both open-source and proprietary models, including the GPT-5 family, reveals substantial performance disparities rarely observed in conventional reasoning benchmarks. Detailed error analysis further uncovers specific weaknesses in current models' abilities to generate robust high-level plans and reliably execute structured actions. HeroBench thus not only significantly advances the evaluation of LLM reasoning but also provides a flexible, scalable foundation for future research into advanced, autonomous planning in virtual environments.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 19 Aug 2025 20:56:14 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8901d6d3/6cc42b25.mp3" length="22445774" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1399</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Petr Anokhin, Roman Khalikov, Stefan Rebrikov, Viktor Volkov, Artyom Sorokin, Vincent Bissonnette</p>

            <p><strong>Title:</strong><br>
            HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.12782v1">http://arxiv.org/abs/2508.12782v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have shown remarkable capabilities in isolated step-by-step reasoning tasks such as mathematics and programming, but their proficiency in long-horizon planning, where solutions require extended, structured sequences of interdependent actions, remains underexplored. Existing benchmarks typically assess LLMs through abstract or low-dimensional algorithmic tasks, failing to capture the complexity of realistic planning environments. We introduce HeroBench, a novel benchmark designed specifically to evaluate long-horizon planning and structured reasoning within complex RPG-inspired virtual worlds. HeroBench provides a rigorously constructed dataset of tasks covering a wide range of difficulties, a simulated environment to execute and validate agent plans, and detailed analytical tools for evaluating model performance. Tasks challenge models to formulate strategic plans, efficiently gather resources, master necessary skills, craft equipment, and defeat adversaries, reflecting practical scenarios' layered dependencies and constraints. Our extensive evaluation of 25 state-of-the-art LLMs, spanning both open-source and proprietary models, including the GPT-5 family, reveals substantial performance disparities rarely observed in conventional reasoning benchmarks. Detailed error analysis further uncovers specific weaknesses in current models' abilities to generate robust high-level plans and reliably execute structured actions. HeroBench thus not only significantly advances the evaluation of LLM reasoning but also provides a flexible, scalable foundation for future research into advanced, autonomous planning in virtual environments.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SSRL: Self-Search Reinforcement Learning</title>
      <itunes:episode>1070</itunes:episode>
      <podcast:episode>1070</podcast:episode>
      <itunes:title>SSRL: Self-Search Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f245f88d-192a-46f3-9c99-36ad6acb624c</guid>
      <link>https://share.transistor.fm/s/980f2584</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuchen Fan, Kaiyan Zhang, Heng Zhou, Yuxin Zuo, Yanxu Chen, Yu Fu, Xinwei Long, Xuekai Zhu, Che Jiang, Yuchen Zhang, Li Kang, Gang Chen, Cheng Huang, Zhizhou He, Bingning Wang, Lei Bai, Ning Ding, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            SSRL: Self-Search Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10874v1">http://arxiv.org/abs/2508.10874v1</a></p>

            <p><strong>Abstract:</strong><br>
            We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. To this end, we first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capability through format-based and rule-based rewards. SSRL enables models to iteratively refine their knowledge utilization internally, without requiring access to external tools. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. We draw the following conclusions: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Our findings highlight the potential of LLMs to support more scalable RL agent training.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuchen Fan, Kaiyan Zhang, Heng Zhou, Yuxin Zuo, Yanxu Chen, Yu Fu, Xinwei Long, Xuekai Zhu, Che Jiang, Yuchen Zhang, Li Kang, Gang Chen, Cheng Huang, Zhizhou He, Bingning Wang, Lei Bai, Ning Ding, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            SSRL: Self-Search Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10874v1">http://arxiv.org/abs/2508.10874v1</a></p>

            <p><strong>Abstract:</strong><br>
            We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. To this end, we first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capability through format-based and rule-based rewards. SSRL enables models to iteratively refine their knowledge utilization internally, without requiring access to external tools. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. We draw the following conclusions: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Our findings highlight the potential of LLMs to support more scalable RL agent training.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 18 Aug 2025 20:30:00 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/980f2584/0f11ad75.mp3" length="21035947" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1311</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuchen Fan, Kaiyan Zhang, Heng Zhou, Yuxin Zuo, Yanxu Chen, Yu Fu, Xinwei Long, Xuekai Zhu, Che Jiang, Yuchen Zhang, Li Kang, Gang Chen, Cheng Huang, Zhizhou He, Bingning Wang, Lei Bai, Ning Ding, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            SSRL: Self-Search Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10874v1">http://arxiv.org/abs/2508.10874v1</a></p>

            <p><strong>Abstract:</strong><br>
            We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. To this end, we first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capability through format-based and rule-based rewards. SSRL enables models to iteratively refine their knowledge utilization internally, without requiring access to external tools. Empirical evaluations demonstrate that SSRL-trained policy models provide a cost-effective and stable environment for search-driven RL training, reducing reliance on external search engines and facilitating robust sim-to-real transfer. We draw the following conclusions: 1) LLMs possess world knowledge that can be effectively elicited to achieve high performance; 2) SSRL demonstrates the potential of leveraging internal knowledge to reduce hallucination; 3) SSRL-trained models integrate seamlessly with external search engines without additional effort. Our findings highlight the potential of LLMs to support more scalable RL agent training.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DINOv3</title>
      <itunes:episode>1069</itunes:episode>
      <podcast:episode>1069</podcast:episode>
      <itunes:title>DINOv3</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b8ac5b9a-e3a9-46c9-a4c6-7c57fb394c7d</guid>
      <link>https://share.transistor.fm/s/827d4456</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien Mairal, Hervé Jégou, Patrick Labatut, Piotr Bojanowski</p>

            <p><strong>Title:</strong><br>
            DINOv3</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10104v1">http://arxiv.org/abs/2508.10104v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images -- using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models' flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien Mairal, Hervé Jégou, Patrick Labatut, Piotr Bojanowski</p>

            <p><strong>Title:</strong><br>
            DINOv3</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10104v1">http://arxiv.org/abs/2508.10104v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images -- using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models' flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 18 Aug 2025 20:29:39 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/827d4456/6423f84b.mp3" length="25551962" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1593</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien Mairal, Hervé Jégou, Patrick Labatut, Piotr Bojanowski</p>

            <p><strong>Title:</strong><br>
            DINOv3</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10104v1">http://arxiv.org/abs/2508.10104v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images -- using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models' flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Thyme: Think Beyond Images</title>
      <itunes:episode>1068</itunes:episode>
      <podcast:episode>1068</podcast:episode>
      <itunes:title>Thyme: Think Beyond Images</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6bda53d7-d307-4f0a-b64b-564466a117f0</guid>
      <link>https://share.transistor.fm/s/01e61eed</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, Guorui Zhou</p>

            <p><strong>Title:</strong><br>
            Thyme: Think Beyond Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.11630v1">http://arxiv.org/abs/2508.11630v1</a></p>

            <p><strong>Abstract:</strong><br>
            Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement) but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. Comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, Guorui Zhou</p>

            <p><strong>Title:</strong><br>
            Thyme: Think Beyond Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.11630v1">http://arxiv.org/abs/2508.11630v1</a></p>

            <p><strong>Abstract:</strong><br>
            Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement) but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. Comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 18 Aug 2025 20:29:17 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/01e61eed/c919d504.mp3" length="22524285" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1404</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, Guorui Zhou</p>

            <p><strong>Title:</strong><br>
            Thyme: Think Beyond Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.11630v1">http://arxiv.org/abs/2508.11630v1</a></p>

            <p><strong>Abstract:</strong><br>
            Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement) but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. Comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining</title>
      <itunes:episode>1067</itunes:episode>
      <podcast:episode>1067</podcast:episode>
      <itunes:title>BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ed112043-8a16-41f4-80eb-ffdd57953d77</guid>
      <link>https://share.transistor.fm/s/b26f0631</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga, Siddharth Joshi, Spandan Das, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt</p>

            <p><strong>Title:</strong><br>
            BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10975v1">http://arxiv.org/abs/2508.10975v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia. We also present several insights from BeyondWeb on synthetic data for pretraining: what drives its benefits, which data to rephrase and how, and the impact of model size and family on data quality. Overall, our work shows that there's no silver bullet for generating high-quality synthetic pretraining data. The best outcomes require jointly optimizing many factors, a challenging task that requires rigorous science and practical expertise. Naive approaches can yield modest improvements, potentially at great cost, while well-executed methods can yield transformative improvements, as exemplified by BeyondWeb.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga, Siddharth Joshi, Spandan Das, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt</p>

            <p><strong>Title:</strong><br>
            BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10975v1">http://arxiv.org/abs/2508.10975v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia. We also present several insights from BeyondWeb on synthetic data for pretraining: what drives its benefits, which data to rephrase and how, and the impact of model size and family on data quality. Overall, our work shows that there's no silver bullet for generating high-quality synthetic pretraining data. The best outcomes require jointly optimizing many factors, a challenging task that requires rigorous science and practical expertise. Naive approaches can yield modest improvements, potentially at great cost, while well-executed methods can yield transformative improvements, as exemplified by BeyondWeb.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 18 Aug 2025 20:28:56 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b26f0631/be91d362.mp3" length="23241554" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1449</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga, Siddharth Joshi, Spandan Das, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt</p>

            <p><strong>Title:</strong><br>
            BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10975v1">http://arxiv.org/abs/2508.10975v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia. We also present several insights from BeyondWeb on synthetic data for pretraining: what drives its benefits, which data to rephrase and how, and the impact of model size and family on data quality. Overall, our work shows that there's no silver bullet for generating high-quality synthetic pretraining data. The best outcomes require jointly optimizing many factors, a challenging task that requires rigorous science and practical expertise. Naive approaches can yield modest improvements, potentially at great cost, while well-executed methods can yield transformative improvements, as exemplified by BeyondWeb.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization</title>
      <itunes:episode>1066</itunes:episode>
      <podcast:episode>1066</podcast:episode>
      <itunes:title>XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e1fcc01e-9c53-4740-947d-8555543192c7</guid>
      <link>https://share.transistor.fm/s/cd5479cb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang, Luca Manolache, Michael W. Mahoney, Kurt Keutzer, Amir Gholami</p>

            <p><strong>Title:</strong><br>
            XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10395v1">http://arxiv.org/abs/2508.10395v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although LLM inference has emerged as a critical workload for many downstream applications, efficiently inferring LLMs is challenging due to the substantial memory footprint and bandwidth requirements. In parallel, compute capabilities have steadily outpaced both memory capacity and bandwidth over the last few decades, a trend that remains evident in modern GPU hardware and exacerbates the challenge of LLM inference. As such, new algorithms are emerging that trade increased computation for reduced memory operations. To that end, we present XQuant, which takes advantage of this trend, enabling an order-of-magnitude reduction in memory consumption through low-bit quantization with substantial accuracy benefits relative to state-of-the-art KV cache quantization methods. We accomplish this by quantizing and caching the layer input activations X, instead of using standard KV caching, and then rematerializing the Keys and Values on-the-fly during inference. This results in an immediate 2$\times$ memory savings compared to KV caching. By applying XQuant, we achieve up to $\sim 7.7\times$ memory savings with $&lt;0.1$ perplexity degradation compared to the FP16 baseline. Furthermore, our approach leverages the fact that X values are similar across layers. Building on this observation, we introduce XQuant-CL, which exploits the cross-layer similarity in the X embeddings for extreme compression. Across different models, XQuant-CL attains up to 10$\times$ memory savings relative to the FP16 baseline with only 0.01 perplexity degradation, and 12.5$\times$ memory savings with only $0.1$ perplexity degradation. XQuant exploits the rapidly increasing compute capabilities of hardware platforms to eliminate the memory bottleneck, while surpassing state-of-the-art KV cache quantization methods and achieving near-FP16 accuracy across a wide range of models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang, Luca Manolache, Michael W. Mahoney, Kurt Keutzer, Amir Gholami</p>

            <p><strong>Title:</strong><br>
            XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10395v1">http://arxiv.org/abs/2508.10395v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although LLM inference has emerged as a critical workload for many downstream applications, efficiently inferring LLMs is challenging due to the substantial memory footprint and bandwidth requirements. In parallel, compute capabilities have steadily outpaced both memory capacity and bandwidth over the last few decades, a trend that remains evident in modern GPU hardware and exacerbates the challenge of LLM inference. As such, new algorithms are emerging that trade increased computation for reduced memory operations. To that end, we present XQuant, which takes advantage of this trend, enabling an order-of-magnitude reduction in memory consumption through low-bit quantization with substantial accuracy benefits relative to state-of-the-art KV cache quantization methods. We accomplish this by quantizing and caching the layer input activations X, instead of using standard KV caching, and then rematerializing the Keys and Values on-the-fly during inference. This results in an immediate 2$\times$ memory savings compared to KV caching. By applying XQuant, we achieve up to $\sim 7.7\times$ memory savings with $&lt;0.1$ perplexity degradation compared to the FP16 baseline. Furthermore, our approach leverages the fact that X values are similar across layers. Building on this observation, we introduce XQuant-CL, which exploits the cross-layer similarity in the X embeddings for extreme compression. Across different models, XQuant-CL attains up to 10$\times$ memory savings relative to the FP16 baseline with only 0.01 perplexity degradation, and 12.5$\times$ memory savings with only $0.1$ perplexity degradation. XQuant exploits the rapidly increasing compute capabilities of hardware platforms to eliminate the memory bottleneck, while surpassing state-of-the-art KV cache quantization methods and achieving near-FP16 accuracy across a wide range of models.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 18 Aug 2025 20:28:35 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cd5479cb/754852ce.mp3" length="22253086" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1387</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang, Luca Manolache, Michael W. Mahoney, Kurt Keutzer, Amir Gholami</p>

            <p><strong>Title:</strong><br>
            XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10395v1">http://arxiv.org/abs/2508.10395v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although LLM inference has emerged as a critical workload for many downstream applications, efficiently inferring LLMs is challenging due to the substantial memory footprint and bandwidth requirements. In parallel, compute capabilities have steadily outpaced both memory capacity and bandwidth over the last few decades, a trend that remains evident in modern GPU hardware and exacerbates the challenge of LLM inference. As such, new algorithms are emerging that trade increased computation for reduced memory operations. To that end, we present XQuant, which takes advantage of this trend, enabling an order-of-magnitude reduction in memory consumption through low-bit quantization with substantial accuracy benefits relative to state-of-the-art KV cache quantization methods. We accomplish this by quantizing and caching the layer input activations X, instead of using standard KV caching, and then rematerializing the Keys and Values on-the-fly during inference. This results in an immediate 2$\times$ memory savings compared to KV caching. By applying XQuant, we achieve up to $\sim 7.7\times$ memory savings with $&lt;0.1$ perplexity degradation compared to the FP16 baseline. Furthermore, our approach leverages the fact that X values are similar across layers. Building on this observation, we introduce XQuant-CL, which exploits the cross-layer similarity in the X embeddings for extreme compression. Across different models, XQuant-CL attains up to 10$\times$ memory savings relative to the FP16 baseline with only 0.01 perplexity degradation, and 12.5$\times$ memory savings with only $0.1$ perplexity degradation. XQuant exploits the rapidly increasing compute capabilities of hardware platforms to eliminate the memory bottleneck, while surpassing state-of-the-art KV cache quantization methods and achieving near-FP16 accuracy across a wide range of models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning</title>
      <itunes:episode>1065</itunes:episode>
      <podcast:episode>1065</podcast:episode>
      <itunes:title>We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b106772a-37d7-4de2-8d62-e30ccf510bc5</guid>
      <link>https://share.transistor.fm/s/f6ade795</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 117 | cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Runqi Qiao, Qiuna Tan, Peiqing Yang, Yanzi Wang, Xiaowan Wang, Enhui Wan, Sitong Zhou, Guanting Dong, Yuchen Zeng, Yida Xu, Jie Wang, Chong Sun, Chen Li, Honggang Zhang</p>

            <p><strong>Title:</strong><br>
            We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10433v1">http://arxiv.org/abs/2508.10433v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various tasks, but still struggle with complex mathematical reasoning. Existing research primarily focuses on dataset construction and method optimization, often overlooking two critical aspects: comprehensive knowledge-driven design and model-centric data space modeling. In this paper, we introduce We-Math 2.0, a unified system that integrates a structured mathematical knowledge system, model-centric data space modeling, and a reinforcement learning (RL)-based training paradigm to comprehensively enhance the mathematical reasoning abilities of MLLMs. The key contributions of We-Math 2.0 are fourfold: (1) MathBook Knowledge System: We construct a five-level hierarchical system encompassing 491 knowledge points and 1,819 fundamental principles. (2) MathBook-Standard &amp; Pro: We develop MathBook-Standard, a dataset that ensures broad conceptual coverage and flexibility through dual expansion. Additionally, we define a three-dimensional difficulty space and generate 7 progressive variants per problem to build MathBook-Pro, a challenging dataset for robust training. (3) MathBook-RL: We propose a two-stage RL framework comprising: (i) Cold-Start Fine-tuning, which aligns the model with knowledge-oriented chain-of-thought reasoning; and (ii) Progressive Alignment RL, leveraging average-reward learning and dynamic data scheduling to achieve progressive alignment across difficulty levels. (4) MathBookEval: We introduce a comprehensive benchmark covering all 491 knowledge points with diverse reasoning step distributions. Experimental results show that MathBook-RL performs competitively with existing baselines on four widely-used benchmarks and achieves strong results on MathBookEval, suggesting promising generalization in mathematical reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 117 | cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Runqi Qiao, Qiuna Tan, Peiqing Yang, Yanzi Wang, Xiaowan Wang, Enhui Wan, Sitong Zhou, Guanting Dong, Yuchen Zeng, Yida Xu, Jie Wang, Chong Sun, Chen Li, Honggang Zhang</p>

            <p><strong>Title:</strong><br>
            We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10433v1">http://arxiv.org/abs/2508.10433v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various tasks, but still struggle with complex mathematical reasoning. Existing research primarily focuses on dataset construction and method optimization, often overlooking two critical aspects: comprehensive knowledge-driven design and model-centric data space modeling. In this paper, we introduce We-Math 2.0, a unified system that integrates a structured mathematical knowledge system, model-centric data space modeling, and a reinforcement learning (RL)-based training paradigm to comprehensively enhance the mathematical reasoning abilities of MLLMs. The key contributions of We-Math 2.0 are fourfold: (1) MathBook Knowledge System: We construct a five-level hierarchical system encompassing 491 knowledge points and 1,819 fundamental principles. (2) MathBook-Standard &amp; Pro: We develop MathBook-Standard, a dataset that ensures broad conceptual coverage and flexibility through dual expansion. Additionally, we define a three-dimensional difficulty space and generate 7 progressive variants per problem to build MathBook-Pro, a challenging dataset for robust training. (3) MathBook-RL: We propose a two-stage RL framework comprising: (i) Cold-Start Fine-tuning, which aligns the model with knowledge-oriented chain-of-thought reasoning; and (ii) Progressive Alignment RL, leveraging average-reward learning and dynamic data scheduling to achieve progressive alignment across difficulty levels. (4) MathBookEval: We introduce a comprehensive benchmark covering all 491 knowledge points with diverse reasoning step distributions. Experimental results show that MathBook-RL performs competitively with existing baselines on four widely-used benchmarks and achieves strong results on MathBookEval, suggesting promising generalization in mathematical reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 15 Aug 2025 20:16:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f6ade795/709b5ca8.mp3" length="20336749" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1267</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 117 | cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Runqi Qiao, Qiuna Tan, Peiqing Yang, Yanzi Wang, Xiaowan Wang, Enhui Wan, Sitong Zhou, Guanting Dong, Yuchen Zeng, Yida Xu, Jie Wang, Chong Sun, Chen Li, Honggang Zhang</p>

            <p><strong>Title:</strong><br>
            We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10433v1">http://arxiv.org/abs/2508.10433v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various tasks, but still struggle with complex mathematical reasoning. Existing research primarily focuses on dataset construction and method optimization, often overlooking two critical aspects: comprehensive knowledge-driven design and model-centric data space modeling. In this paper, we introduce We-Math 2.0, a unified system that integrates a structured mathematical knowledge system, model-centric data space modeling, and a reinforcement learning (RL)-based training paradigm to comprehensively enhance the mathematical reasoning abilities of MLLMs. The key contributions of We-Math 2.0 are fourfold: (1) MathBook Knowledge System: We construct a five-level hierarchical system encompassing 491 knowledge points and 1,819 fundamental principles. (2) MathBook-Standard &amp; Pro: We develop MathBook-Standard, a dataset that ensures broad conceptual coverage and flexibility through dual expansion. Additionally, we define a three-dimensional difficulty space and generate 7 progressive variants per problem to build MathBook-Pro, a challenging dataset for robust training. (3) MathBook-RL: We propose a two-stage RL framework comprising: (i) Cold-Start Fine-tuning, which aligns the model with knowledge-oriented chain-of-thought reasoning; and (ii) Progressive Alignment RL, leveraging average-reward learning and dynamic data scheduling to achieve progressive alignment across difficulty levels. (4) MathBookEval: We introduce a comprehensive benchmark covering all 491 knowledge points with diverse reasoning step distributions. Experimental results show that MathBook-RL performs competitively with existing baselines on four widely-used benchmarks and achieves strong results on MathBookEval, suggesting promising generalization in mathematical reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale</title>
      <itunes:episode>1064</itunes:episode>
      <podcast:episode>1064</podcast:episode>
      <itunes:title>NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e22c1e99-dc52-4315-94ac-eae623843b60</guid>
      <link>https://share.transistor.fm/s/8ca80a2f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 101 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, Hongyu Zhou, Kenkun Liu, Ailin Huang, Bin Wang, Changxin Miao, Deshan Sun, En Yu, Fukun Yin, Gang Yu, Hao Nie, Haoran Lv, Hanpeng Hu, Jia Wang, Jian Zhou, Jianjian Sun, Kaijun Tan, Kang An, Kangheng Lin, Liang Zhao, Mei Chen, Peng Xing, Rui Wang, Shiyu Liu, Shutao Xia, Tianhao You, Wei Ji, Xianfang Zeng, Xin Han, Xuelin Zhang, Yana Wei, Yanming Xu, Yimin Jiang, Yingming Wang, Yu Zhou, Yucheng Han, Ziyang Meng, Binxing Jiao, Daxin Jiang, Xiangyu Zhang, Yibo Zhu</p>

            <p><strong>Title:</strong><br>
            NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10711v1">http://arxiv.org/abs/2508.10711v1</a></p>

            <p><strong>Abstract:</strong><br>
            Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 101 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, Hongyu Zhou, Kenkun Liu, Ailin Huang, Bin Wang, Changxin Miao, Deshan Sun, En Yu, Fukun Yin, Gang Yu, Hao Nie, Haoran Lv, Hanpeng Hu, Jia Wang, Jian Zhou, Jianjian Sun, Kaijun Tan, Kang An, Kangheng Lin, Liang Zhao, Mei Chen, Peng Xing, Rui Wang, Shiyu Liu, Shutao Xia, Tianhao You, Wei Ji, Xianfang Zeng, Xin Han, Xuelin Zhang, Yana Wei, Yanming Xu, Yimin Jiang, Yingming Wang, Yu Zhou, Yucheng Han, Ziyang Meng, Binxing Jiao, Daxin Jiang, Xiangyu Zhang, Yibo Zhu</p>

            <p><strong>Title:</strong><br>
            NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10711v1">http://arxiv.org/abs/2508.10711v1</a></p>

            <p><strong>Abstract:</strong><br>
            Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 15 Aug 2025 20:16:20 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8ca80a2f/65fb1f64.mp3" length="22242219" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1386</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 101 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, Hongyu Zhou, Kenkun Liu, Ailin Huang, Bin Wang, Changxin Miao, Deshan Sun, En Yu, Fukun Yin, Gang Yu, Hao Nie, Haoran Lv, Hanpeng Hu, Jia Wang, Jian Zhou, Jianjian Sun, Kaijun Tan, Kang An, Kangheng Lin, Liang Zhao, Mei Chen, Peng Xing, Rui Wang, Shiyu Liu, Shutao Xia, Tianhao You, Wei Ji, Xianfang Zeng, Xin Han, Xuelin Zhang, Yana Wei, Yanming Xu, Yimin Jiang, Yingming Wang, Yu Zhou, Yucheng Han, Ziyang Meng, Binxing Jiao, Daxin Jiang, Xiangyu Zhang, Yibo Zhu</p>

            <p><strong>Title:</strong><br>
            NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10711v1">http://arxiv.org/abs/2508.10711v1</a></p>

            <p><strong>Abstract:</strong><br>
            Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts</title>
      <itunes:episode>1063</itunes:episode>
      <podcast:episode>1063</podcast:episode>
      <itunes:title>PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">09b8802d-8ab2-41aa-9aca-ce860efc5802</guid>
      <link>https://share.transistor.fm/s/a7500b01</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou</p>

            <p><strong>Title:</strong><br>
            PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09848v2">http://arxiv.org/abs/2508.09848v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by &gt;15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou</p>

            <p><strong>Title:</strong><br>
            PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09848v2">http://arxiv.org/abs/2508.09848v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by &gt;15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 15 Aug 2025 20:15:57 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a7500b01/6b90fd12.mp3" length="24202042" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1509</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou</p>

            <p><strong>Title:</strong><br>
            PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09848v2">http://arxiv.org/abs/2508.09848v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by &gt;15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing</title>
      <itunes:episode>1062</itunes:episode>
      <podcast:episode>1062</podcast:episode>
      <itunes:title>ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5dd1abd5-d761-4436-8618-35e025c9c987</guid>
      <link>https://share.transistor.fm/s/18c2c640</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lingen Li, Guangzhi Wang, Zhaoyang Zhang, Yaowei Li, Xiaoyu Li, Qi Dou, Jinwei Gu, Tianfan Xue, Ying Shan</p>

            <p><strong>Title:</strong><br>
            ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10881v1">http://arxiv.org/abs/2508.10881v1</a></p>

            <p><strong>Abstract:</strong><br>
            Traditional cartoon and anime production involves keyframing, inbetweening, and colorization stages, which require intensive manual effort. Despite recent advances in AI, existing methods often handle these stages separately, leading to error accumulation and artifacts. For instance, inbetweening approaches struggle with large motions, while colorization methods require dense per-frame sketches. To address this, we introduce ToonComposer, a generative model that unifies inbetweening and colorization into a single post-keyframing stage. ToonComposer employs a sparse sketch injection mechanism to provide precise control using keyframe sketches. Additionally, it uses a cartoon adaptation method with the spatial low-rank adapter to tailor a modern video foundation model to the cartoon domain while keeping its temporal prior intact. Requiring as few as a single sketch and a colored reference frame, ToonComposer excels with sparse inputs, while also supporting multiple sketches at any temporal location for more precise motion control. This dual capability reduces manual workload and improves flexibility, empowering artists in real-world scenarios. To evaluate our model, we further created PKBench, a benchmark featuring human-drawn sketches that simulate real-world use cases. Our evaluation demonstrates that ToonComposer outperforms existing methods in visual quality, motion consistency, and production efficiency, offering a superior and more flexible solution for AI-assisted cartoon production.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lingen Li, Guangzhi Wang, Zhaoyang Zhang, Yaowei Li, Xiaoyu Li, Qi Dou, Jinwei Gu, Tianfan Xue, Ying Shan</p>

            <p><strong>Title:</strong><br>
            ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10881v1">http://arxiv.org/abs/2508.10881v1</a></p>

            <p><strong>Abstract:</strong><br>
            Traditional cartoon and anime production involves keyframing, inbetweening, and colorization stages, which require intensive manual effort. Despite recent advances in AI, existing methods often handle these stages separately, leading to error accumulation and artifacts. For instance, inbetweening approaches struggle with large motions, while colorization methods require dense per-frame sketches. To address this, we introduce ToonComposer, a generative model that unifies inbetweening and colorization into a single post-keyframing stage. ToonComposer employs a sparse sketch injection mechanism to provide precise control using keyframe sketches. Additionally, it uses a cartoon adaptation method with the spatial low-rank adapter to tailor a modern video foundation model to the cartoon domain while keeping its temporal prior intact. Requiring as few as a single sketch and a colored reference frame, ToonComposer excels with sparse inputs, while also supporting multiple sketches at any temporal location for more precise motion control. This dual capability reduces manual workload and improves flexibility, empowering artists in real-world scenarios. To evaluate our model, we further created PKBench, a benchmark featuring human-drawn sketches that simulate real-world use cases. Our evaluation demonstrates that ToonComposer outperforms existing methods in visual quality, motion consistency, and production efficiency, offering a superior and more flexible solution for AI-assisted cartoon production.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 15 Aug 2025 20:15:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/18c2c640/853059af.mp3" length="20911850" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1303</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lingen Li, Guangzhi Wang, Zhaoyang Zhang, Yaowei Li, Xiaoyu Li, Qi Dou, Jinwei Gu, Tianfan Xue, Ying Shan</p>

            <p><strong>Title:</strong><br>
            ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.10881v1">http://arxiv.org/abs/2508.10881v1</a></p>

            <p><strong>Abstract:</strong><br>
            Traditional cartoon and anime production involves keyframing, inbetweening, and colorization stages, which require intensive manual effort. Despite recent advances in AI, existing methods often handle these stages separately, leading to error accumulation and artifacts. For instance, inbetweening approaches struggle with large motions, while colorization methods require dense per-frame sketches. To address this, we introduce ToonComposer, a generative model that unifies inbetweening and colorization into a single post-keyframing stage. ToonComposer employs a sparse sketch injection mechanism to provide precise control using keyframe sketches. Additionally, it uses a cartoon adaptation method with the spatial low-rank adapter to tailor a modern video foundation model to the cartoon domain while keeping its temporal prior intact. Requiring as few as a single sketch and a colored reference frame, ToonComposer excels with sparse inputs, while also supporting multiple sketches at any temporal location for more precise motion control. This dual capability reduces manual workload and improves flexibility, empowering artists in real-world scenarios. To evaluate our model, we further created PKBench, a benchmark featuring human-drawn sketches that simulate real-world use cases. Our evaluation demonstrates that ToonComposer outperforms existing methods in visual quality, motion consistency, and production efficiency, offering a superior and more flexible solution for AI-assisted cartoon production.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Story2Board: A Training-Free Approach for Expressive Storyboard Generation</title>
      <itunes:episode>1061</itunes:episode>
      <podcast:episode>1061</podcast:episode>
      <itunes:title>Story2Board: A Training-Free Approach for Expressive Storyboard Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ec275198-8a3f-4a2f-8b7e-f3b7c53c77a9</guid>
      <link>https://share.transistor.fm/s/2e5cd566</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV, cs.GR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            David Dinkevich, Matan Levy, Omri Avrahami, Dvir Samuel, Dani Lischinski</p>

            <p><strong>Title:</strong><br>
            Story2Board: A Training-Free Approach for Expressive Storyboard Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09983v1">http://arxiv.org/abs/2508.09983v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Story2Board, a training-free framework for expressive storyboard generation from natural language. Existing methods narrowly focus on subject identity, overlooking key aspects of visual storytelling such as spatial composition, background evolution, and narrative pacing. To address this, we introduce a lightweight consistency framework composed of two components: Latent Panel Anchoring, which preserves a shared character reference across panels, and Reciprocal Attention Value Mixing, which softly blends visual features between token pairs with strong reciprocal attention. Together, these mechanisms enhance coherence without architectural changes or fine-tuning, enabling state-of-the-art diffusion models to generate visually diverse yet consistent storyboards. To structure generation, we use an off-the-shelf language model to convert free-form stories into grounded panel-level prompts. To evaluate, we propose the Rich Storyboard Benchmark, a suite of open-domain narratives designed to assess layout diversity and background-grounded storytelling, in addition to consistency. We also introduce a new Scene Diversity metric that quantifies spatial and pose variation across storyboards. Our qualitative and quantitative results, as well as a user study, show that Story2Board produces more dynamic, coherent, and narratively engaging storyboards than existing baselines.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV, cs.GR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            David Dinkevich, Matan Levy, Omri Avrahami, Dvir Samuel, Dani Lischinski</p>

            <p><strong>Title:</strong><br>
            Story2Board: A Training-Free Approach for Expressive Storyboard Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09983v1">http://arxiv.org/abs/2508.09983v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Story2Board, a training-free framework for expressive storyboard generation from natural language. Existing methods narrowly focus on subject identity, overlooking key aspects of visual storytelling such as spatial composition, background evolution, and narrative pacing. To address this, we introduce a lightweight consistency framework composed of two components: Latent Panel Anchoring, which preserves a shared character reference across panels, and Reciprocal Attention Value Mixing, which softly blends visual features between token pairs with strong reciprocal attention. Together, these mechanisms enhance coherence without architectural changes or fine-tuning, enabling state-of-the-art diffusion models to generate visually diverse yet consistent storyboards. To structure generation, we use an off-the-shelf language model to convert free-form stories into grounded panel-level prompts. To evaluate, we propose the Rich Storyboard Benchmark, a suite of open-domain narratives designed to assess layout diversity and background-grounded storytelling, in addition to consistency. We also introduce a new Scene Diversity metric that quantifies spatial and pose variation across storyboards. Our qualitative and quantitative results, as well as a user study, show that Story2Board produces more dynamic, coherent, and narratively engaging storyboards than existing baselines.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 14 Aug 2025 20:56:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2e5cd566/03cb4a83.mp3" length="20676536" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1289</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV, cs.GR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            David Dinkevich, Matan Levy, Omri Avrahami, Dvir Samuel, Dani Lischinski</p>

            <p><strong>Title:</strong><br>
            Story2Board: A Training-Free Approach for Expressive Storyboard Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09983v1">http://arxiv.org/abs/2508.09983v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Story2Board, a training-free framework for expressive storyboard generation from natural language. Existing methods narrowly focus on subject identity, overlooking key aspects of visual storytelling such as spatial composition, background evolution, and narrative pacing. To address this, we introduce a lightweight consistency framework composed of two components: Latent Panel Anchoring, which preserves a shared character reference across panels, and Reciprocal Attention Value Mixing, which softly blends visual features between token pairs with strong reciprocal attention. Together, these mechanisms enhance coherence without architectural changes or fine-tuning, enabling state-of-the-art diffusion models to generate visually diverse yet consistent storyboards. To structure generation, we use an off-the-shelf language model to convert free-form stories into grounded panel-level prompts. To evaluate, we propose the Rich Storyboard Benchmark, a suite of open-domain narratives designed to assess layout diversity and background-grounded storytelling, in addition to consistency. We also introduce a new Scene Diversity metric that quantifies spatial and pose variation across storyboards. Our qualitative and quantitative results, as well as a user study, show that Story2Board produces more dynamic, coherent, and narratively engaging storyboards than existing baselines.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery</title>
      <itunes:episode>1060</itunes:episode>
      <podcast:episode>1060</podcast:episode>
      <itunes:title>Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">50e45f0a-d3e3-4163-a8f8-1145cb7fc9c3</guid>
      <link>https://share.transistor.fm/s/d24dc648</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiatong Li, Weida Wang, Qinggang Zhang, Junxian Li, Di Zhang, Changmeng Zheng, Shufei Zhang, Xiaoyong Wei, Qing Li</p>

            <p><strong>Title:</strong><br>
            Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.08401v1">http://arxiv.org/abs/2508.08401v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs), especially Explicit Long Chain-of-Thought (CoT) reasoning models like DeepSeek-R1 and QWQ, have demonstrated powerful reasoning capabilities, achieving impressive performance in commonsense reasoning and mathematical inference. Despite their effectiveness, Long-CoT reasoning models are often criticized for their limited ability and low efficiency in knowledge-intensive domains such as molecule discovery. Success in this field requires a precise understanding of domain knowledge, including molecular structures and chemical principles, which is challenging due to the inherent complexity of molecular data and the scarcity of high-quality expert annotations. To bridge this gap, we introduce Mol-R1, a novel framework designed to improve explainability and reasoning performance of R1-like Explicit Long-CoT reasoning LLMs in text-based molecule generation. Our approach begins with a high-quality reasoning dataset curated through Prior Regulation via In-context Distillation (PRID), a dedicated distillation strategy to effectively generate paired reasoning traces guided by prior regulations. Building upon this, we introduce MoIA, Molecular Iterative Adaptation, a sophisticated training strategy that iteratively combines Supervised Fine-tuning (SFT) with Reinforced Policy Optimization (RPO), tailored to boost the reasoning performance of R1-like reasoning models for molecule discovery. Finally, we examine the performance of Mol-R1 in the text-based molecule reasoning generation task, showing superior performance against existing baselines.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiatong Li, Weida Wang, Qinggang Zhang, Junxian Li, Di Zhang, Changmeng Zheng, Shufei Zhang, Xiaoyong Wei, Qing Li</p>

            <p><strong>Title:</strong><br>
            Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.08401v1">http://arxiv.org/abs/2508.08401v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs), especially Explicit Long Chain-of-Thought (CoT) reasoning models like DeepSeek-R1 and QWQ, have demonstrated powerful reasoning capabilities, achieving impressive performance in commonsense reasoning and mathematical inference. Despite their effectiveness, Long-CoT reasoning models are often criticized for their limited ability and low efficiency in knowledge-intensive domains such as molecule discovery. Success in this field requires a precise understanding of domain knowledge, including molecular structures and chemical principles, which is challenging due to the inherent complexity of molecular data and the scarcity of high-quality expert annotations. To bridge this gap, we introduce Mol-R1, a novel framework designed to improve explainability and reasoning performance of R1-like Explicit Long-CoT reasoning LLMs in text-based molecule generation. Our approach begins with a high-quality reasoning dataset curated through Prior Regulation via In-context Distillation (PRID), a dedicated distillation strategy to effectively generate paired reasoning traces guided by prior regulations. Building upon this, we introduce MoIA, Molecular Iterative Adaptation, a sophisticated training strategy that iteratively combines Supervised Fine-tuning (SFT) with Reinforced Policy Optimization (RPO), tailored to boost the reasoning performance of R1-like reasoning models for molecule discovery. Finally, we examine the performance of Mol-R1 in the text-based molecule reasoning generation task, showing superior performance against existing baselines.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 14 Aug 2025 20:56:01 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d24dc648/50fe2614.mp3" length="22239694" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1386</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiatong Li, Weida Wang, Qinggang Zhang, Junxian Li, Di Zhang, Changmeng Zheng, Shufei Zhang, Xiaoyong Wei, Qing Li</p>

            <p><strong>Title:</strong><br>
            Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.08401v1">http://arxiv.org/abs/2508.08401v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs), especially Explicit Long Chain-of-Thought (CoT) reasoning models like DeepSeek-R1 and QWQ, have demonstrated powerful reasoning capabilities, achieving impressive performance in commonsense reasoning and mathematical inference. Despite their effectiveness, Long-CoT reasoning models are often criticized for their limited ability and low efficiency in knowledge-intensive domains such as molecule discovery. Success in this field requires a precise understanding of domain knowledge, including molecular structures and chemical principles, which is challenging due to the inherent complexity of molecular data and the scarcity of high-quality expert annotations. To bridge this gap, we introduce Mol-R1, a novel framework designed to improve explainability and reasoning performance of R1-like Explicit Long-CoT reasoning LLMs in text-based molecule generation. Our approach begins with a high-quality reasoning dataset curated through Prior Regulation via In-context Distillation (PRID), a dedicated distillation strategy to effectively generate paired reasoning traces guided by prior regulations. Building upon this, we introduce MoIA, Molecular Iterative Adaptation, a sophisticated training strategy that iteratively combines Supervised Fine-tuning (SFT) with Reinforced Policy Optimization (RPO), tailored to boost the reasoning performance of R1-like reasoning models for molecule discovery. Finally, we examine the performance of Mol-R1 in the text-based molecule reasoning generation task, showing superior performance against existing baselines.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation</title>
      <itunes:episode>1059</itunes:episode>
      <podcast:episode>1059</podcast:episode>
      <itunes:title>Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7d03c5ae-b10d-4460-afc8-29d45418ea7a</guid>
      <link>https://share.transistor.fm/s/13483bb8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Bowen Xue, Qixin Yan, Wenjing Wang, Hao Liu, Chen Li</p>

            <p><strong>Title:</strong><br>
            Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07901v2">http://arxiv.org/abs/2508.07901v2</a></p>

            <p><strong>Abstract:</strong><br>
            Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping, and can be learned quickly with only 2000 pairs. Despite incorporating and training just $\sim$1% additional parameters, our framework achieves excellent results in video quality and identity preservation, outperforming other full-parameter training methods. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Bowen Xue, Qixin Yan, Wenjing Wang, Hao Liu, Chen Li</p>

            <p><strong>Title:</strong><br>
            Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07901v2">http://arxiv.org/abs/2508.07901v2</a></p>

            <p><strong>Abstract:</strong><br>
            Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping, and can be learned quickly with only 2000 pairs. Despite incorporating and training just $\sim$1% additional parameters, our framework achieves excellent results in video quality and identity preservation, outperforming other full-parameter training methods. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 14 Aug 2025 20:55:40 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/13483bb8/03120cbe.mp3" length="20276972" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1264</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Bowen Xue, Qixin Yan, Wenjing Wang, Hao Liu, Chen Li</p>

            <p><strong>Title:</strong><br>
            Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07901v2">http://arxiv.org/abs/2508.07901v2</a></p>

            <p><strong>Abstract:</strong><br>
            Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping, and can be learned quickly with only 2000 pairs. Despite incorporating and training just $\sim$1% additional parameters, our framework achieves excellent results in video quality and identity preservation, outperforming other full-parameter training methods. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing</title>
      <itunes:episode>1058</itunes:episode>
      <podcast:episode>1058</podcast:episode>
      <itunes:title>Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">345668e3-fb1e-4d96-a5fb-4680058072a2</guid>
      <link>https://share.transistor.fm/s/7d157fc2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, Zhijie Deng</p>

            <p><strong>Title:</strong><br>
            Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09192v1">http://arxiv.org/abs/2508.09192v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $\mathbf{2.5\times}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $\mathbf{50\times}$ while maintaining comparable output quality. The code is available at https://github.com/zhijie-group/Discrete-Diffusion-Forcing.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, Zhijie Deng</p>

            <p><strong>Title:</strong><br>
            Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09192v1">http://arxiv.org/abs/2508.09192v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $\mathbf{2.5\times}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $\mathbf{50\times}$ while maintaining comparable output quality. The code is available at https://github.com/zhijie-group/Discrete-Diffusion-Forcing.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 14 Aug 2025 20:55:19 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7d157fc2/8c87e8ab.mp3" length="21940029" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1368</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, Zhijie Deng</p>

            <p><strong>Title:</strong><br>
            Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09192v1">http://arxiv.org/abs/2508.09192v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $\mathbf{2.5\times}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $\mathbf{50\times}$ while maintaining comparable output quality. The code is available at https://github.com/zhijie-group/Discrete-Diffusion-Forcing.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory</title>
      <itunes:episode>1057</itunes:episode>
      <podcast:episode>1057</podcast:episode>
      <itunes:title>Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">507278cb-9932-407a-bf0e-52cf2efc817c</guid>
      <link>https://share.transistor.fm/s/ca89792a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li</p>

            <p><strong>Title:</strong><br>
            Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09736v1">http://arxiv.org/abs/2508.09736v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot's perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li</p>

            <p><strong>Title:</strong><br>
            Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09736v1">http://arxiv.org/abs/2508.09736v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot's perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 14 Aug 2025 20:54:58 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ca89792a/b04ea750.mp3" length="21097434" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1315</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li</p>

            <p><strong>Title:</strong><br>
            Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09736v1">http://arxiv.org/abs/2508.09736v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot's perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving</title>
      <itunes:episode>1056</itunes:episode>
      <podcast:episode>1056</podcast:episode>
      <itunes:title>AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d5d36ed4-8414-456b-8b3e-c0b00cbb0ca6</guid>
      <link>https://share.transistor.fm/s/0acd8d0d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhitian Xie, Qintong Wu, Chengyue Yu, Chenyi Zhuang, Jinjie Gu</p>

            <p><strong>Title:</strong><br>
            AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09889v1">http://arxiv.org/abs/2508.09889v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancement of large language models (LLMs) has empowered intelligent agents to leverage diverse external tools for solving complex real-world problems. However, as agents increasingly depend on multiple tools, they encounter new challenges: extended contexts from disparate sources and noisy or irrelevant tool outputs can undermine system reliability and accuracy. These challenges underscore the necessity for enhanced stability in agent-based systems. To address this, we introduce dynamic supervision and maneuvering mechanisms, constructing a robust and dynamic Multi-Agent System (MAS) architecture within the AWorld framework. In our approach, the Execution Agent invokes the Guard Agent at critical steps to verify and correct the reasoning process, effectively reducing errors arising from noise and bolstering problem-solving robustness. Extensive experiments on the GAIA test dataset reveal that our dynamic maneuvering mechanism significantly improves both the effectiveness and stability of solutions, outperforming single-agent system (SAS) and standard tool-augmented systems. As a result, our dynamic MAS system achieved first place among open-source projects on the prestigious GAIA leaderboard. These findings highlight the practical value of collaborative agent roles in developing more reliable and trustworthy intelligent systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhitian Xie, Qintong Wu, Chengyue Yu, Chenyi Zhuang, Jinjie Gu</p>

            <p><strong>Title:</strong><br>
            AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09889v1">http://arxiv.org/abs/2508.09889v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancement of large language models (LLMs) has empowered intelligent agents to leverage diverse external tools for solving complex real-world problems. However, as agents increasingly depend on multiple tools, they encounter new challenges: extended contexts from disparate sources and noisy or irrelevant tool outputs can undermine system reliability and accuracy. These challenges underscore the necessity for enhanced stability in agent-based systems. To address this, we introduce dynamic supervision and maneuvering mechanisms, constructing a robust and dynamic Multi-Agent System (MAS) architecture within the AWorld framework. In our approach, the Execution Agent invokes the Guard Agent at critical steps to verify and correct the reasoning process, effectively reducing errors arising from noise and bolstering problem-solving robustness. Extensive experiments on the GAIA test dataset reveal that our dynamic maneuvering mechanism significantly improves both the effectiveness and stability of solutions, outperforming single-agent system (SAS) and standard tool-augmented systems. As a result, our dynamic MAS system achieved first place among open-source projects on the prestigious GAIA leaderboard. These findings highlight the practical value of collaborative agent roles in developing more reliable and trustworthy intelligent systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 14 Aug 2025 20:54:36 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0acd8d0d/91b436eb.mp3" length="21257515" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1325</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhitian Xie, Qintong Wu, Chengyue Yu, Chenyi Zhuang, Jinjie Gu</p>

            <p><strong>Title:</strong><br>
            AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09889v1">http://arxiv.org/abs/2508.09889v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancement of large language models (LLMs) has empowered intelligent agents to leverage diverse external tools for solving complex real-world problems. However, as agents increasingly depend on multiple tools, they encounter new challenges: extended contexts from disparate sources and noisy or irrelevant tool outputs can undermine system reliability and accuracy. These challenges underscore the necessity for enhanced stability in agent-based systems. To address this, we introduce dynamic supervision and maneuvering mechanisms, constructing a robust and dynamic Multi-Agent System (MAS) architecture within the AWorld framework. In our approach, the Execution Agent invokes the Guard Agent at critical steps to verify and correct the reasoning process, effectively reducing errors arising from noise and bolstering problem-solving robustness. Extensive experiments on the GAIA test dataset reveal that our dynamic maneuvering mechanism significantly improves both the effectiveness and stability of solutions, outperforming single-agent system (SAS) and standard tool-augmented systems. As a result, our dynamic MAS system achieved first place among open-source projects on the prestigious GAIA leaderboard. These findings highlight the practical value of collaborative agent roles in developing more reliable and trustworthy intelligent systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL</title>
      <itunes:episode>1055</itunes:episode>
      <podcast:episode>1055</podcast:episode>
      <itunes:title>Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">335afc99-f318-4553-bfae-101dbf6711b3</guid>
      <link>https://share.transistor.fm/s/d14f9c15</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, Yi Wu</p>

            <p><strong>Title:</strong><br>
            Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07976v2">http://arxiv.org/abs/2508.07976v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. &lt;=10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 46.7% and 20.8% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 40 turns and output tokens exceeding 150k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 42.1 on xBench and 52.8 on GAIA, surpassing existing open-source 32B agents. We open-source our models, training data, and codes in https://github.com/inclusionAI/ASearcher.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, Yi Wu</p>

            <p><strong>Title:</strong><br>
            Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07976v2">http://arxiv.org/abs/2508.07976v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. &lt;=10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 46.7% and 20.8% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 40 turns and output tokens exceeding 150k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 42.1 on xBench and 52.8 on GAIA, surpassing existing open-source 32B agents. We open-source our models, training data, and codes in https://github.com/inclusionAI/ASearcher.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 13 Aug 2025 20:58:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d14f9c15/32989639.mp3" length="22681918" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1414</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, Yi Wu</p>

            <p><strong>Title:</strong><br>
            Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07976v2">http://arxiv.org/abs/2508.07976v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. &lt;=10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 46.7% and 20.8% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 40 turns and output tokens exceeding 150k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 42.1 on xBench and 52.8 on GAIA, surpassing existing open-source 32B agents. We open-source our models, training data, and codes in https://github.com/inclusionAI/ASearcher.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Complex Logical Instruction Generation</title>
      <itunes:episode>1054</itunes:episode>
      <podcast:episode>1054</podcast:episode>
      <itunes:title>Complex Logical Instruction Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">23743431-b731-4c31-a93f-c22cd8af9036</guid>
      <link>https://share.transistor.fm/s/320dfe38</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Mian Zhang, Shujian Liu, Sixun Dong, Ming Yin, Yebowen Hu, Xun Wang, Steven Ma, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Zhiyu Zoey Chen, Kaiqiang Song</p>

            <p><strong>Title:</strong><br>
            Complex Logical Instruction Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09125v1">http://arxiv.org/abs/2508.09125v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditionals, nesting, recursion, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: https://github.com/mianzhang/LogicIF</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Mian Zhang, Shujian Liu, Sixun Dong, Ming Yin, Yebowen Hu, Xun Wang, Steven Ma, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Zhiyu Zoey Chen, Kaiqiang Song</p>

            <p><strong>Title:</strong><br>
            Complex Logical Instruction Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09125v1">http://arxiv.org/abs/2508.09125v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditionals, nesting, recursion, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: https://github.com/mianzhang/LogicIF</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 13 Aug 2025 20:57:48 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/320dfe38/c7eaa0e4.mp3" length="18378142" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1145</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Mian Zhang, Shujian Liu, Sixun Dong, Ming Yin, Yebowen Hu, Xun Wang, Steven Ma, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Zhiyu Zoey Chen, Kaiqiang Song</p>

            <p><strong>Title:</strong><br>
            Complex Logical Instruction Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09125v1">http://arxiv.org/abs/2508.09125v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditionals, nesting, recursion, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: https://github.com/mianzhang/LogicIF</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models</title>
      <itunes:episode>1053</itunes:episode>
      <podcast:episode>1053</podcast:episode>
      <itunes:title>Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ab6baf69-94f3-4e87-bbe9-6c125960e473</guid>
      <link>https://share.transistor.fm/s/33d46d35</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Wen Wang, Bozhen Fang, Chenchen Jing, Yongliang Shen, Yangyi Shen, Qiuyu Wang, Hao Ouyang, Hao Chen, Chunhua Shen</p>

            <p><strong>Title:</strong><br>
            Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09138v1">http://arxiv.org/abs/2508.09138v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Wen Wang, Bozhen Fang, Chenchen Jing, Yongliang Shen, Yangyi Shen, Qiuyu Wang, Hao Ouyang, Hao Chen, Chunhua Shen</p>

            <p><strong>Title:</strong><br>
            Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09138v1">http://arxiv.org/abs/2508.09138v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 13 Aug 2025 20:57:16 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/33d46d35/a732ea59.mp3" length="22253916" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1387</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Wen Wang, Bozhen Fang, Chenchen Jing, Yongliang Shen, Yangyi Shen, Qiuyu Wang, Hao Ouyang, Hao Chen, Chunhua Shen</p>

            <p><strong>Title:</strong><br>
            Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.09138v1">http://arxiv.org/abs/2508.09138v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches</title>
      <itunes:episode>1052</itunes:episode>
      <podcast:episode>1052</podcast:episode>
      <itunes:title>HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7cffbcfa-1324-4ba1-8301-a46872b6350d</guid>
      <link>https://share.transistor.fm/s/6bf23cbe</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.IR, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiejun Tan, Zhicheng Dou, Yan Yu, Jiehan Cheng, Qiang Ju, Jian Xie, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.08088v1">http://arxiv.org/abs/2508.08088v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, large reasoning models have demonstrated strong mathematical and coding abilities, and deep search leverages their reasoning capabilities in challenging information retrieval tasks. Existing deep search works are generally limited to a single knowledge source, either local or the Web. However, enterprises often require private deep search systems that can leverage search tools over both local and the Web corpus. Simply training an agent equipped with multiple search tools using flat reinforcement learning (RL) is a straightforward idea, but it has problems such as low training data efficiency and poor mastery of complex tools. To address the above issue, we propose a hierarchical agentic deep search framework, HierSearch, trained with hierarchical RL. At the low level, a local deep search agent and a Web deep search agent are trained to retrieve evidence from their corresponding domains. At the high level, a planner agent coordinates low-level agents and provides the final answer. Moreover, to prevent direct answer copying and error propagation, we design a knowledge refiner that filters out hallucinations and irrelevant evidence returned by low-level agents. Experiments show that HierSearch achieves better performance compared to flat RL, and outperforms various deep search and multi-source retrieval-augmented generation baselines in six benchmarks across general, finance, and medical domains.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.IR, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiejun Tan, Zhicheng Dou, Yan Yu, Jiehan Cheng, Qiang Ju, Jian Xie, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.08088v1">http://arxiv.org/abs/2508.08088v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, large reasoning models have demonstrated strong mathematical and coding abilities, and deep search leverages their reasoning capabilities in challenging information retrieval tasks. Existing deep search works are generally limited to a single knowledge source, either local or the Web. However, enterprises often require private deep search systems that can leverage search tools over both local and the Web corpus. Simply training an agent equipped with multiple search tools using flat reinforcement learning (RL) is a straightforward idea, but it has problems such as low training data efficiency and poor mastery of complex tools. To address the above issue, we propose a hierarchical agentic deep search framework, HierSearch, trained with hierarchical RL. At the low level, a local deep search agent and a Web deep search agent are trained to retrieve evidence from their corresponding domains. At the high level, a planner agent coordinates low-level agents and provides the final answer. Moreover, to prevent direct answer copying and error propagation, we design a knowledge refiner that filters out hallucinations and irrelevant evidence returned by low-level agents. Experiments show that HierSearch achieves better performance compared to flat RL, and outperforms various deep search and multi-source retrieval-augmented generation baselines in six benchmarks across general, finance, and medical domains.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 13 Aug 2025 20:56:55 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6bf23cbe/48773818.mp3" length="22544833" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1405</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.IR, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiejun Tan, Zhicheng Dou, Yan Yu, Jiehan Cheng, Qiang Ju, Jian Xie, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.08088v1">http://arxiv.org/abs/2508.08088v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, large reasoning models have demonstrated strong mathematical and coding abilities, and deep search leverages their reasoning capabilities in challenging information retrieval tasks. Existing deep search works are generally limited to a single knowledge source, either local or the Web. However, enterprises often require private deep search systems that can leverage search tools over both local and the Web corpus. Simply training an agent equipped with multiple search tools using flat reinforcement learning (RL) is a straightforward idea, but it has problems such as low training data efficiency and poor mastery of complex tools. To address the above issue, we propose a hierarchical agentic deep search framework, HierSearch, trained with hierarchical RL. At the low level, a local deep search agent and a Web deep search agent are trained to retrieve evidence from their corresponding domains. At the high level, a planner agent coordinates low-level agents and provides the final answer. Moreover, to prevent direct answer copying and error propagation, we design a knowledge refiner that filters out hallucinations and irrelevant evidence returned by low-level agents. Experiments show that HierSearch achieves better performance compared to flat RL, and outperforms various deep search and multi-source retrieval-augmented generation baselines in six benchmarks across general, finance, and medical domains.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability</title>
      <itunes:episode>1051</itunes:episode>
      <podcast:episode>1051</podcast:episode>
      <itunes:title>ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2e6b4596-3086-4fa8-bd12-fb4c4b8618eb</guid>
      <link>https://share.transistor.fm/s/3d7b81c6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 90 | cs.IR, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wenhan Liu, Xinyu Ma, Weiwei Sun, Yutao Zhu, Yuchen Li, Dawei Yin, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07050v1">http://arxiv.org/abs/2508.07050v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Model (LLM) based listwise ranking has shown superior performance in many passage ranking tasks. With the development of Large Reasoning Models, many studies have demonstrated that step-by-step reasoning during test-time helps improve listwise ranking performance. However, due to the scarcity of reasoning-intensive training data, existing rerankers perform poorly in many complex ranking scenarios and the ranking ability of reasoning-intensive rerankers remains largely underdeveloped. In this paper, we first propose an automated reasoning-intensive training data synthesis framework, which sources training queries and passages from diverse domains and applies DeepSeek-R1 to generate high-quality training labels. A self-consistency data filtering mechanism is designed to ensure the data quality. To empower the listwise reranker with strong reasoning ability, we further propose a two-stage post-training approach, which includes a cold-start supervised fine-tuning (SFT) stage for reasoning pattern learning and a reinforcement learning (RL) stage for further ranking ability enhancement. During the RL stage, based on the nature of listwise ranking, we design a multi-view ranking reward, which is more effective than a ranking metric-based reward. Extensive experiments demonstrate that our trained reasoning-intensive reranker \textbf{ReasonRank} outperforms existing baselines significantly and also achieves much lower latency than pointwise reranker Rank1. \textbf{Through further experiments, our ReasonRank has achieved state-of-the-art (SOTA) performance 40.6 on the BRIGHT leaderboard\footnote{https://brightbenchmark.github.io/}.} Our codes are available at https://github.com/8421BCD/ReasonRank.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 90 | cs.IR, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wenhan Liu, Xinyu Ma, Weiwei Sun, Yutao Zhu, Yuchen Li, Dawei Yin, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07050v1">http://arxiv.org/abs/2508.07050v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Model (LLM) based listwise ranking has shown superior performance in many passage ranking tasks. With the development of Large Reasoning Models, many studies have demonstrated that step-by-step reasoning during test-time helps improve listwise ranking performance. However, due to the scarcity of reasoning-intensive training data, existing rerankers perform poorly in many complex ranking scenarios and the ranking ability of reasoning-intensive rerankers remains largely underdeveloped. In this paper, we first propose an automated reasoning-intensive training data synthesis framework, which sources training queries and passages from diverse domains and applies DeepSeek-R1 to generate high-quality training labels. A self-consistency data filtering mechanism is designed to ensure the data quality. To empower the listwise reranker with strong reasoning ability, we further propose a two-stage post-training approach, which includes a cold-start supervised fine-tuning (SFT) stage for reasoning pattern learning and a reinforcement learning (RL) stage for further ranking ability enhancement. During the RL stage, based on the nature of listwise ranking, we design a multi-view ranking reward, which is more effective than a ranking metric-based reward. Extensive experiments demonstrate that our trained reasoning-intensive reranker \textbf{ReasonRank} outperforms existing baselines significantly and also achieves much lower latency than pointwise reranker Rank1. \textbf{Through further experiments, our ReasonRank has achieved state-of-the-art (SOTA) performance 40.6 on the BRIGHT leaderboard\footnote{https://brightbenchmark.github.io/}.} Our codes are available at https://github.com/8421BCD/ReasonRank.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 12 Aug 2025 21:16:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3d7b81c6/5a4ae6a0.mp3" length="22125176" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1379</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 90 | cs.IR, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wenhan Liu, Xinyu Ma, Weiwei Sun, Yutao Zhu, Yuchen Li, Dawei Yin, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07050v1">http://arxiv.org/abs/2508.07050v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Model (LLM) based listwise ranking has shown superior performance in many passage ranking tasks. With the development of Large Reasoning Models, many studies have demonstrated that step-by-step reasoning during test-time helps improve listwise ranking performance. However, due to the scarcity of reasoning-intensive training data, existing rerankers perform poorly in many complex ranking scenarios and the ranking ability of reasoning-intensive rerankers remains largely underdeveloped. In this paper, we first propose an automated reasoning-intensive training data synthesis framework, which sources training queries and passages from diverse domains and applies DeepSeek-R1 to generate high-quality training labels. A self-consistency data filtering mechanism is designed to ensure the data quality. To empower the listwise reranker with strong reasoning ability, we further propose a two-stage post-training approach, which includes a cold-start supervised fine-tuning (SFT) stage for reasoning pattern learning and a reinforcement learning (RL) stage for further ranking ability enhancement. During the RL stage, based on the nature of listwise ranking, we design a multi-view ranking reward, which is more effective than a ranking metric-based reward. Extensive experiments demonstrate that our trained reasoning-intensive reranker \textbf{ReasonRank} outperforms existing baselines significantly and also achieves much lower latency than pointwise reranker Rank1. \textbf{Through further experiments, our ReasonRank has achieved state-of-the-art (SOTA) performance 40.6 on the BRIGHT leaderboard\footnote{https://brightbenchmark.github.io/}.} Our codes are available at https://github.com/8421BCD/ReasonRank.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WideSearch: Benchmarking Agentic Broad Info-Seeking</title>
      <itunes:episode>1050</itunes:episode>
      <podcast:episode>1050</podcast:episode>
      <itunes:title>WideSearch: Benchmarking Agentic Broad Info-Seeking</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c0422829-86e3-4521-a652-20017328391c</guid>
      <link>https://share.transistor.fm/s/3e4926c4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 86 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, Ke Wang</p>

            <p><strong>Title:</strong><br>
            WideSearch: Benchmarking Agentic Broad Info-Seeking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07999v1">http://arxiv.org/abs/2508.07999v1</a></p>

            <p><strong>Abstract:</strong><br>
            From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0\%, with the best performer reaching just 5\%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100\% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 86 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, Ke Wang</p>

            <p><strong>Title:</strong><br>
            WideSearch: Benchmarking Agentic Broad Info-Seeking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07999v1">http://arxiv.org/abs/2508.07999v1</a></p>

            <p><strong>Abstract:</strong><br>
            From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0\%, with the best performer reaching just 5\%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100\% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 12 Aug 2025 21:16:28 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3e4926c4/576c6288.mp3" length="21477741" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1339</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 86 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, Ke Wang</p>

            <p><strong>Title:</strong><br>
            WideSearch: Benchmarking Agentic Broad Info-Seeking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07999v1">http://arxiv.org/abs/2508.07999v1</a></p>

            <p><strong>Abstract:</strong><br>
            From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0\%, with the best performer reaching just 5\%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100\% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation</title>
      <itunes:episode>1049</itunes:episode>
      <podcast:episode>1049</podcast:episode>
      <itunes:title>Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">40862c8f-7286-42e6-9f0d-54b4329df285</guid>
      <link>https://share.transistor.fm/s/8aa56c96</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng, Jiashu Zhu, Meiqi Wu, Chubin Chen, Jiahong Wu, Xiangxiang Chu</p>

            <p><strong>Title:</strong><br>
            Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07981v2">http://arxiv.org/abs/2508.07981v2</a></p>

            <p><strong>Abstract:</strong><br>
            Visual effects (VFX) are essential visual enhancements fundamental to modern cinematic production. Although video generation models offer cost-efficient solutions for VFX production, current methods are constrained by per-effect LoRA training, which limits generation to single effects. This fundamental limitation impedes applications that require spatially controllable composite effects, i.e., the concurrent generation of multiple effects at designated locations. However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose Omni-Effects, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. The core of our framework comprises two key innovations: (1) LoRA-based Mixture of Experts (LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects within a unified model while effectively mitigating cross-task interference. (2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control. Furthermore, we introduce an Independent-Information Flow (IIF) module integrated within the SAP, isolating the control signals corresponding to individual effects to prevent any unwanted blending. To facilitate this research, we construct a comprehensive VFX dataset Omni-VFX via a novel data collection pipeline combining image editing and First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX evaluation framework for validating model performance. Extensive experiments demonstrate that Omni-Effects achieves precise spatial control and diverse effect generation, enabling users to specify both the category and location of desired effects.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng, Jiashu Zhu, Meiqi Wu, Chubin Chen, Jiahong Wu, Xiangxiang Chu</p>

            <p><strong>Title:</strong><br>
            Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07981v2">http://arxiv.org/abs/2508.07981v2</a></p>

            <p><strong>Abstract:</strong><br>
            Visual effects (VFX) are essential visual enhancements fundamental to modern cinematic production. Although video generation models offer cost-efficient solutions for VFX production, current methods are constrained by per-effect LoRA training, which limits generation to single effects. This fundamental limitation impedes applications that require spatially controllable composite effects, i.e., the concurrent generation of multiple effects at designated locations. However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose Omni-Effects, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. The core of our framework comprises two key innovations: (1) LoRA-based Mixture of Experts (LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects within a unified model while effectively mitigating cross-task interference. (2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control. Furthermore, we introduce an Independent-Information Flow (IIF) module integrated within the SAP, isolating the control signals corresponding to individual effects to prevent any unwanted blending. To facilitate this research, we construct a comprehensive VFX dataset Omni-VFX via a novel data collection pipeline combining image editing and First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX evaluation framework for validating model performance. Extensive experiments demonstrate that Omni-Effects achieves precise spatial control and diverse effect generation, enabling users to specify both the category and location of desired effects.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 12 Aug 2025 21:16:06 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8aa56c96/963bd7c8.mp3" length="19141790" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1193</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng, Jiashu Zhu, Meiqi Wu, Chubin Chen, Jiahong Wu, Xiangxiang Chu</p>

            <p><strong>Title:</strong><br>
            Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07981v2">http://arxiv.org/abs/2508.07981v2</a></p>

            <p><strong>Abstract:</strong><br>
            Visual effects (VFX) are essential visual enhancements fundamental to modern cinematic production. Although video generation models offer cost-efficient solutions for VFX production, current methods are constrained by per-effect LoRA training, which limits generation to single effects. This fundamental limitation impedes applications that require spatially controllable composite effects, i.e., the concurrent generation of multiple effects at designated locations. However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose Omni-Effects, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. The core of our framework comprises two key innovations: (1) LoRA-based Mixture of Experts (LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects within a unified model while effectively mitigating cross-task interference. (2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control. Furthermore, we introduce an Independent-Information Flow (IIF) module integrated within the SAP, isolating the control signals corresponding to individual effects to prevent any unwanted blending. To facilitate this research, we construct a comprehensive VFX dataset Omni-VFX via a novel data collection pipeline combining image editing and First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX evaluation framework for validating model performance. Extensive experiments demonstrate that Omni-Effects achieves precise spatial control and diverse effect generation, enabling users to specify both the category and location of desired effects.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems</title>
      <itunes:episode>1048</itunes:episode>
      <podcast:episode>1048</podcast:episode>
      <itunes:title>A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7bd8c435-980b-4d36-a06a-ab81bf4af13d</guid>
      <link>https://share.transistor.fm/s/4ba0865a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.AI, cs.CL, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, Zaiqiao Meng</p>

            <p><strong>Title:</strong><br>
            A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07407v1">http://arxiv.org/abs/2508.07407v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large language models have sparked growing interest in AI agents capable of solving complex, real-world tasks. However, most existing agent systems rely on manually crafted configurations that remain static after deployment, limiting their ability to adapt to dynamic and evolving environments. To this end, recent research has explored agent evolution techniques that aim to automatically enhance agent systems based on interaction data and environmental feedback. This emerging direction lays the foundation for self-evolving AI agents, which bridge the static capabilities of foundation models with the continuous adaptability required by lifelong agentic systems. In this survey, we provide a comprehensive review of existing techniques for self-evolving agentic systems. Specifically, we first introduce a unified conceptual framework that abstracts the feedback loop underlying the design of self-evolving agentic systems. The framework highlights four key components: System Inputs, Agent System, Environment, and Optimisers, serving as a foundation for understanding and comparing different strategies. Based on this framework, we systematically review a wide range of self-evolving techniques that target different components of the agent system. We also investigate domain-specific evolution strategies developed for specialised fields such as biomedicine, programming, and finance, where optimisation objectives are tightly coupled with domain constraints. In addition, we provide a dedicated discussion on the evaluation, safety, and ethical considerations for self-evolving agentic systems, which are critical to ensuring their effectiveness and reliability. This survey aims to provide researchers and practitioners with a systematic understanding of self-evolving AI agents, laying the foundation for the development of more adaptive, autonomous, and lifelong agentic systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.AI, cs.CL, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, Zaiqiao Meng</p>

            <p><strong>Title:</strong><br>
            A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07407v1">http://arxiv.org/abs/2508.07407v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large language models have sparked growing interest in AI agents capable of solving complex, real-world tasks. However, most existing agent systems rely on manually crafted configurations that remain static after deployment, limiting their ability to adapt to dynamic and evolving environments. To this end, recent research has explored agent evolution techniques that aim to automatically enhance agent systems based on interaction data and environmental feedback. This emerging direction lays the foundation for self-evolving AI agents, which bridge the static capabilities of foundation models with the continuous adaptability required by lifelong agentic systems. In this survey, we provide a comprehensive review of existing techniques for self-evolving agentic systems. Specifically, we first introduce a unified conceptual framework that abstracts the feedback loop underlying the design of self-evolving agentic systems. The framework highlights four key components: System Inputs, Agent System, Environment, and Optimisers, serving as a foundation for understanding and comparing different strategies. Based on this framework, we systematically review a wide range of self-evolving techniques that target different components of the agent system. We also investigate domain-specific evolution strategies developed for specialised fields such as biomedicine, programming, and finance, where optimisation objectives are tightly coupled with domain constraints. In addition, we provide a dedicated discussion on the evaluation, safety, and ethical considerations for self-evolving agentic systems, which are critical to ensuring their effectiveness and reliability. This survey aims to provide researchers and practitioners with a systematic understanding of self-evolving AI agents, laying the foundation for the development of more adaptive, autonomous, and lifelong agentic systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 12 Aug 2025 21:15:45 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4ba0865a/20ee2d67.mp3" length="20139923" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1255</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.AI, cs.CL, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, Zhaochun Ren, Nikos Aletras, Xi Wang, Han Zhou, Zaiqiao Meng</p>

            <p><strong>Title:</strong><br>
            A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07407v1">http://arxiv.org/abs/2508.07407v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large language models have sparked growing interest in AI agents capable of solving complex, real-world tasks. However, most existing agent systems rely on manually crafted configurations that remain static after deployment, limiting their ability to adapt to dynamic and evolving environments. To this end, recent research has explored agent evolution techniques that aim to automatically enhance agent systems based on interaction data and environmental feedback. This emerging direction lays the foundation for self-evolving AI agents, which bridge the static capabilities of foundation models with the continuous adaptability required by lifelong agentic systems. In this survey, we provide a comprehensive review of existing techniques for self-evolving agentic systems. Specifically, we first introduce a unified conceptual framework that abstracts the feedback loop underlying the design of self-evolving agentic systems. The framework highlights four key components: System Inputs, Agent System, Environment, and Optimisers, serving as a foundation for understanding and comparing different strategies. Based on this framework, we systematically review a wide range of self-evolving techniques that target different components of the agent system. We also investigate domain-specific evolution strategies developed for specialised fields such as biomedicine, programming, and finance, where optimisation objectives are tightly coupled with domain constraints. In addition, we provide a dedicated discussion on the evaluation, safety, and ethical considerations for self-evolving agentic systems, which are critical to ensuring their effectiveness and reliability. This survey aims to provide researchers and practitioners with a systematic understanding of self-evolving AI agents, laying the foundation for the development of more adaptive, autonomous, and lifelong agentic systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent</title>
      <itunes:episode>1047</itunes:episode>
      <podcast:episode>1047</podcast:episode>
      <itunes:title>BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c65c27dd-3d51-4c9d-a064-f9e8bcd29c6b</guid>
      <link>https://share.transistor.fm/s/d06ddcbc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, Jimmy Lin</p>

            <p><strong>Title:</strong><br>
            BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.06600v1">http://arxiv.org/abs/2508.06600v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, Jimmy Lin</p>

            <p><strong>Title:</strong><br>
            BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.06600v1">http://arxiv.org/abs/2508.06600v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 12 Aug 2025 21:15:24 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d06ddcbc/5096f45e.mp3" length="20946134" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1305</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, Jimmy Lin</p>

            <p><strong>Title:</strong><br>
            BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.06600v1">http://arxiv.org/abs/2508.06600v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens</title>
      <itunes:episode>1046</itunes:episode>
      <podcast:episode>1046</podcast:episode>
      <itunes:title>SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">05b68657-bc18-4666-bde8-06ebb5608157</guid>
      <link>https://share.transistor.fm/s/00f90565</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Nikita Dragunov, Temurbek Rahmatullaev, Elizaveta Goncharova, Andrey Kuznetsov, Anton Razzhigaev</p>

            <p><strong>Title:</strong><br>
            SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.05305v1">http://arxiv.org/abs/2508.05305v1</a></p>

            <p><strong>Abstract:</strong><br>
            The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that "thinks" in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Nikita Dragunov, Temurbek Rahmatullaev, Elizaveta Goncharova, Andrey Kuznetsov, Anton Razzhigaev</p>

            <p><strong>Title:</strong><br>
            SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.05305v1">http://arxiv.org/abs/2508.05305v1</a></p>

            <p><strong>Abstract:</strong><br>
            The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that "thinks" in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 12 Aug 2025 21:15:03 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/00f90565/a679ccd6.mp3" length="20316692" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1266</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Nikita Dragunov, Temurbek Rahmatullaev, Elizaveta Goncharova, Andrey Kuznetsov, Anton Razzhigaev</p>

            <p><strong>Title:</strong><br>
            SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.05305v1">http://arxiv.org/abs/2508.05305v1</a></p>

            <p><strong>Abstract:</strong><br>
            The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that "thinks" in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization</title>
      <itunes:episode>1045</itunes:episode>
      <podcast:episode>1045</podcast:episode>
      <itunes:title>Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e8dd2f7f-cf54-410f-aede-e8ddd93e6422</guid>
      <link>https://share.transistor.fm/s/f1dd1b9b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou</p>

            <p><strong>Title:</strong><br>
            Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07629v2">http://arxiv.org/abs/2508.07629v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details. This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow from data preparation and long Chain-of-Thought supervised fine-tuning (long CoT SFT) to reinforcement learning (RL), along with detailed ablation studies for each experimental component. For SFT data, our experiments show that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that difficult samples can achieve better results without accuracy filtering. In addition, we investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. GPPO not only enhances the model's exploration capacity but also improves its efficiency in learning from negative samples. Klear-Reasoner exhibits exceptional reasoning abilities in mathematics and programming, scoring 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5 and 58.1% on LiveCodeBench V6.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou</p>

            <p><strong>Title:</strong><br>
            Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07629v2">http://arxiv.org/abs/2508.07629v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details. This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow from data preparation and long Chain-of-Thought supervised fine-tuning (long CoT SFT) to reinforcement learning (RL), along with detailed ablation studies for each experimental component. For SFT data, our experiments show that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that difficult samples can achieve better results without accuracy filtering. In addition, we investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. GPPO not only enhances the model's exploration capacity but also improves its efficiency in learning from negative samples. Klear-Reasoner exhibits exceptional reasoning abilities in mathematics and programming, scoring 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5 and 58.1% on LiveCodeBench V6.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 12 Aug 2025 21:14:42 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f1dd1b9b/2338687d.mp3" length="23336035" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1455</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou</p>

            <p><strong>Title:</strong><br>
            Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07629v2">http://arxiv.org/abs/2508.07629v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details. This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow from data preparation and long Chain-of-Thought supervised fine-tuning (long CoT SFT) to reinforcement learning (RL), along with detailed ablation studies for each experimental component. For SFT data, our experiments show that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that difficult samples can achieve better results without accuracy filtering. In addition, we investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. GPPO not only enhances the model's exploration capacity but also improves its efficiency in learning from negative samples. Klear-Reasoner exhibits exceptional reasoning abilities in mathematics and programming, scoring 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5 and 58.1% on LiveCodeBench V6.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MolmoAct: Action Reasoning Models that can Reason in Space</title>
      <itunes:episode>1044</itunes:episode>
      <podcast:episode>1044</podcast:episode>
      <itunes:title>MolmoAct: Action Reasoning Models that can Reason in Space</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">eebfac95-f9a5-4fb6-b752-7144ebe9df04</guid>
      <link>https://share.transistor.fm/s/5de0e3c5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, Ranjay Krishna</p>

            <p><strong>Title:</strong><br>
            MolmoAct: Action Reasoning Models that can Reason in Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07917v1">http://arxiv.org/abs/2508.07917v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning is central to purposeful action, yet most robotic foundation models map perception and instructions directly to control, which limits adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), a class of vision-language-action models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior. MolmoAct-7B-D achieves strong performance across simulation and real-world settings: 70.5% zero-shot accuracy on SimplerEnv Visual Matching tasks, surpassing closed-source Pi-0 and GR00T N1; 86.6% average success on LIBERO, including an additional 6.3% gain over ThinkAct on long-horizon tasks; and in real-world fine-tuning, an additional 10% (single-arm) and an additional 22.7% (bimanual) task progression over Pi-0-FAST. It also outperforms baselines by an additional 23.3% on out-of-distribution generalization and achieves top human-preference scores for open-ended instruction following and trajectory steering. Furthermore, we release, for the first time, the MolmoAct Dataset -- a mid-training robot dataset comprising over 10,000 high quality robot trajectories across diverse scenarios and tasks. Training with this dataset yields an average 5.5% improvement in general performance over the base model. We release all model weights, training code, our collected dataset, and our action reasoning dataset, establishing MolmoAct as both a state-of-the-art robotics foundation model and an open blueprint for building ARMs that transform perception into purposeful action through structured reasoning. Blogpost: https://allenai.org/blog/molmoact</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, Ranjay Krishna</p>

            <p><strong>Title:</strong><br>
            MolmoAct: Action Reasoning Models that can Reason in Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07917v1">http://arxiv.org/abs/2508.07917v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning is central to purposeful action, yet most robotic foundation models map perception and instructions directly to control, which limits adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), a class of vision-language-action models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior. MolmoAct-7B-D achieves strong performance across simulation and real-world settings: 70.5% zero-shot accuracy on SimplerEnv Visual Matching tasks, surpassing closed-source Pi-0 and GR00T N1; 86.6% average success on LIBERO, including an additional 6.3% gain over ThinkAct on long-horizon tasks; and in real-world fine-tuning, an additional 10% (single-arm) and an additional 22.7% (bimanual) task progression over Pi-0-FAST. It also outperforms baselines by an additional 23.3% on out-of-distribution generalization and achieves top human-preference scores for open-ended instruction following and trajectory steering. Furthermore, we release, for the first time, the MolmoAct Dataset -- a mid-training robot dataset comprising over 10,000 high quality robot trajectories across diverse scenarios and tasks. Training with this dataset yields an average 5.5% improvement in general performance over the base model. We release all model weights, training code, our collected dataset, and our action reasoning dataset, establishing MolmoAct as both a state-of-the-art robotics foundation model and an open blueprint for building ARMs that transform perception into purposeful action through structured reasoning. Blogpost: https://allenai.org/blog/molmoact</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 12 Aug 2025 21:14:11 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5de0e3c5/44b73a3a.mp3" length="20010293" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1247</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, Ranjay Krishna</p>

            <p><strong>Title:</strong><br>
            MolmoAct: Action Reasoning Models that can Reason in Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.07917v1">http://arxiv.org/abs/2508.07917v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning is central to purposeful action, yet most robotic foundation models map perception and instructions directly to control, which limits adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), a class of vision-language-action models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior. MolmoAct-7B-D achieves strong performance across simulation and real-world settings: 70.5% zero-shot accuracy on SimplerEnv Visual Matching tasks, surpassing closed-source Pi-0 and GR00T N1; 86.6% average success on LIBERO, including an additional 6.3% gain over ThinkAct on long-horizon tasks; and in real-world fine-tuning, an additional 10% (single-arm) and an additional 22.7% (bimanual) task progression over Pi-0-FAST. It also outperforms baselines by an additional 23.3% on out-of-distribution generalization and achieves top human-preference scores for open-ended instruction following and trajectory steering. Furthermore, we release, for the first time, the MolmoAct Dataset -- a mid-training robot dataset comprising over 10,000 high quality robot trajectories across diverse scenarios and tasks. Training with this dataset yields an average 5.5% improvement in general performance over the base model. We release all model weights, training code, our collected dataset, and our action reasoning dataset, establishing MolmoAct as both a state-of-the-art robotics foundation model and an open blueprint for building ARMs that transform perception into purposeful action through structured reasoning. Blogpost: https://allenai.org/blog/molmoact</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models</title>
      <itunes:episode>1043</itunes:episode>
      <podcast:episode>1043</podcast:episode>
      <itunes:title>GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1ece8661-ac82-4142-bcf9-88d707474dc8</guid>
      <link>https://share.transistor.fm/s/d87def40</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 79 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            GLM-4. 5 Team, :, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, Bowen Xu, Can Huang, Casey Zhao, Changpeng Cai, Chao Yu, Chen Li, Chendi Ge, Chenghua Huang, Chenhui Zhang, Chenxi Xu, Chenzheng Zhu, Chuang Li, Congfeng Yin, Daoyan Lin, Dayong Yang, Dazhi Jiang, Ding Ai, Erle Zhu, Fei Wang, Gengzheng Pan, Guo Wang, Hailong Sun, Haitao Li, Haiyang Li, Haiyi Hu, Hanyu Zhang, Hao Peng, Hao Tai, Haoke Zhang, Haoran Wang, Haoyu Yang, He Liu, He Zhao, Hongwei Liu, Hongxi Yan, Huan Liu, Huilong Chen, Ji Li, Jiajing Zhao, Jiamin Ren, Jian Jiao, Jiani Zhao, Jianyang Yan, Jiaqi Wang, Jiayi Gui, Jiayue Zhao, Jie Liu, Jijie Li, Jing Li, Jing Lu, Jingsen Wang, Jingwei Yuan, Jingxuan Li, Jingzhao Du, Jinhua Du, Jinxin Liu, Junkai Zhi, Junli Gao, Ke Wang, Lekang Yang, Liang Xu, Lin Fan, Lindong Wu, Lintao Ding, Lu Wang, Man Zhang, Minghao Li, Minghuan Xu, Mingming Zhao, Mingshu Zhai, Pengfan Du, Qian Dong, Shangde Lei, Shangqing Tu, Shangtong Yang, Shaoyou Lu, Shijie Li, Shuang Li, Shuang-Li, Shuxun Yang, Sibo Yi, Tianshu Yu, Wei Tian, Weihan Wang, Wenbo Yu, Weng Lam Tam, Wenjie Liang, Wentao Liu, Xiao Wang, Xiaohan Jia, Xiaotao Gu, Xiaoying Ling, Xin Wang, Xing Fan, Xingru Pan, Xinyuan Zhang, Xinze Zhang, Xiuqing Fu, Xunkai Zhang, Yabo Xu, Yandong Wu, Yida Lu, Yidong Wang, Yilin Zhou, Yiming Pan, Ying Zhang, Yingli Wang, Yingru Li, Yinpei Su, Yipeng Geng, Yitong Zhu, Yongkun Yang, Yuhang Li, Yuhao Wu, Yujiang Li, Yunan Liu, Yunqing Wang, Yuntao Li, Yuxuan Zhang, Zezhen Liu, Zhen Yang, Zhengda Zhou, Zhongpei Qiao, Zhuoer Feng, Zhuorui Liu, Zichen Zhang, Zihan Wang, Zijun Yao, Zikang Wang, Ziqiang Liu, Ziwei Chai, Zixuan Li, Zuodong Zhao, Wenguang Chen, Jidong Zhai, Bin Xu, Minlie Huang, Hongning Wang, Juanzi Li, Yuxiao Dong, Jie Tang</p>

            <p><strong>Title:</strong><br>
            GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.06471v1">http://arxiv.org/abs/2508.06471v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks. We release both GLM-4.5 (355B parameters) and a compact version, GLM-4.5-Air (106B parameters), to advance research in reasoning and agentic AI systems. Code, models, and more information are available at https://github.com/zai-org/GLM-4.5.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 79 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            GLM-4. 5 Team, :, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, Bowen Xu, Can Huang, Casey Zhao, Changpeng Cai, Chao Yu, Chen Li, Chendi Ge, Chenghua Huang, Chenhui Zhang, Chenxi Xu, Chenzheng Zhu, Chuang Li, Congfeng Yin, Daoyan Lin, Dayong Yang, Dazhi Jiang, Ding Ai, Erle Zhu, Fei Wang, Gengzheng Pan, Guo Wang, Hailong Sun, Haitao Li, Haiyang Li, Haiyi Hu, Hanyu Zhang, Hao Peng, Hao Tai, Haoke Zhang, Haoran Wang, Haoyu Yang, He Liu, He Zhao, Hongwei Liu, Hongxi Yan, Huan Liu, Huilong Chen, Ji Li, Jiajing Zhao, Jiamin Ren, Jian Jiao, Jiani Zhao, Jianyang Yan, Jiaqi Wang, Jiayi Gui, Jiayue Zhao, Jie Liu, Jijie Li, Jing Li, Jing Lu, Jingsen Wang, Jingwei Yuan, Jingxuan Li, Jingzhao Du, Jinhua Du, Jinxin Liu, Junkai Zhi, Junli Gao, Ke Wang, Lekang Yang, Liang Xu, Lin Fan, Lindong Wu, Lintao Ding, Lu Wang, Man Zhang, Minghao Li, Minghuan Xu, Mingming Zhao, Mingshu Zhai, Pengfan Du, Qian Dong, Shangde Lei, Shangqing Tu, Shangtong Yang, Shaoyou Lu, Shijie Li, Shuang Li, Shuang-Li, Shuxun Yang, Sibo Yi, Tianshu Yu, Wei Tian, Weihan Wang, Wenbo Yu, Weng Lam Tam, Wenjie Liang, Wentao Liu, Xiao Wang, Xiaohan Jia, Xiaotao Gu, Xiaoying Ling, Xin Wang, Xing Fan, Xingru Pan, Xinyuan Zhang, Xinze Zhang, Xiuqing Fu, Xunkai Zhang, Yabo Xu, Yandong Wu, Yida Lu, Yidong Wang, Yilin Zhou, Yiming Pan, Ying Zhang, Yingli Wang, Yingru Li, Yinpei Su, Yipeng Geng, Yitong Zhu, Yongkun Yang, Yuhang Li, Yuhao Wu, Yujiang Li, Yunan Liu, Yunqing Wang, Yuntao Li, Yuxuan Zhang, Zezhen Liu, Zhen Yang, Zhengda Zhou, Zhongpei Qiao, Zhuoer Feng, Zhuorui Liu, Zichen Zhang, Zihan Wang, Zijun Yao, Zikang Wang, Ziqiang Liu, Ziwei Chai, Zixuan Li, Zuodong Zhao, Wenguang Chen, Jidong Zhai, Bin Xu, Minlie Huang, Hongning Wang, Juanzi Li, Yuxiao Dong, Jie Tang</p>

            <p><strong>Title:</strong><br>
            GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.06471v1">http://arxiv.org/abs/2508.06471v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks. We release both GLM-4.5 (355B parameters) and a compact version, GLM-4.5-Air (106B parameters), to advance research in reasoning and agentic AI systems. Code, models, and more information are available at https://github.com/zai-org/GLM-4.5.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 11 Aug 2025 19:57:38 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d87def40/05e22e0f.mp3" length="21164701" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1319</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 79 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            GLM-4. 5 Team, :, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, Bowen Xu, Can Huang, Casey Zhao, Changpeng Cai, Chao Yu, Chen Li, Chendi Ge, Chenghua Huang, Chenhui Zhang, Chenxi Xu, Chenzheng Zhu, Chuang Li, Congfeng Yin, Daoyan Lin, Dayong Yang, Dazhi Jiang, Ding Ai, Erle Zhu, Fei Wang, Gengzheng Pan, Guo Wang, Hailong Sun, Haitao Li, Haiyang Li, Haiyi Hu, Hanyu Zhang, Hao Peng, Hao Tai, Haoke Zhang, Haoran Wang, Haoyu Yang, He Liu, He Zhao, Hongwei Liu, Hongxi Yan, Huan Liu, Huilong Chen, Ji Li, Jiajing Zhao, Jiamin Ren, Jian Jiao, Jiani Zhao, Jianyang Yan, Jiaqi Wang, Jiayi Gui, Jiayue Zhao, Jie Liu, Jijie Li, Jing Li, Jing Lu, Jingsen Wang, Jingwei Yuan, Jingxuan Li, Jingzhao Du, Jinhua Du, Jinxin Liu, Junkai Zhi, Junli Gao, Ke Wang, Lekang Yang, Liang Xu, Lin Fan, Lindong Wu, Lintao Ding, Lu Wang, Man Zhang, Minghao Li, Minghuan Xu, Mingming Zhao, Mingshu Zhai, Pengfan Du, Qian Dong, Shangde Lei, Shangqing Tu, Shangtong Yang, Shaoyou Lu, Shijie Li, Shuang Li, Shuang-Li, Shuxun Yang, Sibo Yi, Tianshu Yu, Wei Tian, Weihan Wang, Wenbo Yu, Weng Lam Tam, Wenjie Liang, Wentao Liu, Xiao Wang, Xiaohan Jia, Xiaotao Gu, Xiaoying Ling, Xin Wang, Xing Fan, Xingru Pan, Xinyuan Zhang, Xinze Zhang, Xiuqing Fu, Xunkai Zhang, Yabo Xu, Yandong Wu, Yida Lu, Yidong Wang, Yilin Zhou, Yiming Pan, Ying Zhang, Yingli Wang, Yingru Li, Yinpei Su, Yipeng Geng, Yitong Zhu, Yongkun Yang, Yuhang Li, Yuhao Wu, Yujiang Li, Yunan Liu, Yunqing Wang, Yuntao Li, Yuxuan Zhang, Zezhen Liu, Zhen Yang, Zhengda Zhou, Zhongpei Qiao, Zhuoer Feng, Zhuorui Liu, Zichen Zhang, Zihan Wang, Zijun Yao, Zikang Wang, Ziqiang Liu, Ziwei Chai, Zixuan Li, Zuodong Zhao, Wenguang Chen, Jidong Zhai, Bin Xu, Minlie Huang, Hongning Wang, Juanzi Li, Yuxiao Dong, Jie Tang</p>

            <p><strong>Title:</strong><br>
            GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.06471v1">http://arxiv.org/abs/2508.06471v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks. We release both GLM-4.5 (355B parameters) and a compact version, GLM-4.5-Air (106B parameters), to advance research in reasoning and agentic AI systems. Code, models, and more information are available at https://github.com/zai-org/GLM-4.5.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off</title>
      <itunes:episode>1042</itunes:episode>
      <podcast:episode>1042</podcast:episode>
      <itunes:title>Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">82359c99-ddb4-43d5-be4b-49379dd4f1df</guid>
      <link>https://share.transistor.fm/s/35895bb6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.GR, cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Seungyong Lee, Jeong-gi Kwak</p>

            <p><strong>Title:</strong><br>
            Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.04825v1">http://arxiv.org/abs/2508.04825v1</a></p>

            <p><strong>Abstract:</strong><br>
            Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.GR, cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Seungyong Lee, Jeong-gi Kwak</p>

            <p><strong>Title:</strong><br>
            Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.04825v1">http://arxiv.org/abs/2508.04825v1</a></p>

            <p><strong>Abstract:</strong><br>
            Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 11 Aug 2025 19:57:15 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/35895bb6/ede06239.mp3" length="20083056" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1252</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.GR, cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Seungyong Lee, Jeong-gi Kwak</p>

            <p><strong>Title:</strong><br>
            Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.04825v1">http://arxiv.org/abs/2508.04825v1</a></p>

            <p><strong>Abstract:</strong><br>
            Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens</title>
      <itunes:episode>1041</itunes:episode>
      <podcast:episode>1041</podcast:episode>
      <itunes:title>Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2a3b6fb1-2216-41b3-904f-1ffd4bbe2128</guid>
      <link>https://share.transistor.fm/s/888b33b2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 143 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, Huan Liu</p>

            <p><strong>Title:</strong><br>
            Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.01191v2">http://arxiv.org/abs/2508.01191v2</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further. In this paper, we study CoT reasoning via a data distribution lens and investigate if CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. Thus, its effectiveness is fundamentally bounded by the degree of distribution discrepancy between the training data and the test queries. With this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To investigate each dimension, we design DataAlchemy, an isolated and controlled environment to train LLMs from scratch and systematically probe them under various distribution conditions. Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions. This work offers a deeper understanding of why and when CoT reasoning fails, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 143 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, Huan Liu</p>

            <p><strong>Title:</strong><br>
            Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.01191v2">http://arxiv.org/abs/2508.01191v2</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further. In this paper, we study CoT reasoning via a data distribution lens and investigate if CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. Thus, its effectiveness is fundamentally bounded by the degree of distribution discrepancy between the training data and the test queries. With this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To investigate each dimension, we design DataAlchemy, an isolated and controlled environment to train LLMs from scratch and systematically probe them under various distribution conditions. Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions. This work offers a deeper understanding of why and when CoT reasoning fails, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 07 Aug 2025 20:52:34 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/888b33b2/ba3dec8d.mp3" length="22039917" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1374</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 143 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, Huan Liu</p>

            <p><strong>Title:</strong><br>
            Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.01191v2">http://arxiv.org/abs/2508.01191v2</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further. In this paper, we study CoT reasoning via a data distribution lens and investigate if CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. Thus, its effectiveness is fundamentally bounded by the degree of distribution discrepancy between the training data and the test queries. With this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To investigate each dimension, we design DataAlchemy, an isolated and controlled environment to train LLMs from scratch and systematically probe them under various distribution conditions. Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions. This work offers a deeper understanding of why and when CoT reasoning fails, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VeriGUI: Verifiable Long-Chain GUI Dataset</title>
      <itunes:episode>1040</itunes:episode>
      <podcast:episode>1040</podcast:episode>
      <itunes:title>VeriGUI: Verifiable Long-Chain GUI Dataset</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d1e07236-2497-4998-a93a-9769ac977f9a</guid>
      <link>https://share.transistor.fm/s/462ed6b9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 117 | cs.HC</p>

            <p><strong>Authors:</strong><br>
            Shunyu Liu, Minghao Liu, Huichi Zhou, Zhenyu Cui, Yang Zhou, Yuhao Zhou, Wendong Fan, Ge Zhang, Jiajun Shi, Weihao Xuan, Jiaxing Huang, Shuang Luo, Fang Wu, Heli Qi, Qingcheng Zeng, Ziqi Ren, Jialiang Gao, Jindi Lv, Junjie Wang, Aosong Feng, Heng Zhou, Wangchunshu Zhou, Zhenfei Yin, Wenlong Zhang, Guohao Li, Wenhao Yu, Irene Li, Lei Ma, Lei Bai, Qunshu Lin, Mingli Song, Dacheng Tao</p>

            <p><strong>Title:</strong><br>
            VeriGUI: Verifiable Long-Chain GUI Dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.04026v1">http://arxiv.org/abs/2508.04026v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies have delved into constructing autonomous agents capable of performing complex Graphical User Interface (GUI)-based computer tasks, with the potential to revolutionize human-computer interaction. Despite encouraging results, existing efforts mainly focus on short-term interactions and rely on outcome-only verification, thereby limiting their scalability in real-world GUI applications that demand long-horizon task decomposition and execution. In this work, we introduce VeriGUI, a novel verifiable long-chain GUI dataset designed to facilitate the development and evaluation of generalist GUI agents operating in realistic computer environments. Our dataset emphasizes two critical dimensions: (1) long-chain complexity, with tasks decomposed into a sequence of interdependent subtasks spanning hundreds of steps, explicitly designed to allow any subtask to serve as a valid starting point; and (2) subtask-level verifiability, which enables diverse exploration strategies within each subtask, while ensuring that each subtask-level goal remains verifiable and consistent. The dataset consists of GUI task trajectories across both desktop and web, annotated by human experts. Extensive experiments on VeriGUI using various agents with different foundation models reveal significant performance gaps in handling long-horizon tasks, highlighting the need for more robust planning and decision-making capabilities in GUI agents.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 117 | cs.HC</p>

            <p><strong>Authors:</strong><br>
            Shunyu Liu, Minghao Liu, Huichi Zhou, Zhenyu Cui, Yang Zhou, Yuhao Zhou, Wendong Fan, Ge Zhang, Jiajun Shi, Weihao Xuan, Jiaxing Huang, Shuang Luo, Fang Wu, Heli Qi, Qingcheng Zeng, Ziqi Ren, Jialiang Gao, Jindi Lv, Junjie Wang, Aosong Feng, Heng Zhou, Wangchunshu Zhou, Zhenfei Yin, Wenlong Zhang, Guohao Li, Wenhao Yu, Irene Li, Lei Ma, Lei Bai, Qunshu Lin, Mingli Song, Dacheng Tao</p>

            <p><strong>Title:</strong><br>
            VeriGUI: Verifiable Long-Chain GUI Dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.04026v1">http://arxiv.org/abs/2508.04026v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies have delved into constructing autonomous agents capable of performing complex Graphical User Interface (GUI)-based computer tasks, with the potential to revolutionize human-computer interaction. Despite encouraging results, existing efforts mainly focus on short-term interactions and rely on outcome-only verification, thereby limiting their scalability in real-world GUI applications that demand long-horizon task decomposition and execution. In this work, we introduce VeriGUI, a novel verifiable long-chain GUI dataset designed to facilitate the development and evaluation of generalist GUI agents operating in realistic computer environments. Our dataset emphasizes two critical dimensions: (1) long-chain complexity, with tasks decomposed into a sequence of interdependent subtasks spanning hundreds of steps, explicitly designed to allow any subtask to serve as a valid starting point; and (2) subtask-level verifiability, which enables diverse exploration strategies within each subtask, while ensuring that each subtask-level goal remains verifiable and consistent. The dataset consists of GUI task trajectories across both desktop and web, annotated by human experts. Extensive experiments on VeriGUI using various agents with different foundation models reveal significant performance gaps in handling long-horizon tasks, highlighting the need for more robust planning and decision-making capabilities in GUI agents.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 07 Aug 2025 20:52:11 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/462ed6b9/99355b3c.mp3" length="24616187" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1535</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 117 | cs.HC</p>

            <p><strong>Authors:</strong><br>
            Shunyu Liu, Minghao Liu, Huichi Zhou, Zhenyu Cui, Yang Zhou, Yuhao Zhou, Wendong Fan, Ge Zhang, Jiajun Shi, Weihao Xuan, Jiaxing Huang, Shuang Luo, Fang Wu, Heli Qi, Qingcheng Zeng, Ziqi Ren, Jialiang Gao, Jindi Lv, Junjie Wang, Aosong Feng, Heng Zhou, Wangchunshu Zhou, Zhenfei Yin, Wenlong Zhang, Guohao Li, Wenhao Yu, Irene Li, Lei Ma, Lei Bai, Qunshu Lin, Mingli Song, Dacheng Tao</p>

            <p><strong>Title:</strong><br>
            VeriGUI: Verifiable Long-Chain GUI Dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.04026v1">http://arxiv.org/abs/2508.04026v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies have delved into constructing autonomous agents capable of performing complex Graphical User Interface (GUI)-based computer tasks, with the potential to revolutionize human-computer interaction. Despite encouraging results, existing efforts mainly focus on short-term interactions and rely on outcome-only verification, thereby limiting their scalability in real-world GUI applications that demand long-horizon task decomposition and execution. In this work, we introduce VeriGUI, a novel verifiable long-chain GUI dataset designed to facilitate the development and evaluation of generalist GUI agents operating in realistic computer environments. Our dataset emphasizes two critical dimensions: (1) long-chain complexity, with tasks decomposed into a sequence of interdependent subtasks spanning hundreds of steps, explicitly designed to allow any subtask to serve as a valid starting point; and (2) subtask-level verifiability, which enables diverse exploration strategies within each subtask, while ensuring that each subtask-level goal remains verifiable and consistent. The dataset consists of GUI task trajectories across both desktop and web, annotated by human experts. Extensive experiments on VeriGUI using various agents with different foundation models reveal significant performance gaps in handling long-horizon tasks, highlighting the need for more robust planning and decision-making capabilities in GUI agents.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Efficient Agents: Building Effective Agents While Reducing Cost</title>
      <itunes:episode>1039</itunes:episode>
      <podcast:episode>1039</podcast:episode>
      <itunes:title>Efficient Agents: Building Effective Agents While Reducing Cost</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">068eec98-3e6a-408b-8532-a92fe6195859</guid>
      <link>https://share.transistor.fm/s/9c08f683</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.AI, cs.CL, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Ningning Wang, Xavier Hu, Pai Liu, He Zhu, Yue Hou, Heyuan Huang, Shengyu Zhang, Jian Yang, Jiaheng Liu, Ge Zhang, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou</p>

            <p><strong>Title:</strong><br>
            Efficient Agents: Building Effective Agents While Reducing Cost</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.02694v1">http://arxiv.org/abs/2508.02694v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable capabilities of Large Language Model (LLM)-driven agents have enabled sophisticated systems to tackle complex, multi-step tasks, but their escalating costs threaten scalability and accessibility. This work presents the first systematic study of the efficiency-effectiveness trade-off in modern agent systems, addressing the critical need for cost-effective designs without sacrificing performance. We investigate three key questions: (1) How much complexity do agentic tasks inherently require? (2) When do additional modules yield diminishing returns? (3) How much efficiency can be gained through the design of efficient agent frameworks? Through an empirical analysis on the GAIA benchmark, we evaluate the impact of LLM backbone selection, agent framework designs, and test-time scaling strategies. Using the cost-of-pass metric, we quantify the efficiency-performance trade-off across these dimensions. Our findings inform the development of Efficient Agents , a novel agent framework that has an optimal complexity to task requirements. Efficient Agents retains 96.7% of the performance of OWL, one leading open-source agent framework, while reducing operational costs from $0.398 to $0.228, resulting in a 28.4% improvement in cost-of-pass. Our work provides actionable insights for designing efficient, high-performing agent systems, advancing the accessibility and sustainability of AI-driven solutions.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.AI, cs.CL, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Ningning Wang, Xavier Hu, Pai Liu, He Zhu, Yue Hou, Heyuan Huang, Shengyu Zhang, Jian Yang, Jiaheng Liu, Ge Zhang, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou</p>

            <p><strong>Title:</strong><br>
            Efficient Agents: Building Effective Agents While Reducing Cost</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.02694v1">http://arxiv.org/abs/2508.02694v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable capabilities of Large Language Model (LLM)-driven agents have enabled sophisticated systems to tackle complex, multi-step tasks, but their escalating costs threaten scalability and accessibility. This work presents the first systematic study of the efficiency-effectiveness trade-off in modern agent systems, addressing the critical need for cost-effective designs without sacrificing performance. We investigate three key questions: (1) How much complexity do agentic tasks inherently require? (2) When do additional modules yield diminishing returns? (3) How much efficiency can be gained through the design of efficient agent frameworks? Through an empirical analysis on the GAIA benchmark, we evaluate the impact of LLM backbone selection, agent framework designs, and test-time scaling strategies. Using the cost-of-pass metric, we quantify the efficiency-performance trade-off across these dimensions. Our findings inform the development of Efficient Agents , a novel agent framework that has an optimal complexity to task requirements. Efficient Agents retains 96.7% of the performance of OWL, one leading open-source agent framework, while reducing operational costs from $0.398 to $0.228, resulting in a 28.4% improvement in cost-of-pass. Our work provides actionable insights for designing efficient, high-performing agent systems, advancing the accessibility and sustainability of AI-driven solutions.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 07 Aug 2025 20:51:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9c08f683/590991fe.mp3" length="20666494" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1288</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.AI, cs.CL, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Ningning Wang, Xavier Hu, Pai Liu, He Zhu, Yue Hou, Heyuan Huang, Shengyu Zhang, Jian Yang, Jiaheng Liu, Ge Zhang, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou</p>

            <p><strong>Title:</strong><br>
            Efficient Agents: Building Effective Agents While Reducing Cost</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.02694v1">http://arxiv.org/abs/2508.02694v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable capabilities of Large Language Model (LLM)-driven agents have enabled sophisticated systems to tackle complex, multi-step tasks, but their escalating costs threaten scalability and accessibility. This work presents the first systematic study of the efficiency-effectiveness trade-off in modern agent systems, addressing the critical need for cost-effective designs without sacrificing performance. We investigate three key questions: (1) How much complexity do agentic tasks inherently require? (2) When do additional modules yield diminishing returns? (3) How much efficiency can be gained through the design of efficient agent frameworks? Through an empirical analysis on the GAIA benchmark, we evaluate the impact of LLM backbone selection, agent framework designs, and test-time scaling strategies. Using the cost-of-pass metric, we quantify the efficiency-performance trade-off across these dimensions. Our findings inform the development of Efficient Agents , a novel agent framework that has an optimal complexity to task requirements. Efficient Agents retains 96.7% of the performance of OWL, one leading open-source agent framework, while reducing operational costs from $0.398 to $0.228, resulting in a 28.4% improvement in cost-of-pass. Our work provides actionable insights for designing efficient, high-performing agent systems, advancing the accessibility and sustainability of AI-driven solutions.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience</title>
      <itunes:episode>1038</itunes:episode>
      <podcast:episode>1038</podcast:episode>
      <itunes:title>SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d29f5e08-ede6-4f50-9252-45fb88fd8813</guid>
      <link>https://share.transistor.fm/s/ca26a099</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.AI, cs.CL, cs.CV, cs.LG, cs.MA, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.04700v1">http://arxiv.org/abs/2508.04700v1</a></p>

            <p><strong>Abstract:</strong><br>
            Repurposing large vision-language models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent's policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five novel software environments within OS-World. Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA, i.e., UI-TARS.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.AI, cs.CL, cs.CV, cs.LG, cs.MA, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.04700v1">http://arxiv.org/abs/2508.04700v1</a></p>

            <p><strong>Abstract:</strong><br>
            Repurposing large vision-language models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent's policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five novel software environments within OS-World. Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA, i.e., UI-TARS.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 07 Aug 2025 20:51:27 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ca26a099/95d878e3.mp3" length="23032580" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1436</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.AI, cs.CL, cs.CV, cs.LG, cs.MA, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.04700v1">http://arxiv.org/abs/2508.04700v1</a></p>

            <p><strong>Abstract:</strong><br>
            Repurposing large vision-language models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent's policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five novel software environments within OS-World. Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA, i.e., UI-TARS.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning</title>
      <itunes:episode>1037</itunes:episode>
      <podcast:episode>1037</podcast:episode>
      <itunes:title>Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0268f6a3-fe16-4c5a-9fa8-aa58c60ae872</guid>
      <link>https://share.transistor.fm/s/58bca535</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG, cs.CL, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, Boris Yangel</p>

            <p><strong>Title:</strong><br>
            Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.03501v1">http://arxiv.org/abs/2508.03501v1</a></p>

            <p><strong>Abstract:</strong><br>
            Research on applications of Reinforcement Learning (RL) to Large Language Models (LLMs) has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn MDPs, this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation.   To bridge this gap, we demonstrate the successful application of RL to this general regime. Using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-world software engineering tasks. Our approach increases the agent's success rate on the SWE-bench Verified benchmark from a 20% rejection fine-tuned baseline to 39%, without relying on any teacher models. On SWE-rebench, our agent matches or outperforms leading open-weight models such as DeepSeek-V3-0324 and Qwen3-235B-A22B using an identical scaffolding, offering a viable path toward building more capable autonomous agents for complex real-world problems based on open models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG, cs.CL, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, Boris Yangel</p>

            <p><strong>Title:</strong><br>
            Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.03501v1">http://arxiv.org/abs/2508.03501v1</a></p>

            <p><strong>Abstract:</strong><br>
            Research on applications of Reinforcement Learning (RL) to Large Language Models (LLMs) has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn MDPs, this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation.   To bridge this gap, we demonstrate the successful application of RL to this general regime. Using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-world software engineering tasks. Our approach increases the agent's success rate on the SWE-bench Verified benchmark from a 20% rejection fine-tuned baseline to 39%, without relying on any teacher models. On SWE-rebench, our agent matches or outperforms leading open-weight models such as DeepSeek-V3-0324 and Qwen3-235B-A22B using an identical scaffolding, offering a viable path toward building more capable autonomous agents for complex real-world problems based on open models.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 07 Aug 2025 20:51:05 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/58bca535/e44ff98b.mp3" length="21253335" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1325</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG, cs.CL, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, Boris Yangel</p>

            <p><strong>Title:</strong><br>
            Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.03501v1">http://arxiv.org/abs/2508.03501v1</a></p>

            <p><strong>Abstract:</strong><br>
            Research on applications of Reinforcement Learning (RL) to Large Language Models (LLMs) has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn MDPs, this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation.   To bridge this gap, we demonstrate the successful application of RL to this general regime. Using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-world software engineering tasks. Our approach increases the agent's success rate on the SWE-bench Verified benchmark from a 20% rejection fine-tuned baseline to 39%, without relying on any teacher models. On SWE-rebench, our agent matches or outperforms leading open-weight models such as DeepSeek-V3-0324 and Qwen3-235B-A22B using an identical scaffolding, offering a viable path toward building more capable autonomous agents for complex real-world problems based on open models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success</title>
      <itunes:episode>1036</itunes:episode>
      <podcast:episode>1036</podcast:episode>
      <itunes:title>Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">814bdaf8-45ec-4e4a-a09b-3f85bf71c219</guid>
      <link>https://share.transistor.fm/s/a8f8208a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            George Bredis, Stanislav Dereka, Viacheslav Sinii, Ruslan Rakhimov, Daniil Gavrilov</p>

            <p><strong>Title:</strong><br>
            Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.04280v1">http://arxiv.org/abs/2508.04280v1</a></p>

            <p><strong>Abstract:</strong><br>
            Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions -- a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC), a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable convergence. Training a single VLM with VL-DAC in one inexpensive simulator at a time (MiniWorld, Gym-Cards, ALFWorld, or WebShop) already produces policies that generalize widely: +50\% relative on BALROG (game-centric agentic control), +5\% relative on the hardest part of VSI-Bench (spatial planning), and +2\% on VisualWebBench (web navigation), all without degrading general image understanding accuracy. These results provide the first evidence that a simple RL algorithm can train VLMs entirely in cheap synthetic worlds while delivering measurable gains on real-image agentic, spatial-reasoning, and web-navigation benchmarks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            George Bredis, Stanislav Dereka, Viacheslav Sinii, Ruslan Rakhimov, Daniil Gavrilov</p>

            <p><strong>Title:</strong><br>
            Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.04280v1">http://arxiv.org/abs/2508.04280v1</a></p>

            <p><strong>Abstract:</strong><br>
            Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions -- a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC), a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable convergence. Training a single VLM with VL-DAC in one inexpensive simulator at a time (MiniWorld, Gym-Cards, ALFWorld, or WebShop) already produces policies that generalize widely: +50\% relative on BALROG (game-centric agentic control), +5\% relative on the hardest part of VSI-Bench (spatial planning), and +2\% on VisualWebBench (web navigation), all without degrading general image understanding accuracy. These results provide the first evidence that a simple RL algorithm can train VLMs entirely in cheap synthetic worlds while delivering measurable gains on real-image agentic, spatial-reasoning, and web-navigation benchmarks.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 07 Aug 2025 20:50:42 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a8f8208a/5a6af239.mp3" length="21187319" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1321</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            George Bredis, Stanislav Dereka, Viacheslav Sinii, Ruslan Rakhimov, Daniil Gavrilov</p>

            <p><strong>Title:</strong><br>
            Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.04280v1">http://arxiv.org/abs/2508.04280v1</a></p>

            <p><strong>Abstract:</strong><br>
            Interactive multimodal agents must convert raw visual observations into coherent sequences of language-conditioned actions -- a capability that current vision-language models (VLMs) still lack. Earlier reinforcement-learning (RL) efforts could, in principle, endow VLMs with such skills, but they have seldom tested whether the learned behaviours generalize beyond their training simulators, and they depend either on brittle hyperparameter tuning or on dense-reward environments with low state variability. We introduce Vision-Language Decoupled Actor-Critic (VL-DAC), a lightweight, hyperparameter-free RL algorithm. VL-DAC applies PPO updates to action tokens while learning value only at the environment-step level: an arrangement, to our knowledge, not previously explored for large VLMs or LLMs. This simple decoupling removes unstable weighting terms and yields faster, more reliable convergence. Training a single VLM with VL-DAC in one inexpensive simulator at a time (MiniWorld, Gym-Cards, ALFWorld, or WebShop) already produces policies that generalize widely: +50\% relative on BALROG (game-centric agentic control), +5\% relative on the hardest part of VSI-Bench (spatial planning), and +2\% on VisualWebBench (web navigation), all without degrading general image understanding accuracy. These results provide the first evidence that a simple RL algorithm can train VLMs entirely in cheap synthetic worlds while delivering measurable gains on real-image agentic, spatial-reasoning, and web-navigation benchmarks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Agent Lightning: Train ANY AI Agents with Reinforcement Learning</title>
      <itunes:episode>1035</itunes:episode>
      <podcast:episode>1035</podcast:episode>
      <itunes:title>Agent Lightning: Train ANY AI Agents with Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">30bad9d7-6f78-4f0f-a996-59ff6c96e7e3</guid>
      <link>https://share.transistor.fm/s/7a4fcdcf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xufang Luo, Yuge Zhang, Zhiyuan He, Zilong Wang, Siyun Zhao, Dongsheng Li, Luna K. Qiu, Yuqing Yang</p>

            <p><strong>Title:</strong><br>
            Agent Lightning: Train ANY AI Agents with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.03680v1">http://arxiv.org/abs/2508.03680v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Agent Lightning, a flexible and extensible framework that enables Reinforcement Learning (RL)-based training of Large Language Models (LLMs) for any AI agent. Unlike existing methods that tightly couple RL training with agent or rely on sequence concatenation with masking, Agent Lightning achieves complete decoupling between agent execution and training, allowing seamless integration with existing agents developed via diverse ways (e.g., using frameworks like LangChain, OpenAI Agents SDK, AutoGen, and building from scratch) with almost ZERO code modifications. By formulating agent execution as Markov decision process, we define an unified data interface and propose a hierarchical RL algorithm, LightningRL, which contains a credit assignment module, allowing us to decompose trajectories generated by ANY agents into training transition. This enables RL to handle complex interaction logic, such as multi-agent scenarios and dynamic workflows. For the system design, we introduce a Training-Agent Disaggregation architecture, and brings agent observability frameworks into agent runtime, providing a standardized agent finetuning interface. Experiments across text-to-SQL, retrieval-augmented generation, and math tool-use tasks demonstrate stable, continuous improvements, showcasing the framework's potential for real-world agent training and deployment.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xufang Luo, Yuge Zhang, Zhiyuan He, Zilong Wang, Siyun Zhao, Dongsheng Li, Luna K. Qiu, Yuqing Yang</p>

            <p><strong>Title:</strong><br>
            Agent Lightning: Train ANY AI Agents with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.03680v1">http://arxiv.org/abs/2508.03680v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Agent Lightning, a flexible and extensible framework that enables Reinforcement Learning (RL)-based training of Large Language Models (LLMs) for any AI agent. Unlike existing methods that tightly couple RL training with agent or rely on sequence concatenation with masking, Agent Lightning achieves complete decoupling between agent execution and training, allowing seamless integration with existing agents developed via diverse ways (e.g., using frameworks like LangChain, OpenAI Agents SDK, AutoGen, and building from scratch) with almost ZERO code modifications. By formulating agent execution as Markov decision process, we define an unified data interface and propose a hierarchical RL algorithm, LightningRL, which contains a credit assignment module, allowing us to decompose trajectories generated by ANY agents into training transition. This enables RL to handle complex interaction logic, such as multi-agent scenarios and dynamic workflows. For the system design, we introduce a Training-Agent Disaggregation architecture, and brings agent observability frameworks into agent runtime, providing a standardized agent finetuning interface. Experiments across text-to-SQL, retrieval-augmented generation, and math tool-use tasks demonstrate stable, continuous improvements, showcasing the framework's potential for real-world agent training and deployment.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 07 Aug 2025 20:50:20 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7a4fcdcf/250e636f.mp3" length="22555670" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1406</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xufang Luo, Yuge Zhang, Zhiyuan He, Zilong Wang, Siyun Zhao, Dongsheng Li, Luna K. Qiu, Yuqing Yang</p>

            <p><strong>Title:</strong><br>
            Agent Lightning: Train ANY AI Agents with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.03680v1">http://arxiv.org/abs/2508.03680v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Agent Lightning, a flexible and extensible framework that enables Reinforcement Learning (RL)-based training of Large Language Models (LLMs) for any AI agent. Unlike existing methods that tightly couple RL training with agent or rely on sequence concatenation with masking, Agent Lightning achieves complete decoupling between agent execution and training, allowing seamless integration with existing agents developed via diverse ways (e.g., using frameworks like LangChain, OpenAI Agents SDK, AutoGen, and building from scratch) with almost ZERO code modifications. By formulating agent execution as Markov decision process, we define an unified data interface and propose a hierarchical RL algorithm, LightningRL, which contains a credit assignment module, allowing us to decompose trajectories generated by ANY agents into training transition. This enables RL to handle complex interaction logic, such as multi-agent scenarios and dynamic workflows. For the system design, we introduce a Training-Agent Disaggregation architecture, and brings agent observability frameworks into agent runtime, providing a standardized agent finetuning interface. Experiments across text-to-SQL, retrieval-augmented generation, and math tool-use tasks demonstrate stable, continuous improvements, showcasing the framework's potential for real-world agent training and deployment.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Qwen-Image Technical Report</title>
      <itunes:episode>1034</itunes:episode>
      <podcast:episode>1034</podcast:episode>
      <itunes:title>Qwen-Image Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">30d933c4-0145-4c75-a151-18d51a15e6f1</guid>
      <link>https://share.transistor.fm/s/f771b97d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 91 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun Wen, Wensen Feng, Xiaoxiao Xu, Yi Wang, Yichang Zhang, Yongqiang Zhu, Yujia Wu, Yuxuan Cai, Zenan Liu</p>

            <p><strong>Title:</strong><br>
            Qwen-Image Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.02324v1">http://arxiv.org/abs/2508.02324v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 91 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun Wen, Wensen Feng, Xiaoxiao Xu, Yi Wang, Yichang Zhang, Yongqiang Zhu, Yujia Wu, Yuxuan Cai, Zenan Liu</p>

            <p><strong>Title:</strong><br>
            Qwen-Image Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.02324v1">http://arxiv.org/abs/2508.02324v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 05 Aug 2025 20:36:54 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f771b97d/c25d5879.mp3" length="19738171" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1230</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 91 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun Wen, Wensen Feng, Xiaoxiao Xu, Yi Wang, Yichang Zhang, Yongqiang Zhu, Yujia Wu, Yuxuan Cai, Zenan Liu</p>

            <p><strong>Title:</strong><br>
            Qwen-Image Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.02324v1">http://arxiv.org/abs/2508.02324v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension</title>
      <itunes:episode>1033</itunes:episode>
      <podcast:episode>1033</podcast:episode>
      <itunes:title>SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f5356dc4-ed81-4dea-b21a-d228ff0ba994</guid>
      <link>https://share.transistor.fm/s/ea35d281</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junjie Wu, Jiangnan Li, Yuqing Li, Lemao Liu, Liyan Xu, Jiwei Li, Dit-Yan Yeung, Jie Zhou, Mo Yu</p>

            <p><strong>Title:</strong><br>
            SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.01959v1">http://arxiv.org/abs/2508.01959v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth.   We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance -- i.e., situating a chunk's meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junjie Wu, Jiangnan Li, Yuqing Li, Lemao Liu, Liyan Xu, Jiwei Li, Dit-Yan Yeung, Jie Zhou, Mo Yu</p>

            <p><strong>Title:</strong><br>
            SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.01959v1">http://arxiv.org/abs/2508.01959v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth.   We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance -- i.e., situating a chunk's meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 05 Aug 2025 20:36:31 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ea35d281/c66960d3.mp3" length="22746719" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1418</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junjie Wu, Jiangnan Li, Yuqing Li, Lemao Liu, Liyan Xu, Jiwei Li, Dit-Yan Yeung, Jie Zhou, Mo Yu</p>

            <p><strong>Title:</strong><br>
            SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.01959v1">http://arxiv.org/abs/2508.01959v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth.   We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance -- i.e., situating a chunk's meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CellForge: Agentic Design of Virtual Cell Models</title>
      <itunes:episode>1032</itunes:episode>
      <podcast:episode>1032</podcast:episode>
      <itunes:title>CellForge: Agentic Design of Virtual Cell Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6b3cda71-b22c-43b0-b10a-e8c86a2fc2e8</guid>
      <link>https://share.transistor.fm/s/0c407142</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, cs.AI, cs.CL, q-bio.QM</p>

            <p><strong>Authors:</strong><br>
            Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, Arman Cohan, Xihong Lin, Fabian Theis, Smita Krishnaswamy, Mark Gerstein</p>

            <p><strong>Title:</strong><br>
            CellForge: Agentic Design of Virtual Cell Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.02276v1">http://arxiv.org/abs/2508.02276v1</a></p>

            <p><strong>Abstract:</strong><br>
            Virtual cell modeling represents an emerging frontier at the intersection of artificial intelligence and biology, aiming to predict quantities such as responses to diverse perturbations quantitatively. However, autonomously building computational models for virtual cells is challenging due to the complexity of biological systems, the heterogeneity of data modalities, and the need for domain-specific expertise across multiple disciplines. Here, we introduce CellForge, an agentic system that leverages a multi-agent framework that transforms presented biological datasets and research objectives directly into optimized computational models for virtual cells. More specifically, given only raw single-cell multi-omics data and task descriptions as input, CellForge outputs both an optimized model architecture and executable code for training virtual cell models and inference. The framework integrates three core modules: Task Analysis for presented dataset characterization and relevant literature retrieval, Method Design, where specialized agents collaboratively develop optimized modeling strategies, and Experiment Execution for automated generation of code. The agents in the Design module are separated into experts with differing perspectives and a central moderator, and have to collaboratively exchange solutions until they achieve a reasonable consensus. We demonstrate CellForge's capabilities in single-cell perturbation prediction, using six diverse datasets that encompass gene knockouts, drug treatments, and cytokine stimulations across multiple modalities. CellForge consistently outperforms task-specific state-of-the-art methods. Overall, CellForge demonstrates how iterative interaction between LLM agents with differing perspectives provides better solutions than directly addressing a modeling challenge. Our code is publicly available at https://github.com/gersteinlab/CellForge.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, cs.AI, cs.CL, q-bio.QM</p>

            <p><strong>Authors:</strong><br>
            Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, Arman Cohan, Xihong Lin, Fabian Theis, Smita Krishnaswamy, Mark Gerstein</p>

            <p><strong>Title:</strong><br>
            CellForge: Agentic Design of Virtual Cell Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.02276v1">http://arxiv.org/abs/2508.02276v1</a></p>

            <p><strong>Abstract:</strong><br>
            Virtual cell modeling represents an emerging frontier at the intersection of artificial intelligence and biology, aiming to predict quantities such as responses to diverse perturbations quantitatively. However, autonomously building computational models for virtual cells is challenging due to the complexity of biological systems, the heterogeneity of data modalities, and the need for domain-specific expertise across multiple disciplines. Here, we introduce CellForge, an agentic system that leverages a multi-agent framework that transforms presented biological datasets and research objectives directly into optimized computational models for virtual cells. More specifically, given only raw single-cell multi-omics data and task descriptions as input, CellForge outputs both an optimized model architecture and executable code for training virtual cell models and inference. The framework integrates three core modules: Task Analysis for presented dataset characterization and relevant literature retrieval, Method Design, where specialized agents collaboratively develop optimized modeling strategies, and Experiment Execution for automated generation of code. The agents in the Design module are separated into experts with differing perspectives and a central moderator, and have to collaboratively exchange solutions until they achieve a reasonable consensus. We demonstrate CellForge's capabilities in single-cell perturbation prediction, using six diverse datasets that encompass gene knockouts, drug treatments, and cytokine stimulations across multiple modalities. CellForge consistently outperforms task-specific state-of-the-art methods. Overall, CellForge demonstrates how iterative interaction between LLM agents with differing perspectives provides better solutions than directly addressing a modeling challenge. Our code is publicly available at https://github.com/gersteinlab/CellForge.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 05 Aug 2025 20:36:07 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0c407142/eb182282.mp3" length="20444543" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1274</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, cs.AI, cs.CL, q-bio.QM</p>

            <p><strong>Authors:</strong><br>
            Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, Arman Cohan, Xihong Lin, Fabian Theis, Smita Krishnaswamy, Mark Gerstein</p>

            <p><strong>Title:</strong><br>
            CellForge: Agentic Design of Virtual Cell Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.02276v1">http://arxiv.org/abs/2508.02276v1</a></p>

            <p><strong>Abstract:</strong><br>
            Virtual cell modeling represents an emerging frontier at the intersection of artificial intelligence and biology, aiming to predict quantities such as responses to diverse perturbations quantitatively. However, autonomously building computational models for virtual cells is challenging due to the complexity of biological systems, the heterogeneity of data modalities, and the need for domain-specific expertise across multiple disciplines. Here, we introduce CellForge, an agentic system that leverages a multi-agent framework that transforms presented biological datasets and research objectives directly into optimized computational models for virtual cells. More specifically, given only raw single-cell multi-omics data and task descriptions as input, CellForge outputs both an optimized model architecture and executable code for training virtual cell models and inference. The framework integrates three core modules: Task Analysis for presented dataset characterization and relevant literature retrieval, Method Design, where specialized agents collaboratively develop optimized modeling strategies, and Experiment Execution for automated generation of code. The agents in the Design module are separated into experts with differing perspectives and a central moderator, and have to collaboratively exchange solutions until they achieve a reasonable consensus. We demonstrate CellForge's capabilities in single-cell perturbation prediction, using six diverse datasets that encompass gene knockouts, drug treatments, and cytokine stimulations across multiple modalities. CellForge consistently outperforms task-specific state-of-the-art methods. Overall, CellForge demonstrates how iterative interaction between LLM agents with differing perspectives provides better solutions than directly addressing a modeling challenge. Our code is publicly available at https://github.com/gersteinlab/CellForge.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following</title>
      <itunes:episode>1031</itunes:episode>
      <podcast:episode>1031</podcast:episode>
      <itunes:title>Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ba4be4eb-5cde-4cff-b41a-54200d3fb35b</guid>
      <link>https://share.transistor.fm/s/0fe663ea</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qingyu Ren, Qianyu He, Bowei Zhang, Jie Zeng, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu</p>

            <p><strong>Title:</strong><br>
            Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.02150v1">http://arxiv.org/abs/2508.02150v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models' own internal signals to improve instruction following capabilities without external supervision. Extensive experiments demonstrate that our framework significantly improves instruction following capabilities while maintaining reasoning performance, offering a scalable and cost-effective approach to enhance instruction following in reasoning models. The data and code are publicly available at https://github.com/Rainier-rq/verl-if.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qingyu Ren, Qianyu He, Bowei Zhang, Jie Zeng, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu</p>

            <p><strong>Title:</strong><br>
            Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.02150v1">http://arxiv.org/abs/2508.02150v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models' own internal signals to improve instruction following capabilities without external supervision. Extensive experiments demonstrate that our framework significantly improves instruction following capabilities while maintaining reasoning performance, offering a scalable and cost-effective approach to enhance instruction following in reasoning models. The data and code are publicly available at https://github.com/Rainier-rq/verl-if.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 05 Aug 2025 20:35:44 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0fe663ea/2363807d.mp3" length="17961921" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1119</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qingyu Ren, Qianyu He, Bowei Zhang, Jie Zeng, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu</p>

            <p><strong>Title:</strong><br>
            Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.02150v1">http://arxiv.org/abs/2508.02150v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models' own internal signals to improve instruction following capabilities without external supervision. Extensive experiments demonstrate that our framework significantly improves instruction following capabilities while maintaining reasoning performance, offering a scalable and cost-effective approach to enhance instruction following in reasoning models. The data and code are publicly available at https://github.com/Rainier-rq/verl-if.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct Technical Report</title>
      <itunes:episode>1030</itunes:episode>
      <podcast:episode>1030</podcast:episode>
      <itunes:title>Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">39205dba-11ac-4dae-92be-3b5ce0abdf29</guid>
      <link>https://share.transistor.fm/s/650bc603</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CR, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Sajana Weerawardhena, Paul Kassianik, Blaine Nelson, Baturay Saglam, Anu Vellore, Aman Priyanshu, Supriti Vijay, Massimo Aufiero, Arthur Goldblatt, Fraser Burch, Ed Li, Jianliang He, Dhruv Kedia, Kojin Oshiba, Zhouran Yang, Yaron Singer, Amin Karbasi</p>

            <p><strong>Title:</strong><br>
            Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.01059v1">http://arxiv.org/abs/2508.01059v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have shown remarkable success across many domains, yet their integration into cybersecurity applications remains limited due to a lack of general-purpose cybersecurity data, representational complexity, and safety and regulatory concerns. To address this gap, we previously introduced Foundation-Sec-8B, a cybersecurity-focused LLM suitable for fine-tuning on downstream tasks. That model, however, was not designed for chat-style interactions or instruction-following. In this report, we release Foundation-Sec-8B-Instruct: a model specifically trained for general-purpose cybersecurity dialogue. Built on Foundation-Sec-8B, it combines domain-specific knowledge with instruction-following, conversational capabilities, and alignment with human preferences to produce high-quality, relevant responses. Comprehensive evaluations show that Foundation-Sec-8B-Instruct outperforms Llama 3.1-8B-Instruct on a range of cybersecurity tasks while matching its instruction-following performance. It is also competitive with GPT-4o-mini on cyber threat intelligence and instruction-following tasks. We envision Foundation-Sec-8B-Instruct becoming an indispensable assistant in the daily workflows of cybersecurity professionals. We release the model publicly at https://huggingface.co/fdtn-ai/Foundation-Sec-8B-Instruct.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CR, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Sajana Weerawardhena, Paul Kassianik, Blaine Nelson, Baturay Saglam, Anu Vellore, Aman Priyanshu, Supriti Vijay, Massimo Aufiero, Arthur Goldblatt, Fraser Burch, Ed Li, Jianliang He, Dhruv Kedia, Kojin Oshiba, Zhouran Yang, Yaron Singer, Amin Karbasi</p>

            <p><strong>Title:</strong><br>
            Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.01059v1">http://arxiv.org/abs/2508.01059v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have shown remarkable success across many domains, yet their integration into cybersecurity applications remains limited due to a lack of general-purpose cybersecurity data, representational complexity, and safety and regulatory concerns. To address this gap, we previously introduced Foundation-Sec-8B, a cybersecurity-focused LLM suitable for fine-tuning on downstream tasks. That model, however, was not designed for chat-style interactions or instruction-following. In this report, we release Foundation-Sec-8B-Instruct: a model specifically trained for general-purpose cybersecurity dialogue. Built on Foundation-Sec-8B, it combines domain-specific knowledge with instruction-following, conversational capabilities, and alignment with human preferences to produce high-quality, relevant responses. Comprehensive evaluations show that Foundation-Sec-8B-Instruct outperforms Llama 3.1-8B-Instruct on a range of cybersecurity tasks while matching its instruction-following performance. It is also competitive with GPT-4o-mini on cyber threat intelligence and instruction-following tasks. We envision Foundation-Sec-8B-Instruct becoming an indispensable assistant in the daily workflows of cybersecurity professionals. We release the model publicly at https://huggingface.co/fdtn-ai/Foundation-Sec-8B-Instruct.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 05 Aug 2025 20:35:21 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/650bc603/ba598b69.mp3" length="20867114" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1301</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CR, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Sajana Weerawardhena, Paul Kassianik, Blaine Nelson, Baturay Saglam, Anu Vellore, Aman Priyanshu, Supriti Vijay, Massimo Aufiero, Arthur Goldblatt, Fraser Burch, Ed Li, Jianliang He, Dhruv Kedia, Kojin Oshiba, Zhouran Yang, Yaron Singer, Amin Karbasi</p>

            <p><strong>Title:</strong><br>
            Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.01059v1">http://arxiv.org/abs/2508.01059v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have shown remarkable success across many domains, yet their integration into cybersecurity applications remains limited due to a lack of general-purpose cybersecurity data, representational complexity, and safety and regulatory concerns. To address this gap, we previously introduced Foundation-Sec-8B, a cybersecurity-focused LLM suitable for fine-tuning on downstream tasks. That model, however, was not designed for chat-style interactions or instruction-following. In this report, we release Foundation-Sec-8B-Instruct: a model specifically trained for general-purpose cybersecurity dialogue. Built on Foundation-Sec-8B, it combines domain-specific knowledge with instruction-following, conversational capabilities, and alignment with human preferences to produce high-quality, relevant responses. Comprehensive evaluations show that Foundation-Sec-8B-Instruct outperforms Llama 3.1-8B-Instruct on a range of cybersecurity tasks while matching its instruction-following performance. It is also competitive with GPT-4o-mini on cyber threat intelligence and instruction-following tasks. We envision Foundation-Sec-8B-Instruct becoming an indispensable assistant in the daily workflows of cybersecurity professionals. We release the model publicly at https://huggingface.co/fdtn-ai/Foundation-Sec-8B-Instruct.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models</title>
      <itunes:episode>1029</itunes:episode>
      <podcast:episode>1029</podcast:episode>
      <itunes:title>Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ba45b61f-5ac3-480e-a257-9a6370aefda4</guid>
      <link>https://share.transistor.fm/s/92e7fae8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin</p>

            <p><strong>Title:</strong><br>
            Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.00819v1">http://arxiv.org/abs/2508.00819v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Large Language Models (DLLMs) are emerging as a powerful alternative to the dominant Autoregressive Large Language Models, offering efficient parallel generation and capable global context modeling. However, the practical application of DLLMs is hindered by a critical architectural constraint: the need for a statically predefined generation length. This static length allocation leads to a problematic trade-off: insufficient lengths cripple performance on complex tasks, while excessive lengths incur significant computational overhead and sometimes result in performance degradation. While the inference framework is rigid, we observe that the model itself possesses internal signals that correlate with the optimal response length for a given task. To bridge this gap, we leverage these latent signals and introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion for Diffusion Large Language Models. DAEDAL operates in two phases: 1) Before the denoising process, DAEDAL starts from a short initial length and iteratively expands it to a coarse task-appropriate length, guided by a sequence completion metric. 2) During the denoising process, DAEDAL dynamically intervenes by pinpointing and expanding insufficient generation regions through mask token insertion, ensuring the final output is fully developed. Extensive experiments on DLLMs demonstrate that DAEDAL achieves performance comparable, and in some cases superior, to meticulously tuned fixed-length baselines, while simultaneously enhancing computational efficiency by achieving a higher effective token ratio. By resolving the static length constraint, DAEDAL unlocks new potential for DLLMs, bridging a critical gap with their Autoregressive counterparts and paving the way for more efficient and capable generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin</p>

            <p><strong>Title:</strong><br>
            Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.00819v1">http://arxiv.org/abs/2508.00819v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Large Language Models (DLLMs) are emerging as a powerful alternative to the dominant Autoregressive Large Language Models, offering efficient parallel generation and capable global context modeling. However, the practical application of DLLMs is hindered by a critical architectural constraint: the need for a statically predefined generation length. This static length allocation leads to a problematic trade-off: insufficient lengths cripple performance on complex tasks, while excessive lengths incur significant computational overhead and sometimes result in performance degradation. While the inference framework is rigid, we observe that the model itself possesses internal signals that correlate with the optimal response length for a given task. To bridge this gap, we leverage these latent signals and introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion for Diffusion Large Language Models. DAEDAL operates in two phases: 1) Before the denoising process, DAEDAL starts from a short initial length and iteratively expands it to a coarse task-appropriate length, guided by a sequence completion metric. 2) During the denoising process, DAEDAL dynamically intervenes by pinpointing and expanding insufficient generation regions through mask token insertion, ensuring the final output is fully developed. Extensive experiments on DLLMs demonstrate that DAEDAL achieves performance comparable, and in some cases superior, to meticulously tuned fixed-length baselines, while simultaneously enhancing computational efficiency by achieving a higher effective token ratio. By resolving the static length constraint, DAEDAL unlocks new potential for DLLMs, bridging a critical gap with their Autoregressive counterparts and paving the way for more efficient and capable generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 04 Aug 2025 20:22:19 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/92e7fae8/9c066ddc.mp3" length="18895613" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1177</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin</p>

            <p><strong>Title:</strong><br>
            Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.00819v1">http://arxiv.org/abs/2508.00819v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Large Language Models (DLLMs) are emerging as a powerful alternative to the dominant Autoregressive Large Language Models, offering efficient parallel generation and capable global context modeling. However, the practical application of DLLMs is hindered by a critical architectural constraint: the need for a statically predefined generation length. This static length allocation leads to a problematic trade-off: insufficient lengths cripple performance on complex tasks, while excessive lengths incur significant computational overhead and sometimes result in performance degradation. While the inference framework is rigid, we observe that the model itself possesses internal signals that correlate with the optimal response length for a given task. To bridge this gap, we leverage these latent signals and introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion for Diffusion Large Language Models. DAEDAL operates in two phases: 1) Before the denoising process, DAEDAL starts from a short initial length and iteratively expands it to a coarse task-appropriate length, guided by a sequence completion metric. 2) During the denoising process, DAEDAL dynamically intervenes by pinpointing and expanding insufficient generation regions through mask token insertion, ensuring the final output is fully developed. Extensive experiments on DLLMs demonstrate that DAEDAL achieves performance comparable, and in some cases superior, to meticulously tuned fixed-length baselines, while simultaneously enhancing computational efficiency by achieving a higher effective token ratio. By resolving the static length constraint, DAEDAL unlocks new potential for DLLMs, bridging a critical gap with their Autoregressive counterparts and paving the way for more efficient and capable generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training</title>
      <itunes:episode>1028</itunes:episode>
      <podcast:episode>1028</podcast:episode>
      <itunes:title>Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">239fdd91-0759-4e84-ae4b-471e4c28d481</guid>
      <link>https://share.transistor.fm/s/3c3a4266</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, Hongming Zhang, Haitao Mi, Dong Yu</p>

            <p><strong>Title:</strong><br>
            Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.00414v1">http://arxiv.org/abs/2508.00414v1</a></p>

            <p><strong>Abstract:</strong><br>
            General AI Agents are increasingly recognized as foundational frameworks for the next generation of artificial intelligence, enabling complex reasoning, web interaction, coding, and autonomous research capabilities. However, current agent systems are either closed-source or heavily reliant on a variety of paid APIs and proprietary tools, limiting accessibility and reproducibility for the research community. In this work, we present \textbf{Cognitive Kernel-Pro}, a fully open-source and (to the maximum extent) free multi-module agent framework designed to democratize the development and evaluation of advanced AI agents. Within Cognitive Kernel-Pro, we systematically investigate the curation of high-quality training data for Agent Foundation Models, focusing on the construction of queries, trajectories, and verifiable answers across four key domains: web, file, code, and general reasoning. Furthermore, we explore novel strategies for agent test-time reflection and voting to enhance agent robustness and performance. We evaluate Cognitive Kernel-Pro on GAIA, achieving state-of-the-art results among open-source and free agents. Notably, our 8B-parameter open-source model surpasses previous leading systems such as WebDancer and WebSailor, establishing a new performance standard for accessible, high-capability AI agents. Code is available at https://github.com/Tencent/CognitiveKernel-Pro</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, Hongming Zhang, Haitao Mi, Dong Yu</p>

            <p><strong>Title:</strong><br>
            Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.00414v1">http://arxiv.org/abs/2508.00414v1</a></p>

            <p><strong>Abstract:</strong><br>
            General AI Agents are increasingly recognized as foundational frameworks for the next generation of artificial intelligence, enabling complex reasoning, web interaction, coding, and autonomous research capabilities. However, current agent systems are either closed-source or heavily reliant on a variety of paid APIs and proprietary tools, limiting accessibility and reproducibility for the research community. In this work, we present \textbf{Cognitive Kernel-Pro}, a fully open-source and (to the maximum extent) free multi-module agent framework designed to democratize the development and evaluation of advanced AI agents. Within Cognitive Kernel-Pro, we systematically investigate the curation of high-quality training data for Agent Foundation Models, focusing on the construction of queries, trajectories, and verifiable answers across four key domains: web, file, code, and general reasoning. Furthermore, we explore novel strategies for agent test-time reflection and voting to enhance agent robustness and performance. We evaluate Cognitive Kernel-Pro on GAIA, achieving state-of-the-art results among open-source and free agents. Notably, our 8B-parameter open-source model surpasses previous leading systems such as WebDancer and WebSailor, establishing a new performance standard for accessible, high-capability AI agents. Code is available at https://github.com/Tencent/CognitiveKernel-Pro</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 04 Aug 2025 20:21:58 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3c3a4266/051eb0db.mp3" length="20164557" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1257</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, Hongming Zhang, Haitao Mi, Dong Yu</p>

            <p><strong>Title:</strong><br>
            Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2508.00414v1">http://arxiv.org/abs/2508.00414v1</a></p>

            <p><strong>Abstract:</strong><br>
            General AI Agents are increasingly recognized as foundational frameworks for the next generation of artificial intelligence, enabling complex reasoning, web interaction, coding, and autonomous research capabilities. However, current agent systems are either closed-source or heavily reliant on a variety of paid APIs and proprietary tools, limiting accessibility and reproducibility for the research community. In this work, we present \textbf{Cognitive Kernel-Pro}, a fully open-source and (to the maximum extent) free multi-module agent framework designed to democratize the development and evaluation of advanced AI agents. Within Cognitive Kernel-Pro, we systematically investigate the curation of high-quality training data for Agent Foundation Models, focusing on the construction of queries, trajectories, and verifiable answers across four key domains: web, file, code, and general reasoning. Furthermore, we explore novel strategies for agent test-time reflection and voting to enhance agent robustness and performance. We evaluate Cognitive Kernel-Pro on GAIA, achieving state-of-the-art results among open-source and free agents. Notably, our 8B-parameter open-source model surpasses previous leading systems such as WebDancer and WebSailor, establishing a new performance standard for accessible, high-capability AI agents. Code is available at https://github.com/Tencent/CognitiveKernel-Pro</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PixNerd: Pixel Neural Field Diffusion</title>
      <itunes:episode>1027</itunes:episode>
      <podcast:episode>1027</podcast:episode>
      <itunes:title>PixNerd: Pixel Neural Field Diffusion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fea97526-e5b6-4a8a-8309-c625dbc49b3c</guid>
      <link>https://share.transistor.fm/s/dfecb112</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, Limin Wang</p>

            <p><strong>Title:</strong><br>
            PixNerd: Pixel Neural Field Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.23268v2">http://arxiv.org/abs/2507.23268v2</a></p>

            <p><strong>Abstract:</strong><br>
            The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet $256\times256$ and 2.84 FID on ImageNet $512\times512$ without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, Limin Wang</p>

            <p><strong>Title:</strong><br>
            PixNerd: Pixel Neural Field Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.23268v2">http://arxiv.org/abs/2507.23268v2</a></p>

            <p><strong>Abstract:</strong><br>
            The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet $256\times256$ and 2.84 FID on ImageNet $512\times512$ without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 04 Aug 2025 20:21:36 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/dfecb112/3e90dd5b.mp3" length="22801403" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1421</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, Limin Wang</p>

            <p><strong>Title:</strong><br>
            PixNerd: Pixel Neural Field Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.23268v2">http://arxiv.org/abs/2507.23268v2</a></p>

            <p><strong>Abstract:</strong><br>
            The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet $256\times256$ and 2.84 FID on ImageNet $512\times512$ without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving</title>
      <itunes:episode>1026</itunes:episode>
      <podcast:episode>1026</podcast:episode>
      <itunes:title>Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">972754b7-b28a-4150-8c94-3a1848ca6685</guid>
      <link>https://share.transistor.fm/s/875b0f27</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, Cheng Ren, Jiawei Shen, Wenlei Shi, Tong Sun, He Sun, Jiahui Wang, Siran Wang, Zhihong Wang, Chenrui Wei, Shufa Wei, Yonghui Wu, Yuchen Wu, Yihang Xia, Huajian Xin, Fan Yang, Huaiyuan Ying, Hongyi Yuan, Zheng Yuan, Tianyang Zhan, Chi Zhang, Yue Zhang, Ge Zhang, Tianyun Zhao, Jianqiu Zhao, Yichi Zhou, Thomas Hanwen Zhu</p>

            <p><strong>Title:</strong><br>
            Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.23726v1">http://arxiv.org/abs/2507.23726v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose \textbf{Seed-Prover}, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves $78.1\%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50\% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine \textbf{Seed-Geometry}, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, Cheng Ren, Jiawei Shen, Wenlei Shi, Tong Sun, He Sun, Jiahui Wang, Siran Wang, Zhihong Wang, Chenrui Wei, Shufa Wei, Yonghui Wu, Yuchen Wu, Yihang Xia, Huajian Xin, Fan Yang, Huaiyuan Ying, Hongyi Yuan, Zheng Yuan, Tianyang Zhan, Chi Zhang, Yue Zhang, Ge Zhang, Tianyun Zhao, Jianqiu Zhao, Yichi Zhou, Thomas Hanwen Zhu</p>

            <p><strong>Title:</strong><br>
            Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.23726v1">http://arxiv.org/abs/2507.23726v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose \textbf{Seed-Prover}, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves $78.1\%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50\% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine \textbf{Seed-Geometry}, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 01 Aug 2025 20:04:23 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/875b0f27/dde9aaea.mp3" length="20457100" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1275</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, Cheng Ren, Jiawei Shen, Wenlei Shi, Tong Sun, He Sun, Jiahui Wang, Siran Wang, Zhihong Wang, Chenrui Wei, Shufa Wei, Yonghui Wu, Yuchen Wu, Yihang Xia, Huajian Xin, Fan Yang, Huaiyuan Ying, Hongyi Yuan, Zheng Yuan, Tianyang Zhan, Chi Zhang, Yue Zhang, Ge Zhang, Tianyun Zhao, Jianqiu Zhao, Yichi Zhou, Thomas Hanwen Zhu</p>

            <p><strong>Title:</strong><br>
            Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.23726v1">http://arxiv.org/abs/2507.23726v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose \textbf{Seed-Prover}, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves $78.1\%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50\% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine \textbf{Seed-Geometry}, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Phi-Ground Tech Report: Advancing Perception in GUI Grounding</title>
      <itunes:episode>1025</itunes:episode>
      <podcast:episode>1025</podcast:episode>
      <itunes:title>Phi-Ground Tech Report: Advancing Perception in GUI Grounding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c0543350-82d9-44e9-918b-4efc7746e4fd</guid>
      <link>https://share.transistor.fm/s/ac90dcf5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.AI, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Miaosen Zhang, Ziqiang Xu, Jialiang Zhu, Qi Dai, Kai Qiu, Yifan Yang, Chong Luo, Tianyi Chen, Justin Wagle, Tim Franklin, Baining Guo</p>

            <p><strong>Title:</strong><br>
            Phi-Ground Tech Report: Advancing Perception in GUI Grounding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.23779v1">http://arxiv.org/abs/2507.23779v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from \textit{"Iron Man"}, are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65\% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the \textbf{Phi-Ground} model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under $10B$ parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textit{\textbf{43.2}} on ScreenSpot-pro and \textit{\textbf{27.2}} on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: \href{https://zhangmiaosen2000.github.io/Phi-Ground/}{https://zhangmiaosen2000.github.io/Phi-Ground/}</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.AI, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Miaosen Zhang, Ziqiang Xu, Jialiang Zhu, Qi Dai, Kai Qiu, Yifan Yang, Chong Luo, Tianyi Chen, Justin Wagle, Tim Franklin, Baining Guo</p>

            <p><strong>Title:</strong><br>
            Phi-Ground Tech Report: Advancing Perception in GUI Grounding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.23779v1">http://arxiv.org/abs/2507.23779v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from \textit{"Iron Man"}, are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65\% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the \textbf{Phi-Ground} model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under $10B$ parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textit{\textbf{43.2}} on ScreenSpot-pro and \textit{\textbf{27.2}} on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: \href{https://zhangmiaosen2000.github.io/Phi-Ground/}{https://zhangmiaosen2000.github.io/Phi-Ground/}</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 01 Aug 2025 20:04:02 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ac90dcf5/0551819a.mp3" length="20635145" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1286</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.AI, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Miaosen Zhang, Ziqiang Xu, Jialiang Zhu, Qi Dai, Kai Qiu, Yifan Yang, Chong Luo, Tianyi Chen, Justin Wagle, Tim Franklin, Baining Guo</p>

            <p><strong>Title:</strong><br>
            Phi-Ground Tech Report: Advancing Perception in GUI Grounding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.23779v1">http://arxiv.org/abs/2507.23779v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from \textit{"Iron Man"}, are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65\% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the \textbf{Phi-Ground} model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under $10B$ parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textit{\textbf{43.2}} on ScreenSpot-pro and \textit{\textbf{27.2}} on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: \href{https://zhangmiaosen2000.github.io/Phi-Ground/}{https://zhangmiaosen2000.github.io/Phi-Ground/}</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents</title>
      <itunes:episode>1024</itunes:episode>
      <podcast:episode>1024</podcast:episode>
      <itunes:title>ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f74b0c43-d36c-4012-abf1-9261872e941e</guid>
      <link>https://share.transistor.fm/s/69b15837</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue</p>

            <p><strong>Title:</strong><br>
            ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.22827v1">http://arxiv.org/abs/2507.22827v1</a></p>

            <p><strong>Abstract:</strong><br>
            Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While recent large language models (LLMs) have demonstrated progress in text-to-code generation, many existing approaches rely solely on natural language prompts, limiting their effectiveness in capturing spatial layout and visual design intent. In contrast, UI development in practice is inherently multimodal, often starting from visual sketches or mockups. To address this gap, we introduce a modular multi-agent framework that performs UI-to-code generation in three interpretable stages: grounding, planning, and generation. The grounding agent uses a vision-language model to detect and label UI components, the planning agent constructs a hierarchical layout using front-end engineering priors, and the generation agent produces HTML/CSS code via adaptive prompt-based synthesis. This design improves robustness, interpretability, and fidelity over end-to-end black-box methods. Furthermore, we extend the framework into a scalable data engine that automatically produces large-scale image-code pairs. Using these synthetic examples, we fine-tune and reinforce an open-source VLM, yielding notable gains in UI understanding and code quality. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue</p>

            <p><strong>Title:</strong><br>
            ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.22827v1">http://arxiv.org/abs/2507.22827v1</a></p>

            <p><strong>Abstract:</strong><br>
            Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While recent large language models (LLMs) have demonstrated progress in text-to-code generation, many existing approaches rely solely on natural language prompts, limiting their effectiveness in capturing spatial layout and visual design intent. In contrast, UI development in practice is inherently multimodal, often starting from visual sketches or mockups. To address this gap, we introduce a modular multi-agent framework that performs UI-to-code generation in three interpretable stages: grounding, planning, and generation. The grounding agent uses a vision-language model to detect and label UI components, the planning agent constructs a hierarchical layout using front-end engineering priors, and the generation agent produces HTML/CSS code via adaptive prompt-based synthesis. This design improves robustness, interpretability, and fidelity over end-to-end black-box methods. Furthermore, we extend the framework into a scalable data engine that automatically produces large-scale image-code pairs. Using these synthetic examples, we fine-tune and reinforce an open-source VLM, yielding notable gains in UI understanding and code quality. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 31 Jul 2025 20:46:32 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/69b15837/c87472ae.mp3" length="19832287" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1236</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue</p>

            <p><strong>Title:</strong><br>
            ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.22827v1">http://arxiv.org/abs/2507.22827v1</a></p>

            <p><strong>Abstract:</strong><br>
            Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While recent large language models (LLMs) have demonstrated progress in text-to-code generation, many existing approaches rely solely on natural language prompts, limiting their effectiveness in capturing spatial layout and visual design intent. In contrast, UI development in practice is inherently multimodal, often starting from visual sketches or mockups. To address this gap, we introduce a modular multi-agent framework that performs UI-to-code generation in three interpretable stages: grounding, planning, and generation. The grounding agent uses a vision-language model to detect and label UI components, the planning agent constructs a hierarchical layout using front-end engineering priors, and the generation agent produces HTML/CSS code via adaptive prompt-based synthesis. This design improves robustness, interpretability, and fidelity over end-to-end black-box methods. Furthermore, we extend the framework into a scalable data engine that automatically produces large-scale image-code pairs. Using these synthetic examples, we fine-tune and reinforce an open-source VLM, yielding notable gains in UI understanding and code quality. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BANG: Dividing 3D Assets via Generative Exploded Dynamics</title>
      <itunes:episode>1023</itunes:episode>
      <podcast:episode>1023</podcast:episode>
      <itunes:title>BANG: Dividing 3D Assets via Generative Exploded Dynamics</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ad2e49de-ca01-4535-94df-5fe0b6c945c9</guid>
      <link>https://share.transistor.fm/s/1e81aa2e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.GR</p>

            <p><strong>Authors:</strong><br>
            Longwen Zhang, Qixuan Zhang, Haoran Jiang, Yinuo Bai, Wei Yang, Lan Xu, Jingyi Yu</p>

            <p><strong>Title:</strong><br>
            BANG: Dividing 3D Assets via Generative Exploded Dynamics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.21493v1">http://arxiv.org/abs/2507.21493v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D creation has always been a unique human strength, driven by our ability to deconstruct and reassemble objects using our eyes, mind and hand. However, current 3D design tools struggle to replicate this natural process, requiring considerable artistic expertise and manual labor. This paper introduces BANG, a novel generative approach that bridges 3D generation and reasoning, allowing for intuitive and flexible part-level decomposition of 3D objects. At the heart of BANG is "Generative Exploded Dynamics", which creates a smooth sequence of exploded states for an input geometry, progressively separating parts while preserving their geometric and semantic coherence.   BANG utilizes a pre-trained large-scale latent diffusion model, fine-tuned for exploded dynamics with a lightweight exploded view adapter, allowing precise control over the decomposition process. It also incorporates a temporal attention module to ensure smooth transitions and consistency across time. BANG enhances control with spatial prompts, such as bounding boxes and surface regions, enabling users to specify which parts to decompose and how. This interaction can be extended with multimodal models like GPT-4, enabling 2D-to-3D manipulations for more intuitive and creative workflows.   The capabilities of BANG extend to generating detailed part-level geometry, associating parts with functional descriptions, and facilitating component-aware 3D creation and manufacturing workflows. Additionally, BANG offers applications in 3D printing, where separable parts are generated for easy printing and reassembly. In essence, BANG enables seamless transformation from imaginative concepts to detailed 3D assets, offering a new perspective on creation that resonates with human intuition.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.GR</p>

            <p><strong>Authors:</strong><br>
            Longwen Zhang, Qixuan Zhang, Haoran Jiang, Yinuo Bai, Wei Yang, Lan Xu, Jingyi Yu</p>

            <p><strong>Title:</strong><br>
            BANG: Dividing 3D Assets via Generative Exploded Dynamics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.21493v1">http://arxiv.org/abs/2507.21493v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D creation has always been a unique human strength, driven by our ability to deconstruct and reassemble objects using our eyes, mind and hand. However, current 3D design tools struggle to replicate this natural process, requiring considerable artistic expertise and manual labor. This paper introduces BANG, a novel generative approach that bridges 3D generation and reasoning, allowing for intuitive and flexible part-level decomposition of 3D objects. At the heart of BANG is "Generative Exploded Dynamics", which creates a smooth sequence of exploded states for an input geometry, progressively separating parts while preserving their geometric and semantic coherence.   BANG utilizes a pre-trained large-scale latent diffusion model, fine-tuned for exploded dynamics with a lightweight exploded view adapter, allowing precise control over the decomposition process. It also incorporates a temporal attention module to ensure smooth transitions and consistency across time. BANG enhances control with spatial prompts, such as bounding boxes and surface regions, enabling users to specify which parts to decompose and how. This interaction can be extended with multimodal models like GPT-4, enabling 2D-to-3D manipulations for more intuitive and creative workflows.   The capabilities of BANG extend to generating detailed part-level geometry, associating parts with functional descriptions, and facilitating component-aware 3D creation and manufacturing workflows. Additionally, BANG offers applications in 3D printing, where separable parts are generated for easy printing and reassembly. In essence, BANG enables seamless transformation from imaginative concepts to detailed 3D assets, offering a new perspective on creation that resonates with human intuition.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 31 Jul 2025 20:46:00 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1e81aa2e/994ccb3c.mp3" length="19772473" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1232</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.GR</p>

            <p><strong>Authors:</strong><br>
            Longwen Zhang, Qixuan Zhang, Haoran Jiang, Yinuo Bai, Wei Yang, Lan Xu, Jingyi Yu</p>

            <p><strong>Title:</strong><br>
            BANG: Dividing 3D Assets via Generative Exploded Dynamics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.21493v1">http://arxiv.org/abs/2507.21493v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D creation has always been a unique human strength, driven by our ability to deconstruct and reassemble objects using our eyes, mind and hand. However, current 3D design tools struggle to replicate this natural process, requiring considerable artistic expertise and manual labor. This paper introduces BANG, a novel generative approach that bridges 3D generation and reasoning, allowing for intuitive and flexible part-level decomposition of 3D objects. At the heart of BANG is "Generative Exploded Dynamics", which creates a smooth sequence of exploded states for an input geometry, progressively separating parts while preserving their geometric and semantic coherence.   BANG utilizes a pre-trained large-scale latent diffusion model, fine-tuned for exploded dynamics with a lightweight exploded view adapter, allowing precise control over the decomposition process. It also incorporates a temporal attention module to ensure smooth transitions and consistency across time. BANG enhances control with spatial prompts, such as bounding boxes and surface regions, enabling users to specify which parts to decompose and how. This interaction can be extended with multimodal models like GPT-4, enabling 2D-to-3D manipulations for more intuitive and creative workflows.   The capabilities of BANG extend to generating detailed part-level geometry, associating parts with functional descriptions, and facilitating component-aware 3D creation and manufacturing workflows. Additionally, BANG offers applications in 3D printing, where separable parts are generated for easy printing and reassembly. In essence, BANG enables seamless transformation from imaginative concepts to detailed 3D assets, offering a new perspective on creation that resonates with human intuition.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning</title>
      <itunes:episode>1022</itunes:episode>
      <podcast:episode>1022</podcast:episode>
      <itunes:title>VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8568e659-3ea9-40b3-b294-24b3f5b7f281</guid>
      <link>https://share.transistor.fm/s/169e648b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Weiwen Xu, Hou Pong Chan, Deli Zhao, Tingyang Xu, Zhongyu Wei, Hao Zhang, Yu Rong</p>

            <p><strong>Title:</strong><br>
            VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.22607v2">http://arxiv.org/abs/2507.22607v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across various domains and difficulty levels. To address these limitations, we propose VL-Cogito, an advanced multimodal reasoning model trained via a novel multi-stage Progressive Curriculum Reinforcement Learning (PCuRL) framework. PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts. The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity, thus balancing reasoning efficiency with correctness. Experimental evaluations demonstrate that VL-Cogito consistently matches or surpasses existing reasoning-oriented models across mainstream multimodal benchmarks spanning mathematics, science, logic, and general understanding, validating the effectiveness of our approach.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Weiwen Xu, Hou Pong Chan, Deli Zhao, Tingyang Xu, Zhongyu Wei, Hao Zhang, Yu Rong</p>

            <p><strong>Title:</strong><br>
            VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.22607v2">http://arxiv.org/abs/2507.22607v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across various domains and difficulty levels. To address these limitations, we propose VL-Cogito, an advanced multimodal reasoning model trained via a novel multi-stage Progressive Curriculum Reinforcement Learning (PCuRL) framework. PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts. The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity, thus balancing reasoning efficiency with correctness. Experimental evaluations demonstrate that VL-Cogito consistently matches or surpasses existing reasoning-oriented models across mainstream multimodal benchmarks spanning mathematics, science, logic, and general understanding, validating the effectiveness of our approach.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 31 Jul 2025 20:45:39 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/169e648b/d8416b65.mp3" length="21701388" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1353</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Weiwen Xu, Hou Pong Chan, Deli Zhao, Tingyang Xu, Zhongyu Wei, Hao Zhang, Yu Rong</p>

            <p><strong>Title:</strong><br>
            VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.22607v2">http://arxiv.org/abs/2507.22607v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across various domains and difficulty levels. To address these limitations, we propose VL-Cogito, an advanced multimodal reasoning model trained via a novel multi-stage Progressive Curriculum Reinforcement Learning (PCuRL) framework. PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts. The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity, thus balancing reasoning efficiency with correctness. Experimental evaluations demonstrate that VL-Cogito consistently matches or surpasses existing reasoning-oriented models across mainstream multimodal benchmarks spanning mathematics, science, logic, and general understanding, validating the effectiveness of our approach.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels</title>
      <itunes:episode>1021</itunes:episode>
      <podcast:episode>1021</podcast:episode>
      <itunes:title>HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">55d5d34a-d18e-4044-9a22-ad517f34b424</guid>
      <link>https://share.transistor.fm/s/42cadfca</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wenhuan Li, Sheng Zhang, Yihang Lian, Yulin Tsai, Lifu Wang, Sicong Liu, Puhua Jiang, Xianghui Yang, Dongyuan Guo, Yixuan Tang, Xinyue Mao, Jiaao Yu, Junlin Yu, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Chao Zhang, Yonghao Tan, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Minghui Chen, Zhan Li, Wangchen Qin, Lei Wang, Yifu Sun, Lin Niu, Xiang Yuan, Xiaofeng Yang, Yingping He, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Tian Liu, Peng Chen, Di Wang, Yuhong Liu, Linus, Jie Jiang, Tengfei Wang, Chunchao Guo</p>

            <p><strong>Title:</strong><br>
            HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.21809v1">http://arxiv.org/abs/2507.21809v1</a></p>

            <p><strong>Abstract:</strong><br>
            Creating immersive and playable 3D worlds from texts or images remains a fundamental challenge in computer vision and graphics. Existing world generation approaches typically fall into two categories: video-based methods that offer rich diversity but lack 3D consistency and rendering efficiency, and 3D-based methods that provide geometric consistency but struggle with limited training data and memory-inefficient representations. To address these limitations, we present HunyuanWorld 1.0, a novel framework that combines the best of both worlds for generating immersive, explorable, and interactive 3D scenes from text and image conditions. Our approach features three key advantages: 1) 360{\deg} immersive experiences via panoramic world proxies; 2) mesh export capabilities for seamless compatibility with existing computer graphics pipelines; 3) disentangled object representations for augmented interactivity. The core of our framework is a semantically layered 3D mesh representation that leverages panoramic images as 360{\deg} world proxies for semantic-aware world decomposition and reconstruction, enabling the generation of diverse 3D worlds. Extensive experiments demonstrate that our method achieves state-of-the-art performance in generating coherent, explorable, and interactive 3D worlds while enabling versatile applications in virtual reality, physical simulation, game development, and interactive content creation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wenhuan Li, Sheng Zhang, Yihang Lian, Yulin Tsai, Lifu Wang, Sicong Liu, Puhua Jiang, Xianghui Yang, Dongyuan Guo, Yixuan Tang, Xinyue Mao, Jiaao Yu, Junlin Yu, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Chao Zhang, Yonghao Tan, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Minghui Chen, Zhan Li, Wangchen Qin, Lei Wang, Yifu Sun, Lin Niu, Xiang Yuan, Xiaofeng Yang, Yingping He, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Tian Liu, Peng Chen, Di Wang, Yuhong Liu, Linus, Jie Jiang, Tengfei Wang, Chunchao Guo</p>

            <p><strong>Title:</strong><br>
            HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.21809v1">http://arxiv.org/abs/2507.21809v1</a></p>

            <p><strong>Abstract:</strong><br>
            Creating immersive and playable 3D worlds from texts or images remains a fundamental challenge in computer vision and graphics. Existing world generation approaches typically fall into two categories: video-based methods that offer rich diversity but lack 3D consistency and rendering efficiency, and 3D-based methods that provide geometric consistency but struggle with limited training data and memory-inefficient representations. To address these limitations, we present HunyuanWorld 1.0, a novel framework that combines the best of both worlds for generating immersive, explorable, and interactive 3D scenes from text and image conditions. Our approach features three key advantages: 1) 360{\deg} immersive experiences via panoramic world proxies; 2) mesh export capabilities for seamless compatibility with existing computer graphics pipelines; 3) disentangled object representations for augmented interactivity. The core of our framework is a semantically layered 3D mesh representation that leverages panoramic images as 360{\deg} world proxies for semantic-aware world decomposition and reconstruction, enabling the generation of diverse 3D worlds. Extensive experiments demonstrate that our method achieves state-of-the-art performance in generating coherent, explorable, and interactive 3D worlds while enabling versatile applications in virtual reality, physical simulation, game development, and interactive content creation.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 30 Jul 2025 20:19:28 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/42cadfca/b18a1a61.mp3" length="23107828" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1441</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wenhuan Li, Sheng Zhang, Yihang Lian, Yulin Tsai, Lifu Wang, Sicong Liu, Puhua Jiang, Xianghui Yang, Dongyuan Guo, Yixuan Tang, Xinyue Mao, Jiaao Yu, Junlin Yu, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Chao Zhang, Yonghao Tan, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Minghui Chen, Zhan Li, Wangchen Qin, Lei Wang, Yifu Sun, Lin Niu, Xiang Yuan, Xiaofeng Yang, Yingping He, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Tian Liu, Peng Chen, Di Wang, Yuhong Liu, Linus, Jie Jiang, Tengfei Wang, Chunchao Guo</p>

            <p><strong>Title:</strong><br>
            HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.21809v1">http://arxiv.org/abs/2507.21809v1</a></p>

            <p><strong>Abstract:</strong><br>
            Creating immersive and playable 3D worlds from texts or images remains a fundamental challenge in computer vision and graphics. Existing world generation approaches typically fall into two categories: video-based methods that offer rich diversity but lack 3D consistency and rendering efficiency, and 3D-based methods that provide geometric consistency but struggle with limited training data and memory-inefficient representations. To address these limitations, we present HunyuanWorld 1.0, a novel framework that combines the best of both worlds for generating immersive, explorable, and interactive 3D scenes from text and image conditions. Our approach features three key advantages: 1) 360{\deg} immersive experiences via panoramic world proxies; 2) mesh export capabilities for seamless compatibility with existing computer graphics pipelines; 3) disentangled object representations for augmented interactivity. The core of our framework is a semantically layered 3D mesh representation that leverages panoramic images as 360{\deg} world proxies for semantic-aware world decomposition and reconstruction, enabling the generation of diverse 3D worlds. Extensive experiments demonstrate that our method achieves state-of-the-art performance in generating coherent, explorable, and interactive 3D worlds while enabling versatile applications in virtual reality, physical simulation, game development, and interactive content creation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again</title>
      <itunes:episode>1020</itunes:episode>
      <podcast:episode>1020</podcast:episode>
      <itunes:title>X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">751049fd-ac62-421d-9cc8-38ccd99c5dc6</guid>
      <link>https://share.transistor.fm/s/6c0754d1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, Jie Jiang</p>

            <p><strong>Title:</strong><br>
            X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.22058v1">http://arxiv.org/abs/2507.22058v1</a></p>

            <p><strong>Abstract:</strong><br>
            Numerous efforts have been made to extend the ``next token prediction'' paradigm to visual contents, aiming to create a unified approach for both image generation and understanding. Nevertheless, attempts to generate images through autoregressive modeling with discrete tokens have been plagued by issues such as low visual fidelity, distorted outputs, and failure to adhere to complex instructions when rendering intricate details. These shortcomings are likely attributed to cumulative errors during autoregressive inference or information loss incurred during the discretization process. Probably due to this challenge, recent research has increasingly shifted toward jointly training image generation with diffusion objectives and language generation with autoregressive objectives, moving away from unified modeling approaches. In this work, we demonstrate that reinforcement learning can effectively mitigate artifacts and largely enhance the generation quality of a discrete autoregressive modeling method, thereby enabling seamless integration of image and language generation. Our framework comprises a semantic image tokenizer, a unified autoregressive model for both language and images, and an offline diffusion decoder for image generation, termed X-Omni. X-Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, Jie Jiang</p>

            <p><strong>Title:</strong><br>
            X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.22058v1">http://arxiv.org/abs/2507.22058v1</a></p>

            <p><strong>Abstract:</strong><br>
            Numerous efforts have been made to extend the ``next token prediction'' paradigm to visual contents, aiming to create a unified approach for both image generation and understanding. Nevertheless, attempts to generate images through autoregressive modeling with discrete tokens have been plagued by issues such as low visual fidelity, distorted outputs, and failure to adhere to complex instructions when rendering intricate details. These shortcomings are likely attributed to cumulative errors during autoregressive inference or information loss incurred during the discretization process. Probably due to this challenge, recent research has increasingly shifted toward jointly training image generation with diffusion objectives and language generation with autoregressive objectives, moving away from unified modeling approaches. In this work, we demonstrate that reinforcement learning can effectively mitigate artifacts and largely enhance the generation quality of a discrete autoregressive modeling method, thereby enabling seamless integration of image and language generation. Our framework comprises a semantic image tokenizer, a unified autoregressive model for both language and images, and an offline diffusion decoder for image generation, termed X-Omni. X-Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 30 Jul 2025 20:19:06 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6c0754d1/97d82fae.mp3" length="17317420" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1079</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, Jie Jiang</p>

            <p><strong>Title:</strong><br>
            X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.22058v1">http://arxiv.org/abs/2507.22058v1</a></p>

            <p><strong>Abstract:</strong><br>
            Numerous efforts have been made to extend the ``next token prediction'' paradigm to visual contents, aiming to create a unified approach for both image generation and understanding. Nevertheless, attempts to generate images through autoregressive modeling with discrete tokens have been plagued by issues such as low visual fidelity, distorted outputs, and failure to adhere to complex instructions when rendering intricate details. These shortcomings are likely attributed to cumulative errors during autoregressive inference or information loss incurred during the discretization process. Probably due to this challenge, recent research has increasingly shifted toward jointly training image generation with diffusion objectives and language generation with autoregressive objectives, moving away from unified modeling approaches. In this work, we demonstrate that reinforcement learning can effectively mitigate artifacts and largely enhance the generation quality of a discrete autoregressive modeling method, thereby enabling seamless integration of image and language generation. Our framework comprises a semantic image tokenizer, a unified autoregressive model for both language and images, and an offline diffusion decoder for image generation, termed X-Omni. X-Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ChemDFM-R: An Chemical Reasoner LLM Enhanced with Atomized Chemical Knowledge</title>
      <itunes:episode>1019</itunes:episode>
      <podcast:episode>1019</podcast:episode>
      <itunes:title>ChemDFM-R: An Chemical Reasoner LLM Enhanced with Atomized Chemical Knowledge</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">10112126-01d3-4264-b777-1c3018b2404c</guid>
      <link>https://share.transistor.fm/s/18741468</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zihan Zhao, Bo Chen, Ziping Wan, Lu Chen, Xuanze Lin, Shiyang Yu, Situo Zhang, Da Ma, Zichen Zhu, Danyang Zhang, Huayang Wang, Zhongyang Dai, Liyang Wen, Xin Chen, Kai Yu</p>

            <p><strong>Title:</strong><br>
            ChemDFM-R: An Chemical Reasoner LLM Enhanced with Atomized Chemical Knowledge</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.21990v2">http://arxiv.org/abs/2507.21990v2</a></p>

            <p><strong>Abstract:</strong><br>
            While large language models (LLMs) have achieved impressive progress, their application in scientific domains such as chemistry remains hindered by shallow domain understanding and limited reasoning capabilities. In this work, we focus on the specific field of chemistry and develop a Chemical Reasoner LLM, ChemDFM-R. We first construct a comprehensive dataset of atomized knowledge points to enhance the model's understanding of the fundamental principles and logical structure of chemistry. Then, we propose a mix-sourced distillation strategy that integrates expert-curated knowledge with general-domain reasoning skills, followed by domain-specific reinforcement learning to enhance chemical reasoning. Experiments on diverse chemical benchmarks demonstrate that ChemDFM-R achieves cutting-edge performance while providing interpretable, rationale-driven outputs. Further case studies illustrate how explicit reasoning chains significantly improve the reliability, transparency, and practical utility of the model in real-world human-AI collaboration scenarios.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zihan Zhao, Bo Chen, Ziping Wan, Lu Chen, Xuanze Lin, Shiyang Yu, Situo Zhang, Da Ma, Zichen Zhu, Danyang Zhang, Huayang Wang, Zhongyang Dai, Liyang Wen, Xin Chen, Kai Yu</p>

            <p><strong>Title:</strong><br>
            ChemDFM-R: An Chemical Reasoner LLM Enhanced with Atomized Chemical Knowledge</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.21990v2">http://arxiv.org/abs/2507.21990v2</a></p>

            <p><strong>Abstract:</strong><br>
            While large language models (LLMs) have achieved impressive progress, their application in scientific domains such as chemistry remains hindered by shallow domain understanding and limited reasoning capabilities. In this work, we focus on the specific field of chemistry and develop a Chemical Reasoner LLM, ChemDFM-R. We first construct a comprehensive dataset of atomized knowledge points to enhance the model's understanding of the fundamental principles and logical structure of chemistry. Then, we propose a mix-sourced distillation strategy that integrates expert-curated knowledge with general-domain reasoning skills, followed by domain-specific reinforcement learning to enhance chemical reasoning. Experiments on diverse chemical benchmarks demonstrate that ChemDFM-R achieves cutting-edge performance while providing interpretable, rationale-driven outputs. Further case studies illustrate how explicit reasoning chains significantly improve the reliability, transparency, and practical utility of the model in real-world human-AI collaboration scenarios.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 30 Jul 2025 20:18:45 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/18741468/0ce2f65c.mp3" length="21807536" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1359</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zihan Zhao, Bo Chen, Ziping Wan, Lu Chen, Xuanze Lin, Shiyang Yu, Situo Zhang, Da Ma, Zichen Zhu, Danyang Zhang, Huayang Wang, Zhongyang Dai, Liyang Wen, Xin Chen, Kai Yu</p>

            <p><strong>Title:</strong><br>
            ChemDFM-R: An Chemical Reasoner LLM Enhanced with Atomized Chemical Knowledge</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.21990v2">http://arxiv.org/abs/2507.21990v2</a></p>

            <p><strong>Abstract:</strong><br>
            While large language models (LLMs) have achieved impressive progress, their application in scientific domains such as chemistry remains hindered by shallow domain understanding and limited reasoning capabilities. In this work, we focus on the specific field of chemistry and develop a Chemical Reasoner LLM, ChemDFM-R. We first construct a comprehensive dataset of atomized knowledge points to enhance the model's understanding of the fundamental principles and logical structure of chemistry. Then, we propose a mix-sourced distillation strategy that integrates expert-curated knowledge with general-domain reasoning skills, followed by domain-specific reinforcement learning to enhance chemical reasoning. Experiments on diverse chemical benchmarks demonstrate that ChemDFM-R achieves cutting-edge performance while providing interpretable, rationale-driven outputs. Further case studies illustrate how explicit reasoning chains significantly improve the reliability, transparency, and practical utility of the model in real-world human-AI collaboration scenarios.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Agentic Reinforced Policy Optimization</title>
      <itunes:episode>1018</itunes:episode>
      <podcast:episode>1018</podcast:episode>
      <itunes:title>Agentic Reinforced Policy Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b489557b-644d-48ad-a906-a5eaa446ea27</guid>
      <link>https://share.transistor.fm/s/e058d4a6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 84 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            Agentic Reinforced Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.19849v1">http://arxiv.org/abs/2507.19849v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can often utilize external tools to assist in task-solving processes. However, current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Through preliminary experiments, we observe that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism, dynamically balancing global trajectory sampling and step-level sampling, thereby promoting exploration at steps with high uncertainty after tool usage. By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Our experiments across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains demonstrate ARPO's superiority over trajectory-level RL algorithms. Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments. Our code and datasets are released at https://github.com/dongguanting/ARPO</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 84 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            Agentic Reinforced Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.19849v1">http://arxiv.org/abs/2507.19849v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can often utilize external tools to assist in task-solving processes. However, current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Through preliminary experiments, we observe that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism, dynamically balancing global trajectory sampling and step-level sampling, thereby promoting exploration at steps with high uncertainty after tool usage. By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Our experiments across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains demonstrate ARPO's superiority over trajectory-level RL algorithms. Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments. Our code and datasets are released at https://github.com/dongguanting/ARPO</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 29 Jul 2025 20:52:45 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e058d4a6/22835a41.mp3" length="21346070" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1330</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 84 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            Agentic Reinforced Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.19849v1">http://arxiv.org/abs/2507.19849v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can often utilize external tools to assist in task-solving processes. However, current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Through preliminary experiments, we observe that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism, dynamically balancing global trajectory sampling and step-level sampling, thereby promoting exploration at steps with high uncertainty after tool usage. By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Our experiments across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains demonstrate ARPO's superiority over trajectory-level RL algorithms. Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments. Our code and datasets are released at https://github.com/dongguanting/ARPO</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts</title>
      <itunes:episode>1017</itunes:episode>
      <podcast:episode>1017</podcast:episode>
      <itunes:title>ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">993c45e4-d417-4d37-964c-42f55d716b34</guid>
      <link>https://share.transistor.fm/s/8f2f21cd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuying Ge, Yixiao Ge, Chen Li, Teng Wang, Junfu Pu, Yizhuo Li, Lu Qiu, Jin Ma, Lisheng Duan, Xinyu Zuo, Jinwen Luo, Weibo Gu, Zexuan Li, Xiaojing Zhang, Yangyu Tao, Han Hu, Di Wang, Ying Shan</p>

            <p><strong>Title:</strong><br>
            ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.20939v1">http://arxiv.org/abs/2507.20939v1</a></p>

            <p><strong>Abstract:</strong><br>
            Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning. Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning, cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning. Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuying Ge, Yixiao Ge, Chen Li, Teng Wang, Junfu Pu, Yizhuo Li, Lu Qiu, Jin Ma, Lisheng Duan, Xinyu Zuo, Jinwen Luo, Weibo Gu, Zexuan Li, Xiaojing Zhang, Yangyu Tao, Han Hu, Di Wang, Ying Shan</p>

            <p><strong>Title:</strong><br>
            ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.20939v1">http://arxiv.org/abs/2507.20939v1</a></p>

            <p><strong>Abstract:</strong><br>
            Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning. Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning, cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning. Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 29 Jul 2025 20:52:24 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8f2f21cd/bf9a1a6e.mp3" length="21459372" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1338</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuying Ge, Yixiao Ge, Chen Li, Teng Wang, Junfu Pu, Yizhuo Li, Lu Qiu, Jin Ma, Lisheng Duan, Xinyu Zuo, Jinwen Luo, Weibo Gu, Zexuan Li, Xiaojing Zhang, Yangyu Tao, Han Hu, Di Wang, Ying Shan</p>

            <p><strong>Title:</strong><br>
            ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.20939v1">http://arxiv.org/abs/2507.20939v1</a></p>

            <p><strong>Abstract:</strong><br>
            Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning. Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning, cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning. Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence</title>
      <itunes:episode>1016</itunes:episode>
      <podcast:episode>1016</podcast:episode>
      <itunes:title>A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">afc62911-2bd7-4842-bb05-6c49d26f3845</guid>
      <link>https://share.transistor.fm/s/8e12967a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Qihan Ren, Cheng Qian, Zhenghailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, Mengdi Wang</p>

            <p><strong>Title:</strong><br>
            A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.21046v1">http://arxiv.org/abs/2507.21046v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated strong capabilities but remain fundamentally static, unable to adapt their internal parameters to novel tasks, evolving knowledge domains, or dynamic interaction contexts. As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck, necessitating agents that can adaptively reason, act, and evolve in real time. This paradigm shift -- from scaling static models to developing self-evolving agents -- has sparked growing interest in architectures and methods enabling continual learning and adaptation from data, interactions, and experiences. This survey provides the first systematic and comprehensive review of self-evolving agents, organized around three foundational dimensions -- what to evolve, when to evolve, and how to evolve. We examine evolutionary mechanisms across agent components (e.g., models, memory, tools, architecture), categorize adaptation methods by stages (e.g., intra-test-time, inter-test-time), and analyze the algorithmic and architectural designs that guide evolutionary adaptation (e.g., scalar rewards, textual feedback, single-agent and multi-agent systems). Additionally, we analyze evaluation metrics and benchmarks tailored for self-evolving agents, highlight applications in domains such as coding, education, and healthcare, and identify critical challenges and research directions in safety, scalability, and co-evolutionary dynamics. By providing a structured framework for understanding and designing self-evolving agents, this survey establishes a roadmap for advancing adaptive agentic systems in both research and real-world deployments, ultimately shedding lights to pave the way for the realization of Artificial Super Intelligence (ASI), where agents evolve autonomously, performing at or beyond human-level intelligence across a wide array of tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Qihan Ren, Cheng Qian, Zhenghailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, Mengdi Wang</p>

            <p><strong>Title:</strong><br>
            A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.21046v1">http://arxiv.org/abs/2507.21046v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated strong capabilities but remain fundamentally static, unable to adapt their internal parameters to novel tasks, evolving knowledge domains, or dynamic interaction contexts. As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck, necessitating agents that can adaptively reason, act, and evolve in real time. This paradigm shift -- from scaling static models to developing self-evolving agents -- has sparked growing interest in architectures and methods enabling continual learning and adaptation from data, interactions, and experiences. This survey provides the first systematic and comprehensive review of self-evolving agents, organized around three foundational dimensions -- what to evolve, when to evolve, and how to evolve. We examine evolutionary mechanisms across agent components (e.g., models, memory, tools, architecture), categorize adaptation methods by stages (e.g., intra-test-time, inter-test-time), and analyze the algorithmic and architectural designs that guide evolutionary adaptation (e.g., scalar rewards, textual feedback, single-agent and multi-agent systems). Additionally, we analyze evaluation metrics and benchmarks tailored for self-evolving agents, highlight applications in domains such as coding, education, and healthcare, and identify critical challenges and research directions in safety, scalability, and co-evolutionary dynamics. By providing a structured framework for understanding and designing self-evolving agents, this survey establishes a roadmap for advancing adaptive agentic systems in both research and real-world deployments, ultimately shedding lights to pave the way for the realization of Artificial Super Intelligence (ASI), where agents evolve autonomously, performing at or beyond human-level intelligence across a wide array of tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 29 Jul 2025 20:52:02 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8e12967a/bc2a028f.mp3" length="21255409" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1325</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Qihan Ren, Cheng Qian, Zhenghailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, Mengdi Wang</p>

            <p><strong>Title:</strong><br>
            A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.21046v1">http://arxiv.org/abs/2507.21046v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated strong capabilities but remain fundamentally static, unable to adapt their internal parameters to novel tasks, evolving knowledge domains, or dynamic interaction contexts. As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck, necessitating agents that can adaptively reason, act, and evolve in real time. This paradigm shift -- from scaling static models to developing self-evolving agents -- has sparked growing interest in architectures and methods enabling continual learning and adaptation from data, interactions, and experiences. This survey provides the first systematic and comprehensive review of self-evolving agents, organized around three foundational dimensions -- what to evolve, when to evolve, and how to evolve. We examine evolutionary mechanisms across agent components (e.g., models, memory, tools, architecture), categorize adaptation methods by stages (e.g., intra-test-time, inter-test-time), and analyze the algorithmic and architectural designs that guide evolutionary adaptation (e.g., scalar rewards, textual feedback, single-agent and multi-agent systems). Additionally, we analyze evaluation metrics and benchmarks tailored for self-evolving agents, highlight applications in domains such as coding, education, and healthcare, and identify critical challenges and research directions in safety, scalability, and co-evolutionary dynamics. By providing a structured framework for understanding and designing self-evolving agents, this survey establishes a roadmap for advancing adaptive agentic systems in both research and real-world deployments, ultimately shedding lights to pave the way for the realization of Artificial Super Intelligence (ASI), where agents evolve autonomously, performing at or beyond human-level intelligence across a wide array of tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning</title>
      <itunes:episode>1015</itunes:episode>
      <podcast:episode>1015</podcast:episode>
      <itunes:title>Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">96e27350-7b67-4ede-bb6f-6ec644fe556e</guid>
      <link>https://share.transistor.fm/s/752eae56</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.LG, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zedong Wang, Siyuan Li, Dan Xu</p>

            <p><strong>Title:</strong><br>
            Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.21049v1">http://arxiv.org/abs/2507.21049v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the promise of Multi-Task Learning in leveraging complementary knowledge across tasks, existing multi-task optimization (MTO) techniques remain fixated on resolving conflicts via optimizer-centric loss scaling and gradient manipulation strategies, yet fail to deliver consistent gains. In this paper, we argue that the shared representation space, where task interactions naturally occur, offers rich information and potential for operations complementary to existing optimizers, especially for facilitating the inter-task complementarity, which is rarely explored in MTO. This intuition leads to Rep-MTL, which exploits the representation-level task saliency to quantify interactions between task-specific optimization and shared representation learning. By steering these saliencies through entropy-based penalization and sample-wise cross-task alignment, Rep-MTL aims to mitigate negative transfer by maintaining the effective training of individual tasks instead pure conflict-solving, while explicitly promoting complementary information sharing. Experiments are conducted on four challenging MTL benchmarks covering both task-shift and domain-shift scenarios. The results show that Rep-MTL, even paired with the basic equal weighting policy, achieves competitive performance gains with favorable efficiency. Beyond standard performance metrics, Power Law exponent analysis demonstrates Rep-MTL's efficacy in balancing task-specific learning and cross-task sharing. The project page is available at HERE.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.LG, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zedong Wang, Siyuan Li, Dan Xu</p>

            <p><strong>Title:</strong><br>
            Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.21049v1">http://arxiv.org/abs/2507.21049v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the promise of Multi-Task Learning in leveraging complementary knowledge across tasks, existing multi-task optimization (MTO) techniques remain fixated on resolving conflicts via optimizer-centric loss scaling and gradient manipulation strategies, yet fail to deliver consistent gains. In this paper, we argue that the shared representation space, where task interactions naturally occur, offers rich information and potential for operations complementary to existing optimizers, especially for facilitating the inter-task complementarity, which is rarely explored in MTO. This intuition leads to Rep-MTL, which exploits the representation-level task saliency to quantify interactions between task-specific optimization and shared representation learning. By steering these saliencies through entropy-based penalization and sample-wise cross-task alignment, Rep-MTL aims to mitigate negative transfer by maintaining the effective training of individual tasks instead pure conflict-solving, while explicitly promoting complementary information sharing. Experiments are conducted on four challenging MTL benchmarks covering both task-shift and domain-shift scenarios. The results show that Rep-MTL, even paired with the basic equal weighting policy, achieves competitive performance gains with favorable efficiency. Beyond standard performance metrics, Power Law exponent analysis demonstrates Rep-MTL's efficacy in balancing task-specific learning and cross-task sharing. The project page is available at HERE.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 29 Jul 2025 20:51:40 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/752eae56/4eae52b7.mp3" length="19284749" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1202</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.LG, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zedong Wang, Siyuan Li, Dan Xu</p>

            <p><strong>Title:</strong><br>
            Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.21049v1">http://arxiv.org/abs/2507.21049v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the promise of Multi-Task Learning in leveraging complementary knowledge across tasks, existing multi-task optimization (MTO) techniques remain fixated on resolving conflicts via optimizer-centric loss scaling and gradient manipulation strategies, yet fail to deliver consistent gains. In this paper, we argue that the shared representation space, where task interactions naturally occur, offers rich information and potential for operations complementary to existing optimizers, especially for facilitating the inter-task complementarity, which is rarely explored in MTO. This intuition leads to Rep-MTL, which exploits the representation-level task saliency to quantify interactions between task-specific optimization and shared representation learning. By steering these saliencies through entropy-based penalization and sample-wise cross-task alignment, Rep-MTL aims to mitigate negative transfer by maintaining the effective training of individual tasks instead pure conflict-solving, while explicitly promoting complementary information sharing. Experiments are conducted on four challenging MTL benchmarks covering both task-shift and domain-shift scenarios. The results show that Rep-MTL, even paired with the basic equal weighting policy, achieves competitive performance gains with favorable efficiency. Beyond standard performance metrics, Power Law exponent analysis demonstrates Rep-MTL's efficacy in balancing task-specific learning and cross-task sharing. The project page is available at HERE.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment</title>
      <itunes:episode>1014</itunes:episode>
      <podcast:episode>1014</podcast:episode>
      <itunes:title>SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f38ebed9-882f-4373-9915-0ae33d72c758</guid>
      <link>https://share.transistor.fm/s/7936acbd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yixin Song, Zhenliang Xue, Dongliang Wei, Feiyang Chen, Jianxiang Gao, Junchen Liu, Hangyu Liang, Guangshuo Qin, Chengrong Tian, Bo Wen, Longyu Zhao, Xinrui Zheng, Zeyu Mi, Haibo Chen</p>

            <p><strong>Title:</strong><br>
            SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.20984v1">http://arxiv.org/abs/2507.20984v1</a></p>

            <p><strong>Abstract:</strong><br>
            While frontier large language models (LLMs) continue to push capability boundaries, their deployment remains confined to GPU-powered cloud infrastructure. We challenge this paradigm with SmallThinker, a family of LLMs natively designed - not adapted - for the unique constraints of local devices: weak computational power, limited memory, and slow storage. Unlike traditional approaches that mainly compress existing models built for clouds, we architect SmallThinker from the ground up to thrive within these limitations. Our innovation lies in a deployment-aware architecture that transforms constraints into design principles. First, We introduce a two-level sparse structure combining fine-grained Mixture-of-Experts (MoE) with sparse feed-forward networks, drastically reducing computational demands without sacrificing model capacity. Second, to conquer the I/O bottleneck of slow storage, we design a pre-attention router that enables our co-designed inference engine to prefetch expert parameters from storage while computing attention, effectively hiding storage latency that would otherwise cripple on-device inference. Third, for memory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism to slash KV cache requirements. We release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and even outperform larger LLMs. Remarkably, our co-designed system mostly eliminates the need for expensive GPU hardware: with Q4_0 quantization, both models exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GB and 8GB of memory respectively. SmallThinker is publicly available at hf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and hf.co/PowerInfer/SmallThinker-21BA3B-Instruct.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yixin Song, Zhenliang Xue, Dongliang Wei, Feiyang Chen, Jianxiang Gao, Junchen Liu, Hangyu Liang, Guangshuo Qin, Chengrong Tian, Bo Wen, Longyu Zhao, Xinrui Zheng, Zeyu Mi, Haibo Chen</p>

            <p><strong>Title:</strong><br>
            SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.20984v1">http://arxiv.org/abs/2507.20984v1</a></p>

            <p><strong>Abstract:</strong><br>
            While frontier large language models (LLMs) continue to push capability boundaries, their deployment remains confined to GPU-powered cloud infrastructure. We challenge this paradigm with SmallThinker, a family of LLMs natively designed - not adapted - for the unique constraints of local devices: weak computational power, limited memory, and slow storage. Unlike traditional approaches that mainly compress existing models built for clouds, we architect SmallThinker from the ground up to thrive within these limitations. Our innovation lies in a deployment-aware architecture that transforms constraints into design principles. First, We introduce a two-level sparse structure combining fine-grained Mixture-of-Experts (MoE) with sparse feed-forward networks, drastically reducing computational demands without sacrificing model capacity. Second, to conquer the I/O bottleneck of slow storage, we design a pre-attention router that enables our co-designed inference engine to prefetch expert parameters from storage while computing attention, effectively hiding storage latency that would otherwise cripple on-device inference. Third, for memory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism to slash KV cache requirements. We release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and even outperform larger LLMs. Remarkably, our co-designed system mostly eliminates the need for expensive GPU hardware: with Q4_0 quantization, both models exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GB and 8GB of memory respectively. SmallThinker is publicly available at hf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and hf.co/PowerInfer/SmallThinker-21BA3B-Instruct.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 29 Jul 2025 20:51:19 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7936acbd/78385d7e.mp3" length="22410252" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1397</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yixin Song, Zhenliang Xue, Dongliang Wei, Feiyang Chen, Jianxiang Gao, Junchen Liu, Hangyu Liang, Guangshuo Qin, Chengrong Tian, Bo Wen, Longyu Zhao, Xinrui Zheng, Zeyu Mi, Haibo Chen</p>

            <p><strong>Title:</strong><br>
            SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.20984v1">http://arxiv.org/abs/2507.20984v1</a></p>

            <p><strong>Abstract:</strong><br>
            While frontier large language models (LLMs) continue to push capability boundaries, their deployment remains confined to GPU-powered cloud infrastructure. We challenge this paradigm with SmallThinker, a family of LLMs natively designed - not adapted - for the unique constraints of local devices: weak computational power, limited memory, and slow storage. Unlike traditional approaches that mainly compress existing models built for clouds, we architect SmallThinker from the ground up to thrive within these limitations. Our innovation lies in a deployment-aware architecture that transforms constraints into design principles. First, We introduce a two-level sparse structure combining fine-grained Mixture-of-Experts (MoE) with sparse feed-forward networks, drastically reducing computational demands without sacrificing model capacity. Second, to conquer the I/O bottleneck of slow storage, we design a pre-attention router that enables our co-designed inference engine to prefetch expert parameters from storage while computing attention, effectively hiding storage latency that would otherwise cripple on-device inference. Third, for memory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism to slash KV cache requirements. We release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and even outperform larger LLMs. Remarkably, our co-designed system mostly eliminates the need for expensive GPU hardware: with Q4_0 quantization, both models exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GB and 8GB of memory respectively. SmallThinker is publicly available at hf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and hf.co/PowerInfer/SmallThinker-21BA3B-Instruct.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Reconstructing 4D Spatial Intelligence: A Survey</title>
      <itunes:episode>1013</itunes:episode>
      <podcast:episode>1013</podcast:episode>
      <itunes:title>Reconstructing 4D Spatial Intelligence: A Survey</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b087bd58-68b9-4172-b9fe-b480fd03acfe</guid>
      <link>https://share.transistor.fm/s/724f0437</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yukang Cao, Jiahao Lu, Zhisheng Huang, Zhuowei Shen, Chengfeng Zhao, Fangzhou Hong, Zhaoxi Chen, Xin Li, Wenping Wang, Yuan Liu, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Reconstructing 4D Spatial Intelligence: A Survey</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.21045v1">http://arxiv.org/abs/2507.21045v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reconstructing 4D spatial intelligence from visual observations has long been a central yet challenging task in computer vision, with broad real-world applications. These range from entertainment domains like movies, where the focus is often on reconstructing fundamental visual elements, to embodied AI, which emphasizes interaction modeling and physical realism. Fueled by rapid advances in 3D representations and deep learning architectures, the field has evolved quickly, outpacing the scope of previous surveys. Additionally, existing surveys rarely offer a comprehensive analysis of the hierarchical structure of 4D scene reconstruction. To address this gap, we present a new perspective that organizes existing methods into five progressive levels of 4D spatial intelligence: (1) Level 1 -- reconstruction of low-level 3D attributes (e.g., depth, pose, and point maps); (2) Level 2 -- reconstruction of 3D scene components (e.g., objects, humans, structures); (3) Level 3 -- reconstruction of 4D dynamic scenes; (4) Level 4 -- modeling of interactions among scene components; and (5) Level 5 -- incorporation of physical laws and constraints. We conclude the survey by discussing the key challenges at each level and highlighting promising directions for advancing toward even richer levels of 4D spatial intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/yukangcao/Awesome-4D-Spatial-Intelligence.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yukang Cao, Jiahao Lu, Zhisheng Huang, Zhuowei Shen, Chengfeng Zhao, Fangzhou Hong, Zhaoxi Chen, Xin Li, Wenping Wang, Yuan Liu, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Reconstructing 4D Spatial Intelligence: A Survey</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.21045v1">http://arxiv.org/abs/2507.21045v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reconstructing 4D spatial intelligence from visual observations has long been a central yet challenging task in computer vision, with broad real-world applications. These range from entertainment domains like movies, where the focus is often on reconstructing fundamental visual elements, to embodied AI, which emphasizes interaction modeling and physical realism. Fueled by rapid advances in 3D representations and deep learning architectures, the field has evolved quickly, outpacing the scope of previous surveys. Additionally, existing surveys rarely offer a comprehensive analysis of the hierarchical structure of 4D scene reconstruction. To address this gap, we present a new perspective that organizes existing methods into five progressive levels of 4D spatial intelligence: (1) Level 1 -- reconstruction of low-level 3D attributes (e.g., depth, pose, and point maps); (2) Level 2 -- reconstruction of 3D scene components (e.g., objects, humans, structures); (3) Level 3 -- reconstruction of 4D dynamic scenes; (4) Level 4 -- modeling of interactions among scene components; and (5) Level 5 -- incorporation of physical laws and constraints. We conclude the survey by discussing the key challenges at each level and highlighting promising directions for advancing toward even richer levels of 4D spatial intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/yukangcao/Awesome-4D-Spatial-Intelligence.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 29 Jul 2025 20:50:57 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/724f0437/92833807.mp3" length="20689467" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1289</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yukang Cao, Jiahao Lu, Zhisheng Huang, Zhuowei Shen, Chengfeng Zhao, Fangzhou Hong, Zhaoxi Chen, Xin Li, Wenping Wang, Yuan Liu, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Reconstructing 4D Spatial Intelligence: A Survey</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.21045v1">http://arxiv.org/abs/2507.21045v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reconstructing 4D spatial intelligence from visual observations has long been a central yet challenging task in computer vision, with broad real-world applications. These range from entertainment domains like movies, where the focus is often on reconstructing fundamental visual elements, to embodied AI, which emphasizes interaction modeling and physical realism. Fueled by rapid advances in 3D representations and deep learning architectures, the field has evolved quickly, outpacing the scope of previous surveys. Additionally, existing surveys rarely offer a comprehensive analysis of the hierarchical structure of 4D scene reconstruction. To address this gap, we present a new perspective that organizes existing methods into five progressive levels of 4D spatial intelligence: (1) Level 1 -- reconstruction of low-level 3D attributes (e.g., depth, pose, and point maps); (2) Level 2 -- reconstruction of 3D scene components (e.g., objects, humans, structures); (3) Level 3 -- reconstruction of 4D dynamic scenes; (4) Level 4 -- modeling of interactions among scene components; and (5) Level 5 -- incorporation of physical laws and constraints. We conclude the survey by discussing the key challenges at each level and highlighting promising directions for advancing toward even richer levels of 4D spatial intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/yukangcao/Awesome-4D-Spatial-Intelligence.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Deep Researcher with Test-Time Diffusion</title>
      <itunes:episode>1012</itunes:episode>
      <podcast:episode>1012</podcast:episode>
      <itunes:title>Deep Researcher with Test-Time Diffusion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">23892682-6020-4a39-9ead-525ec39428a4</guid>
      <link>https://share.transistor.fm/s/4ac76704</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Rujun Han, Yanfei Chen, Zoey CuiZhu, Lesly Miculicich, Guan Sun, Yuanjun Bi, Weiming Wen, Hui Wan, Chunfeng Wen, Solène Maître, George Lee, Vishy Tirumalashetty, Emily Xue, Zizhao Zhang, Salem Haykal, Burak Gokturk, Tomas Pfister, Chen-Yu Lee</p>

            <p><strong>Title:</strong><br>
            Deep Researcher with Test-Time Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.16075v1">http://arxiv.org/abs/2507.16075v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher (TTD-DR). This novel framework conceptualizes research report generation as a diffusion process. TTD-DR initiates this process with a preliminary draft, an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a "denoising" process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow, ensuring the generation of high-quality context for the diffusion process. This draft-centric design makes the report writing process more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning, significantly outperforming existing deep research agents.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Rujun Han, Yanfei Chen, Zoey CuiZhu, Lesly Miculicich, Guan Sun, Yuanjun Bi, Weiming Wen, Hui Wan, Chunfeng Wen, Solène Maître, George Lee, Vishy Tirumalashetty, Emily Xue, Zizhao Zhang, Salem Haykal, Burak Gokturk, Tomas Pfister, Chen-Yu Lee</p>

            <p><strong>Title:</strong><br>
            Deep Researcher with Test-Time Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.16075v1">http://arxiv.org/abs/2507.16075v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher (TTD-DR). This novel framework conceptualizes research report generation as a diffusion process. TTD-DR initiates this process with a preliminary draft, an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a "denoising" process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow, ensuring the generation of high-quality context for the diffusion process. This draft-centric design makes the report writing process more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning, significantly outperforming existing deep research agents.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 28 Jul 2025 20:02:40 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4ac76704/a8e6efc9.mp3" length="22118043" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1379</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Rujun Han, Yanfei Chen, Zoey CuiZhu, Lesly Miculicich, Guan Sun, Yuanjun Bi, Weiming Wen, Hui Wan, Chunfeng Wen, Solène Maître, George Lee, Vishy Tirumalashetty, Emily Xue, Zizhao Zhang, Salem Haykal, Burak Gokturk, Tomas Pfister, Chen-Yu Lee</p>

            <p><strong>Title:</strong><br>
            Deep Researcher with Test-Time Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.16075v1">http://arxiv.org/abs/2507.16075v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher (TTD-DR). This novel framework conceptualizes research report generation as a diffusion process. TTD-DR initiates this process with a preliminary draft, an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a "denoising" process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow, ensuring the generation of high-quality context for the diffusion process. This draft-centric design makes the report writing process more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning, significantly outperforming existing deep research agents.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>$\nabla$NABLA: Neighborhood Adaptive Block-Level Attention</title>
      <itunes:episode>1011</itunes:episode>
      <podcast:episode>1011</podcast:episode>
      <itunes:title>$\nabla$NABLA: Neighborhood Adaptive Block-Level Attention</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">03664069-c0b1-496d-ace6-dc493baf9c6b</guid>
      <link>https://share.transistor.fm/s/2041b7bb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 81 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Dmitrii Mikhailov, Aleksey Letunovskiy, Maria Kovaleva, Vladimir Arkhipkin, Vladimir Korviakov, Vladimir Polovnikov, Viacheslav Vasilev, Evelina Sidorova, Denis Dimitrov</p>

            <p><strong>Title:</strong><br>
            $\nabla$NABLA: Neighborhood Adaptive Block-Level Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13546v1">http://arxiv.org/abs/2507.13546v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in transformer-based architectures has demonstrated remarkable success in video generation tasks. However, the quadratic complexity of full attention mechanisms remains a critical bottleneck, particularly for high-resolution and long-duration video sequences. In this paper, we propose NABLA, a novel Neighborhood Adaptive Block-Level Attention mechanism that dynamically adapts to sparsity patterns in video diffusion transformers (DiTs). By leveraging block-wise attention with adaptive sparsity-driven threshold, NABLA reduces computational overhead while preserving generative quality. Our method does not require custom low-level operator design and can be seamlessly integrated with PyTorch's Flex Attention operator. Experiments demonstrate that NABLA achieves up to 2.7x faster training and inference compared to baseline almost without compromising quantitative metrics (CLIP score, VBench score, human evaluation score) and visual quality drop. The code and model weights are available here: https://github.com/gen-ai-team/Wan2.1-NABLA</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 81 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Dmitrii Mikhailov, Aleksey Letunovskiy, Maria Kovaleva, Vladimir Arkhipkin, Vladimir Korviakov, Vladimir Polovnikov, Viacheslav Vasilev, Evelina Sidorova, Denis Dimitrov</p>

            <p><strong>Title:</strong><br>
            $\nabla$NABLA: Neighborhood Adaptive Block-Level Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13546v1">http://arxiv.org/abs/2507.13546v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in transformer-based architectures has demonstrated remarkable success in video generation tasks. However, the quadratic complexity of full attention mechanisms remains a critical bottleneck, particularly for high-resolution and long-duration video sequences. In this paper, we propose NABLA, a novel Neighborhood Adaptive Block-Level Attention mechanism that dynamically adapts to sparsity patterns in video diffusion transformers (DiTs). By leveraging block-wise attention with adaptive sparsity-driven threshold, NABLA reduces computational overhead while preserving generative quality. Our method does not require custom low-level operator design and can be seamlessly integrated with PyTorch's Flex Attention operator. Experiments demonstrate that NABLA achieves up to 2.7x faster training and inference compared to baseline almost without compromising quantitative metrics (CLIP score, VBench score, human evaluation score) and visual quality drop. The code and model weights are available here: https://github.com/gen-ai-team/Wan2.1-NABLA</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 25 Jul 2025 20:24:20 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2041b7bb/9f2c6e58.mp3" length="20398577" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1271</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 81 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Dmitrii Mikhailov, Aleksey Letunovskiy, Maria Kovaleva, Vladimir Arkhipkin, Vladimir Korviakov, Vladimir Polovnikov, Viacheslav Vasilev, Evelina Sidorova, Denis Dimitrov</p>

            <p><strong>Title:</strong><br>
            $\nabla$NABLA: Neighborhood Adaptive Block-Level Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13546v1">http://arxiv.org/abs/2507.13546v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in transformer-based architectures has demonstrated remarkable success in video generation tasks. However, the quadratic complexity of full attention mechanisms remains a critical bottleneck, particularly for high-resolution and long-duration video sequences. In this paper, we propose NABLA, a novel Neighborhood Adaptive Block-Level Attention mechanism that dynamically adapts to sparsity patterns in video diffusion transformers (DiTs). By leveraging block-wise attention with adaptive sparsity-driven threshold, NABLA reduces computational overhead while preserving generative quality. Our method does not require custom low-level operator design and can be seamlessly integrated with PyTorch's Flex Attention operator. Experiments demonstrate that NABLA achieves up to 2.7x faster training and inference compared to baseline almost without compromising quantitative metrics (CLIP score, VBench score, human evaluation score) and visual quality drop. The code and model weights are available here: https://github.com/gen-ai-team/Wan2.1-NABLA</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Group Sequence Policy Optimization</title>
      <itunes:episode>1010</itunes:episode>
      <podcast:episode>1010</podcast:episode>
      <itunes:title>Group Sequence Policy Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a87e1814-93b7-4e33-9b02-a81049a47b83</guid>
      <link>https://share.transistor.fm/s/26e38441</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Group Sequence Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.18071v1">http://arxiv.org/abs/2507.18071v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Group Sequence Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.18071v1">http://arxiv.org/abs/2507.18071v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 25 Jul 2025 20:23:59 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/26e38441/8c750c3a.mp3" length="22425237" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1398</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Group Sequence Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.18071v1">http://arxiv.org/abs/2507.18071v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MUR: Momentum Uncertainty guided Reasoning for Large Language Models</title>
      <itunes:episode>1009</itunes:episode>
      <podcast:episode>1009</podcast:episode>
      <itunes:title>MUR: Momentum Uncertainty guided Reasoning for Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">05c89323-90ae-40df-b366-b5ac4ae8b4ef</guid>
      <link>https://share.transistor.fm/s/a3ca2f99</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hang Yan, Fangzhi Xu, Rongman Xu, Yifei Li, Jian Zhang, Haoran Luo, Xiaobao Wu, Luu Anh Tuan, Haiteng Zhao, Qika Lin, Jun Liu</p>

            <p><strong>Title:</strong><br>
            MUR: Momentum Uncertainty guided Reasoning for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.14958v1">http://arxiv.org/abs/2507.14958v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 50% on average while improving accuracy by 0.62-3.37%.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hang Yan, Fangzhi Xu, Rongman Xu, Yifei Li, Jian Zhang, Haoran Luo, Xiaobao Wu, Luu Anh Tuan, Haiteng Zhao, Qika Lin, Jun Liu</p>

            <p><strong>Title:</strong><br>
            MUR: Momentum Uncertainty guided Reasoning for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.14958v1">http://arxiv.org/abs/2507.14958v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 50% on average while improving accuracy by 0.62-3.37%.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 25 Jul 2025 20:23:37 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a3ca2f99/f1bc2531.mp3" length="21641180" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1349</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hang Yan, Fangzhi Xu, Rongman Xu, Yifei Li, Jian Zhang, Haoran Luo, Xiaobao Wu, Luu Anh Tuan, Haiteng Zhao, Qika Lin, Jun Liu</p>

            <p><strong>Title:</strong><br>
            MUR: Momentum Uncertainty guided Reasoning for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.14958v1">http://arxiv.org/abs/2507.14958v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 50% on average while improving accuracy by 0.62-3.37%.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization</title>
      <itunes:episode>1008</itunes:episode>
      <podcast:episode>1008</podcast:episode>
      <itunes:title>LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">23b46523-8b5c-4521-bb81-a04b47ed477b</guid>
      <link>https://share.transistor.fm/s/b245db45</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingyu Wu, Yuchen Yan, Shangke Lyu, Linjuan Wu, Yiwen Qiu, Yongliang Shen, Weiming Lu, Jian Shao, Jun Xiao, Yueting Zhuang</p>

            <p><strong>Title:</strong><br>
            LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.15758v1">http://arxiv.org/abs/2507.15758v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model's reasoning context to ensure inference-time flexibility. Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9\% while improving accuracy by 2.3\%. Our analysis reveals that models trained with LAPO develop emergent abilities to allocate computational resources based on problem complexity, achieving efficient reasoning without sacrificing quality.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingyu Wu, Yuchen Yan, Shangke Lyu, Linjuan Wu, Yiwen Qiu, Yongliang Shen, Weiming Lu, Jian Shao, Jun Xiao, Yueting Zhuang</p>

            <p><strong>Title:</strong><br>
            LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.15758v1">http://arxiv.org/abs/2507.15758v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model's reasoning context to ensure inference-time flexibility. Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9\% while improving accuracy by 2.3\%. Our analysis reveals that models trained with LAPO develop emergent abilities to allocate computational resources based on problem complexity, achieving efficient reasoning without sacrificing quality.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 25 Jul 2025 20:23:16 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b245db45/7d455640.mp3" length="19489538" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1214</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingyu Wu, Yuchen Yan, Shangke Lyu, Linjuan Wu, Yiwen Qiu, Yongliang Shen, Weiming Lu, Jian Shao, Jun Xiao, Yueting Zhuang</p>

            <p><strong>Title:</strong><br>
            LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.15758v1">http://arxiv.org/abs/2507.15758v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model's reasoning context to ensure inference-time flexibility. Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9\% while improving accuracy by 2.3\%. Our analysis reveals that models trained with LAPO develop emergent abilities to allocate computational resources based on problem complexity, achieving efficient reasoning without sacrificing quality.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Pixels, Patterns, but No Poetry: To See The World like Humans</title>
      <itunes:episode>1007</itunes:episode>
      <podcast:episode>1007</podcast:episode>
      <itunes:title>Pixels, Patterns, but No Poetry: To See The World like Humans</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">178b67d9-2219-449c-a670-494db3951ea0</guid>
      <link>https://share.transistor.fm/s/13e21cc6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hongcheng Gao, Zihao Huang, Lin Xu, Jingyi Tang, Xinhao Li, Yue Liu, Haoyang Li, Taihang Hu, Minhua Lin, Xinlong Yang, Ge Wu, Balong Bi, Hongyu Chen, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            Pixels, Patterns, but No Poetry: To See The World like Humans</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.16863v1">http://arxiv.org/abs/2507.16863v1</a></p>

            <p><strong>Abstract:</strong><br>
            Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: Can Multimodal Large Language Models truly perceive the world as humans do? This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs' performance on synthetic images that humans process intuitively. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone-effective for previous benchmarks-fail to improve performance on our tasks, while fine-tuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone-a key gap between current MLLMs and human perception. We release a representative subset of TET tasks in this version, and will introduce more diverse tasks and methods to enhance visual generalization in future work.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hongcheng Gao, Zihao Huang, Lin Xu, Jingyi Tang, Xinhao Li, Yue Liu, Haoyang Li, Taihang Hu, Minhua Lin, Xinlong Yang, Ge Wu, Balong Bi, Hongyu Chen, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            Pixels, Patterns, but No Poetry: To See The World like Humans</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.16863v1">http://arxiv.org/abs/2507.16863v1</a></p>

            <p><strong>Abstract:</strong><br>
            Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: Can Multimodal Large Language Models truly perceive the world as humans do? This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs' performance on synthetic images that humans process intuitively. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone-effective for previous benchmarks-fail to improve performance on our tasks, while fine-tuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone-a key gap between current MLLMs and human perception. We release a representative subset of TET tasks in this version, and will introduce more diverse tasks and methods to enhance visual generalization in future work.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 24 Jul 2025 20:44:34 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/13e21cc6/20209ed8.mp3" length="16642381" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1036</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hongcheng Gao, Zihao Huang, Lin Xu, Jingyi Tang, Xinhao Li, Yue Liu, Haoyang Li, Taihang Hu, Minhua Lin, Xinlong Yang, Ge Wu, Balong Bi, Hongyu Chen, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            Pixels, Patterns, but No Poetry: To See The World like Humans</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.16863v1">http://arxiv.org/abs/2507.16863v1</a></p>

            <p><strong>Abstract:</strong><br>
            Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: Can Multimodal Large Language Models truly perceive the world as humans do? This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs' performance on synthetic images that humans process intuitively. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone-effective for previous benchmarks-fail to improve performance on our tasks, while fine-tuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone-a key gap between current MLLMs and human perception. We release a representative subset of TET tasks in this version, and will introduce more diverse tasks and methods to enhance visual generalization in future work.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Yume: An Interactive World Generation Model</title>
      <itunes:episode>1006</itunes:episode>
      <podcast:episode>1006</podcast:episode>
      <itunes:title>Yume: An Interactive World Generation Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e820334d-1511-42ab-af79-4edeb0a3557f</guid>
      <link>https://share.transistor.fm/s/245aca64</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV, cs.AI, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, Kaipeng Zhang</p>

            <p><strong>Title:</strong><br>
            Yume: An Interactive World Generation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.17744v1">http://arxiv.org/abs/2507.17744v1</a></p>

            <p><strong>Abstract:</strong><br>
            Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of \method, which creates a dynamic world from an input image and allows exploration of the world using keyboard actions. To achieve this high-fidelity and interactive video world generation, we introduce a well-designed framework, which consists of four main components, including camera motion quantization, video generation architecture, advanced sampler, and model acceleration. First, we quantize camera motions for stable training and user-friendly interaction using keyboard inputs. Then, we introduce the Masked Video Diffusion Transformer~(MVDT) with a memory module for infinite video generation in an autoregressive manner. After that, training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) are introduced to the sampler for better visual quality and more precise control. Moreover, we investigate model acceleration by synergistic optimization of adversarial distillation and caching mechanisms. We use the high-quality world exploration dataset \sekai to train \method, and it achieves remarkable results in diverse scenes and applications. All data, codebase, and model weights are available on https://github.com/stdstu12/YUME. Yume will update monthly to achieve its original goal. Project page: https://stdstu12.github.io/YUME-Project/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV, cs.AI, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, Kaipeng Zhang</p>

            <p><strong>Title:</strong><br>
            Yume: An Interactive World Generation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.17744v1">http://arxiv.org/abs/2507.17744v1</a></p>

            <p><strong>Abstract:</strong><br>
            Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of \method, which creates a dynamic world from an input image and allows exploration of the world using keyboard actions. To achieve this high-fidelity and interactive video world generation, we introduce a well-designed framework, which consists of four main components, including camera motion quantization, video generation architecture, advanced sampler, and model acceleration. First, we quantize camera motions for stable training and user-friendly interaction using keyboard inputs. Then, we introduce the Masked Video Diffusion Transformer~(MVDT) with a memory module for infinite video generation in an autoregressive manner. After that, training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) are introduced to the sampler for better visual quality and more precise control. Moreover, we investigate model acceleration by synergistic optimization of adversarial distillation and caching mechanisms. We use the high-quality world exploration dataset \sekai to train \method, and it achieves remarkable results in diverse scenes and applications. All data, codebase, and model weights are available on https://github.com/stdstu12/YUME. Yume will update monthly to achieve its original goal. Project page: https://stdstu12.github.io/YUME-Project/.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 24 Jul 2025 20:44:13 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/245aca64/9ab72ede.mp3" length="25197569" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1571</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV, cs.AI, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, Kaipeng Zhang</p>

            <p><strong>Title:</strong><br>
            Yume: An Interactive World Generation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.17744v1">http://arxiv.org/abs/2507.17744v1</a></p>

            <p><strong>Abstract:</strong><br>
            Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of \method, which creates a dynamic world from an input image and allows exploration of the world using keyboard actions. To achieve this high-fidelity and interactive video world generation, we introduce a well-designed framework, which consists of four main components, including camera motion quantization, video generation architecture, advanced sampler, and model acceleration. First, we quantize camera motions for stable training and user-friendly interaction using keyboard inputs. Then, we introduce the Masked Video Diffusion Transformer~(MVDT) with a memory module for infinite video generation in an autoregressive manner. After that, training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) are introduced to the sampler for better visual quality and more precise control. Moreover, we investigate model acceleration by synergistic optimization of adversarial distillation and caching mechanisms. We use the high-quality world exploration dataset \sekai to train \method, and it achieves remarkable results in diverse scenes and applications. All data, codebase, and model weights are available on https://github.com/stdstu12/YUME. Yume will update monthly to achieve its original goal. Project page: https://stdstu12.github.io/YUME-Project/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DesignLab: Designing Slides Through Iterative Detection and Correction</title>
      <itunes:episode>1005</itunes:episode>
      <podcast:episode>1005</podcast:episode>
      <itunes:title>DesignLab: Designing Slides Through Iterative Detection and Correction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f1359ec2-9ad3-4460-9c8c-21c05cf8b645</guid>
      <link>https://share.transistor.fm/s/a1c2aece</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jooyeol Yun, Heng Wang, Yotaro Shimose, Jaegul Choo, Shingo Takamatsu</p>

            <p><strong>Title:</strong><br>
            DesignLab: Designing Slides Through Iterative Detection and Correction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.17202v1">http://arxiv.org/abs/2507.17202v1</a></p>

            <p><strong>Abstract:</strong><br>
            Designing high-quality presentation slides can be challenging for non-experts due to the complexity involved in navigating various design choices. Numerous automated tools can suggest layouts and color schemes, yet often lack the ability to refine their own output, which is a key aspect in real-world workflows. We propose DesignLab, which separates the design process into two roles, the design reviewer, who identifies design-related issues, and the design contributor who corrects them. This decomposition enables an iterative loop where the reviewer continuously detects issues and the contributor corrects them, allowing a draft to be further polished with each iteration, reaching qualities that were unattainable. We fine-tune large language models for these roles and simulate intermediate drafts by introducing controlled perturbations, enabling the design reviewer learn design errors and the contributor learn how to fix them. Our experiments show that DesignLab outperforms existing design-generation methods, including a commercial tool, by embracing the iterative nature of designing which can result in polished, professional slides.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jooyeol Yun, Heng Wang, Yotaro Shimose, Jaegul Choo, Shingo Takamatsu</p>

            <p><strong>Title:</strong><br>
            DesignLab: Designing Slides Through Iterative Detection and Correction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.17202v1">http://arxiv.org/abs/2507.17202v1</a></p>

            <p><strong>Abstract:</strong><br>
            Designing high-quality presentation slides can be challenging for non-experts due to the complexity involved in navigating various design choices. Numerous automated tools can suggest layouts and color schemes, yet often lack the ability to refine their own output, which is a key aspect in real-world workflows. We propose DesignLab, which separates the design process into two roles, the design reviewer, who identifies design-related issues, and the design contributor who corrects them. This decomposition enables an iterative loop where the reviewer continuously detects issues and the contributor corrects them, allowing a draft to be further polished with each iteration, reaching qualities that were unattainable. We fine-tune large language models for these roles and simulate intermediate drafts by introducing controlled perturbations, enabling the design reviewer learn design errors and the contributor learn how to fix them. Our experiments show that DesignLab outperforms existing design-generation methods, including a commercial tool, by embracing the iterative nature of designing which can result in polished, professional slides.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 24 Jul 2025 20:43:52 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a1c2aece/0bdadbd5.mp3" length="22265613" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1388</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jooyeol Yun, Heng Wang, Yotaro Shimose, Jaegul Choo, Shingo Takamatsu</p>

            <p><strong>Title:</strong><br>
            DesignLab: Designing Slides Through Iterative Detection and Correction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.17202v1">http://arxiv.org/abs/2507.17202v1</a></p>

            <p><strong>Abstract:</strong><br>
            Designing high-quality presentation slides can be challenging for non-experts due to the complexity involved in navigating various design choices. Numerous automated tools can suggest layouts and color schemes, yet often lack the ability to refine their own output, which is a key aspect in real-world workflows. We propose DesignLab, which separates the design process into two roles, the design reviewer, who identifies design-related issues, and the design contributor who corrects them. This decomposition enables an iterative loop where the reviewer continuously detects issues and the contributor corrects them, allowing a draft to be further polished with each iteration, reaching qualities that were unattainable. We fine-tune large language models for these roles and simulate intermediate drafts by introducing controlled perturbations, enabling the design reviewer learn design errors and the contributor learn how to fix them. Our experiments show that DesignLab outperforms existing design-generation methods, including a commercial tool, by embracing the iterative nature of designing which can result in polished, professional slides.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning</title>
      <itunes:episode>1004</itunes:episode>
      <podcast:episode>1004</podcast:episode>
      <itunes:title>Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4f8a6414-7d7d-46aa-9c51-91856b911f94</guid>
      <link>https://share.transistor.fm/s/3ddd64e5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yu Li, Zhuoshi Pan, Honglin Lin, Mengyuan Sun, Conghui He, Lijun Wu</p>

            <p><strong>Title:</strong><br>
            Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.17512v1">http://arxiv.org/abs/2507.17512v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of LLMs. Existing research has predominantly concentrated on isolated reasoning domains such as mathematical problem-solving, coding tasks, or logical reasoning. However, real world reasoning scenarios inherently demand an integrated application of multiple cognitive skills. Despite this, the interplay among these reasoning skills under reinforcement learning remains poorly understood. To bridge this gap, we present a systematic investigation of multi-domain reasoning within the RLVR framework, explicitly focusing on three primary domains: mathematical reasoning, code generation, and logical puzzle solving. We conduct a comprehensive study comprising four key components: (1) Leveraging the GRPO algorithm and the Qwen-2.5-7B model family, our study thoroughly evaluates the models' in-domain improvements and cross-domain generalization capabilities when trained on single-domain datasets. (2) Additionally, we examine the intricate interactions including mutual enhancements and conflicts that emerge during combined cross-domain training. (3) To further understand the influence of SFT on RL, we also analyze and compare performance differences between base and instruct models under identical RL configurations. (4) Furthermore, we delve into critical RL training details, systematically exploring the impacts of curriculum learning strategies, variations in reward design, and language-specific factors. Through extensive experiments, our results offer significant insights into the dynamics governing domain interactions, revealing key factors influencing both specialized and generalizable reasoning performance. These findings provide valuable guidance for optimizing RL methodologies to foster comprehensive, multi-domain reasoning capabilities in LLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yu Li, Zhuoshi Pan, Honglin Lin, Mengyuan Sun, Conghui He, Lijun Wu</p>

            <p><strong>Title:</strong><br>
            Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.17512v1">http://arxiv.org/abs/2507.17512v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of LLMs. Existing research has predominantly concentrated on isolated reasoning domains such as mathematical problem-solving, coding tasks, or logical reasoning. However, real world reasoning scenarios inherently demand an integrated application of multiple cognitive skills. Despite this, the interplay among these reasoning skills under reinforcement learning remains poorly understood. To bridge this gap, we present a systematic investigation of multi-domain reasoning within the RLVR framework, explicitly focusing on three primary domains: mathematical reasoning, code generation, and logical puzzle solving. We conduct a comprehensive study comprising four key components: (1) Leveraging the GRPO algorithm and the Qwen-2.5-7B model family, our study thoroughly evaluates the models' in-domain improvements and cross-domain generalization capabilities when trained on single-domain datasets. (2) Additionally, we examine the intricate interactions including mutual enhancements and conflicts that emerge during combined cross-domain training. (3) To further understand the influence of SFT on RL, we also analyze and compare performance differences between base and instruct models under identical RL configurations. (4) Furthermore, we delve into critical RL training details, systematically exploring the impacts of curriculum learning strategies, variations in reward design, and language-specific factors. Through extensive experiments, our results offer significant insights into the dynamics governing domain interactions, revealing key factors influencing both specialized and generalizable reasoning performance. These findings provide valuable guidance for optimizing RL methodologies to foster comprehensive, multi-domain reasoning capabilities in LLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 24 Jul 2025 20:43:20 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3ddd64e5/970d0041.mp3" length="18700870" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1165</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yu Li, Zhuoshi Pan, Honglin Lin, Mengyuan Sun, Conghui He, Lijun Wu</p>

            <p><strong>Title:</strong><br>
            Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.17512v1">http://arxiv.org/abs/2507.17512v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of LLMs. Existing research has predominantly concentrated on isolated reasoning domains such as mathematical problem-solving, coding tasks, or logical reasoning. However, real world reasoning scenarios inherently demand an integrated application of multiple cognitive skills. Despite this, the interplay among these reasoning skills under reinforcement learning remains poorly understood. To bridge this gap, we present a systematic investigation of multi-domain reasoning within the RLVR framework, explicitly focusing on three primary domains: mathematical reasoning, code generation, and logical puzzle solving. We conduct a comprehensive study comprising four key components: (1) Leveraging the GRPO algorithm and the Qwen-2.5-7B model family, our study thoroughly evaluates the models' in-domain improvements and cross-domain generalization capabilities when trained on single-domain datasets. (2) Additionally, we examine the intricate interactions including mutual enhancements and conflicts that emerge during combined cross-domain training. (3) To further understand the influence of SFT on RL, we also analyze and compare performance differences between base and instruct models under identical RL configurations. (4) Furthermore, we delve into critical RL training details, systematically exploring the impacts of curriculum learning strategies, variations in reward design, and language-specific factors. Through extensive experiments, our results offer significant insights into the dynamics governing domain interactions, revealing key factors influencing both specialized and generalizable reasoning performance. These findings provide valuable guidance for optimizing RL methodologies to foster comprehensive, multi-domain reasoning capabilities in LLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning</title>
      <itunes:episode>1003</itunes:episode>
      <podcast:episode>1003</podcast:episode>
      <itunes:title>Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">48a7ad70-68c7-43c9-8e1c-d67d2bdc51bb</guid>
      <link>https://share.transistor.fm/s/40b2c7fd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 77 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hongyin Luo, Nathaniel Morgan, Tina Li, Derek Zhao, Ai Vy Ngo, Philip Schroeder, Lijie Yang, Assaf Ben-Kish, Jack O'Brien, James Glass</p>

            <p><strong>Title:</strong><br>
            Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.16784v1">http://arxiv.org/abs/2507.16784v1</a></p>

            <p><strong>Abstract:</strong><br>
            To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 77 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hongyin Luo, Nathaniel Morgan, Tina Li, Derek Zhao, Ai Vy Ngo, Philip Schroeder, Lijie Yang, Assaf Ben-Kish, Jack O'Brien, James Glass</p>

            <p><strong>Title:</strong><br>
            Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.16784v1">http://arxiv.org/abs/2507.16784v1</a></p>

            <p><strong>Abstract:</strong><br>
            To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 23 Jul 2025 20:41:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/40b2c7fd/4bec8884.mp3" length="20221792" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1260</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 77 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hongyin Luo, Nathaniel Morgan, Tina Li, Derek Zhao, Ai Vy Ngo, Philip Schroeder, Lijie Yang, Assaf Ben-Kish, Jack O'Brien, James Glass</p>

            <p><strong>Title:</strong><br>
            Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.16784v1">http://arxiv.org/abs/2507.16784v1</a></p>

            <p><strong>Abstract:</strong><br>
            To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Step-Audio 2 Technical Report</title>
      <itunes:episode>1002</itunes:episode>
      <podcast:episode>1002</podcast:episode>
      <itunes:title>Step-Audio 2 Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0365ec23-9794-4a08-8f81-e1ed2d17d267</guid>
      <link>https://share.transistor.fm/s/b5875794</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zhou, Yuxiang Yang, Bingxin Li, Buyun Ma, Changhe Song, Dongqing Pang, Guoqiang Hu, Haiyang Sun, Kang An, Na Wang, Shuli Gao, Wei Ji, Wen Li, Wen Sun, Xuan Wen, Yong Ren, Yuankai Ma, Yufan Lu, Bin Wang, Bo Li, Changxin Miao, Che Liu, Chen Xu, Dapeng Shi, Dingyuan Hu, Donghang Wu, Enle Liu, Guanzhe Huang, Gulin Yan, Han Zhang, Hao Nie, Haonan Jia, Hongyu Zhou, Jianjian Sun, Jiaoren Wu, Jie Wu, Jie Yang, Jin Yang, Junzhe Lin, Kaixiang Li, Lei Yang, Liying Shi, Li Zhou, Longlong Gu, Ming Li, Mingliang Li, Mingxiao Li, Nan Wu, Qi Han, Qinyuan Tan, Shaoliang Pang, Shengjie Fan, Siqi Liu, Tiancheng Cao, Wanying Lu, Wenqing He, Wuxun Xie, Xu Zhao, Xueqi Li, Yanbo Yu, Yang Yang, Yi Liu, Yifan Lu, Yilei Wang, Yuanhao Ding, Yuanwei Liang, Yuanwei Lu, Yuchu Luo, Yuhe Yin, Yumeng Zhan, Yuxiang Zhang, Zidong Yang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu</p>

            <p><strong>Title:</strong><br>
            Step-Audio 2 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.16632v1">http://arxiv.org/abs/2507.16632v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents Step-Audio~2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zhou, Yuxiang Yang, Bingxin Li, Buyun Ma, Changhe Song, Dongqing Pang, Guoqiang Hu, Haiyang Sun, Kang An, Na Wang, Shuli Gao, Wei Ji, Wen Li, Wen Sun, Xuan Wen, Yong Ren, Yuankai Ma, Yufan Lu, Bin Wang, Bo Li, Changxin Miao, Che Liu, Chen Xu, Dapeng Shi, Dingyuan Hu, Donghang Wu, Enle Liu, Guanzhe Huang, Gulin Yan, Han Zhang, Hao Nie, Haonan Jia, Hongyu Zhou, Jianjian Sun, Jiaoren Wu, Jie Wu, Jie Yang, Jin Yang, Junzhe Lin, Kaixiang Li, Lei Yang, Liying Shi, Li Zhou, Longlong Gu, Ming Li, Mingliang Li, Mingxiao Li, Nan Wu, Qi Han, Qinyuan Tan, Shaoliang Pang, Shengjie Fan, Siqi Liu, Tiancheng Cao, Wanying Lu, Wenqing He, Wuxun Xie, Xu Zhao, Xueqi Li, Yanbo Yu, Yang Yang, Yi Liu, Yifan Lu, Yilei Wang, Yuanhao Ding, Yuanwei Liang, Yuanwei Lu, Yuchu Luo, Yuhe Yin, Yumeng Zhan, Yuxiang Zhang, Zidong Yang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu</p>

            <p><strong>Title:</strong><br>
            Step-Audio 2 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.16632v1">http://arxiv.org/abs/2507.16632v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents Step-Audio~2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 23 Jul 2025 20:41:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b5875794/d4a7241d.mp3" length="22088357" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1377</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zhou, Yuxiang Yang, Bingxin Li, Buyun Ma, Changhe Song, Dongqing Pang, Guoqiang Hu, Haiyang Sun, Kang An, Na Wang, Shuli Gao, Wei Ji, Wen Li, Wen Sun, Xuan Wen, Yong Ren, Yuankai Ma, Yufan Lu, Bin Wang, Bo Li, Changxin Miao, Che Liu, Chen Xu, Dapeng Shi, Dingyuan Hu, Donghang Wu, Enle Liu, Guanzhe Huang, Gulin Yan, Han Zhang, Hao Nie, Haonan Jia, Hongyu Zhou, Jianjian Sun, Jiaoren Wu, Jie Wu, Jie Yang, Jin Yang, Junzhe Lin, Kaixiang Li, Lei Yang, Liying Shi, Li Zhou, Longlong Gu, Ming Li, Mingliang Li, Mingxiao Li, Nan Wu, Qi Han, Qinyuan Tan, Shaoliang Pang, Shengjie Fan, Siqi Liu, Tiancheng Cao, Wanying Lu, Wenqing He, Wuxun Xie, Xu Zhao, Xueqi Li, Yanbo Yu, Yang Yang, Yi Liu, Yifan Lu, Yilei Wang, Yuanhao Ding, Yuanwei Liang, Yuanwei Lu, Yuchu Luo, Yuhe Yin, Yumeng Zhan, Yuxiang Zhang, Zidong Yang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu</p>

            <p><strong>Title:</strong><br>
            Step-Audio 2 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.16632v1">http://arxiv.org/abs/2507.16632v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents Step-Audio~2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning</title>
      <itunes:episode>1001</itunes:episode>
      <podcast:episode>1001</podcast:episode>
      <itunes:title>MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f200fd8d-e417-446d-9e40-36d8c77e847e</guid>
      <link>https://share.transistor.fm/s/e03fb3c7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Run-Ze Fan, Zengzhi Wang, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.16812v1">http://arxiv.org/abs/2507.16812v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scientific reasoning is critical for developing AI scientists and supporting human researchers in advancing the frontiers of natural science discovery. However, the open-source community has primarily focused on mathematics and coding while neglecting the scientific domain, largely due to the absence of open, large-scale, high-quality, verifiable scientific reasoning datasets. To bridge this gap, we first present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level scientific textbooks, comprising 650k reasoning questions spanning 7 scientific disciplines. We further introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances, developed through systematic ablation studies that evaluate various data selection methodologies to identify the optimal subset for each publicly available scientific dataset. Meanwhile, we build a comprehensive evaluation system covering diverse subjects and question types across 15 benchmarks, incorporating comprehensive answer extraction strategies to ensure accurate evaluation metrics. Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths compared to existing open-source scientific datasets. Furthermore, we train Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which significantly outperform the corresponding official instruct models in average performance. In addition, MegaScience exhibits greater effectiveness for larger and stronger models, suggesting a scaling benefit for scientific tuning. We release our data curation pipeline, evaluation system, datasets, and seven trained models to the community to advance scientific reasoning research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Run-Ze Fan, Zengzhi Wang, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.16812v1">http://arxiv.org/abs/2507.16812v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scientific reasoning is critical for developing AI scientists and supporting human researchers in advancing the frontiers of natural science discovery. However, the open-source community has primarily focused on mathematics and coding while neglecting the scientific domain, largely due to the absence of open, large-scale, high-quality, verifiable scientific reasoning datasets. To bridge this gap, we first present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level scientific textbooks, comprising 650k reasoning questions spanning 7 scientific disciplines. We further introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances, developed through systematic ablation studies that evaluate various data selection methodologies to identify the optimal subset for each publicly available scientific dataset. Meanwhile, we build a comprehensive evaluation system covering diverse subjects and question types across 15 benchmarks, incorporating comprehensive answer extraction strategies to ensure accurate evaluation metrics. Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths compared to existing open-source scientific datasets. Furthermore, we train Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which significantly outperform the corresponding official instruct models in average performance. In addition, MegaScience exhibits greater effectiveness for larger and stronger models, suggesting a scaling benefit for scientific tuning. We release our data curation pipeline, evaluation system, datasets, and seven trained models to the community to advance scientific reasoning research.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 23 Jul 2025 20:41:01 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e03fb3c7/fad4598f.mp3" length="19096240" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1190</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Run-Ze Fan, Zengzhi Wang, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.16812v1">http://arxiv.org/abs/2507.16812v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scientific reasoning is critical for developing AI scientists and supporting human researchers in advancing the frontiers of natural science discovery. However, the open-source community has primarily focused on mathematics and coding while neglecting the scientific domain, largely due to the absence of open, large-scale, high-quality, verifiable scientific reasoning datasets. To bridge this gap, we first present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level scientific textbooks, comprising 650k reasoning questions spanning 7 scientific disciplines. We further introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances, developed through systematic ablation studies that evaluate various data selection methodologies to identify the optimal subset for each publicly available scientific dataset. Meanwhile, we build a comprehensive evaluation system covering diverse subjects and question types across 15 benchmarks, incorporating comprehensive answer extraction strategies to ensure accurate evaluation metrics. Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths compared to existing open-source scientific datasets. Furthermore, we train Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which significantly outperform the corresponding official instruct models in average performance. In addition, MegaScience exhibits greater effectiveness for larger and stronger models, suggesting a scaling benefit for scientific tuning. We release our data curation pipeline, evaluation system, datasets, and seven trained models to the community to advance scientific reasoning research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers</title>
      <itunes:episode>1000</itunes:episode>
      <podcast:episode>1000</podcast:episode>
      <itunes:title>Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">27bab1a0-27e0-4c72-a7a4-f0a00f5b8be9</guid>
      <link>https://share.transistor.fm/s/8db5c323</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV, eess.IV</p>

            <p><strong>Authors:</strong><br>
            Wongi Jeong, Kyungryeol Lee, Hoigi Seo, Se Young Chun</p>

            <p><strong>Title:</strong><br>
            Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.08422v1">http://arxiv.org/abs/2507.08422v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion transformers have emerged as an alternative to U-net-based diffusion models for high-fidelity image and video generation, offering superior scalability. However, their heavy computation remains a major obstacle to real-world deployment. Existing acceleration methods primarily exploit the temporal dimension such as reusing cached features across diffusion timesteps. Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free framework that accelerates inference along spatial dimension. RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement. To stabilize generations across resolution transitions, we leverage noise-timestep rescheduling to adapt the noise level across varying resolutions. Our method significantly reduces computation while preserving image quality by achieving up to 7.0$\times$ speed-up on FLUX and 3.0$\times$ on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is complementary to existing temporal accelerations such as caching methods, thus can be seamlessly integrated to further reduce inference latency without compromising generation quality.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV, eess.IV</p>

            <p><strong>Authors:</strong><br>
            Wongi Jeong, Kyungryeol Lee, Hoigi Seo, Se Young Chun</p>

            <p><strong>Title:</strong><br>
            Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.08422v1">http://arxiv.org/abs/2507.08422v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion transformers have emerged as an alternative to U-net-based diffusion models for high-fidelity image and video generation, offering superior scalability. However, their heavy computation remains a major obstacle to real-world deployment. Existing acceleration methods primarily exploit the temporal dimension such as reusing cached features across diffusion timesteps. Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free framework that accelerates inference along spatial dimension. RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement. To stabilize generations across resolution transitions, we leverage noise-timestep rescheduling to adapt the noise level across varying resolutions. Our method significantly reduces computation while preserving image quality by achieving up to 7.0$\times$ speed-up on FLUX and 3.0$\times$ on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is complementary to existing temporal accelerations such as caching methods, thus can be seamlessly integrated to further reduce inference latency without compromising generation quality.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 23 Jul 2025 20:40:39 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8db5c323/6527300f.mp3" length="18662410" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1163</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV, eess.IV</p>

            <p><strong>Authors:</strong><br>
            Wongi Jeong, Kyungryeol Lee, Hoigi Seo, Se Young Chun</p>

            <p><strong>Title:</strong><br>
            Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.08422v1">http://arxiv.org/abs/2507.08422v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion transformers have emerged as an alternative to U-net-based diffusion models for high-fidelity image and video generation, offering superior scalability. However, their heavy computation remains a major obstacle to real-world deployment. Existing acceleration methods primarily exploit the temporal dimension such as reusing cached features across diffusion timesteps. Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free framework that accelerates inference along spatial dimension. RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement. To stabilize generations across resolution transitions, we leverage noise-timestep rescheduling to adapt the noise level across varying resolutions. Our method significantly reduces computation while preserving image quality by achieving up to 7.0$\times$ speed-up on FLUX and 3.0$\times$ on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is complementary to existing temporal accelerations such as caching methods, thus can be seamlessly integrated to further reduce inference latency without compromising generation quality.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning</title>
      <itunes:episode>999</itunes:episode>
      <podcast:episode>999</podcast:episode>
      <itunes:title>Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fa6cde8e-dec1-4a0f-89dd-96113ec9a27d</guid>
      <link>https://share.transistor.fm/s/df782aaf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ang Li, Charles Wang, Kaiyu Yue, Zikui Cai, Ollie Liu, Deqing Fu, Peng Guo, Wang Bill Zhu, Vatsal Sharan, Robin Jia, Willie Neiswanger, Furong Huang, Tom Goldstein, Micah Goldblum</p>

            <p><strong>Title:</strong><br>
            Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.16746v1">http://arxiv.org/abs/2507.16746v1</a></p>

            <p><strong>Abstract:</strong><br>
            Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce $\textbf{Zebra-CoT}$, a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoT training corpus results in an improvement of +12% in our test-set accuracy and yields up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness for developing multimodal reasoning abilities. We open-source our dataset and models to support development and evaluation of visual CoT.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ang Li, Charles Wang, Kaiyu Yue, Zikui Cai, Ollie Liu, Deqing Fu, Peng Guo, Wang Bill Zhu, Vatsal Sharan, Robin Jia, Willie Neiswanger, Furong Huang, Tom Goldstein, Micah Goldblum</p>

            <p><strong>Title:</strong><br>
            Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.16746v1">http://arxiv.org/abs/2507.16746v1</a></p>

            <p><strong>Abstract:</strong><br>
            Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce $\textbf{Zebra-CoT}$, a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoT training corpus results in an improvement of +12% in our test-set accuracy and yields up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness for developing multimodal reasoning abilities. We open-source our dataset and models to support development and evaluation of visual CoT.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 23 Jul 2025 20:40:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/df782aaf/1ec49150.mp3" length="18216833" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1135</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ang Li, Charles Wang, Kaiyu Yue, Zikui Cai, Ollie Liu, Deqing Fu, Peng Guo, Wang Bill Zhu, Vatsal Sharan, Robin Jia, Willie Neiswanger, Furong Huang, Tom Goldstein, Micah Goldblum</p>

            <p><strong>Title:</strong><br>
            Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.16746v1">http://arxiv.org/abs/2507.16746v1</a></p>

            <p><strong>Abstract:</strong><br>
            Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce $\textbf{Zebra-CoT}$, a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoT training corpus results in an improvement of +12% in our test-set accuracy and yields up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness for developing multimodal reasoning abilities. We open-source our dataset and models to support development and evaluation of visual CoT.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding</title>
      <itunes:episode>998</itunes:episode>
      <podcast:episode>998</podcast:episode>
      <itunes:title>GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">13b6cd6d-471b-4f5b-a711-b1c82b485261</guid>
      <link>https://share.transistor.fm/s/55ba1801</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 98 | cs.LG, cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang</p>

            <p><strong>Title:</strong><br>
            GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.15846v2">http://arxiv.org/abs/2507.15846v2</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 98 | cs.LG, cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang</p>

            <p><strong>Title:</strong><br>
            GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.15846v2">http://arxiv.org/abs/2507.15846v2</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 22 Jul 2025 21:40:32 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/55ba1801/3d5871b3.mp3" length="23661578" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1475</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 98 | cs.LG, cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang</p>

            <p><strong>Title:</strong><br>
            GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.15846v2">http://arxiv.org/abs/2507.15846v2</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization</title>
      <itunes:episode>997</itunes:episode>
      <podcast:episode>997</podcast:episode>
      <itunes:title>MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">29144f32-5cfe-4179-af77-a75ee975d1b5</guid>
      <link>https://share.transistor.fm/s/affc176b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 93 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingxuan Li, Yao Xiao, Dianwen Ng, Hai Ye, Yue Deng, Xiang Lin, Bin Wang, Zhanfeng Mo, Chong Zhang, Yueyi Zhang, Zonglin Yang, Ruilin Li, Lei Lei, Shihao Xu, Han Zhao, Weiling Chen, Feng Ji, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.14683v1">http://arxiv.org/abs/2507.14683v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models have recently evolved from fluent text generation to advanced reasoning across diverse domains, giving rise to reasoning language models. Among these domains, mathematical reasoning serves as a representative benchmark as it requires precise multi-step logic and abstract reasoning, which can be generalized to other tasks. While closed-source RLMs such as GPT-o3 demonstrate impressive reasoning capabilities, their proprietary nature limits transparency and reproducibility. Although many open-source projects aim to close this gap, most of them lack sufficient openness by omitting critical resources such as datasets and detailed training configurations, which hinders reproducibility. To contribute toward greater transparency in RLM development, we introduce the MiroMind-M1 series, a set of fully open-source RLMs built on the Qwen-2.5 backbone that match or exceed the performance of existing open-source RLMs. Specifically, our models are trained in two stages: SFT on a carefully curated corpus of 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K challenging and verifiable problems. To enhance the robustness and efficiency of the RLVR process, we introduce Context-Aware Multi-Stage Policy Optimization, an algorithm that integrates length-progressive training with an adaptive repetition penalty to encourage context-aware RL training. Our model achieves state-of-the-art or competitive performance and superior token efficiency among Qwen-2.5-based open-source 7B and 32B models on the AIME24, AIME25, and MATH benchmarks. To facilitate reproducibility, we release the complete stack: models (MiroMind-M1-SFT-7B, MiroMind-M1-RL-7B, MiroMind-M1-RL-32B); datasets (MiroMind-M1-SFT-719K, MiroMind-M1-RL-62K); and all training and evaluation configurations. We hope these resources will support further research and foster community advancement.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 93 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingxuan Li, Yao Xiao, Dianwen Ng, Hai Ye, Yue Deng, Xiang Lin, Bin Wang, Zhanfeng Mo, Chong Zhang, Yueyi Zhang, Zonglin Yang, Ruilin Li, Lei Lei, Shihao Xu, Han Zhao, Weiling Chen, Feng Ji, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.14683v1">http://arxiv.org/abs/2507.14683v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models have recently evolved from fluent text generation to advanced reasoning across diverse domains, giving rise to reasoning language models. Among these domains, mathematical reasoning serves as a representative benchmark as it requires precise multi-step logic and abstract reasoning, which can be generalized to other tasks. While closed-source RLMs such as GPT-o3 demonstrate impressive reasoning capabilities, their proprietary nature limits transparency and reproducibility. Although many open-source projects aim to close this gap, most of them lack sufficient openness by omitting critical resources such as datasets and detailed training configurations, which hinders reproducibility. To contribute toward greater transparency in RLM development, we introduce the MiroMind-M1 series, a set of fully open-source RLMs built on the Qwen-2.5 backbone that match or exceed the performance of existing open-source RLMs. Specifically, our models are trained in two stages: SFT on a carefully curated corpus of 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K challenging and verifiable problems. To enhance the robustness and efficiency of the RLVR process, we introduce Context-Aware Multi-Stage Policy Optimization, an algorithm that integrates length-progressive training with an adaptive repetition penalty to encourage context-aware RL training. Our model achieves state-of-the-art or competitive performance and superior token efficiency among Qwen-2.5-based open-source 7B and 32B models on the AIME24, AIME25, and MATH benchmarks. To facilitate reproducibility, we release the complete stack: models (MiroMind-M1-SFT-7B, MiroMind-M1-RL-7B, MiroMind-M1-RL-32B); datasets (MiroMind-M1-SFT-719K, MiroMind-M1-RL-62K); and all training and evaluation configurations. We hope these resources will support further research and foster community advancement.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 22 Jul 2025 21:40:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/affc176b/e9bfa7cc.mp3" length="22314976" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1391</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 93 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingxuan Li, Yao Xiao, Dianwen Ng, Hai Ye, Yue Deng, Xiang Lin, Bin Wang, Zhanfeng Mo, Chong Zhang, Yueyi Zhang, Zonglin Yang, Ruilin Li, Lei Lei, Shihao Xu, Han Zhao, Weiling Chen, Feng Ji, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.14683v1">http://arxiv.org/abs/2507.14683v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models have recently evolved from fluent text generation to advanced reasoning across diverse domains, giving rise to reasoning language models. Among these domains, mathematical reasoning serves as a representative benchmark as it requires precise multi-step logic and abstract reasoning, which can be generalized to other tasks. While closed-source RLMs such as GPT-o3 demonstrate impressive reasoning capabilities, their proprietary nature limits transparency and reproducibility. Although many open-source projects aim to close this gap, most of them lack sufficient openness by omitting critical resources such as datasets and detailed training configurations, which hinders reproducibility. To contribute toward greater transparency in RLM development, we introduce the MiroMind-M1 series, a set of fully open-source RLMs built on the Qwen-2.5 backbone that match or exceed the performance of existing open-source RLMs. Specifically, our models are trained in two stages: SFT on a carefully curated corpus of 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K challenging and verifiable problems. To enhance the robustness and efficiency of the RLVR process, we introduce Context-Aware Multi-Stage Policy Optimization, an algorithm that integrates length-progressive training with an adaptive repetition penalty to encourage context-aware RL training. Our model achieves state-of-the-art or competitive performance and superior token efficiency among Qwen-2.5-based open-source 7B and 32B models on the AIME24, AIME25, and MATH benchmarks. To facilitate reproducibility, we release the complete stack: models (MiroMind-M1-SFT-7B, MiroMind-M1-RL-7B, MiroMind-M1-RL-32B); datasets (MiroMind-M1-SFT-719K, MiroMind-M1-RL-62K); and all training and evaluation configurations. We hope these resources will support further research and foster community advancement.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Invisible Leash: Why RLVR May Not Escape Its Origin</title>
      <itunes:episode>996</itunes:episode>
      <podcast:episode>996</podcast:episode>
      <itunes:title>The Invisible Leash: Why RLVR May Not Escape Its Origin</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">931684bd-bd18-4060-bb04-8c72195f0319</guid>
      <link>https://share.transistor.fm/s/b7615886</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, Yejin Choi</p>

            <p><strong>Title:</strong><br>
            The Invisible Leash: Why RLVR May Not Escape Its Origin</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.14843v1">http://arxiv.org/abs/2507.14843v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI's capabilities, particularly in solving complex logical tasks. However, it remains unclear whether RLVR truly expands a model's reasoning boundary or merely amplifies high-reward outputs that the base model already knows for improved precision. This study presents a theoretical and empirical investigation that provides fresh insights into the potential limits of RLVR. First, we offer a new theoretical perspective that RLVR is constrained by the base model's support-unable to sample solutions with zero initial probability-and operates as a conservative reweighting mechanism that may restrict the discovery of entirely original solutions. We also identify an entropy-reward tradeoff: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while RLVR consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy, resulting in greater uncertainty at each generation step, answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of RLVR in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, Yejin Choi</p>

            <p><strong>Title:</strong><br>
            The Invisible Leash: Why RLVR May Not Escape Its Origin</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.14843v1">http://arxiv.org/abs/2507.14843v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI's capabilities, particularly in solving complex logical tasks. However, it remains unclear whether RLVR truly expands a model's reasoning boundary or merely amplifies high-reward outputs that the base model already knows for improved precision. This study presents a theoretical and empirical investigation that provides fresh insights into the potential limits of RLVR. First, we offer a new theoretical perspective that RLVR is constrained by the base model's support-unable to sample solutions with zero initial probability-and operates as a conservative reweighting mechanism that may restrict the discovery of entirely original solutions. We also identify an entropy-reward tradeoff: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while RLVR consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy, resulting in greater uncertainty at each generation step, answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of RLVR in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 22 Jul 2025 21:39:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b7615886/4bffc5ed.mp3" length="23236934" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1449</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, Yejin Choi</p>

            <p><strong>Title:</strong><br>
            The Invisible Leash: Why RLVR May Not Escape Its Origin</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.14843v1">http://arxiv.org/abs/2507.14843v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI's capabilities, particularly in solving complex logical tasks. However, it remains unclear whether RLVR truly expands a model's reasoning boundary or merely amplifies high-reward outputs that the base model already knows for improved precision. This study presents a theoretical and empirical investigation that provides fresh insights into the potential limits of RLVR. First, we offer a new theoretical perspective that RLVR is constrained by the base model's support-unable to sample solutions with zero initial probability-and operates as a conservative reweighting mechanism that may restrict the discovery of entirely original solutions. We also identify an entropy-reward tradeoff: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while RLVR consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy, resulting in greater uncertainty at each generation step, answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of RLVR in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining</title>
      <itunes:episode>995</itunes:episode>
      <podcast:episode>995</podcast:episode>
      <itunes:title>NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8f494198-898f-4594-b0f5-227cef26edc5</guid>
      <link>https://share.transistor.fm/s/38a2bd0d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh, Georgii Fedorov, Bulat Suleimanov, Vladimir Dokholyan, Aleksandr Gordeev</p>

            <p><strong>Title:</strong><br>
            NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.14119v1">http://arxiv.org/abs/2507.14119v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in generative modeling enable image editing assistants that follow natural language instructions without additional user input. Their supervised training requires millions of triplets: original image, instruction, edited image. Yet mining pixel-accurate examples is hard. Each edit must affect only prompt-specified regions, preserve stylistic coherence, respect physical plausibility, and retain visual appeal. The lack of robust automated edit-quality metrics hinders reliable automation at scale. We present an automated, modular pipeline that mines high-fidelity triplets across domains, resolutions, instruction complexities, and styles. Built on public generative models and running without human intervention, our system uses a task-tuned Gemini validator to score instruction adherence and aesthetics directly, removing any need for segmentation or grounding models. Inversion and compositional bootstrapping enlarge the mined set by approximately 2.2x, enabling large-scale high-fidelity training data. By automating the most repetitive annotation steps, the approach allows a new scale of training without human labeling effort. To democratize research in this resource-intensive area, we release NHR-Edit: an open dataset of 358k high-quality triplets. In the largest cross-dataset evaluation, it surpasses all public alternatives. We also release Bagel-NHR-Edit, an open-source fine-tuned Bagel model, which achieves state-of-the-art metrics in our experiments.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh, Georgii Fedorov, Bulat Suleimanov, Vladimir Dokholyan, Aleksandr Gordeev</p>

            <p><strong>Title:</strong><br>
            NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.14119v1">http://arxiv.org/abs/2507.14119v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in generative modeling enable image editing assistants that follow natural language instructions without additional user input. Their supervised training requires millions of triplets: original image, instruction, edited image. Yet mining pixel-accurate examples is hard. Each edit must affect only prompt-specified regions, preserve stylistic coherence, respect physical plausibility, and retain visual appeal. The lack of robust automated edit-quality metrics hinders reliable automation at scale. We present an automated, modular pipeline that mines high-fidelity triplets across domains, resolutions, instruction complexities, and styles. Built on public generative models and running without human intervention, our system uses a task-tuned Gemini validator to score instruction adherence and aesthetics directly, removing any need for segmentation or grounding models. Inversion and compositional bootstrapping enlarge the mined set by approximately 2.2x, enabling large-scale high-fidelity training data. By automating the most repetitive annotation steps, the approach allows a new scale of training without human labeling effort. To democratize research in this resource-intensive area, we release NHR-Edit: an open dataset of 358k high-quality triplets. In the largest cross-dataset evaluation, it surpasses all public alternatives. We also release Bagel-NHR-Edit, an open-source fine-tuned Bagel model, which achieves state-of-the-art metrics in our experiments.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 22 Jul 2025 21:39:28 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/38a2bd0d/09e6c6b8.mp3" length="22492564" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1402</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh, Georgii Fedorov, Bulat Suleimanov, Vladimir Dokholyan, Aleksandr Gordeev</p>

            <p><strong>Title:</strong><br>
            NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.14119v1">http://arxiv.org/abs/2507.14119v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in generative modeling enable image editing assistants that follow natural language instructions without additional user input. Their supervised training requires millions of triplets: original image, instruction, edited image. Yet mining pixel-accurate examples is hard. Each edit must affect only prompt-specified regions, preserve stylistic coherence, respect physical plausibility, and retain visual appeal. The lack of robust automated edit-quality metrics hinders reliable automation at scale. We present an automated, modular pipeline that mines high-fidelity triplets across domains, resolutions, instruction complexities, and styles. Built on public generative models and running without human intervention, our system uses a task-tuned Gemini validator to score instruction adherence and aesthetics directly, removing any need for segmentation or grounding models. Inversion and compositional bootstrapping enlarge the mined set by approximately 2.2x, enabling large-scale high-fidelity training data. By automating the most repetitive annotation steps, the approach allows a new scale of training without human labeling effort. To democratize research in this resource-intensive area, we release NHR-Edit: an open dataset of 358k high-quality triplets. In the largest cross-dataset evaluation, it surpasses all public alternatives. We also release Bagel-NHR-Edit, an open-source fine-tuned Bagel model, which achieves state-of-the-art metrics in our experiments.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization</title>
      <itunes:episode>994</itunes:episode>
      <podcast:episode>994</podcast:episode>
      <itunes:title>WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4c78ca21-a69c-41dc-95ca-dc7c4a9b78f7</guid>
      <link>https://share.transistor.fm/s/5f08ecc0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.15061v1">http://arxiv.org/abs/2507.15061v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advent of Large Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-based information-seeking (IS) capabilities. The scarcity of high-quality training data has limited the development of IS agents. Existing approaches typically adopt an information-driven paradigm that first collects web data and then generates questions based on the retrieval. However, this may lead to inconsistency between information structure and reasoning structure, question and answer. To mitigate, we propose a formalization-driven IS data synthesis framework WebShaper to construct a dataset. WebShaper systematically formalizes IS tasks through set theory. Central to the formalization is the concept of Knowledge Projections (KP), which enables precise control over reasoning structure by KP operation compositions. During synthesis, we begin by creating seed tasks, then use a multi-step expansion process. At each step, an agentic Expander expands the current formal question more complex with retrieval and validation tools based on our formalization. We train our model on the synthesized dataset. Experiment results demonstrate that WebShaper achieves state-of-the-art performance among open-sourced IS agents on GAIA and WebWalkerQA benchmarks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.15061v1">http://arxiv.org/abs/2507.15061v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advent of Large Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-based information-seeking (IS) capabilities. The scarcity of high-quality training data has limited the development of IS agents. Existing approaches typically adopt an information-driven paradigm that first collects web data and then generates questions based on the retrieval. However, this may lead to inconsistency between information structure and reasoning structure, question and answer. To mitigate, we propose a formalization-driven IS data synthesis framework WebShaper to construct a dataset. WebShaper systematically formalizes IS tasks through set theory. Central to the formalization is the concept of Knowledge Projections (KP), which enables precise control over reasoning structure by KP operation compositions. During synthesis, we begin by creating seed tasks, then use a multi-step expansion process. At each step, an agentic Expander expands the current formal question more complex with retrieval and validation tools based on our formalization. We train our model on the synthesized dataset. Experiment results demonstrate that WebShaper achieves state-of-the-art performance among open-sourced IS agents on GAIA and WebWalkerQA benchmarks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 22 Jul 2025 21:39:06 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5f08ecc0/9518fa66.mp3" length="19859429" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1238</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.15061v1">http://arxiv.org/abs/2507.15061v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advent of Large Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-based information-seeking (IS) capabilities. The scarcity of high-quality training data has limited the development of IS agents. Existing approaches typically adopt an information-driven paradigm that first collects web data and then generates questions based on the retrieval. However, this may lead to inconsistency between information structure and reasoning structure, question and answer. To mitigate, we propose a formalization-driven IS data synthesis framework WebShaper to construct a dataset. WebShaper systematically formalizes IS tasks through set theory. Central to the formalization is the concept of Knowledge Projections (KP), which enables precise control over reasoning structure by KP operation compositions. During synthesis, we begin by creating seed tasks, then use a multi-step expansion process. At each step, an agentic Expander expands the current formal question more complex with retrieval and validation tools based on our formalization. We train our model on the synthesized dataset. Experiment results demonstrate that WebShaper achieves state-of-the-art performance among open-sourced IS agents on GAIA and WebWalkerQA benchmarks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GR-3 Technical Report</title>
      <itunes:episode>993</itunes:episode>
      <podcast:episode>993</podcast:episode>
      <itunes:title>GR-3 Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e7b96a05-094c-491f-aec5-93aba6e918ee</guid>
      <link>https://share.transistor.fm/s/97ac0e6c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.RO, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, Hao Niu, Wenxuan Ou, Wanli Peng, Zeyu Ren, Haixin Shi, Jiawen Tian, Hongtao Wu, Xin Xiao, Yuyang Xiao, Jiafeng Xu, Yichu Yang</p>

            <p><strong>Title:</strong><br>
            GR-3 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.15493v2">http://arxiv.org/abs/2507.15493v2</a></p>

            <p><strong>Abstract:</strong><br>
            We report our recent progress towards building generalist robot policies, the development of GR-3. GR-3 is a large-scale vision-language-action (VLA) model. It showcases exceptional capabilities in generalizing to novel objects, environments, and instructions involving abstract concepts. Furthermore, it can be efficiently fine-tuned with minimal human trajectory data, enabling rapid and cost-effective adaptation to new settings. GR-3 also excels in handling long-horizon and dexterous tasks, including those requiring bi-manual manipulation and mobile movement, showcasing robust and reliable performance. These capabilities are achieved through a multi-faceted training recipe that includes co-training with web-scale vision-language data, efficient fine-tuning from human trajectory data collected via VR devices, and effective imitation learning with robot trajectory data. In addition, we introduce ByteMini, a versatile bi-manual mobile robot designed with exceptional flexibility and reliability, capable of accomplishing a wide range of tasks when integrated with GR-3. Through extensive real-world experiments, we show GR-3 surpasses the state-of-the-art baseline method, $\pi_0$, on a wide variety of challenging tasks. We hope GR-3 can serve as a step towards building generalist robots capable of assisting humans in daily life.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.RO, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, Hao Niu, Wenxuan Ou, Wanli Peng, Zeyu Ren, Haixin Shi, Jiawen Tian, Hongtao Wu, Xin Xiao, Yuyang Xiao, Jiafeng Xu, Yichu Yang</p>

            <p><strong>Title:</strong><br>
            GR-3 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.15493v2">http://arxiv.org/abs/2507.15493v2</a></p>

            <p><strong>Abstract:</strong><br>
            We report our recent progress towards building generalist robot policies, the development of GR-3. GR-3 is a large-scale vision-language-action (VLA) model. It showcases exceptional capabilities in generalizing to novel objects, environments, and instructions involving abstract concepts. Furthermore, it can be efficiently fine-tuned with minimal human trajectory data, enabling rapid and cost-effective adaptation to new settings. GR-3 also excels in handling long-horizon and dexterous tasks, including those requiring bi-manual manipulation and mobile movement, showcasing robust and reliable performance. These capabilities are achieved through a multi-faceted training recipe that includes co-training with web-scale vision-language data, efficient fine-tuning from human trajectory data collected via VR devices, and effective imitation learning with robot trajectory data. In addition, we introduce ByteMini, a versatile bi-manual mobile robot designed with exceptional flexibility and reliability, capable of accomplishing a wide range of tasks when integrated with GR-3. Through extensive real-world experiments, we show GR-3 surpasses the state-of-the-art baseline method, $\pi_0$, on a wide variety of challenging tasks. We hope GR-3 can serve as a step towards building generalist robots capable of assisting humans in daily life.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 22 Jul 2025 21:38:45 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/97ac0e6c/424411e7.mp3" length="25008211" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1559</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.RO, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, Hao Niu, Wenxuan Ou, Wanli Peng, Zeyu Ren, Haixin Shi, Jiawen Tian, Hongtao Wu, Xin Xiao, Yuyang Xiao, Jiafeng Xu, Yichu Yang</p>

            <p><strong>Title:</strong><br>
            GR-3 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.15493v2">http://arxiv.org/abs/2507.15493v2</a></p>

            <p><strong>Abstract:</strong><br>
            We report our recent progress towards building generalist robot policies, the development of GR-3. GR-3 is a large-scale vision-language-action (VLA) model. It showcases exceptional capabilities in generalizing to novel objects, environments, and instructions involving abstract concepts. Furthermore, it can be efficiently fine-tuned with minimal human trajectory data, enabling rapid and cost-effective adaptation to new settings. GR-3 also excels in handling long-horizon and dexterous tasks, including those requiring bi-manual manipulation and mobile movement, showcasing robust and reliable performance. These capabilities are achieved through a multi-faceted training recipe that includes co-training with web-scale vision-language data, efficient fine-tuning from human trajectory data collected via VR devices, and effective imitation learning with robot trajectory data. In addition, we introduce ByteMini, a versatile bi-manual mobile robot designed with exceptional flexibility and reliability, capable of accomplishing a wide range of tasks when integrated with GR-3. Through extensive real-world experiments, we show GR-3 surpasses the state-of-the-art baseline method, $\pi_0$, on a wide variety of challenging tasks. We hope GR-3 can serve as a step towards building generalist robots capable of assisting humans in daily life.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling</title>
      <itunes:episode>992</itunes:episode>
      <podcast:episode>992</podcast:episode>
      <itunes:title>Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">44ff027b-8dca-4ff9-998e-b67c79cb18fc</guid>
      <link>https://share.transistor.fm/s/dcc8809c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hayeon Kim, Ji Ha Jang, Se Young Chun</p>

            <p><strong>Title:</strong><br>
            Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.11061v2">http://arxiv.org/abs/2507.11061v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in 3D neural representations and instance-level editing models have enabled the efficient creation of high-quality 3D content. However, achieving precise local 3D edits remains challenging, especially for Gaussian Splatting, due to inconsistent multi-view 2D part segmentations and inherently ambiguous nature of Score Distillation Sampling (SDS) loss. To address these limitations, we propose RoMaP, a novel local 3D Gaussian editing framework that enables precise and drastic part-level modifications. First, we introduce a robust 3D mask generation module with our 3D-Geometry Aware Label Prediction (3D-GALP), which uses spherical harmonics (SH) coefficients to model view-dependent label variations and soft-label property, yielding accurate and consistent part segmentations across viewpoints. Second, we propose a regularized SDS loss that combines the standard SDS loss with additional regularizers. In particular, an L1 anchor loss is introduced via our Scheduled Latent Mixing and Part (SLaMP) editing method, which generates high-quality part-edited 2D images and confines modifications only to the target region while preserving contextual coherence. Additional regularizers, such as Gaussian prior removal, further improve flexibility by allowing changes beyond the existing context, and robust 3D masking prevents unintended edits. Experimental results demonstrate that our RoMaP achieves state-of-the-art local 3D editing on both reconstructed and generated Gaussian scenes and objects qualitatively and quantitatively, making it possible for more robust and flexible part-level 3D Gaussian editing. Code is available at https://janeyeon.github.io/romap.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hayeon Kim, Ji Ha Jang, Se Young Chun</p>

            <p><strong>Title:</strong><br>
            Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.11061v2">http://arxiv.org/abs/2507.11061v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in 3D neural representations and instance-level editing models have enabled the efficient creation of high-quality 3D content. However, achieving precise local 3D edits remains challenging, especially for Gaussian Splatting, due to inconsistent multi-view 2D part segmentations and inherently ambiguous nature of Score Distillation Sampling (SDS) loss. To address these limitations, we propose RoMaP, a novel local 3D Gaussian editing framework that enables precise and drastic part-level modifications. First, we introduce a robust 3D mask generation module with our 3D-Geometry Aware Label Prediction (3D-GALP), which uses spherical harmonics (SH) coefficients to model view-dependent label variations and soft-label property, yielding accurate and consistent part segmentations across viewpoints. Second, we propose a regularized SDS loss that combines the standard SDS loss with additional regularizers. In particular, an L1 anchor loss is introduced via our Scheduled Latent Mixing and Part (SLaMP) editing method, which generates high-quality part-edited 2D images and confines modifications only to the target region while preserving contextual coherence. Additional regularizers, such as Gaussian prior removal, further improve flexibility by allowing changes beyond the existing context, and robust 3D masking prevents unintended edits. Experimental results demonstrate that our RoMaP achieves state-of-the-art local 3D editing on both reconstructed and generated Gaussian scenes and objects qualitatively and quantitatively, making it possible for more robust and flexible part-level 3D Gaussian editing. Code is available at https://janeyeon.github.io/romap.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 22 Jul 2025 21:38:24 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/dcc8809c/7030f15e.mp3" length="21645395" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1349</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hayeon Kim, Ji Ha Jang, Se Young Chun</p>

            <p><strong>Title:</strong><br>
            Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.11061v2">http://arxiv.org/abs/2507.11061v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in 3D neural representations and instance-level editing models have enabled the efficient creation of high-quality 3D content. However, achieving precise local 3D edits remains challenging, especially for Gaussian Splatting, due to inconsistent multi-view 2D part segmentations and inherently ambiguous nature of Score Distillation Sampling (SDS) loss. To address these limitations, we propose RoMaP, a novel local 3D Gaussian editing framework that enables precise and drastic part-level modifications. First, we introduce a robust 3D mask generation module with our 3D-Geometry Aware Label Prediction (3D-GALP), which uses spherical harmonics (SH) coefficients to model view-dependent label variations and soft-label property, yielding accurate and consistent part segmentations across viewpoints. Second, we propose a regularized SDS loss that combines the standard SDS loss with additional regularizers. In particular, an L1 anchor loss is introduced via our Scheduled Latent Mixing and Part (SLaMP) editing method, which generates high-quality part-edited 2D images and confines modifications only to the target region while preserving contextual coherence. Additional regularizers, such as Gaussian prior removal, further improve flexibility by allowing changes beyond the existing context, and robust 3D masking prevents unintended edits. Experimental results demonstrate that our RoMaP achieves state-of-the-art local 3D editing on both reconstructed and generated Gaussian scenes and objects qualitatively and quantitatively, making it possible for more robust and flexible part-level 3D Gaussian editing. Code is available at https://janeyeon.github.io/romap.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction</title>
      <itunes:episode>991</itunes:episode>
      <podcast:episode>991</podcast:episode>
      <itunes:title>SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9424a84d-29cd-4522-a9af-4a7a17aa4106</guid>
      <link>https://share.transistor.fm/s/d8fa82b8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Songxin He, Jianfan Lin, Junsong Tang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.15852v2">http://arxiv.org/abs/2507.15852v2</a></p>

            <p><strong>Abstract:</strong><br>
            Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts, current techniques still lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes. This limitation arises from their reliance on appearance matching, neglecting the human-like conceptual understanding of objects that enables robust identification across temporal dynamics. Motivated by this gap, we propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. During inference, SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. Furthermore, SeC adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Songxin He, Jianfan Lin, Junsong Tang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.15852v2">http://arxiv.org/abs/2507.15852v2</a></p>

            <p><strong>Abstract:</strong><br>
            Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts, current techniques still lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes. This limitation arises from their reliance on appearance matching, neglecting the human-like conceptual understanding of objects that enables robust identification across temporal dynamics. Motivated by this gap, we propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. During inference, SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. Furthermore, SeC adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 22 Jul 2025 21:38:03 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d8fa82b8/7df50a64.mp3" length="20856268" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1300</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Songxin He, Jianfan Lin, Junsong Tang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.15852v2">http://arxiv.org/abs/2507.15852v2</a></p>

            <p><strong>Abstract:</strong><br>
            Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts, current techniques still lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes. This limitation arises from their reliance on appearance matching, neglecting the human-like conceptual understanding of objects that enables robust identification across temporal dynamics. Motivated by this gap, we propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. During inference, SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. Furthermore, SeC adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos</title>
      <itunes:episode>990</itunes:episode>
      <podcast:episode>990</podcast:episode>
      <itunes:title>Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">609282af-4866-4719-9e03-61010212b881</guid>
      <link>https://share.transistor.fm/s/1135fee6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, Zongqing Lu</p>

            <p><strong>Title:</strong><br>
            Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.15597v1">http://arxiv.org/abs/2507.15597v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this data bottleneck, we propose leveraging human hands as a foundation manipulator, capitalizing on the rich dexterity and scalability present in web data. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our proposed paradigm, we further develop a comprehensive data curation pipeline that integrates heterogeneous sources -- including motion capture, VR, and RGB-only videos -- into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains of Being-H0 in real-world robotic manipulation as physical instruction tuning is applied. More details are available at https://beingbeyond.github.io/Being-H0.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, Zongqing Lu</p>

            <p><strong>Title:</strong><br>
            Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.15597v1">http://arxiv.org/abs/2507.15597v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this data bottleneck, we propose leveraging human hands as a foundation manipulator, capitalizing on the rich dexterity and scalability present in web data. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our proposed paradigm, we further develop a comprehensive data curation pipeline that integrates heterogeneous sources -- including motion capture, VR, and RGB-only videos -- into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains of Being-H0 in real-world robotic manipulation as physical instruction tuning is applied. More details are available at https://beingbeyond.github.io/Being-H0.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 22 Jul 2025 21:37:41 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1135fee6/5b0fd31a.mp3" length="23174677" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1445</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, Zongqing Lu</p>

            <p><strong>Title:</strong><br>
            Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.15597v1">http://arxiv.org/abs/2507.15597v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this data bottleneck, we propose leveraging human hands as a foundation manipulator, capitalizing on the rich dexterity and scalability present in web data. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our proposed paradigm, we further develop a comprehensive data curation pipeline that integrates heterogeneous sources -- including motion capture, VR, and RGB-only videos -- into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains of Being-H0 in real-world robotic manipulation as physical instruction tuning is applied. More details are available at https://beingbeyond.github.io/Being-H0.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs</title>
      <itunes:episode>989</itunes:episode>
      <podcast:episode>989</podcast:episode>
      <itunes:title>The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ea4fa4c5-67d5-40b2-a22d-5e31a310460d</guid>
      <link>https://share.transistor.fm/s/975f064b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zichen Wen, Jiashu Qu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, Chaochao Lu, Jing Shao, Conghui He, Linfeng Zhang</p>

            <p><strong>Title:</strong><br>
            The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.11097v1">http://arxiv.org/abs/2507.11097v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/DIJA.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zichen Wen, Jiashu Qu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, Chaochao Lu, Jing Shao, Conghui He, Linfeng Zhang</p>

            <p><strong>Title:</strong><br>
            The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.11097v1">http://arxiv.org/abs/2507.11097v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/DIJA.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 21 Jul 2025 20:06:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/975f064b/53584234.mp3" length="19171467" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1195</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zichen Wen, Jiashu Qu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, Chaochao Lu, Jing Shao, Conghui He, Linfeng Zhang</p>

            <p><strong>Title:</strong><br>
            The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.11097v1">http://arxiv.org/abs/2507.11097v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/DIJA.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models</title>
      <itunes:episode>988</itunes:episode>
      <podcast:episode>988</podcast:episode>
      <itunes:title>A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">40b29613-9ffa-4caf-a62f-3189167b2fad</guid>
      <link>https://share.transistor.fm/s/ebc4f4f7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Kirill Borodin, Nikita Vasiliev, Vasiliy Kudryavtsev, Maxim Maslov, Mikhail Gorodnichev, Oleg Rogov, Grach Mkrtchian</p>

            <p><strong>Title:</strong><br>
            A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13563v1">http://arxiv.org/abs/2507.13563v1</a></p>

            <p><strong>Abstract:</strong><br>
            Russian speech synthesis presents distinctive challenges, including vowel reduction, consonant devoicing, variable stress patterns, homograph ambiguity, and unnatural intonation. This paper introduces Balalaika, a novel dataset comprising more than 2,000 hours of studio-quality Russian speech with comprehensive textual annotations, including punctuation and stress markings. Experimental results show that models trained on Balalaika significantly outperform those trained on existing datasets in both speech synthesis and enhancement tasks. We detail the dataset construction pipeline, annotation methodology, and results of comparative evaluations.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Kirill Borodin, Nikita Vasiliev, Vasiliy Kudryavtsev, Maxim Maslov, Mikhail Gorodnichev, Oleg Rogov, Grach Mkrtchian</p>

            <p><strong>Title:</strong><br>
            A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13563v1">http://arxiv.org/abs/2507.13563v1</a></p>

            <p><strong>Abstract:</strong><br>
            Russian speech synthesis presents distinctive challenges, including vowel reduction, consonant devoicing, variable stress patterns, homograph ambiguity, and unnatural intonation. This paper introduces Balalaika, a novel dataset comprising more than 2,000 hours of studio-quality Russian speech with comprehensive textual annotations, including punctuation and stress markings. Experimental results show that models trained on Balalaika significantly outperform those trained on existing datasets in both speech synthesis and enhancement tasks. We detail the dataset construction pipeline, annotation methodology, and results of comparative evaluations.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 21 Jul 2025 20:06:03 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ebc4f4f7/857d2590.mp3" length="18977147" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1182</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Kirill Borodin, Nikita Vasiliev, Vasiliy Kudryavtsev, Maxim Maslov, Mikhail Gorodnichev, Oleg Rogov, Grach Mkrtchian</p>

            <p><strong>Title:</strong><br>
            A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13563v1">http://arxiv.org/abs/2507.13563v1</a></p>

            <p><strong>Abstract:</strong><br>
            Russian speech synthesis presents distinctive challenges, including vowel reduction, consonant devoicing, variable stress patterns, homograph ambiguity, and unnatural intonation. This paper introduces Balalaika, a novel dataset comprising more than 2,000 hours of studio-quality Russian speech with comprehensive textual annotations, including punctuation and stress markings. Experimental results show that models trained on Balalaika significantly outperform those trained on existing datasets in both speech synthesis and enhancement tasks. We detail the dataset construction pipeline, annotation methodology, and results of comparative evaluations.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Survey of Context Engineering for Large Language Models</title>
      <itunes:episode>987</itunes:episode>
      <podcast:episode>987</podcast:episode>
      <itunes:title>A Survey of Context Engineering for Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b7807c41-741b-42b5-a0e7-653fbe94cd5b</guid>
      <link>https://share.transistor.fm/s/eff53dc2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 96 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, Shenghua Liu</p>

            <p><strong>Title:</strong><br>
            A Survey of Context Engineering for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13334v1">http://arxiv.org/abs/2507.13334v1</a></p>

            <p><strong>Abstract:</strong><br>
            The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management. We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning, and multi-agent systems. Through this systematic analysis of over 1300 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 96 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, Shenghua Liu</p>

            <p><strong>Title:</strong><br>
            A Survey of Context Engineering for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13334v1">http://arxiv.org/abs/2507.13334v1</a></p>

            <p><strong>Abstract:</strong><br>
            The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management. We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning, and multi-agent systems. Through this systematic analysis of over 1300 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 18 Jul 2025 20:56:55 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/eff53dc2/04583211.mp3" length="25763081" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1607</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 96 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, Shenghua Liu</p>

            <p><strong>Title:</strong><br>
            A Survey of Context Engineering for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13334v1">http://arxiv.org/abs/2507.13334v1</a></p>

            <p><strong>Abstract:</strong><br>
            The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management. We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning, and multi-agent systems. Through this systematic analysis of over 1300 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning</title>
      <itunes:episode>986</itunes:episode>
      <podcast:episode>986</podcast:episode>
      <itunes:title>VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5a8cc50b-e029-437d-b03d-1822346c0dd0</guid>
      <link>https://share.transistor.fm/s/a0b13e2d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia</p>

            <p><strong>Title:</strong><br>
            VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13348v1">http://arxiv.org/abs/2507.13348v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia</p>

            <p><strong>Title:</strong><br>
            VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13348v1">http://arxiv.org/abs/2507.13348v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 18 Jul 2025 20:56:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a0b13e2d/1e9ad997.mp3" length="21688421" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1352</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia</p>

            <p><strong>Title:</strong><br>
            VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13348v1">http://arxiv.org/abs/2507.13348v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning</title>
      <itunes:episode>985</itunes:episode>
      <podcast:episode>985</podcast:episode>
      <itunes:title>$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">23b1e48c-7a98-47aa-9571-f5488f1d36ab</guid>
      <link>https://share.transistor.fm/s/634419f0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He</p>

            <p><strong>Title:</strong><br>
            $π^3$: Scalable Permutation-Equivariant Visual Geometry Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13347v1">http://arxiv.org/abs/2507.13347v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce $\pi^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, $\pi^3$ employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design makes our model inherently robust to input ordering and highly scalable. These advantages enable our simple and bias-free approach to achieve state-of-the-art performance on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction. Code and models are publicly available.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He</p>

            <p><strong>Title:</strong><br>
            $π^3$: Scalable Permutation-Equivariant Visual Geometry Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13347v1">http://arxiv.org/abs/2507.13347v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce $\pi^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, $\pi^3$ employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design makes our model inherently robust to input ordering and highly scalable. These advantages enable our simple and bias-free approach to achieve state-of-the-art performance on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction. Code and models are publicly available.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 18 Jul 2025 20:56:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/634419f0/d2109c87.mp3" length="19664294" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1225</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He</p>

            <p><strong>Title:</strong><br>
            $π^3$: Scalable Permutation-Equivariant Visual Geometry Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13347v1">http://arxiv.org/abs/2507.13347v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce $\pi^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, $\pi^3$ employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design makes our model inherently robust to input ordering and highly scalable. These advantages enable our simple and bias-free approach to achieve state-of-the-art performance on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction. Code and models are publicly available.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner</title>
      <itunes:episode>984</itunes:episode>
      <podcast:episode>984</podcast:episode>
      <itunes:title>The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e3043fa7-b939-41fb-93cb-5562434eafe3</guid>
      <link>https://share.transistor.fm/s/3c7a3290</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhouqi Hua, Wenwei Zhang, Chengqi Lyu, Yuzhe Gu, Songyang Gao, Kuikun Liu, Kai Chen</p>

            <p><strong>Title:</strong><br>
            The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13332v1">http://arxiv.org/abs/2507.13332v1</a></p>

            <p><strong>Abstract:</strong><br>
            Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are computable, i.e., problems that algorithms can solve, thus can be solved by the Turing Machine. From this perspective, this paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of LLMs. TAIL synthesizes chain-of-thoughts (CoT) data that imitate the execution process of a Turing Machine by computer programs, which linearly expands the reasoning steps into atomic states to alleviate shortcut learning and explicit memory fetch mechanism to reduce the difficulties of dynamic and long-range data access in elementary operations. To validate the reliability and universality of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks. Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks using only synthetic data, surpassing previous methods and DeepSeek-R1. The experimental results reveal that the key concepts in the Turing Machine, instead of the thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing Machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhouqi Hua, Wenwei Zhang, Chengqi Lyu, Yuzhe Gu, Songyang Gao, Kuikun Liu, Kai Chen</p>

            <p><strong>Title:</strong><br>
            The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13332v1">http://arxiv.org/abs/2507.13332v1</a></p>

            <p><strong>Abstract:</strong><br>
            Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are computable, i.e., problems that algorithms can solve, thus can be solved by the Turing Machine. From this perspective, this paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of LLMs. TAIL synthesizes chain-of-thoughts (CoT) data that imitate the execution process of a Turing Machine by computer programs, which linearly expands the reasoning steps into atomic states to alleviate shortcut learning and explicit memory fetch mechanism to reduce the difficulties of dynamic and long-range data access in elementary operations. To validate the reliability and universality of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks. Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks using only synthetic data, surpassing previous methods and DeepSeek-R1. The experimental results reveal that the key concepts in the Turing Machine, instead of the thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing Machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 18 Jul 2025 20:55:50 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3c7a3290/23e16c7b.mp3" length="22740001" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1418</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhouqi Hua, Wenwei Zhang, Chengqi Lyu, Yuzhe Gu, Songyang Gao, Kuikun Liu, Kai Chen</p>

            <p><strong>Title:</strong><br>
            The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13332v1">http://arxiv.org/abs/2507.13332v1</a></p>

            <p><strong>Abstract:</strong><br>
            Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are computable, i.e., problems that algorithms can solve, thus can be solved by the Turing Machine. From this perspective, this paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of LLMs. TAIL synthesizes chain-of-thoughts (CoT) data that imitate the execution process of a Turing Machine by computer programs, which linearly expands the reasoning steps into atomic states to alleviate shortcut learning and explicit memory fetch mechanism to reduce the difficulties of dynamic and long-range data access in elementary operations. To validate the reliability and universality of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks. Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks using only synthetic data, surpassing previous methods and DeepSeek-R1. The experimental results reveal that the key concepts in the Turing Machine, instead of the thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing Machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning</title>
      <itunes:episode>983</itunes:episode>
      <podcast:episode>983</podcast:episode>
      <itunes:title>AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">917276bb-0cfb-498c-8d60-d056bb8805f6</guid>
      <link>https://share.transistor.fm/s/23351f12</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, Ruihang Chu</p>

            <p><strong>Title:</strong><br>
            AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.12841v1">http://arxiv.org/abs/2507.12841v1</a></p>

            <p><strong>Abstract:</strong><br>
            Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300\,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4o\'s content scores by 45\% and style scores by 12\%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, Ruihang Chu</p>

            <p><strong>Title:</strong><br>
            AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.12841v1">http://arxiv.org/abs/2507.12841v1</a></p>

            <p><strong>Abstract:</strong><br>
            Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300\,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4o\'s content scores by 45\% and style scores by 12\%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 18 Jul 2025 20:55:29 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/23351f12/bc154302.mp3" length="22026985" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1373</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, Ruihang Chu</p>

            <p><strong>Title:</strong><br>
            AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.12841v1">http://arxiv.org/abs/2507.12841v1</a></p>

            <p><strong>Abstract:</strong><br>
            Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300\,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4o\'s content scores by 45\% and style scores by 12\%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models</title>
      <itunes:episode>982</itunes:episode>
      <podcast:episode>982</podcast:episode>
      <itunes:title>Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">09a7d9e3-e8d2-4fdb-9128-707cc2db9baf</guid>
      <link>https://share.transistor.fm/s/d6081f68</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yudong Jin, Sida Peng, Xuan Wang, Tao Xie, Zhen Xu, Yifan Yang, Yujun Shen, Hujun Bao, Xiaowei Zhou</p>

            <p><strong>Title:</strong><br>
            Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13344v1">http://arxiv.org/abs/2507.13344v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lack spatio-temporal consistency, thus degrading view synthesis quality. In this paper, we propose a novel sliding iterative denoising process to enhance the spatio-temporal consistency of the 4D diffusion model. Specifically, we define a latent grid in which each latent encodes the image, camera pose, and human pose for a certain viewpoint and timestamp, then alternately denoising the latent grid along spatial and temporal dimensions with a sliding window, and finally decode the videos at target viewpoints from the corresponding denoised latents. Through the iterative sliding, information flows sufficiently across the latent grid, allowing the diffusion model to obtain a large receptive field and thus enhance the 4D consistency of the output, while making the GPU memory consumption affordable. The experiments on the DNA-Rendering and ActorsHQ datasets demonstrate that our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches. See our project page for interactive demos and video results: https://diffuman4d.github.io/ .</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yudong Jin, Sida Peng, Xuan Wang, Tao Xie, Zhen Xu, Yifan Yang, Yujun Shen, Hujun Bao, Xiaowei Zhou</p>

            <p><strong>Title:</strong><br>
            Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13344v1">http://arxiv.org/abs/2507.13344v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lack spatio-temporal consistency, thus degrading view synthesis quality. In this paper, we propose a novel sliding iterative denoising process to enhance the spatio-temporal consistency of the 4D diffusion model. Specifically, we define a latent grid in which each latent encodes the image, camera pose, and human pose for a certain viewpoint and timestamp, then alternately denoising the latent grid along spatial and temporal dimensions with a sliding window, and finally decode the videos at target viewpoints from the corresponding denoised latents. Through the iterative sliding, information flows sufficiently across the latent grid, allowing the diffusion model to obtain a large receptive field and thus enhance the 4D consistency of the output, while making the GPU memory consumption affordable. The experiments on the DNA-Rendering and ActorsHQ datasets demonstrate that our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches. See our project page for interactive demos and video results: https://diffuman4d.github.io/ .</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 18 Jul 2025 20:55:07 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d6081f68/0bad1996.mp3" length="22549026" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1406</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yudong Jin, Sida Peng, Xuan Wang, Tao Xie, Zhen Xu, Yifan Yang, Yujun Shen, Hujun Bao, Xiaowei Zhou</p>

            <p><strong>Title:</strong><br>
            Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.13344v1">http://arxiv.org/abs/2507.13344v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lack spatio-temporal consistency, thus degrading view synthesis quality. In this paper, we propose a novel sliding iterative denoising process to enhance the spatio-temporal consistency of the 4D diffusion model. Specifically, we define a latent grid in which each latent encodes the image, camera pose, and human pose for a certain viewpoint and timestamp, then alternately denoising the latent grid along spatial and temporal dimensions with a sliding window, and finally decode the videos at target viewpoints from the corresponding denoised latents. Through the iterative sliding, information flows sufficiently across the latent grid, allowing the diffusion model to obtain a large receptive field and thus enhance the 4D consistency of the output, while making the GPU memory consumption affordable. The experiments on the DNA-Rendering and ActorsHQ datasets demonstrate that our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches. See our project page for interactive demos and video results: https://diffuman4d.github.io/ .</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization</title>
      <itunes:episode>981</itunes:episode>
      <podcast:episode>981</podcast:episode>
      <itunes:title>RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0d9573b5-a827-4b5b-b619-17d68cee5c2e</guid>
      <link>https://share.transistor.fm/s/9abadc80</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.CL, cs.NA, math.DG, math.NA, 68T07, 65F55, 53Z50</p>

            <p><strong>Authors:</strong><br>
            Vladimir Bogachev, Vladimir Aletov, Alexander Molozhavenko, Denis Bobkov, Vera Soboleva, Aibek Alanov, Maxim Rakhuba</p>

            <p><strong>Title:</strong><br>
            RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.12142v1">http://arxiv.org/abs/2507.12142v1</a></p>

            <p><strong>Abstract:</strong><br>
            Low-Rank Adaptation (LoRA) has become a widely adopted standard for parameter-efficient fine-tuning of large language models (LLMs), significantly reducing memory and computational demands. However, challenges remain, including finding optimal initialization strategies or mitigating overparametrization in low-rank matrix factorization. In this work, we propose a novel approach that addresses both of the challenges simultaneously within a unified framework. Our method treats a set of fixed-rank LoRA matrices as a smooth manifold. Considering adapters as elements on this manifold removes overparametrization, while determining the direction of the fastest loss decrease along the manifold provides initialization. Special care is taken to obtain numerically stable and computationally efficient implementation of our method, using best practices from numerical linear algebra and Riemannian optimization. Experimental results on LLM and diffusion model architectures demonstrate that RiemannLoRA consistently improves both convergence speed and final performance over standard LoRA and its state-of-the-art modifications.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.CL, cs.NA, math.DG, math.NA, 68T07, 65F55, 53Z50</p>

            <p><strong>Authors:</strong><br>
            Vladimir Bogachev, Vladimir Aletov, Alexander Molozhavenko, Denis Bobkov, Vera Soboleva, Aibek Alanov, Maxim Rakhuba</p>

            <p><strong>Title:</strong><br>
            RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.12142v1">http://arxiv.org/abs/2507.12142v1</a></p>

            <p><strong>Abstract:</strong><br>
            Low-Rank Adaptation (LoRA) has become a widely adopted standard for parameter-efficient fine-tuning of large language models (LLMs), significantly reducing memory and computational demands. However, challenges remain, including finding optimal initialization strategies or mitigating overparametrization in low-rank matrix factorization. In this work, we propose a novel approach that addresses both of the challenges simultaneously within a unified framework. Our method treats a set of fixed-rank LoRA matrices as a smooth manifold. Considering adapters as elements on this manifold removes overparametrization, while determining the direction of the fastest loss decrease along the manifold provides initialization. Special care is taken to obtain numerically stable and computationally efficient implementation of our method, using best practices from numerical linear algebra and Riemannian optimization. Experimental results on LLM and diffusion model architectures demonstrate that RiemannLoRA consistently improves both convergence speed and final performance over standard LoRA and its state-of-the-art modifications.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 18 Jul 2025 20:54:45 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9abadc80/7a067336.mp3" length="22134383" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1380</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.CL, cs.NA, math.DG, math.NA, 68T07, 65F55, 53Z50</p>

            <p><strong>Authors:</strong><br>
            Vladimir Bogachev, Vladimir Aletov, Alexander Molozhavenko, Denis Bobkov, Vera Soboleva, Aibek Alanov, Maxim Rakhuba</p>

            <p><strong>Title:</strong><br>
            RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.12142v1">http://arxiv.org/abs/2507.12142v1</a></p>

            <p><strong>Abstract:</strong><br>
            Low-Rank Adaptation (LoRA) has become a widely adopted standard for parameter-efficient fine-tuning of large language models (LLMs), significantly reducing memory and computational demands. However, challenges remain, including finding optimal initialization strategies or mitigating overparametrization in low-rank matrix factorization. In this work, we propose a novel approach that addresses both of the challenges simultaneously within a unified framework. Our method treats a set of fixed-rank LoRA matrices as a smooth manifold. Considering adapters as elements on this manifold removes overparametrization, while determining the direction of the fastest loss decrease along the manifold provides initialization. Special care is taken to obtain numerically stable and computationally efficient implementation of our method, using best practices from numerical linear algebra and Riemannian optimization. Experimental results on LLM and diffusion model architectures demonstrate that RiemannLoRA consistently improves both convergence speed and final performance over standard LoRA and its state-of-the-art modifications.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs</title>
      <itunes:episode>980</itunes:episode>
      <podcast:episode>980</podcast:episode>
      <itunes:title>Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d0eb523a-b657-4eb8-a9af-f103f315d6be</guid>
      <link>https://share.transistor.fm/s/d461d977</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Chunkit Chan, Yankai Chen, Zhongfen Deng, Yinghui Li, Hai-Tao Zheng, Dongyuan Li, Renhe Jiang, Ming Zhang, Yangqiu Song, Philip S. Yu</p>

            <p><strong>Title:</strong><br>
            Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.09477v2">http://arxiv.org/abs/2507.09477v2</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge, yet it falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts. This survey synthesizes both strands under a unified reasoning-retrieval perspective. We first map how advanced reasoning optimizes each stage of RAG (Reasoning-Enhanced RAG). Then, we show how retrieved knowledge of different type supply missing premises and expand context for complex inference (RAG-Enhanced Reasoning). Finally, we spotlight emerging Synergized RAG-Reasoning frameworks, where (agentic) LLMs iteratively interleave search and reasoning to achieve state-of-the-art performance across knowledge-intensive benchmarks. We categorize methods, datasets, and open challenges, and outline research avenues toward deeper RAG-Reasoning systems that are more effective, multimodally-adaptive, trustworthy, and human-centric. The collection is available at https://github.com/DavidZWZ/Awesome-RAG-Reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Chunkit Chan, Yankai Chen, Zhongfen Deng, Yinghui Li, Hai-Tao Zheng, Dongyuan Li, Renhe Jiang, Ming Zhang, Yangqiu Song, Philip S. Yu</p>

            <p><strong>Title:</strong><br>
            Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.09477v2">http://arxiv.org/abs/2507.09477v2</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge, yet it falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts. This survey synthesizes both strands under a unified reasoning-retrieval perspective. We first map how advanced reasoning optimizes each stage of RAG (Reasoning-Enhanced RAG). Then, we show how retrieved knowledge of different type supply missing premises and expand context for complex inference (RAG-Enhanced Reasoning). Finally, we spotlight emerging Synergized RAG-Reasoning frameworks, where (agentic) LLMs iteratively interleave search and reasoning to achieve state-of-the-art performance across knowledge-intensive benchmarks. We categorize methods, datasets, and open challenges, and outline research avenues toward deeper RAG-Reasoning systems that are more effective, multimodally-adaptive, trustworthy, and human-centric. The collection is available at https://github.com/DavidZWZ/Awesome-RAG-Reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 17 Jul 2025 19:55:58 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d461d977/fd1bed27.mp3" length="19788380" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1233</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Chunkit Chan, Yankai Chen, Zhongfen Deng, Yinghui Li, Hai-Tao Zheng, Dongyuan Li, Renhe Jiang, Ming Zhang, Yangqiu Song, Philip S. Yu</p>

            <p><strong>Title:</strong><br>
            Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.09477v2">http://arxiv.org/abs/2507.09477v2</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge, yet it falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts. This survey synthesizes both strands under a unified reasoning-retrieval perspective. We first map how advanced reasoning optimizes each stage of RAG (Reasoning-Enhanced RAG). Then, we show how retrieved knowledge of different type supply missing premises and expand context for complex inference (RAG-Enhanced Reasoning). Finally, we spotlight emerging Synergized RAG-Reasoning frameworks, where (agentic) LLMs iteratively interleave search and reasoning to achieve state-of-the-art performance across knowledge-intensive benchmarks. We categorize methods, datasets, and open challenges, and outline research avenues toward deeper RAG-Reasoning systems that are more effective, multimodally-adaptive, trustworthy, and human-centric. The collection is available at https://github.com/DavidZWZ/Awesome-RAG-Reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models</title>
      <itunes:episode>979</itunes:episode>
      <podcast:episode>979</podcast:episode>
      <itunes:title>Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c0262548-afd1-457e-ae14-608caec25811</guid>
      <link>https://share.transistor.fm/s/ab9214ac</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tiezheng Zhang, Yitong Li, Yu-cheng Chou, Jieneng Chen, Alan Yuille, Chen Wei, Junfei Xiao</p>

            <p><strong>Title:</strong><br>
            Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07104v2">http://arxiv.org/abs/2507.07104v2</a></p>

            <p><strong>Abstract:</strong><br>
            Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). Specifically, we establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder. Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the intermediate language representations into detailed descriptions, we construct a state-of-the-art (SoTA) captioner comparable to leading models like GPT-4o and Gemini 2.0 Flash. Our method demonstrates exceptional cost-efficiency and significantly reduces data requirements; by primarily utilizing single-modal images for training and maximizing the utility of existing pretrained models (image encoder, T2I diffusion model, and LLM), it circumvents the need for massive paired image-text datasets, keeping the total training expenditure under $1,000 USD.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tiezheng Zhang, Yitong Li, Yu-cheng Chou, Jieneng Chen, Alan Yuille, Chen Wei, Junfei Xiao</p>

            <p><strong>Title:</strong><br>
            Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07104v2">http://arxiv.org/abs/2507.07104v2</a></p>

            <p><strong>Abstract:</strong><br>
            Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). Specifically, we establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder. Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the intermediate language representations into detailed descriptions, we construct a state-of-the-art (SoTA) captioner comparable to leading models like GPT-4o and Gemini 2.0 Flash. Our method demonstrates exceptional cost-efficiency and significantly reduces data requirements; by primarily utilizing single-modal images for training and maximizing the utility of existing pretrained models (image encoder, T2I diffusion model, and LLM), it circumvents the need for massive paired image-text datasets, keeping the total training expenditure under $1,000 USD.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 16 Jul 2025 20:05:01 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ab9214ac/1203b3bf.mp3" length="19319855" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1204</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tiezheng Zhang, Yitong Li, Yu-cheng Chou, Jieneng Chen, Alan Yuille, Chen Wei, Junfei Xiao</p>

            <p><strong>Title:</strong><br>
            Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07104v2">http://arxiv.org/abs/2507.07104v2</a></p>

            <p><strong>Abstract:</strong><br>
            Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). Specifically, we establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder. Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the intermediate language representations into detailed descriptions, we construct a state-of-the-art (SoTA) captioner comparable to leading models like GPT-4o and Gemini 2.0 Flash. Our method demonstrates exceptional cost-efficiency and significantly reduces data requirements; by primarily utilizing single-modal images for training and maximizing the utility of existing pretrained models (image encoder, T2I diffusion model, and LLM), it circumvents the need for massive paired image-text datasets, keeping the total training expenditure under $1,000 USD.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes</title>
      <itunes:episode>978</itunes:episode>
      <podcast:episode>978</podcast:episode>
      <itunes:title>EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e615974a-fc8c-42ab-9124-ca3beab88655</guid>
      <link>https://share.transistor.fm/s/a9c71f3d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            LG AI Research, :, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Kyubeen Han, Seokhee Hong, Junwon Hwang, Taewan Hwang, Joonwon Jang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Euisoon Kim, Hyosang Kim, Jihoon Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Gwangho Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Young Min Paik, Yongmin Park, Youngyong Park, Sanghyun Seo, Sihoon Yang, Heuiyeen Yeen, Sihyuk Yi, Hyeongu Yun</p>

            <p><strong>Title:</strong><br>
            EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.11407v1">http://arxiv.org/abs/2507.11407v1</a></p>

            <p><strong>Abstract:</strong><br>
            This technical report introduces EXAONE 4.0, which integrates a Non-reasoning mode and a Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean. The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications. The EXAONE 4.0 demonstrates superior performance compared to open-weight models in its class and remains competitive even against frontier-class models. The models are publicly available for research purposes and can be easily downloaded via https://huggingface.co/LGAI-EXAONE.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            LG AI Research, :, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Kyubeen Han, Seokhee Hong, Junwon Hwang, Taewan Hwang, Joonwon Jang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Euisoon Kim, Hyosang Kim, Jihoon Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Gwangho Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Young Min Paik, Yongmin Park, Youngyong Park, Sanghyun Seo, Sihoon Yang, Heuiyeen Yeen, Sihyuk Yi, Hyeongu Yun</p>

            <p><strong>Title:</strong><br>
            EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.11407v1">http://arxiv.org/abs/2507.11407v1</a></p>

            <p><strong>Abstract:</strong><br>
            This technical report introduces EXAONE 4.0, which integrates a Non-reasoning mode and a Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean. The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications. The EXAONE 4.0 demonstrates superior performance compared to open-weight models in its class and remains competitive even against frontier-class models. The models are publicly available for research purposes and can be easily downloaded via https://huggingface.co/LGAI-EXAONE.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 16 Jul 2025 20:04:39 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a9c71f3d/28edbfa1.mp3" length="18733456" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1167</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            LG AI Research, :, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Kyubeen Han, Seokhee Hong, Junwon Hwang, Taewan Hwang, Joonwon Jang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Euisoon Kim, Hyosang Kim, Jihoon Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Gwangho Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Young Min Paik, Yongmin Park, Youngyong Park, Sanghyun Seo, Sihoon Yang, Heuiyeen Yeen, Sihyuk Yi, Hyeongu Yun</p>

            <p><strong>Title:</strong><br>
            EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.11407v1">http://arxiv.org/abs/2507.11407v1</a></p>

            <p><strong>Abstract:</strong><br>
            This technical report introduces EXAONE 4.0, which integrates a Non-reasoning mode and a Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean. The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications. The EXAONE 4.0 demonstrates superior performance compared to open-weight models in its class and remains competitive even against frontier-class models. The models are publicly available for research purposes and can be easily downloaded via https://huggingface.co/LGAI-EXAONE.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination</title>
      <itunes:episode>977</itunes:episode>
      <podcast:episode>977</podcast:episode>
      <itunes:title>Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">443358c2-f62e-4e46-8a5f-f76d1e3c3c04</guid>
      <link>https://share.transistor.fm/s/bfc78a47</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Yanwei Fu, Qin Liu, Songyang Zhang, Qi Zhang</p>

            <p><strong>Title:</strong><br>
            Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.10532v1">http://arxiv.org/abs/2507.10532v1</a></p>

            <p><strong>Abstract:</strong><br>
            The reasoning capabilities of large language models (LLMs) have been a longstanding focus of research. Recent works have further enhanced these capabilities using reinforcement learning (RL), with many new methods claiming significant improvements with minimal or no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance reasoning performance. However, these breakthroughs are mostly reported on the Qwen2.5 model family and evaluated on well-known benchmarks such as MATH-500, AMC, and AIME, while failing to achieve similar gains on other models like Llama, which warrants further investigation. Our analysis shows that although Qwen2.5 achieves strong mathematical reasoning performance, its pretraining on large-scale web corpora makes it vulnerable to data contamination in popular benchmarks. As a result, results derived from these benchmarks may be unreliable. To address this, we introduce a generator that produces fully synthetic arithmetic problems of arbitrary length and difficulty, yielding a clean dataset we call RandomCalculation. Using these leakage-free datasets, we show that only accurate reward signals consistently improve performance, while noisy or incorrect signals do not. We advocate for evaluating RL methods on uncontaminated benchmarks and across diverse model families to ensure trustworthy conclusions.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Yanwei Fu, Qin Liu, Songyang Zhang, Qi Zhang</p>

            <p><strong>Title:</strong><br>
            Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.10532v1">http://arxiv.org/abs/2507.10532v1</a></p>

            <p><strong>Abstract:</strong><br>
            The reasoning capabilities of large language models (LLMs) have been a longstanding focus of research. Recent works have further enhanced these capabilities using reinforcement learning (RL), with many new methods claiming significant improvements with minimal or no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance reasoning performance. However, these breakthroughs are mostly reported on the Qwen2.5 model family and evaluated on well-known benchmarks such as MATH-500, AMC, and AIME, while failing to achieve similar gains on other models like Llama, which warrants further investigation. Our analysis shows that although Qwen2.5 achieves strong mathematical reasoning performance, its pretraining on large-scale web corpora makes it vulnerable to data contamination in popular benchmarks. As a result, results derived from these benchmarks may be unreliable. To address this, we introduce a generator that produces fully synthetic arithmetic problems of arbitrary length and difficulty, yielding a clean dataset we call RandomCalculation. Using these leakage-free datasets, we show that only accurate reward signals consistently improve performance, while noisy or incorrect signals do not. We advocate for evaluating RL methods on uncontaminated benchmarks and across diverse model families to ensure trustworthy conclusions.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 15 Jul 2025 20:43:16 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bfc78a47/25be2c76.mp3" length="20394435" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1271</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Yanwei Fu, Qin Liu, Songyang Zhang, Qi Zhang</p>

            <p><strong>Title:</strong><br>
            Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.10532v1">http://arxiv.org/abs/2507.10532v1</a></p>

            <p><strong>Abstract:</strong><br>
            The reasoning capabilities of large language models (LLMs) have been a longstanding focus of research. Recent works have further enhanced these capabilities using reinforcement learning (RL), with many new methods claiming significant improvements with minimal or no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance reasoning performance. However, these breakthroughs are mostly reported on the Qwen2.5 model family and evaluated on well-known benchmarks such as MATH-500, AMC, and AIME, while failing to achieve similar gains on other models like Llama, which warrants further investigation. Our analysis shows that although Qwen2.5 achieves strong mathematical reasoning performance, its pretraining on large-scale web corpora makes it vulnerable to data contamination in popular benchmarks. As a result, results derived from these benchmarks may be unreliable. To address this, we introduce a generator that produces fully synthetic arithmetic problems of arbitrary length and difficulty, yielding a clean dataset we call RandomCalculation. Using these leakage-free datasets, we show that only accurate reward signals consistently improve performance, while noisy or incorrect signals do not. We advocate for evaluating RL methods on uncontaminated benchmarks and across diverse model families to ensure trustworthy conclusions.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation</title>
      <itunes:episode>976</itunes:episode>
      <podcast:episode>976</podcast:episode>
      <itunes:title>SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">72e6b1f3-3c12-4fec-b8db-ac75c74ec541</guid>
      <link>https://share.transistor.fm/s/2e812cb5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li</p>

            <p><strong>Title:</strong><br>
            SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.09862v1">http://arxiv.org/abs/2507.09862v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: https://dorniwang.github.io/SpeakerVid-5M/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li</p>

            <p><strong>Title:</strong><br>
            SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.09862v1">http://arxiv.org/abs/2507.09862v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: https://dorniwang.github.io/SpeakerVid-5M/</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 15 Jul 2025 20:42:55 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2e812cb5/01607c90.mp3" length="19664684" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1225</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li</p>

            <p><strong>Title:</strong><br>
            SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.09862v1">http://arxiv.org/abs/2507.09862v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: https://dorniwang.github.io/SpeakerVid-5M/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation</title>
      <itunes:episode>975</itunes:episode>
      <podcast:episode>975</podcast:episode>
      <itunes:title>Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">082e831f-24ad-4671-a0d5-997a7df76dc6</guid>
      <link>https://share.transistor.fm/s/6f88165c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun</p>

            <p><strong>Title:</strong><br>
            Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.10524v1">http://arxiv.org/abs/2507.10524v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun</p>

            <p><strong>Title:</strong><br>
            Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.10524v1">http://arxiv.org/abs/2507.10524v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 15 Jul 2025 20:42:34 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6f88165c/7441f1c1.mp3" length="21102872" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1315</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun</p>

            <p><strong>Title:</strong><br>
            Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.10524v1">http://arxiv.org/abs/2507.10524v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>EmbRACE-3K: Embodied Reasoning and Action in Complex Environments</title>
      <itunes:episode>974</itunes:episode>
      <podcast:episode>974</podcast:episode>
      <itunes:title>EmbRACE-3K: Embodied Reasoning and Action in Complex Environments</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fa2919e1-6f24-4305-9a7b-814f9bbc1465</guid>
      <link>https://share.transistor.fm/s/687a5b97</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mingxian Lin, Wei Huang, Yitang Li, Chengjie Jiang, Kui Wu, Fangwei Zhong, Shengju Qian, Xin Wang, Xiaojuan Qi</p>

            <p><strong>Title:</strong><br>
            EmbRACE-3K: Embodied Reasoning and Action in Complex Environments</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.10548v1">http://arxiv.org/abs/2507.10548v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective, with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation, object manipulation, and multi-stage goal execution. Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent's intent at every step. Using EmRACE-3K, we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning. This approach yields substantial improvements across all three challenge categories, highlighting the dataset's effectiveness in enabling the development of embodied reasoning capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mingxian Lin, Wei Huang, Yitang Li, Chengjie Jiang, Kui Wu, Fangwei Zhong, Shengju Qian, Xin Wang, Xiaojuan Qi</p>

            <p><strong>Title:</strong><br>
            EmbRACE-3K: Embodied Reasoning and Action in Complex Environments</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.10548v1">http://arxiv.org/abs/2507.10548v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective, with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation, object manipulation, and multi-stage goal execution. Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent's intent at every step. Using EmRACE-3K, we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning. This approach yields substantial improvements across all three challenge categories, highlighting the dataset's effectiveness in enabling the development of embodied reasoning capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 15 Jul 2025 20:42:13 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/687a5b97/d98b7f41.mp3" length="21417985" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1335</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mingxian Lin, Wei Huang, Yitang Li, Chengjie Jiang, Kui Wu, Fangwei Zhong, Shengju Qian, Xin Wang, Xiaojuan Qi</p>

            <p><strong>Title:</strong><br>
            EmbRACE-3K: Embodied Reasoning and Action in Complex Environments</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.10548v1">http://arxiv.org/abs/2507.10548v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective, with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation, object manipulation, and multi-stage goal execution. Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent's intent at every step. Using EmRACE-3K, we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning. This approach yields substantial improvements across all three challenge categories, highlighting the dataset's effectiveness in enabling the development of embodied reasoning capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once</title>
      <itunes:episode>973</itunes:episode>
      <podcast:episode>973</podcast:episode>
      <itunes:title>REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">88385391-6aca-4b23-920c-e584f862ac0e</guid>
      <link>https://share.transistor.fm/s/21dfc521</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhuoshi Pan, Qizhi Pei, Yu Li, Qiyao Sun, Zinan Tang, H. Vicky Zhao, Conghui He, Lijun Wu</p>

            <p><strong>Title:</strong><br>
            REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.10541v2">http://arxiv.org/abs/2507.10541v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST evaluates several under-tested capabilities: contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management. Our evaluation reveals several striking findings: Even state-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performance degradation under stress testing. Crucially, REST demonstrates stronger discriminative power than existing benchmarks, revealing pronounced performance differences among models that exhibit similar, near-ceiling performance under single-question evaluations. Some key insights emerge from our analysis: (1) the "overthinking trap" is a critical factor contributing to the performance degradation; (2) the models trained with "long2short" technique preserve more accuracy of their single-problem performance under REST, outperforming standard-trained counterparts. These results establish REST as a cost-efficient, future-proof evaluation paradigm that better reflects real-world reasoning demands while reducing reliance on continuous human annotation. Code and results are available at https://opendatalab.github.io/REST.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhuoshi Pan, Qizhi Pei, Yu Li, Qiyao Sun, Zinan Tang, H. Vicky Zhao, Conghui He, Lijun Wu</p>

            <p><strong>Title:</strong><br>
            REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.10541v2">http://arxiv.org/abs/2507.10541v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST evaluates several under-tested capabilities: contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management. Our evaluation reveals several striking findings: Even state-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performance degradation under stress testing. Crucially, REST demonstrates stronger discriminative power than existing benchmarks, revealing pronounced performance differences among models that exhibit similar, near-ceiling performance under single-question evaluations. Some key insights emerge from our analysis: (1) the "overthinking trap" is a critical factor contributing to the performance degradation; (2) the models trained with "long2short" technique preserve more accuracy of their single-problem performance under REST, outperforming standard-trained counterparts. These results establish REST as a cost-efficient, future-proof evaluation paradigm that better reflects real-world reasoning demands while reducing reliance on continuous human annotation. Code and results are available at https://opendatalab.github.io/REST.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 15 Jul 2025 20:41:51 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/21dfc521/26eeb2f2.mp3" length="24446532" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1524</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhuoshi Pan, Qizhi Pei, Yu Li, Qiyao Sun, Zinan Tang, H. Vicky Zhao, Conghui He, Lijun Wu</p>

            <p><strong>Title:</strong><br>
            REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.10541v2">http://arxiv.org/abs/2507.10541v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST evaluates several under-tested capabilities: contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management. Our evaluation reveals several striking findings: Even state-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performance degradation under stress testing. Crucially, REST demonstrates stronger discriminative power than existing benchmarks, revealing pronounced performance differences among models that exhibit similar, near-ceiling performance under single-question evaluations. Some key insights emerge from our analysis: (1) the "overthinking trap" is a critical factor contributing to the performance degradation; (2) the models trained with "long2short" technique preserve more accuracy of their single-problem performance under REST, outperforming standard-trained counterparts. These results establish REST as a cost-efficient, future-proof evaluation paradigm that better reflects real-world reasoning demands while reducing reliance on continuous human annotation. Code and results are available at https://opendatalab.github.io/REST.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Test-Time Scaling with Reflective Generative Model</title>
      <itunes:episode>972</itunes:episode>
      <podcast:episode>972</podcast:episode>
      <itunes:title>Test-Time Scaling with Reflective Generative Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">82e3c39a-8185-409c-a525-116e92758bab</guid>
      <link>https://share.transistor.fm/s/c9c4bf5c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie</p>

            <p><strong>Title:</strong><br>
            Test-Time Scaling with Reflective Generative Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01951v2">http://arxiv.org/abs/2507.01951v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie</p>

            <p><strong>Title:</strong><br>
            Test-Time Scaling with Reflective Generative Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01951v2">http://arxiv.org/abs/2507.01951v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 14 Jul 2025 20:57:54 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c9c4bf5c/b7067bbc.mp3" length="20751744" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1293</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie</p>

            <p><strong>Title:</strong><br>
            Test-Time Scaling with Reflective Generative Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01951v2">http://arxiv.org/abs/2507.01951v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning</title>
      <itunes:episode>971</itunes:episode>
      <podcast:episode>971</podcast:episode>
      <itunes:title>Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a3d68a6c-cc63-42eb-88b2-ddf561cf9de2</guid>
      <link>https://share.transistor.fm/s/1a139539</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Chunrui Han, Yuang Peng, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel</p>

            <p><strong>Title:</strong><br>
            Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05255v1">http://arxiv.org/abs/2507.05255v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps, surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Chunrui Han, Yuang Peng, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel</p>

            <p><strong>Title:</strong><br>
            Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05255v1">http://arxiv.org/abs/2507.05255v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps, surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 14 Jul 2025 20:57:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1a139539/e90af95e.mp3" length="20445414" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1274</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Chunrui Han, Yuang Peng, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel</p>

            <p><strong>Title:</strong><br>
            Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05255v1">http://arxiv.org/abs/2507.05255v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps, surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>NeuralOS: Towards Simulating Operating Systems via Neural Generative Models</title>
      <itunes:episode>970</itunes:episode>
      <podcast:episode>970</podcast:episode>
      <itunes:title>NeuralOS: Towards Simulating Operating Systems via Neural Generative Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f03700d1-ae39-481d-8c08-fc0f18b3ba90</guid>
      <link>https://share.transistor.fm/s/de3cd21f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV, cs.AI, cs.CL, cs.HC, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Luke Rivard, Sun Sun, Hongyu Guo, Wenhu Chen, Yuntian Deng</p>

            <p><strong>Title:</strong><br>
            NeuralOS: Towards Simulating Operating Systems via Neural Generative Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.08800v1">http://arxiv.org/abs/2507.08800v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks computer state, with a diffusion-based neural renderer that generates screen images. The model is trained on a large-scale dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents. Experiments show that NeuralOS successfully renders realistic GUI sequences, accurately captures mouse interactions, and reliably predicts state transitions like application launches. Although modeling fine-grained keyboard interactions precisely remains challenging, NeuralOS offers a step toward creating fully adaptive, generative neural interfaces for future human-computer interaction systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV, cs.AI, cs.CL, cs.HC, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Luke Rivard, Sun Sun, Hongyu Guo, Wenhu Chen, Yuntian Deng</p>

            <p><strong>Title:</strong><br>
            NeuralOS: Towards Simulating Operating Systems via Neural Generative Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.08800v1">http://arxiv.org/abs/2507.08800v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks computer state, with a diffusion-based neural renderer that generates screen images. The model is trained on a large-scale dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents. Experiments show that NeuralOS successfully renders realistic GUI sequences, accurately captures mouse interactions, and reliably predicts state transitions like application launches. Although modeling fine-grained keyboard interactions precisely remains challenging, NeuralOS offers a step toward creating fully adaptive, generative neural interfaces for future human-computer interaction systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 14 Jul 2025 20:57:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/de3cd21f/ca7f7a7b.mp3" length="20286580" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1264</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV, cs.AI, cs.CL, cs.HC, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Luke Rivard, Sun Sun, Hongyu Guo, Wenhu Chen, Yuntian Deng</p>

            <p><strong>Title:</strong><br>
            NeuralOS: Towards Simulating Operating Systems via Neural Generative Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.08800v1">http://arxiv.org/abs/2507.08800v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks computer state, with a diffusion-based neural renderer that generates screen images. The model is trained on a large-scale dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents. Experiments show that NeuralOS successfully renders realistic GUI sequences, accurately captures mouse interactions, and reliably predicts state transitions like application launches. Although modeling fine-grained keyboard interactions precisely remains challenging, NeuralOS offers a step toward creating fully adaptive, generative neural interfaces for future human-computer interaction systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering</title>
      <itunes:episode>969</itunes:episode>
      <podcast:episode>969</podcast:episode>
      <itunes:title>CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">43faffd1-a1f8-462c-aa81-a83e3388a7a8</guid>
      <link>https://share.transistor.fm/s/07bfda23</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhengqing Wang, Yuefan Wu, Jiacheng Chen, Fuyang Zhang, Yasutaka Furukawa</p>

            <p><strong>Title:</strong><br>
            CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.08776v2">http://arxiv.org/abs/2507.08776v2</a></p>

            <p><strong>Abstract:</strong><br>
            This paper proposes a neural rendering approach that represents a scene as "compressed light-field tokens (CLiFTs)", retaining rich appearance and geometric information of a scene. CLiFT enables compute-efficient rendering by compressed tokens, while being capable of changing the number of tokens to represent a scene or render a novel view with one trained network. Concretely, given a set of images, multi-view encoder tokenizes the images with the camera poses. Latent-space K-means selects a reduced set of rays as cluster centroids using the tokens. The multi-view ``condenser'' compresses the information of all the tokens into the centroid tokens to construct CLiFTs. At test time, given a target view and a compute budget (i.e., the number of CLiFTs), the system collects the specified number of nearby tokens and synthesizes a novel view using a compute-adaptive renderer. Extensive experiments on RealEstate10K and DL3DV datasets quantitatively and qualitatively validate our approach, achieving significant data reduction with comparable rendering quality and the highest overall rendering score, while providing trade-offs of data size, rendering quality, and rendering speed.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhengqing Wang, Yuefan Wu, Jiacheng Chen, Fuyang Zhang, Yasutaka Furukawa</p>

            <p><strong>Title:</strong><br>
            CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.08776v2">http://arxiv.org/abs/2507.08776v2</a></p>

            <p><strong>Abstract:</strong><br>
            This paper proposes a neural rendering approach that represents a scene as "compressed light-field tokens (CLiFTs)", retaining rich appearance and geometric information of a scene. CLiFT enables compute-efficient rendering by compressed tokens, while being capable of changing the number of tokens to represent a scene or render a novel view with one trained network. Concretely, given a set of images, multi-view encoder tokenizes the images with the camera poses. Latent-space K-means selects a reduced set of rays as cluster centroids using the tokens. The multi-view ``condenser'' compresses the information of all the tokens into the centroid tokens to construct CLiFTs. At test time, given a target view and a compute budget (i.e., the number of CLiFTs), the system collects the specified number of nearby tokens and synthesizes a novel view using a compute-adaptive renderer. Extensive experiments on RealEstate10K and DL3DV datasets quantitatively and qualitatively validate our approach, achieving significant data reduction with comparable rendering quality and the highest overall rendering score, while providing trade-offs of data size, rendering quality, and rendering speed.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 14 Jul 2025 20:56:51 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/07bfda23/dec2ae29.mp3" length="19975214" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1245</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhengqing Wang, Yuefan Wu, Jiacheng Chen, Fuyang Zhang, Yasutaka Furukawa</p>

            <p><strong>Title:</strong><br>
            CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.08776v2">http://arxiv.org/abs/2507.08776v2</a></p>

            <p><strong>Abstract:</strong><br>
            This paper proposes a neural rendering approach that represents a scene as "compressed light-field tokens (CLiFTs)", retaining rich appearance and geometric information of a scene. CLiFT enables compute-efficient rendering by compressed tokens, while being capable of changing the number of tokens to represent a scene or render a novel view with one trained network. Concretely, given a set of images, multi-view encoder tokenizes the images with the camera poses. Latent-space K-means selects a reduced set of rays as cluster centroids using the tokens. The multi-view ``condenser'' compresses the information of all the tokens into the centroid tokens to construct CLiFTs. At test time, given a target view and a compute budget (i.e., the number of CLiFTs), the system collects the specified number of nearby tokens and synthesizes a novel view using a compute-adaptive renderer. Extensive experiments on RealEstate10K and DL3DV datasets quantitatively and qualitatively validate our approach, achieving significant data reduction with comparable rendering quality and the highest overall rendering score, while providing trade-offs of data size, rendering quality, and rendering speed.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>KV Cache Steering for Inducing Reasoning in Small Language Models</title>
      <itunes:episode>968</itunes:episode>
      <podcast:episode>968</podcast:episode>
      <itunes:title>KV Cache Steering for Inducing Reasoning in Small Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">86ae48d9-6980-479a-b2ec-76f246427896</guid>
      <link>https://share.transistor.fm/s/c2814a7f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Max Belitsky, Dawid J. Kopiczko, Michael Dorkenwald, M. Jehanzeb Mirza, Cees G. M. Snoek, Yuki M. Asano</p>

            <p><strong>Title:</strong><br>
            KV Cache Steering for Inducing Reasoning in Small Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.08799v1">http://arxiv.org/abs/2507.08799v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach leverages GPT-4o-generated reasoning traces to construct steering vectors that shift model behavior toward more explicit, multi-step reasoning without fine-tuning or prompt modifications. Experimental evaluations on diverse reasoning benchmarks demonstrate that cache steering improves both the qualitative structure of model reasoning and quantitative task performance. Compared to prior activation steering techniques that require continuous interventions, our one-shot cache steering offers substantial advantages in terms of hyperparameter stability, inference-time efficiency, and ease of integration, making it a more robust and practical solution for controlled generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Max Belitsky, Dawid J. Kopiczko, Michael Dorkenwald, M. Jehanzeb Mirza, Cees G. M. Snoek, Yuki M. Asano</p>

            <p><strong>Title:</strong><br>
            KV Cache Steering for Inducing Reasoning in Small Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.08799v1">http://arxiv.org/abs/2507.08799v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach leverages GPT-4o-generated reasoning traces to construct steering vectors that shift model behavior toward more explicit, multi-step reasoning without fine-tuning or prompt modifications. Experimental evaluations on diverse reasoning benchmarks demonstrate that cache steering improves both the qualitative structure of model reasoning and quantitative task performance. Compared to prior activation steering techniques that require continuous interventions, our one-shot cache steering offers substantial advantages in terms of hyperparameter stability, inference-time efficiency, and ease of integration, making it a more robust and practical solution for controlled generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 14 Jul 2025 20:56:30 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c2814a7f/ec3cbc89.mp3" length="21932493" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1367</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Max Belitsky, Dawid J. Kopiczko, Michael Dorkenwald, M. Jehanzeb Mirza, Cees G. M. Snoek, Yuki M. Asano</p>

            <p><strong>Title:</strong><br>
            KV Cache Steering for Inducing Reasoning in Small Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.08799v1">http://arxiv.org/abs/2507.08799v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach leverages GPT-4o-generated reasoning traces to construct steering vectors that shift model behavior toward more explicit, multi-step reasoning without fine-tuning or prompt modifications. Experimental evaluations on diverse reasoning benchmarks demonstrate that cache steering improves both the qualitative structure of model reasoning and quantitative task performance. Compared to prior activation steering techniques that require continuous interventions, our one-shot cache steering offers substantial advantages in terms of hyperparameter stability, inference-time efficiency, and ease of integration, making it a more robust and practical solution for controlled generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities</title>
      <itunes:episode>967</itunes:episode>
      <podcast:episode>967</podcast:episode>
      <itunes:title>Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">659f9847-cb78-40d6-bd15-87b52b63cc58</guid>
      <link>https://share.transistor.fm/s/2e290f58</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ilaï Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ramé, Sagar Waghmare, Helen Miller, Vaishakh Keshava, Ying Jian, Xiaofan Zhang, Raluca Ada Popa, Kedar Dhamdhere, Blaž Bratanič, Kyuyeun Kim, Terry Koo, Ferran Alet, Yi-ting Chen, Arsha Nagrani, Hannah Muckenhirn, Zhiyuan Zhang, Corbin Quick, Filip Pavetić, Duc Dung Nguyen, Joao Carreira, Michael Elabd, Haroon Qureshi, Fabian Mentzer, Yao-Yuan Yang, Danielle Eisenbud, Anmol Gulati, Ellie Talius, Eric Ni, Sahra Ghalebikesabi, Edouard Yvinec, Alaa Saade, Thatcher Ulrich, Lorenzo Blanco, Dan A. Calian, Muhuan Huang, Aäron van den Oord, Naman Goyal, Terry Chen, Praynaa Rawlani, Christian Schallhart, Swachhand Lokhande, Xianghong Luo, Jyn Shan, Ceslee Montgomery, Victoria Krakovna, Federico Piccinini, Omer Barak, Jingyu Cui, Yiling Jia, Mikhail Dektiarev, Alexey Kolganov, Shiyu Huang, Zhe Chen, Xingyu Wang, Jessica Austin, Peter de Boursac, Evgeny Sluzhaev, Frank Ding, Huijian Li, Surya Bhupatiraju, Mohit Agarwal, Sławek Kwasiborski, Paramjit Sandhu, Patrick Siegler, Ahmet Iscen, Eyal Ben-David, Shiraz Butt, Miltos Allamanis, Seth Benjamin, Robert Busa-Fekete, Felix Hernandez-Campos, Sasha Goldshtein, Matt Dibb, Weiyang Zhang, Annie Marsden, Carey Radebaugh, Stephen Roller, Abhishek Nayyar, Jacob Austin, Tayfun Terzi, Bhargav Kanagal Shamanna, Pete Shaw, Aayush Singh, Florian Luisier, Artur Mendonça, Vaibhav Aggarwal, Larisa Markeeva, Claudio Fantacci, Sergey Brin, HyunJeong Choe, Guanyu Wang, Hartwig Adam, Avigail Dabush, Tatsuya Kiyono, Eyal Marcus, Jeremy Cole, Theophane Weber, Hongrae Lee, Ronny Huang, Alex Muzio, Leandro Kieliger, Maigo Le, Courtney Biles, Long Le, Archit Sharma, Chengrun Yang, Avery Lamp, Dave Dopson, Nate Hurley, Katrina Xinyi Xu, Zhihao Shan, Shuang Song, Jiewen Tan, Alexandre Senges, George Zhang, Chong You, Yennie Jun, David Raposo, Susanna Ricco, Xuan Yang, Weijie Chen, Prakhar Gupta, Arthur Szlam, Kevin Villela, Chun-Sung Ferng, Daniel Kasenberg, Chen Liang, Rui Zhu, Arunachalam Narayanaswamy, Florence Perot, Paul Pucciarelli, Anna Shekhawat, Alexey Stern, Rishikesh Ingale, Stefani Karp, Sanaz Bahargam, Adrian Goedeckemeyer, Jie Han, Sicheng Li, Andrea Tacchetti, Dian Yu, Abhishek Chakladar, Zhiying Zhang, Mona El Mahdy, Xu Gao, Dale Johnson, Samrat Phatale, AJ Piergiovanni, Hyeontaek Lim, Clement Farabet, Carl Lebsack, Theo Guidroz, John Blitzer, Nico Duduta, David Madras, Steve Li, Daniel von Dincklage, Xin Li, Mahdis Mahdieh, George Tucker, Ganesh Jawahar, Owen Xiao, Danny Tarlow, Robert Geirhos, Noam Velan, Daniel Vlasic, Kalesha Bullard, SK Park, Nishesh Gupta, Kellie Webster, Ayal Hitron, Jieming Mao, Julian Eisenschlos, Laurel Prince, Nina D'Souza, Kelvin Zheng, Sara Nasso, Gabriela Botea, Carl Doersch, Caglar Unlu, Chris Alberti, Alexey Svyatkovskiy, Ankita Goel, Krzysztof Choromanski, Pan-Pan Jiang, Richard Nguyen, Four Flynn, Daria Ćurko, Peter Chen, Nicholas Roth, Kieran Milan, Caleb Habtegebriel, Shashi Narayan, Michael Moffitt, Jake Marcus, Thomas Anthony, Brendan McMahan, Gowoon Cheon, Ruibo Liu, Megan Barnes, Lukasz Lew, Rebeca Santamaria-Fernandez, Mayank Upadhyay, Arjun Akula, Arnar Mar Hrafnkelsson, Alvaro Caceres, Andrew Bunner, Michal Sokolik, Subha Puttagunta, Lawrence Moore, Berivan Isik, Jay Hartford, Lawrence Chan, Pradeep Shenoy, Dan Holtmann-Rice, Jane Park, Fabio Viola, Alex Salcianu, Sujeevan Rajayogam, Ian Stewart-Binks, Zelin Wu, Richard Everett, Xi Xiong, Pierre-Antoine Manzagol, Gary Leung, Carl Saroufim, Bo Pang, Dawid Wegner, George Papamakarios, Jennimaria Palomaki, Helena Pankov, Guangda Lai, Guilherme Tubone, Shubin Zhao, Theofilos Strinopoulos, Seth Neel, Mingqiu Wang, Joe Kelley, Li Li, Pingmei Xu, Anitha Vijayakumar, Andrea D'olimpio, Omer Levy, Massimo Nicosia, Grigory Rozhdestvenskiy, Ni Lao, Sirui Xie, Yash Katariya, Jon Simon, Sanjiv Kumar, Florian Hartmann, Michael Kilgore, Jinhyuk Lee, Aroma Mahendru, Roman Ring, Tom Hennigan, Fiona Lang, Colin Cherry, David Steiner, Dawsen Hwang, Ray Smith, Pidong Wang, Jeremy Chen, Ming-Hsuan Yang, Sam Kwei, Philippe Schlattner, Donnie Kim, Ganesh Poomal Girirajan, Nikola Momchev, Ayushi Agarwal, Xingyi Zhou, Ilkin Safarli, Zachary Garrett, AJ Pierigiovanni, Sarthak Jauhari, Alif Raditya Rochman, Shikhar Vashishth, Quan Yuan, Christof Angermueller, Jon Blanton, Xinying Song, Nitesh Bharadwaj Gundavarapu, Thi Avrahami, Maxine Deines, Subhrajit Roy, Manish Gupta, Christopher Semturs, Shobha Vasudevan, Aditya Srikanth Veerubhotla, Shriya Sharma, Josh Jacob, Zhen Yang, Andreas Terzis, Dan Karliner, Auriel Wright, Tania Rojas-Esponda, Ashley Brown, Abhijit Guha Roy, Pawan Dogra, Andrei Kapishnikov, Peter Young, Wendy Kan, Vinodh Kumar Rajendran, Maria Ivanova, Salil Deshmukh, Chia-Hua Ho, Mike Kwong, Stav Ginzburg, Annie Louis, KP Sawhney, Slav Petrov, Jing Xie, Yunfei Bai, Georgi Stoyanov, Alex Fabrikant, Rajesh Jayaram, Yuqi Li, Joe Heyward, Justin Gilmer, Yaqing Wang, Radu Soricut, Luyang Liu, Qingnan Duan, Jamie Hayes, Maura O'Brien, Gaurav Singh Tomar, Sivan Eiger, Bahar Fatemi, Jeffrey Hui, Catarina Barros, Adaeze Chukwuka, Alena Butryna, Saksham Thakur, Austin Huang, Zhufeng Pan, Haotian Tang, Serkan Cabi, Tulsee Doshi, Michiel Bakker, Sumit Bagri, Ruy Ley-Wild, Adam Lelkes, Jennie Lees, Patrick Kane, David Greene, Shimu Wu, Jörg Bornschein, Gabriela Surita, Sarah Hodkinson, Fangtao Li, Chris Hidey, Sébastien Pereira, Sean Ammirati, Phillip Lippe, Adam Kraft, Pu Han, Sebastian Gerlach, Zifeng Wang, Liviu Panait, Feng Han, Brian Farris, Yingying Bi, Hannah DeBalsi, Miaosen Wang, Gladys Tyen, James Cohan, Susan Zhang, Jarred Barber, Da-Woon Chung, Jaeyoun Kim, Markus Kunesch, Steven Pecht, Nami Akazawa, Abe Friesen, James Lyon, Ali Eslami, Junru Wu, Jie Tan, Yue Song, Ravi Kumar, Chris Welty, Ilia Akolzin, Gena Gibson, Sean Augenstein, Arjun Pillai, Nancy Yuen, Du Phan, Xin Wang, Iain Barr, Heiga Zen, Nan Hua, Casper Liu, Jilei Jerry Wang, Tanuj Bhatia, Hao Xu, Oded Elyada, Pushmeet Kohli, Mirek Olšák, Ke Chen, Azalia Mirhoseini, Noam Shazeer, Shoshana Jakobovits, Maggie Tran, Nolan Ramsden, Tarun Bharti, Fred Alcober, Yunjie Li, Shilpa Shetty, Jing Chen, Dmitry Kalashnikov, Megha Nawhal, Sercan Arik, Hanwen Chen, Michiel Blokzijl, Shubham Gupta, James Rubin, Rigel Swavely, Sophie Bridgers, Ian Gemp, Chen Su, Arun Suggala, Juliette Pluto, Mary Cassin, Alain Vaucher, Kaiyang Ji, Jiahao Cai, Andrew Audibert, Animesh Sinha, David Tian, Efrat Farkash, Amy Hua, Jilin Chen, Duc-Hieu Tran, Edward Loper, Nicole Brichtova, Lara McConnaughey, Ballie Sandhu, Robert Leland, Doug DeCarlo, Andrew Over, James Huang, Xing Wu, Connie Fan, Eric Li, Yun Lei, Deepak Sharma, Cosmin Paduraru, Luo Yu, Matko Bošnjak, Phuong Dao, Min Choi, Sneha Kudugunta, Jakub Adamek, Carlos Guía, Ali Khodaei, Jie Feng, Wenjun Zeng, David Welling, Sandeep Tata, Christina Butterfield, Andrey Vlasov, Seliem El-Sayed, Swaroop Mishra, Tara Sainath, Shentao Yang, RJ Skerry-Ryan, Jeremy Shar, Robert Berry, Arunkumar Rajendran, Arun Kandoor, Andrea Burns, Deepali Jain, Tom Stone, Wonpyo Park, Shibo Wang, Albin Cassirer, Guohui Wang, Hayato Kobayashi, Sergey Rogulenko, Vineetha Govindaraj, Mikołaj Rybiński, Nadav Olmert, Colin Evans, Po-Sen Huang, Kelvin Xu, Premal Shah, Terry Thurk, Caitlin Sikora, Mu Cai, Jin Xie, Elahe Dabir, Saloni Shah, Norbert Kalb, Carrie Zhang, Shruthi Prabhakara, Amit Sabne, ...</p>]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ilaï Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ramé, Sagar Waghmare, Helen Miller, Vaishakh Keshava, Ying Jian, Xiaofan Zhang, Raluca Ada Popa, Kedar Dhamdhere, Blaž Bratanič, Kyuyeun Kim, Terry Koo, Ferran Alet, Yi-ting Chen, Arsha Nagrani, Hannah Muckenhirn, Zhiyuan Zhang, Corbin Quick, Filip Pavetić, Duc Dung Nguyen, Joao Carreira, Michael Elabd, Haroon Qureshi, Fabian Mentzer, Yao-Yuan Yang, Danielle Eisenbud, Anmol Gulati, Ellie Talius, Eric Ni, Sahra Ghalebikesabi, Edouard Yvinec, Alaa Saade, Thatcher Ulrich, Lorenzo Blanco, Dan A. Calian, Muhuan Huang, Aäron van den Oord, Naman Goyal, Terry Chen, Praynaa Rawlani, Christian Schallhart, Swachhand Lokhande, Xianghong Luo, Jyn Shan, Ceslee Montgomery, Victoria Krakovna, Federico Piccinini, Omer Barak, Jingyu Cui, Yiling Jia, Mikhail Dektiarev, Alexey Kolganov, Shiyu Huang, Zhe Chen, Xingyu Wang, Jessica Austin, Peter de Boursac, Evgeny Sluzhaev, Frank Ding, Huijian Li, Surya Bhupatiraju, Mohit Agarwal, Sławek Kwasiborski, Paramjit Sandhu, Patrick Siegler, Ahmet Iscen, Eyal Ben-David, Shiraz Butt, Miltos Allamanis, Seth Benjamin, Robert Busa-Fekete, Felix Hernandez-Campos, Sasha Goldshtein, Matt Dibb, Weiyang Zhang, Annie Marsden, Carey Radebaugh, Stephen Roller, Abhishek Nayyar, Jacob Austin, Tayfun Terzi, Bhargav Kanagal Shamanna, Pete Shaw, Aayush Singh, Florian Luisier, Artur Mendonça, Vaibhav Aggarwal, Larisa Markeeva, Claudio Fantacci, Sergey Brin, HyunJeong Choe, Guanyu Wang, Hartwig Adam, Avigail Dabush, Tatsuya Kiyono, Eyal Marcus, Jeremy Cole, Theophane Weber, Hongrae Lee, Ronny Huang, Alex Muzio, Leandro Kieliger, Maigo Le, Courtney Biles, Long Le, Archit Sharma, Chengrun Yang, Avery Lamp, Dave Dopson, Nate Hurley, Katrina Xinyi Xu, Zhihao Shan, Shuang Song, Jiewen Tan, Alexandre Senges, George Zhang, Chong You, Yennie Jun, David Raposo, Susanna Ricco, Xuan Yang, Weijie Chen, Prakhar Gupta, Arthur Szlam, Kevin Villela, Chun-Sung Ferng, Daniel Kasenberg, Chen Liang, Rui Zhu, Arunachalam Narayanaswamy, Florence Perot, Paul Pucciarelli, Anna Shekhawat, Alexey Stern, Rishikesh Ingale, Stefani Karp, Sanaz Bahargam, Adrian Goedeckemeyer, Jie Han, Sicheng Li, Andrea Tacchetti, Dian Yu, Abhishek Chakladar, Zhiying Zhang, Mona El Mahdy, Xu Gao, Dale Johnson, Samrat Phatale, AJ Piergiovanni, Hyeontaek Lim, Clement Farabet, Carl Lebsack, Theo Guidroz, John Blitzer, Nico Duduta, David Madras, Steve Li, Daniel von Dincklage, Xin Li, Mahdis Mahdieh, George Tucker, Ganesh Jawahar, Owen Xiao, Danny Tarlow, Robert Geirhos, Noam Velan, Daniel Vlasic, Kalesha Bullard, SK Park, Nishesh Gupta, Kellie Webster, Ayal Hitron, Jieming Mao, Julian Eisenschlos, Laurel Prince, Nina D'Souza, Kelvin Zheng, Sara Nasso, Gabriela Botea, Carl Doersch, Caglar Unlu, Chris Alberti, Alexey Svyatkovskiy, Ankita Goel, Krzysztof Choromanski, Pan-Pan Jiang, Richard Nguyen, Four Flynn, Daria Ćurko, Peter Chen, Nicholas Roth, Kieran Milan, Caleb Habtegebriel, Shashi Narayan, Michael Moffitt, Jake Marcus, Thomas Anthony, Brendan McMahan, Gowoon Cheon, Ruibo Liu, Megan Barnes, Lukasz Lew, Rebeca Santamaria-Fernandez, Mayank Upadhyay, Arjun Akula, Arnar Mar Hrafnkelsson, Alvaro Caceres, Andrew Bunner, Michal Sokolik, Subha Puttagunta, Lawrence Moore, Berivan Isik, Jay Hartford, Lawrence Chan, Pradeep Shenoy, Dan Holtmann-Rice, Jane Park, Fabio Viola, Alex Salcianu, Sujeevan Rajayogam, Ian Stewart-Binks, Zelin Wu, Richard Everett, Xi Xiong, Pierre-Antoine Manzagol, Gary Leung, Carl Saroufim, Bo Pang, Dawid Wegner, George Papamakarios, Jennimaria Palomaki, Helena Pankov, Guangda Lai, Guilherme Tubone, Shubin Zhao, Theofilos Strinopoulos, Seth Neel, Mingqiu Wang, Joe Kelley, Li Li, Pingmei Xu, Anitha Vijayakumar, Andrea D'olimpio, Omer Levy, Massimo Nicosia, Grigory Rozhdestvenskiy, Ni Lao, Sirui Xie, Yash Katariya, Jon Simon, Sanjiv Kumar, Florian Hartmann, Michael Kilgore, Jinhyuk Lee, Aroma Mahendru, Roman Ring, Tom Hennigan, Fiona Lang, Colin Cherry, David Steiner, Dawsen Hwang, Ray Smith, Pidong Wang, Jeremy Chen, Ming-Hsuan Yang, Sam Kwei, Philippe Schlattner, Donnie Kim, Ganesh Poomal Girirajan, Nikola Momchev, Ayushi Agarwal, Xingyi Zhou, Ilkin Safarli, Zachary Garrett, AJ Pierigiovanni, Sarthak Jauhari, Alif Raditya Rochman, Shikhar Vashishth, Quan Yuan, Christof Angermueller, Jon Blanton, Xinying Song, Nitesh Bharadwaj Gundavarapu, Thi Avrahami, Maxine Deines, Subhrajit Roy, Manish Gupta, Christopher Semturs, Shobha Vasudevan, Aditya Srikanth Veerubhotla, Shriya Sharma, Josh Jacob, Zhen Yang, Andreas Terzis, Dan Karliner, Auriel Wright, Tania Rojas-Esponda, Ashley Brown, Abhijit Guha Roy, Pawan Dogra, Andrei Kapishnikov, Peter Young, Wendy Kan, Vinodh Kumar Rajendran, Maria Ivanova, Salil Deshmukh, Chia-Hua Ho, Mike Kwong, Stav Ginzburg, Annie Louis, KP Sawhney, Slav Petrov, Jing Xie, Yunfei Bai, Georgi Stoyanov, Alex Fabrikant, Rajesh Jayaram, Yuqi Li, Joe Heyward, Justin Gilmer, Yaqing Wang, Radu Soricut, Luyang Liu, Qingnan Duan, Jamie Hayes, Maura O'Brien, Gaurav Singh Tomar, Sivan Eiger, Bahar Fatemi, Jeffrey Hui, Catarina Barros, Adaeze Chukwuka, Alena Butryna, Saksham Thakur, Austin Huang, Zhufeng Pan, Haotian Tang, Serkan Cabi, Tulsee Doshi, Michiel Bakker, Sumit Bagri, Ruy Ley-Wild, Adam Lelkes, Jennie Lees, Patrick Kane, David Greene, Shimu Wu, Jörg Bornschein, Gabriela Surita, Sarah Hodkinson, Fangtao Li, Chris Hidey, Sébastien Pereira, Sean Ammirati, Phillip Lippe, Adam Kraft, Pu Han, Sebastian Gerlach, Zifeng Wang, Liviu Panait, Feng Han, Brian Farris, Yingying Bi, Hannah DeBalsi, Miaosen Wang, Gladys Tyen, James Cohan, Susan Zhang, Jarred Barber, Da-Woon Chung, Jaeyoun Kim, Markus Kunesch, Steven Pecht, Nami Akazawa, Abe Friesen, James Lyon, Ali Eslami, Junru Wu, Jie Tan, Yue Song, Ravi Kumar, Chris Welty, Ilia Akolzin, Gena Gibson, Sean Augenstein, Arjun Pillai, Nancy Yuen, Du Phan, Xin Wang, Iain Barr, Heiga Zen, Nan Hua, Casper Liu, Jilei Jerry Wang, Tanuj Bhatia, Hao Xu, Oded Elyada, Pushmeet Kohli, Mirek Olšák, Ke Chen, Azalia Mirhoseini, Noam Shazeer, Shoshana Jakobovits, Maggie Tran, Nolan Ramsden, Tarun Bharti, Fred Alcober, Yunjie Li, Shilpa Shetty, Jing Chen, Dmitry Kalashnikov, Megha Nawhal, Sercan Arik, Hanwen Chen, Michiel Blokzijl, Shubham Gupta, James Rubin, Rigel Swavely, Sophie Bridgers, Ian Gemp, Chen Su, Arun Suggala, Juliette Pluto, Mary Cassin, Alain Vaucher, Kaiyang Ji, Jiahao Cai, Andrew Audibert, Animesh Sinha, David Tian, Efrat Farkash, Amy Hua, Jilin Chen, Duc-Hieu Tran, Edward Loper, Nicole Brichtova, Lara McConnaughey, Ballie Sandhu, Robert Leland, Doug DeCarlo, Andrew Over, James Huang, Xing Wu, Connie Fan, Eric Li, Yun Lei, Deepak Sharma, Cosmin Paduraru, Luo Yu, Matko Bošnjak, Phuong Dao, Min Choi, Sneha Kudugunta, Jakub Adamek, Carlos Guía, Ali Khodaei, Jie Feng, Wenjun Zeng, David Welling, Sandeep Tata, Christina Butterfield, Andrey Vlasov, Seliem El-Sayed, Swaroop Mishra, Tara Sainath, Shentao Yang, RJ Skerry-Ryan, Jeremy Shar, Robert Berry, Arunkumar Rajendran, Arun Kandoor, Andrea Burns, Deepali Jain, Tom Stone, Wonpyo Park, Shibo Wang, Albin Cassirer, Guohui Wang, Hayato Kobayashi, Sergey Rogulenko, Vineetha Govindaraj, Mikołaj Rybiński, Nadav Olmert, Colin Evans, Po-Sen Huang, Kelvin Xu, Premal Shah, Terry Thurk, Caitlin Sikora, Mu Cai, Jin Xie, Elahe Dabir, Saloni Shah, Norbert Kalb, Carrie Zhang, Shruthi Prabhakara, Amit Sabne, ...</p>]]>
      </content:encoded>
      <pubDate>Mon, 14 Jul 2025 20:56:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2e290f58/2c269a77.mp3" length="19908379" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1241</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ilaï Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ramé, Sagar Waghmare, Helen Miller, Vaishakh Keshava, Ying Jian, Xiaofan Zhang, Raluca Ada Popa, Kedar Dhamdhere, Blaž Bratanič, Kyuyeun Kim, Terry Koo, Ferran Alet, Yi-ting Chen, Arsha Nagrani, Hannah Muckenhirn, Zhiyuan Zhang, Corbin Quick, Filip Pavetić, Duc Dung Nguyen, Joao Carreira, Michael Elabd, Haroon Qureshi, Fabian Mentzer, Yao-Yuan Yang, Danielle Eisenbud, Anmol Gulati, Ellie Talius, Eric Ni, Sahra Ghalebikesabi, Edouard Yvinec, Alaa Saade, Thatcher Ulrich, Lorenzo Blanco, Dan A. Calian, Muhuan Huang, Aäron van den Oord, Naman Goyal, Terry Chen, Praynaa Rawlani, Christian Schallhart, Swachhand Lokhande, Xianghong Luo, Jyn Shan, Ceslee Montgomery, Victoria Krakovna, Federico Piccinini, Omer Barak, Jingyu Cui, Yiling Jia, Mikhail Dektiarev, Alexey Kolganov, Shiyu Huang, Zhe Chen, Xingyu Wang, Jessica Austin, Peter de Boursac, Evgeny Sluzhaev, Frank Ding, Huijian Li, Surya Bhupatiraju, Mohit Agarwal, Sławek Kwasiborski, Paramjit Sandhu, Patrick Siegler, Ahmet Iscen, Eyal Ben-David, Shiraz Butt, Miltos Allamanis, Seth Benjamin, Robert Busa-Fekete, Felix Hernandez-Campos, Sasha Goldshtein, Matt Dibb, Weiyang Zhang, Annie Marsden, Carey Radebaugh, Stephen Roller, Abhishek Nayyar, Jacob Austin, Tayfun Terzi, Bhargav Kanagal Shamanna, Pete Shaw, Aayush Singh, Florian Luisier, Artur Mendonça, Vaibhav Aggarwal, Larisa Markeeva, Claudio Fantacci, Sergey Brin, HyunJeong Choe, Guanyu Wang, Hartwig Adam, Avigail Dabush, Tatsuya Kiyono, Eyal Marcus, Jeremy Cole, Theophane Weber, Hongrae Lee, Ronny Huang, Alex Muzio, Leandro Kieliger, Maigo Le, Courtney Biles, Long Le, Archit Sharma, Chengrun Yang, Avery Lamp, Dave Dopson, Nate Hurley, Katrina Xinyi Xu, Zhihao Shan, Shuang Song, Jiewen Tan, Alexandre Senges, George Zhang, Chong You, Yennie Jun, David Raposo, Susanna Ricco, Xuan Yang, Weijie Chen, Prakhar Gupta, Arthur Szlam, Kevin Villela, Chun-Sung Ferng, Daniel Kasenberg, Chen Liang, Rui Zhu, Arunachalam Narayanaswamy, Florence Perot, Paul Pucciarelli, Anna Shekhawat, Alexey Stern, Rishikesh Ingale, Stefani Karp, Sanaz Bahargam, Adrian Goedeckemeyer, Jie Han, Sicheng Li, Andrea Tacchetti, Dian Yu, Abhishek Chakladar, Zhiying Zhang, Mona El Mahdy, Xu Gao, Dale Johnson, Samrat Phatale, AJ Piergiovanni, Hyeontaek Lim, Clement Farabet, Carl Lebsack, Theo Guidroz, John Blitzer, Nico Duduta, David Madras, Steve Li, Daniel von Dincklage, Xin Li, Mahdis Mahdieh, George Tucker, Ganesh Jawahar, Owen Xiao, Danny Tarlow, Robert Geirhos, Noam Velan, Daniel Vlasic, Kalesha Bullard, SK Park, Nishesh Gupta, Kellie Webster, Ayal Hitron, Jieming Mao, Julian Eisenschlos, Laurel Prince, Nina D'Souza, Kelvin Zheng, Sara Nasso, Gabriela Botea, Carl Doersch, Caglar Unlu, Chris Alberti, Alexey Svyatkovskiy, Ankita Goel, Krzysztof Choromanski, Pan-Pan Jiang, Richard Nguyen, Four Flynn, Daria Ćurko, Peter Chen, Nicholas Roth, Kieran Milan, Caleb Habtegebriel, Shashi Narayan, Michael Moffitt, Jake Marcus, Thomas Anthony, Brendan McMahan, Gowoon Cheon, Ruibo Liu, Megan Barnes, Lukasz Lew, Rebeca Santamaria-Fernandez, Mayank Upadhyay, Arjun Akula, Arnar Mar Hrafnkelsson, Alvaro Caceres, Andrew Bunner, Michal Sokolik, Subha Puttagunta, Lawrence Moore, Berivan Isik, Jay Hartford, Lawrence Chan, Pradeep Shenoy, Dan Holtmann-Rice, Jane Park, Fabio Viola, Alex Salcianu, Sujeevan Rajayogam, Ian Stewart-Binks, Zelin Wu, Richard Everett, Xi Xiong, Pierre-Antoine Manzagol, Gary Leung, Carl Saroufim, Bo Pang, Dawid Wegner, George Papamakarios, Jennimaria Palomaki, Helena Pankov, Guangda Lai, Guilherme Tubone, Shubin Zhao, Theofilos Strinopoulos, Seth Neel, Mingqiu Wang, Joe Kelley, Li Li, Pingmei Xu, Anitha Vijayakumar, Andrea D'olimpio, Omer Levy, Massimo Nicosia, Grigory Rozhdestvenskiy, Ni Lao, Sirui Xie, Yash Katariya, Jon Simon, Sanjiv Kumar, Florian Hartmann, Michael Kilgore, Jinhyuk Lee, Aroma Mahendru, Roman Ring, Tom Hennigan, Fiona Lang, Colin Cherry, David Steiner, Dawsen Hwang, Ray Smith, Pidong Wang, Jeremy Chen, Ming-Hsuan Yang, Sam Kwei, Philippe Schlattner, Donnie Kim, Ganesh Poomal Girirajan, Nikola Momchev, Ayushi Agarwal, Xingyi Zhou, Ilkin Safarli, Zachary Garrett, AJ Pierigiovanni, Sarthak Jauhari, Alif Raditya Rochman, Shikhar Vashishth, Quan Yuan, Christof Angermueller, Jon Blanton, Xinying Song, Nitesh Bharadwaj Gundavarapu, Thi Avrahami, Maxine Deines, Subhrajit Roy, Manish Gupta, Christopher Semturs, Shobha Vasudevan, Aditya Srikanth Veerubhotla, Shriya Sharma, Josh Jacob, Zhen Yang, Andreas Terzis, Dan Karliner, Auriel Wright, Tania Rojas-Esponda, Ashley Brown, Abhijit Guha Roy, Pawan Dogra, Andrei Kapishnikov, Peter Young, Wendy Kan, Vinodh Kumar Rajendran, Maria Ivanova, Salil Deshmukh, Chia-Hua Ho, Mike Kwong, Stav Ginzburg, Annie Louis, KP Sawhney, Slav Petrov, Jing Xie, Yunfei Bai, Georgi Stoyanov, Alex Fabrikant, Rajesh Jayaram, Yuqi Li, Joe Heyward, Justin Gilmer, Yaqing Wang, Radu Soricut, Luyang Liu, Qingnan Duan, Jamie Hayes, Maura O'Brien, Gaurav Singh Tomar, Sivan Eiger, Bahar Fatemi, Jeffrey Hui, Catarina Barros, Adaeze Chukwuka, Alena Butryna, Saksham Thakur, Austin Huang, Zhufeng Pan, Haotian Tang, Serkan Cabi, Tulsee Doshi, Michiel Bakker, Sumit Bagri, Ruy Ley-Wild, Adam Lelkes, Jennie Lees, Patrick Kane, David Greene, Shimu Wu, Jörg Bornschein, Gabriela Surita, Sarah Hodkinson, Fangtao Li, Chris Hidey, Sébastien Pereira, Sean Ammirati, Phillip Lippe, Adam Kraft, Pu Han, Sebastian Gerlach, Zifeng Wang, Liviu Panait, Feng Han, Brian Farris, Yingying Bi, Hannah DeBalsi, Miaosen Wang, Gladys Tyen, James Cohan, Susan Zhang, Jarred Barber, Da-Woon Chung, Jaeyoun Kim, Markus Kunesch, Steven Pecht, Nami Akazawa, Abe Friesen, James Lyon, Ali Eslami, Junru Wu, Jie Tan, Yue Song, Ravi Kumar, Chris Welty, Ilia Akolzin, Gena Gibson, Sean Augenstein, Arjun Pillai, Nancy Yuen, Du Phan, Xin Wang, Iain Barr, Heiga Zen, Nan Hua, Casper Liu, Jilei Jerry Wang, Tanuj Bhatia, Hao Xu, Oded Elyada, Pushmeet Kohli, Mirek Olšák, Ke Chen, Azalia Mirhoseini, Noam Shazeer, Shoshana Jakobovits, Maggie Tran, Nolan Ramsden, Tarun Bharti, Fred Alcober, Yunjie Li, Shilpa Shetty, Jing Chen, Dmitry Kalashnikov, Megha Nawhal, Sercan Arik, Hanwen Chen, Michiel Blokzijl, Shubham Gupta, James Rubin, Rigel Swavely, Sophie Bridgers, Ian Gemp, Chen Su, Arun Suggala, Juliette Pluto, Mary Cassin, Alain Vaucher, Kaiyang Ji, Jiahao Cai, Andrew Audibert, Animesh Sinha, David Tian, Efrat Farkash, Amy Hua, Jilin Chen, Duc-Hieu Tran, Edward Loper, Nicole Brichtova, Lara McConnaughey, Ballie Sandhu, Robert Leland, Doug DeCarlo, Andrew Over, James Huang, Xing Wu, Connie Fan, Eric Li, Yun Lei, Deepak Sharma, Cosmin Paduraru, Luo Yu, Matko Bošnjak, Phuong Dao, Min Choi, Sneha Kudugunta, Jakub Adamek, Carlos Guía, Ali Khodaei, Jie Feng, Wenjun Zeng, David Welling, Sandeep Tata, Christina Butterfield, Andrey Vlasov, Seliem El-Sayed, Swaroop Mishra, Tara Sainath, Shentao Yang, RJ Skerry-Ryan, Jeremy Shar, Robert Berry, Arunkumar Rajendran, Arun Kandoor, Andrea Burns, Deepali Jain, Tom Stone, Wonpyo Park, Shibo Wang, Albin Cassirer, Guohui Wang, Hayato Kobayashi, Sergey Rogulenko, Vineetha Govindaraj, Mikołaj Rybiński, Nadav Olmert, Colin Evans, Po-Sen Huang, Kelvin Xu, Premal Shah, Terry Thurk, Caitlin Sikora, Mu Cai, Jin Xie, Elahe Dabir, Saloni Shah, Norbert Kalb, Carrie Zhang, Shruthi Prabhakara, Amit Sabne, ...</p>]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Neural-Driven Image Editing</title>
      <itunes:episode>966</itunes:episode>
      <podcast:episode>966</podcast:episode>
      <itunes:title>Neural-Driven Image Editing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cce71258-d64a-4fbc-88bb-f1da70ca8d87</guid>
      <link>https://share.transistor.fm/s/1b09ece2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Pengfei Zhou, Jie Xia, Xiaopeng Peng, Wangbo Zhao, Zilong Ye, Zekai Li, Suorong Yang, Jiadong Pan, Yuanxiang Chen, Ziqiao Wang, Kai Wang, Qian Zheng, Xiaojun Chang, Gang Pan, Shurong Dong, Kaipeng Zhang, Yang You</p>

            <p><strong>Title:</strong><br>
            Neural-Driven Image Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05397v1">http://arxiv.org/abs/2507.05397v1</a></p>

            <p><strong>Abstract:</strong><br>
            Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. Datasets and code will be released to support future work and foster progress in this emerging area.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Pengfei Zhou, Jie Xia, Xiaopeng Peng, Wangbo Zhao, Zilong Ye, Zekai Li, Suorong Yang, Jiadong Pan, Yuanxiang Chen, Ziqiao Wang, Kai Wang, Qian Zheng, Xiaojun Chang, Gang Pan, Shurong Dong, Kaipeng Zhang, Yang You</p>

            <p><strong>Title:</strong><br>
            Neural-Driven Image Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05397v1">http://arxiv.org/abs/2507.05397v1</a></p>

            <p><strong>Abstract:</strong><br>
            Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. Datasets and code will be released to support future work and foster progress in this emerging area.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 14 Jul 2025 20:55:48 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1b09ece2/90b0a702.mp3" length="19836808" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1236</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Pengfei Zhou, Jie Xia, Xiaopeng Peng, Wangbo Zhao, Zilong Ye, Zekai Li, Suorong Yang, Jiadong Pan, Yuanxiang Chen, Ziqiao Wang, Kai Wang, Qian Zheng, Xiaojun Chang, Gang Pan, Shurong Dong, Kaipeng Zhang, Yang You</p>

            <p><strong>Title:</strong><br>
            Neural-Driven Image Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05397v1">http://arxiv.org/abs/2507.05397v1</a></p>

            <p><strong>Abstract:</strong><br>
            Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. Datasets and code will be released to support future work and foster progress in this emerging area.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Scaling RL to Long Videos</title>
      <itunes:episode>965</itunes:episode>
      <podcast:episode>965</podcast:episode>
      <itunes:title>Scaling RL to Long Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">61a4e149-fb87-4da9-b44e-b97e2c168323</guid>
      <link>https://share.transistor.fm/s/ab0ab014</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 95 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han</p>

            <p><strong>Title:</strong><br>
            Scaling RL to Long Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07966v1">http://arxiv.org/abs/2507.07966v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 95 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han</p>

            <p><strong>Title:</strong><br>
            Scaling RL to Long Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07966v1">http://arxiv.org/abs/2507.07966v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 11 Jul 2025 20:57:16 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ab0ab014/6fd52888.mp3" length="21885224" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1364</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 95 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han</p>

            <p><strong>Title:</strong><br>
            Scaling RL to Long Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07966v1">http://arxiv.org/abs/2507.07966v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>T-LoRA: Single Image Diffusion Model Customization Without Overfitting</title>
      <itunes:episode>964</itunes:episode>
      <podcast:episode>964</podcast:episode>
      <itunes:title>T-LoRA: Single Image Diffusion Model Customization Without Overfitting</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">58ea1b0e-5059-4e09-be6e-cdef7b5539fa</guid>
      <link>https://share.transistor.fm/s/82b0cc37</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Vera Soboleva, Aibek Alanov, Andrey Kuznetsov, Konstantin Sobolev</p>

            <p><strong>Title:</strong><br>
            T-LoRA: Single Image Diffusion Model Customization Without Overfitting</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05964v1">http://arxiv.org/abs/2507.05964v1</a></p>

            <p><strong>Abstract:</strong><br>
            While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity. This paper tackles the challenging yet most impactful task of adapting a diffusion model using just a single concept image, as single-image customization holds the greatest practical potential. We introduce T-LoRA, a Timestep-Dependent Low-Rank Adaptation framework specifically designed for diffusion model personalization. In our work we show that higher diffusion timesteps are more prone to overfitting than lower ones, necessitating a timestep-sensitive fine-tuning strategy. T-LoRA incorporates two key innovations: (1) a dynamic fine-tuning strategy that adjusts rank-constrained updates based on diffusion timesteps, and (2) a weight parametrization technique that ensures independence between adapter components through orthogonal initialization. Extensive experiments show that T-LoRA and its individual components outperform standard LoRA and other diffusion model personalization techniques. They achieve a superior balance between concept fidelity and text alignment, highlighting the potential of T-LoRA in data-limited and resource-constrained scenarios. Code is available at https://github.com/ControlGenAI/T-LoRA.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Vera Soboleva, Aibek Alanov, Andrey Kuznetsov, Konstantin Sobolev</p>

            <p><strong>Title:</strong><br>
            T-LoRA: Single Image Diffusion Model Customization Without Overfitting</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05964v1">http://arxiv.org/abs/2507.05964v1</a></p>

            <p><strong>Abstract:</strong><br>
            While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity. This paper tackles the challenging yet most impactful task of adapting a diffusion model using just a single concept image, as single-image customization holds the greatest practical potential. We introduce T-LoRA, a Timestep-Dependent Low-Rank Adaptation framework specifically designed for diffusion model personalization. In our work we show that higher diffusion timesteps are more prone to overfitting than lower ones, necessitating a timestep-sensitive fine-tuning strategy. T-LoRA incorporates two key innovations: (1) a dynamic fine-tuning strategy that adjusts rank-constrained updates based on diffusion timesteps, and (2) a weight parametrization technique that ensures independence between adapter components through orthogonal initialization. Extensive experiments show that T-LoRA and its individual components outperform standard LoRA and other diffusion model personalization techniques. They achieve a superior balance between concept fidelity and text alignment, highlighting the potential of T-LoRA in data-limited and resource-constrained scenarios. Code is available at https://github.com/ControlGenAI/T-LoRA.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 11 Jul 2025 20:56:55 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/82b0cc37/a4e13e1a.mp3" length="22380550" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1395</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Vera Soboleva, Aibek Alanov, Andrey Kuznetsov, Konstantin Sobolev</p>

            <p><strong>Title:</strong><br>
            T-LoRA: Single Image Diffusion Model Customization Without Overfitting</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05964v1">http://arxiv.org/abs/2507.05964v1</a></p>

            <p><strong>Abstract:</strong><br>
            While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity. This paper tackles the challenging yet most impactful task of adapting a diffusion model using just a single concept image, as single-image customization holds the greatest practical potential. We introduce T-LoRA, a Timestep-Dependent Low-Rank Adaptation framework specifically designed for diffusion model personalization. In our work we show that higher diffusion timesteps are more prone to overfitting than lower ones, necessitating a timestep-sensitive fine-tuning strategy. T-LoRA incorporates two key innovations: (1) a dynamic fine-tuning strategy that adjusts rank-constrained updates based on diffusion timesteps, and (2) a weight parametrization technique that ensures independence between adapter components through orthogonal initialization. Extensive experiments show that T-LoRA and its individual components outperform standard LoRA and other diffusion model personalization techniques. They achieve a superior balance between concept fidelity and text alignment, highlighting the potential of T-LoRA in data-limited and resource-constrained scenarios. Code is available at https://github.com/ControlGenAI/T-LoRA.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology</title>
      <itunes:episode>963</itunes:episode>
      <podcast:episode>963</podcast:episode>
      <itunes:title>Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ecc8f9cb-59f5-4895-bdff-8381301e1e48</guid>
      <link>https://share.transistor.fm/s/4347e62b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang</p>

            <p><strong>Title:</strong><br>
            Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07999v1">http://arxiv.org/abs/2507.07999v1</a></p>

            <p><strong>Abstract:</strong><br>
            Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang</p>

            <p><strong>Title:</strong><br>
            Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07999v1">http://arxiv.org/abs/2507.07999v1</a></p>

            <p><strong>Abstract:</strong><br>
            Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 11 Jul 2025 20:56:34 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4347e62b/f9a0bf96.mp3" length="19368748" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1207</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang</p>

            <p><strong>Title:</strong><br>
            Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07999v1">http://arxiv.org/abs/2507.07999v1</a></p>

            <p><strong>Abstract:</strong><br>
            Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding</title>
      <itunes:episode>962</itunes:episode>
      <podcast:episode>962</podcast:episode>
      <itunes:title>OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">94fafe70-4b6a-490e-9532-a6d8db31d8b6</guid>
      <link>https://share.transistor.fm/s/729047bd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            JingLi Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, Jiangmiao Pang</p>

            <p><strong>Title:</strong><br>
            OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07984v1">http://arxiv.org/abs/2507.07984v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            JingLi Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, Jiangmiao Pang</p>

            <p><strong>Title:</strong><br>
            OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07984v1">http://arxiv.org/abs/2507.07984v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 11 Jul 2025 20:56:13 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/729047bd/356ebe20.mp3" length="22100541" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1378</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            JingLi Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, Jiangmiao Pang</p>

            <p><strong>Title:</strong><br>
            OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07984v1">http://arxiv.org/abs/2507.07984v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs</title>
      <itunes:episode>961</itunes:episode>
      <podcast:episode>961</podcast:episode>
      <itunes:title>Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">57e8622c-4f12-4387-a91c-3da822e82129</guid>
      <link>https://share.transistor.fm/s/5021fe1f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim</p>

            <p><strong>Title:</strong><br>
            Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07990v1">http://arxiv.org/abs/2507.07990v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2$\times$ speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3$\times$ speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at https://www.jshyun.me/projects/sttm.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim</p>

            <p><strong>Title:</strong><br>
            Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07990v1">http://arxiv.org/abs/2507.07990v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2$\times$ speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3$\times$ speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at https://www.jshyun.me/projects/sttm.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 11 Jul 2025 20:55:52 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5021fe1f/000be2f3.mp3" length="21937951" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1367</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim</p>

            <p><strong>Title:</strong><br>
            Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07990v1">http://arxiv.org/abs/2507.07990v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2$\times$ speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3$\times$ speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at https://www.jshyun.me/projects/sttm.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling</title>
      <itunes:episode>960</itunes:episode>
      <podcast:episode>960</podcast:episode>
      <itunes:title>Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c72b0c6a-4e00-4b81-9958-1cd31ea41e04</guid>
      <link>https://share.transistor.fm/s/a90ccdd0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, Jiang Bian</p>

            <p><strong>Title:</strong><br>
            Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07982v1">http://arxiv.org/abs/2507.07982v1</a></p>

            <p><strong>Abstract:</strong><br>
            Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, Jiang Bian</p>

            <p><strong>Title:</strong><br>
            Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07982v1">http://arxiv.org/abs/2507.07982v1</a></p>

            <p><strong>Abstract:</strong><br>
            Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 11 Jul 2025 20:55:30 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a90ccdd0/981fcc2e.mp3" length="20061319" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1250</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, Jiang Bian</p>

            <p><strong>Title:</strong><br>
            Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07982v1">http://arxiv.org/abs/2507.07982v1</a></p>

            <p><strong>Abstract:</strong><br>
            Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PyVision: Agentic Vision with Dynamic Tooling</title>
      <itunes:episode>959</itunes:episode>
      <podcast:episode>959</podcast:episode>
      <itunes:title>PyVision: Agentic Vision with Dynamic Tooling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">823dd850-a788-49cb-9290-d4d816d423ea</guid>
      <link>https://share.transistor.fm/s/b1e7a753</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, Chen Wei</p>

            <p><strong>Title:</strong><br>
            PyVision: Agentic Vision with Dynamic Tooling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07998v1">http://arxiv.org/abs/2507.07998v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, Chen Wei</p>

            <p><strong>Title:</strong><br>
            PyVision: Agentic Vision with Dynamic Tooling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07998v1">http://arxiv.org/abs/2507.07998v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 11 Jul 2025 20:55:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b1e7a753/6868bc04.mp3" length="18344712" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1143</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, Chen Wei</p>

            <p><strong>Title:</strong><br>
            PyVision: Agentic Vision with Dynamic Tooling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07998v1">http://arxiv.org/abs/2507.07998v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>4KAgent: Agentic Any Image to 4K Super-Resolution</title>
      <itunes:episode>958</itunes:episode>
      <podcast:episode>958</podcast:episode>
      <itunes:title>4KAgent: Agentic Any Image to 4K Super-Resolution</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f34ee5bd-79bc-4136-b2d9-2fb5c22599e4</guid>
      <link>https://share.transistor.fm/s/036860bb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CV, eess.IV</p>

            <p><strong>Authors:</strong><br>
            Yushen Zuo, Qi Zheng, Mingyang Wu, Xinrui Jiang, Renjie Li, Jian Wang, Yide Zhang, Gengchen Mai, Lihong V. Wang, James Zou, Xiaoyu Wang, Ming-Hsuan Yang, Zhengzhong Tu</p>

            <p><strong>Title:</strong><br>
            4KAgent: Agentic Any Image to 4K Super-Resolution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07105v1">http://arxiv.org/abs/2507.07105v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling, a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent, which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-expert policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e.g., NIQE, MUSIQ) and fidelity (e.g., PSNR) metrics. By establishing a novel agentic paradigm for low-level vision tasks, we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We will release all the code, models, and results at: https://4kagent.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CV, eess.IV</p>

            <p><strong>Authors:</strong><br>
            Yushen Zuo, Qi Zheng, Mingyang Wu, Xinrui Jiang, Renjie Li, Jian Wang, Yide Zhang, Gengchen Mai, Lihong V. Wang, James Zou, Xiaoyu Wang, Ming-Hsuan Yang, Zhengzhong Tu</p>

            <p><strong>Title:</strong><br>
            4KAgent: Agentic Any Image to 4K Super-Resolution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07105v1">http://arxiv.org/abs/2507.07105v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling, a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent, which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-expert policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e.g., NIQE, MUSIQ) and fidelity (e.g., PSNR) metrics. By establishing a novel agentic paradigm for low-level vision tasks, we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We will release all the code, models, and results at: https://4kagent.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 10 Jul 2025 21:11:08 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/036860bb/baee9ff4.mp3" length="25739667" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1605</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CV, eess.IV</p>

            <p><strong>Authors:</strong><br>
            Yushen Zuo, Qi Zheng, Mingyang Wu, Xinrui Jiang, Renjie Li, Jian Wang, Yide Zhang, Gengchen Mai, Lihong V. Wang, James Zou, Xiaoyu Wang, Ming-Hsuan Yang, Zhengzhong Tu</p>

            <p><strong>Title:</strong><br>
            4KAgent: Agentic Any Image to 4K Super-Resolution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07105v1">http://arxiv.org/abs/2507.07105v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling, a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent, which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-expert policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e.g., NIQE, MUSIQ) and fidelity (e.g., PSNR) metrics. By establishing a novel agentic paradigm for low-level vision tasks, we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We will release all the code, models, and results at: https://4kagent.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data</title>
      <itunes:episode>957</itunes:episode>
      <podcast:episode>957</podcast:episode>
      <itunes:title>Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b0729070-9e37-4d25-9736-094db7add210</guid>
      <link>https://share.transistor.fm/s/2bf01132</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, Jingbo Wang</p>

            <p><strong>Title:</strong><br>
            Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07095v1">http://arxiv.org/abs/2507.07095v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zero-shot human motion generation. The code is available at https://github.com/VankouF/MotionMillion-Codes.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, Jingbo Wang</p>

            <p><strong>Title:</strong><br>
            Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07095v1">http://arxiv.org/abs/2507.07095v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zero-shot human motion generation. The code is available at https://github.com/VankouF/MotionMillion-Codes.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 10 Jul 2025 21:10:46 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2bf01132/8e9b5f61.mp3" length="16177619" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1007</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, Jingbo Wang</p>

            <p><strong>Title:</strong><br>
            Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07095v1">http://arxiv.org/abs/2507.07095v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited size of training datasets. Moreover, the lack of a comprehensive evaluation framework impedes the advancement of this task by failing to identify directions for improvement. In this work, we aim to push text-to-motion into a new era, that is, to achieve the generalization ability of zero-shot. To this end, firstly, we develop an efficient annotation pipeline and introduce MotionMillion-the largest human motion dataset to date, featuring over 2,000 hours and 2 million high-quality motion sequences. Additionally, we propose MotionMillion-Eval, the most comprehensive benchmark for evaluating zero-shot motion generation. Leveraging a scalable architecture, we scale our model to 7B parameters and validate its performance on MotionMillion-Eval. Our results demonstrate strong generalization to out-of-domain and complex compositional motions, marking a significant step toward zero-shot human motion generation. The code is available at https://github.com/VankouF/MotionMillion-Codes.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Perception-Aware Policy Optimization for Multimodal Reasoning</title>
      <itunes:episode>956</itunes:episode>
      <podcast:episode>956</podcast:episode>
      <itunes:title>Perception-Aware Policy Optimization for Multimodal Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">532a4695-dd46-468a-971f-0f4ac7b7ec59</guid>
      <link>https://share.transistor.fm/s/3d77cddd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji</p>

            <p><strong>Title:</strong><br>
            Perception-Aware Policy Optimization for Multimodal Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.06448v1">http://arxiv.org/abs/2507.06448v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji</p>

            <p><strong>Title:</strong><br>
            Perception-Aware Policy Optimization for Multimodal Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.06448v1">http://arxiv.org/abs/2507.06448v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 10 Jul 2025 21:10:24 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3d77cddd/7a43ea30.mp3" length="22587013" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1408</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji</p>

            <p><strong>Title:</strong><br>
            Perception-Aware Policy Optimization for Multimodal Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.06448v1">http://arxiv.org/abs/2507.06448v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MIRIX: Multi-Agent Memory System for LLM-Based Agents</title>
      <itunes:episode>955</itunes:episode>
      <podcast:episode>955</podcast:episode>
      <itunes:title>MIRIX: Multi-Agent Memory System for LLM-Based Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">37321aa9-7b5b-4566-b561-13fb6994defd</guid>
      <link>https://share.transistor.fm/s/fd95c0e0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yu Wang, Xi Chen</p>

            <p><strong>Title:</strong><br>
            MIRIX: Multi-Agent Memory System for LLM-Based Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07957v1">http://arxiv.org/abs/2507.07957v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX, a modular, multi-agent memory system that redefines the future of AI memory by solving the field's most critical challenge: enabling language models to truly remember. Unlike prior approaches, MIRIX transcends text to embrace rich visual and multimodal experiences, making memory genuinely useful in real-world scenarios. MIRIX consists of six distinct, carefully structured memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault, coupled with a multi-agent framework that dynamically controls and coordinates updates and retrieval. This design enables agents to persist, reason over, and accurately retrieve diverse, long-term user data at scale. We validate MIRIX in two demanding settings. First, on ScreenshotVQA, a challenging multimodal benchmark comprising nearly 20,000 high-resolution computer screenshots per sequence, requiring deep contextual understanding and where no existing memory systems can be applied, MIRIX achieves 35% higher accuracy than the RAG baseline while reducing storage requirements by 99.9%. Second, on LOCOMO, a long-form conversation benchmark with single-modal textual input, MIRIX attains state-of-the-art performance of 85.4%, far surpassing existing baselines. These results show that MIRIX sets a new performance standard for memory-augmented LLM agents. To allow users to experience our memory system, we provide a packaged application powered by MIRIX. It monitors the screen in real time, builds a personalized memory base, and offers intuitive visualization and secure local storage to ensure privacy.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yu Wang, Xi Chen</p>

            <p><strong>Title:</strong><br>
            MIRIX: Multi-Agent Memory System for LLM-Based Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07957v1">http://arxiv.org/abs/2507.07957v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX, a modular, multi-agent memory system that redefines the future of AI memory by solving the field's most critical challenge: enabling language models to truly remember. Unlike prior approaches, MIRIX transcends text to embrace rich visual and multimodal experiences, making memory genuinely useful in real-world scenarios. MIRIX consists of six distinct, carefully structured memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault, coupled with a multi-agent framework that dynamically controls and coordinates updates and retrieval. This design enables agents to persist, reason over, and accurately retrieve diverse, long-term user data at scale. We validate MIRIX in two demanding settings. First, on ScreenshotVQA, a challenging multimodal benchmark comprising nearly 20,000 high-resolution computer screenshots per sequence, requiring deep contextual understanding and where no existing memory systems can be applied, MIRIX achieves 35% higher accuracy than the RAG baseline while reducing storage requirements by 99.9%. Second, on LOCOMO, a long-form conversation benchmark with single-modal textual input, MIRIX attains state-of-the-art performance of 85.4%, far surpassing existing baselines. These results show that MIRIX sets a new performance standard for memory-augmented LLM agents. To allow users to experience our memory system, we provide a packaged application powered by MIRIX. It monitors the screen in real time, builds a personalized memory base, and offers intuitive visualization and secure local storage to ensure privacy.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 10 Jul 2025 21:10:03 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fd95c0e0/7d25ac9d.mp3" length="20709115" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1291</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yu Wang, Xi Chen</p>

            <p><strong>Title:</strong><br>
            MIRIX: Multi-Agent Memory System for LLM-Based Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.07957v1">http://arxiv.org/abs/2507.07957v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX, a modular, multi-agent memory system that redefines the future of AI memory by solving the field's most critical challenge: enabling language models to truly remember. Unlike prior approaches, MIRIX transcends text to embrace rich visual and multimodal experiences, making memory genuinely useful in real-world scenarios. MIRIX consists of six distinct, carefully structured memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault, coupled with a multi-agent framework that dynamically controls and coordinates updates and retrieval. This design enables agents to persist, reason over, and accurately retrieve diverse, long-term user data at scale. We validate MIRIX in two demanding settings. First, on ScreenshotVQA, a challenging multimodal benchmark comprising nearly 20,000 high-resolution computer screenshots per sequence, requiring deep contextual understanding and where no existing memory systems can be applied, MIRIX achieves 35% higher accuracy than the RAG baseline while reducing storage requirements by 99.9%. Second, on LOCOMO, a long-form conversation benchmark with single-modal textual input, MIRIX attains state-of-the-art performance of 85.4%, far surpassing existing baselines. These results show that MIRIX sets a new performance standard for memory-augmented LLM agents. To allow users to experience our memory system, we provide a packaged application powered by MIRIX. It monitors the screen in real time, builds a personalized memory base, and offers intuitive visualization and secure local storage to ensure privacy.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Rethinking Verification for LLM Code Generation: From Generation to Testing</title>
      <itunes:episode>954</itunes:episode>
      <podcast:episode>954</podcast:episode>
      <itunes:title>Rethinking Verification for LLM Code Generation: From Generation to Testing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ebbf3f1e-fd9e-4acb-8404-945ec95656cd</guid>
      <link>https://share.transistor.fm/s/b06e3493</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zihan Ma, Taolin Zhang, Maosong Cao, Junnan Liu, Wenwei Zhang, Minnan Luo, Songyang Zhang, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Rethinking Verification for LLM Code Generation: From Generation to Testing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.06920v2">http://arxiv.org/abs/2507.06920v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases. In addition, we develop a TCGBench to facilitate the study of the TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10.78% higher than that of LiveCodeBench-v6. These results demonstrate the effectiveness of our proposed method. We hope this work contributes to building a scalable foundation for reliable LLM code evaluation, further advancing RLVR in code generation, and paving the way for automated adversarial test synthesis and adaptive benchmark integration.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zihan Ma, Taolin Zhang, Maosong Cao, Junnan Liu, Wenwei Zhang, Minnan Luo, Songyang Zhang, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Rethinking Verification for LLM Code Generation: From Generation to Testing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.06920v2">http://arxiv.org/abs/2507.06920v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases. In addition, we develop a TCGBench to facilitate the study of the TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10.78% higher than that of LiveCodeBench-v6. These results demonstrate the effectiveness of our proposed method. We hope this work contributes to building a scalable foundation for reliable LLM code evaluation, further advancing RLVR in code generation, and paving the way for automated adversarial test synthesis and adaptive benchmark integration.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 10 Jul 2025 21:09:41 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b06e3493/bdb27b32.mp3" length="21458538" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1337</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zihan Ma, Taolin Zhang, Maosong Cao, Junnan Liu, Wenwei Zhang, Minnan Luo, Songyang Zhang, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Rethinking Verification for LLM Code Generation: From Generation to Testing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.06920v2">http://arxiv.org/abs/2507.06920v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases. In addition, we develop a TCGBench to facilitate the study of the TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10.78% higher than that of LiveCodeBench-v6. These results demonstrate the effectiveness of our proposed method. We hope this work contributes to building a scalable foundation for reliable LLM code evaluation, further advancing RLVR in code generation, and paving the way for automated adversarial test synthesis and adaptive benchmark integration.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SingLoRA: Low Rank Adaptation Using a Single Matrix</title>
      <itunes:episode>953</itunes:episode>
      <podcast:episode>953</podcast:episode>
      <itunes:title>SingLoRA: Low Rank Adaptation Using a Single Matrix</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">aa959ef0-10dc-4901-9c92-7e1d4e4ea02b</guid>
      <link>https://share.transistor.fm/s/5ed1332a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            David Bensaïd, Noam Rotstein, Roy Velich, Daniel Bensaïd, Ron Kimmel</p>

            <p><strong>Title:</strong><br>
            SingLoRA: Low Rank Adaptation Using a Single Matrix</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05566v1">http://arxiv.org/abs/2507.05566v1</a></p>

            <p><strong>Abstract:</strong><br>
            Low-Rank Adaptation (LoRA) has significantly advanced parameter-efficient fine-tuning of large pretrained models. LoRA augments the pre-trained weights of a model by adding the product of two smaller matrices that together form a low-rank matrix update. Recent research has shown that scale disparities between these two matrices often cause unstable training dynamics, leading to suboptimal performance. In this paper, we propose SingLoRA, which reformulates low-rank adaptation by learning the weights update as a decomposition of a single low-rank matrix multiplied by its transpose. This simple design inherently removes inter-matrix scale conflicts, ensuring stable optimization, and roughly halves the parameter count. We analyze SingLoRA within the infinite-width neural network framework, showing that it guarantees stable feature learning by construction. Extensive experiments on multiple tasks validate these benefits. In common sense reasoning, fine-tuning LLama 7B on MNLI with SingLoRA achieves 91.3% accuracy - surpassing LoRA (89.1%) and LoRA+ (90.2%) - while using only 60% of their parameter budget. In image generation, fine-tuning Stable Diffusion with SingLoRA significantly improves image fidelity on DreamBooth, achieving a DINO similarity score of 0.151, compared to scores of 0.148 and 0.143 for DoRA and LoRA, respectively.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            David Bensaïd, Noam Rotstein, Roy Velich, Daniel Bensaïd, Ron Kimmel</p>

            <p><strong>Title:</strong><br>
            SingLoRA: Low Rank Adaptation Using a Single Matrix</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05566v1">http://arxiv.org/abs/2507.05566v1</a></p>

            <p><strong>Abstract:</strong><br>
            Low-Rank Adaptation (LoRA) has significantly advanced parameter-efficient fine-tuning of large pretrained models. LoRA augments the pre-trained weights of a model by adding the product of two smaller matrices that together form a low-rank matrix update. Recent research has shown that scale disparities between these two matrices often cause unstable training dynamics, leading to suboptimal performance. In this paper, we propose SingLoRA, which reformulates low-rank adaptation by learning the weights update as a decomposition of a single low-rank matrix multiplied by its transpose. This simple design inherently removes inter-matrix scale conflicts, ensuring stable optimization, and roughly halves the parameter count. We analyze SingLoRA within the infinite-width neural network framework, showing that it guarantees stable feature learning by construction. Extensive experiments on multiple tasks validate these benefits. In common sense reasoning, fine-tuning LLama 7B on MNLI with SingLoRA achieves 91.3% accuracy - surpassing LoRA (89.1%) and LoRA+ (90.2%) - while using only 60% of their parameter budget. In image generation, fine-tuning Stable Diffusion with SingLoRA significantly improves image fidelity on DreamBooth, achieving a DINO similarity score of 0.151, compared to scores of 0.148 and 0.143 for DoRA and LoRA, respectively.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 09 Jul 2025 21:03:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5ed1332a/db787328.mp3" length="20680691" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1289</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            David Bensaïd, Noam Rotstein, Roy Velich, Daniel Bensaïd, Ron Kimmel</p>

            <p><strong>Title:</strong><br>
            SingLoRA: Low Rank Adaptation Using a Single Matrix</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05566v1">http://arxiv.org/abs/2507.05566v1</a></p>

            <p><strong>Abstract:</strong><br>
            Low-Rank Adaptation (LoRA) has significantly advanced parameter-efficient fine-tuning of large pretrained models. LoRA augments the pre-trained weights of a model by adding the product of two smaller matrices that together form a low-rank matrix update. Recent research has shown that scale disparities between these two matrices often cause unstable training dynamics, leading to suboptimal performance. In this paper, we propose SingLoRA, which reformulates low-rank adaptation by learning the weights update as a decomposition of a single low-rank matrix multiplied by its transpose. This simple design inherently removes inter-matrix scale conflicts, ensuring stable optimization, and roughly halves the parameter count. We analyze SingLoRA within the infinite-width neural network framework, showing that it guarantees stable feature learning by construction. Extensive experiments on multiple tasks validate these benefits. In common sense reasoning, fine-tuning LLama 7B on MNLI with SingLoRA achieves 91.3% accuracy - surpassing LoRA (89.1%) and LoRA+ (90.2%) - while using only 60% of their parameter budget. In image generation, fine-tuning Stable Diffusion with SingLoRA significantly improves image fidelity on DreamBooth, achieving a DINO similarity score of 0.151, compared to scores of 0.148 and 0.143 for DoRA and LoRA, respectively.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Survey on Latent Reasoning</title>
      <itunes:episode>952</itunes:episode>
      <podcast:episode>952</podcast:episode>
      <itunes:title>A Survey on Latent Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">89cae6c8-62d4-44f7-9738-af9f979587be</guid>
      <link>https://share.transistor.fm/s/c48c4e36</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, Tianle Cai, Taylor Kergan, Assel Kembay, Andrew Smith, Chenghua Lin, Binh Nguyen, Yuqi Pan, Yuhong Chou, Zefan Cai, Zhenhe Wu, Yongchi Zhao, Tianyu Liu, Jian Yang, Wangchunshu Zhou, Chujie Zheng, Chongxuan Li, Yuyin Zhou, Zhoujun Li, Zhaoxiang Zhang, Jiaheng Liu, Ge Zhang, Wenhao Huang, Jason Eshraghian</p>

            <p><strong>Title:</strong><br>
            A Survey on Latent Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.06203v1">http://arxiv.org/abs/2507.06203v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especially when guided by explicit chain-of-thought (CoT) reasoning that verbalizes intermediate steps. While CoT improves both interpretability and accuracy, its dependence on natural language reasoning limits the model's expressive bandwidth. Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model's continuous hidden state, eliminating token-level supervision. To advance latent reasoning research, this survey provides a comprehensive overview of the emerging field of latent reasoning. We begin by examining the foundational role of neural network layers as the computational substrate for reasoning, highlighting how hierarchical representations support complex transformations. Next, we explore diverse latent reasoning methodologies, including activation-based recurrence, hidden state propagation, and fine-tuning strategies that compress or internalize explicit reasoning traces. Finally, we discuss advanced paradigms such as infinite-depth latent reasoning via masked diffusion models, which enable globally consistent and reversible reasoning processes. By unifying these perspectives, we aim to clarify the conceptual landscape of latent reasoning and chart future directions for research at the frontier of LLM cognition. An associated GitHub repository collecting the latest papers and repos is available at: https://github.com/multimodal-art-projection/LatentCoT-Horizon/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, Tianle Cai, Taylor Kergan, Assel Kembay, Andrew Smith, Chenghua Lin, Binh Nguyen, Yuqi Pan, Yuhong Chou, Zefan Cai, Zhenhe Wu, Yongchi Zhao, Tianyu Liu, Jian Yang, Wangchunshu Zhou, Chujie Zheng, Chongxuan Li, Yuyin Zhou, Zhoujun Li, Zhaoxiang Zhang, Jiaheng Liu, Ge Zhang, Wenhao Huang, Jason Eshraghian</p>

            <p><strong>Title:</strong><br>
            A Survey on Latent Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.06203v1">http://arxiv.org/abs/2507.06203v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especially when guided by explicit chain-of-thought (CoT) reasoning that verbalizes intermediate steps. While CoT improves both interpretability and accuracy, its dependence on natural language reasoning limits the model's expressive bandwidth. Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model's continuous hidden state, eliminating token-level supervision. To advance latent reasoning research, this survey provides a comprehensive overview of the emerging field of latent reasoning. We begin by examining the foundational role of neural network layers as the computational substrate for reasoning, highlighting how hierarchical representations support complex transformations. Next, we explore diverse latent reasoning methodologies, including activation-based recurrence, hidden state propagation, and fine-tuning strategies that compress or internalize explicit reasoning traces. Finally, we discuss advanced paradigms such as infinite-depth latent reasoning via masked diffusion models, which enable globally consistent and reversible reasoning processes. By unifying these perspectives, we aim to clarify the conceptual landscape of latent reasoning and chart future directions for research at the frontier of LLM cognition. An associated GitHub repository collecting the latest papers and repos is available at: https://github.com/multimodal-art-projection/LatentCoT-Horizon/.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 09 Jul 2025 21:02:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c48c4e36/4bb00534.mp3" length="19480290" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1214</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, Tianle Cai, Taylor Kergan, Assel Kembay, Andrew Smith, Chenghua Lin, Binh Nguyen, Yuqi Pan, Yuhong Chou, Zefan Cai, Zhenhe Wu, Yongchi Zhao, Tianyu Liu, Jian Yang, Wangchunshu Zhou, Chujie Zheng, Chongxuan Li, Yuyin Zhou, Zhoujun Li, Zhaoxiang Zhang, Jiaheng Liu, Ge Zhang, Wenhao Huang, Jason Eshraghian</p>

            <p><strong>Title:</strong><br>
            A Survey on Latent Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.06203v1">http://arxiv.org/abs/2507.06203v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especially when guided by explicit chain-of-thought (CoT) reasoning that verbalizes intermediate steps. While CoT improves both interpretability and accuracy, its dependence on natural language reasoning limits the model's expressive bandwidth. Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model's continuous hidden state, eliminating token-level supervision. To advance latent reasoning research, this survey provides a comprehensive overview of the emerging field of latent reasoning. We begin by examining the foundational role of neural network layers as the computational substrate for reasoning, highlighting how hierarchical representations support complex transformations. Next, we explore diverse latent reasoning methodologies, including activation-based recurrence, hidden state propagation, and fine-tuning strategies that compress or internalize explicit reasoning traces. Finally, we discuss advanced paradigms such as infinite-depth latent reasoning via masked diffusion models, which enable globally consistent and reversible reasoning processes. By unifying these perspectives, we aim to clarify the conceptual landscape of latent reasoning and chart future directions for research at the frontier of LLM cognition. An associated GitHub repository collecting the latest papers and repos is available at: https://github.com/multimodal-art-projection/LatentCoT-Horizon/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion</title>
      <itunes:episode>951</itunes:episode>
      <podcast:episode>951</podcast:episode>
      <itunes:title>OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8f98faba-586d-4fc7-a721-3a9c30a90380</guid>
      <link>https://share.transistor.fm/s/3829979e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunhan Yang, Yufan Zhou, Yuan-Chen Guo, Zi-Xin Zou, Yukun Huang, Ying-Tian Liu, Hao Xu, Ding Liang, Yan-Pei Cao, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.06165v1">http://arxiv.org/abs/2507.06165v1</a></p>

            <p><strong>Abstract:</strong><br>
            The creation of 3D assets with explicit, editable part structures is crucial for advancing interactive applications, yet most generative methods produce only monolithic shapes, limiting their utility. We introduce OmniPart, a novel framework for part-aware 3D object generation designed to achieve high semantic decoupling among components while maintaining robust structural cohesion. OmniPart uniquely decouples this complex task into two synergistic stages: (1) an autoregressive structure planning module generates a controllable, variable-length sequence of 3D part bounding boxes, critically guided by flexible 2D part masks that allow for intuitive control over part decomposition without requiring direct correspondences or semantic labels; and (2) a spatially-conditioned rectified flow model, efficiently adapted from a pre-trained holistic 3D generator, synthesizes all 3D parts simultaneously and consistently within the planned layout. Our approach supports user-defined part granularity, precise localization, and enables diverse downstream applications. Extensive experiments demonstrate that OmniPart achieves state-of-the-art performance, paving the way for more interpretable, editable, and versatile 3D content.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunhan Yang, Yufan Zhou, Yuan-Chen Guo, Zi-Xin Zou, Yukun Huang, Ying-Tian Liu, Hao Xu, Ding Liang, Yan-Pei Cao, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.06165v1">http://arxiv.org/abs/2507.06165v1</a></p>

            <p><strong>Abstract:</strong><br>
            The creation of 3D assets with explicit, editable part structures is crucial for advancing interactive applications, yet most generative methods produce only monolithic shapes, limiting their utility. We introduce OmniPart, a novel framework for part-aware 3D object generation designed to achieve high semantic decoupling among components while maintaining robust structural cohesion. OmniPart uniquely decouples this complex task into two synergistic stages: (1) an autoregressive structure planning module generates a controllable, variable-length sequence of 3D part bounding boxes, critically guided by flexible 2D part masks that allow for intuitive control over part decomposition without requiring direct correspondences or semantic labels; and (2) a spatially-conditioned rectified flow model, efficiently adapted from a pre-trained holistic 3D generator, synthesizes all 3D parts simultaneously and consistently within the planned layout. Our approach supports user-defined part granularity, precise localization, and enables diverse downstream applications. Extensive experiments demonstrate that OmniPart achieves state-of-the-art performance, paving the way for more interpretable, editable, and versatile 3D content.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 09 Jul 2025 21:02:17 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3829979e/68f6ba3f.mp3" length="23136652" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1442</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunhan Yang, Yufan Zhou, Yuan-Chen Guo, Zi-Xin Zou, Yukun Huang, Ying-Tian Liu, Hao Xu, Ding Liang, Yan-Pei Cao, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.06165v1">http://arxiv.org/abs/2507.06165v1</a></p>

            <p><strong>Abstract:</strong><br>
            The creation of 3D assets with explicit, editable part structures is crucial for advancing interactive applications, yet most generative methods produce only monolithic shapes, limiting their utility. We introduce OmniPart, a novel framework for part-aware 3D object generation designed to achieve high semantic decoupling among components while maintaining robust structural cohesion. OmniPart uniquely decouples this complex task into two synergistic stages: (1) an autoregressive structure planning module generates a controllable, variable-length sequence of 3D part bounding boxes, critically guided by flexible 2D part masks that allow for intuitive control over part decomposition without requiring direct correspondences or semantic labels; and (2) a spatially-conditioned rectified flow model, efficiently adapted from a pre-trained holistic 3D generator, synthesizes all 3D parts simultaneously and consistently within the planned layout. Our approach supports user-defined part granularity, precise localization, and enables diverse downstream applications. Extensive experiments demonstrate that OmniPart achieves state-of-the-art performance, paving the way for more interpretable, editable, and versatile 3D content.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>How to Train Your LLM Web Agent: A Statistical Diagnosis</title>
      <itunes:episode>950</itunes:episode>
      <podcast:episode>950</podcast:episode>
      <itunes:title>How to Train Your LLM Web Agent: A Statistical Diagnosis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b466860d-26e1-4d8b-9f37-f7f6ae1354e0</guid>
      <link>https://share.transistor.fm/s/b54262dd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.AI, cs.LG, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Piché, Alexandre Lacoste, Massimo Caccia</p>

            <p><strong>Title:</strong><br>
            How to Train Your LLM Web Agent: A Statistical Diagnosis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.04103v1">http://arxiv.org/abs/2507.04103v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.AI, cs.LG, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Piché, Alexandre Lacoste, Massimo Caccia</p>

            <p><strong>Title:</strong><br>
            How to Train Your LLM Web Agent: A Statistical Diagnosis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.04103v1">http://arxiv.org/abs/2507.04103v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 09 Jul 2025 21:01:56 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b54262dd/6a57baf0.mp3" length="21849310" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1362</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.AI, cs.LG, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Piché, Alexandre Lacoste, Massimo Caccia</p>

            <p><strong>Title:</strong><br>
            How to Train Your LLM Web Agent: A Statistical Diagnosis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.04103v1">http://arxiv.org/abs/2507.04103v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling</title>
      <itunes:episode>949</itunes:episode>
      <podcast:episode>949</podcast:episode>
      <itunes:title>StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">81a88a41-749e-4f80-94fa-dd6ead5e2c33</guid>
      <link>https://share.transistor.fm/s/d5f10e4b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, Jiangmiao Pang</p>

            <p><strong>Title:</strong><br>
            StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05240v1">http://arxiv.org/abs/2507.05240v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: \href{https://streamvln.github.io/}{https://streamvln.github.io/}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, Jiangmiao Pang</p>

            <p><strong>Title:</strong><br>
            StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05240v1">http://arxiv.org/abs/2507.05240v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: \href{https://streamvln.github.io/}{https://streamvln.github.io/}.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 09 Jul 2025 21:01:34 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d5f10e4b/b173be70.mp3" length="21992277" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1371</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.RO, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, Jiangmiao Pang</p>

            <p><strong>Title:</strong><br>
            StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05240v1">http://arxiv.org/abs/2507.05240v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: \href{https://streamvln.github.io/}{https://streamvln.github.io/}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization</title>
      <itunes:episode>948</itunes:episode>
      <podcast:episode>948</podcast:episode>
      <itunes:title>CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e78774ea-88b0-49df-b464-adfec20024b9</guid>
      <link>https://share.transistor.fm/s/edb04f10</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhongyuan Peng, Yifan Yao, Kaijing Ma, Shuyue Guo, Yizhe Li, Yichi Zhang, Chenchen Zhang, Yifan Zhang, Zhouliang Yu, Luming Li, Minghao Liu, Yihang Xia, Jiawei Shen, Yuchen Wu, Yixin Cao, Zhaoxiang Zhang, Wenhao Huang, Jiaheng Liu, Ge Zhang</p>

            <p><strong>Title:</strong><br>
            CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.06181v1">http://arxiv.org/abs/2507.06181v1</a></p>

            <p><strong>Abstract:</strong><br>
            Translating natural language mathematical statements into formal, executable code is a fundamental challenge in automated theorem proving. While prior work has focused on generation and compilation success, little attention has been paid to the critic phase-the evaluation of whether generated formalizations truly capture the semantic intent of the original problem. In this paper, we introduce CriticLean, a novel critic-guided reinforcement learning framework that elevates the role of the critic from a passive validator to an active learning component. Specifically, first, we propose the CriticLeanGPT, trained via supervised fine-tuning and reinforcement learning, to rigorously assess the semantic fidelity of Lean 4 formalizations. Then, we introduce CriticLeanBench, a benchmark designed to measure models' ability to distinguish semantically correct from incorrect formalizations, and demonstrate that our trained CriticLeanGPT models can significantly outperform strong open- and closed-source baselines. Building on the CriticLean framework, we construct FineLeanCorpus, a dataset comprising over 285K problems that exhibits rich domain diversity, broad difficulty coverage, and high correctness based on human evaluation. Overall, our findings highlight that optimizing the critic phase is essential for producing reliable formalizations, and we hope our CriticLean will provide valuable insights for future advances in formal mathematical reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhongyuan Peng, Yifan Yao, Kaijing Ma, Shuyue Guo, Yizhe Li, Yichi Zhang, Chenchen Zhang, Yifan Zhang, Zhouliang Yu, Luming Li, Minghao Liu, Yihang Xia, Jiawei Shen, Yuchen Wu, Yixin Cao, Zhaoxiang Zhang, Wenhao Huang, Jiaheng Liu, Ge Zhang</p>

            <p><strong>Title:</strong><br>
            CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.06181v1">http://arxiv.org/abs/2507.06181v1</a></p>

            <p><strong>Abstract:</strong><br>
            Translating natural language mathematical statements into formal, executable code is a fundamental challenge in automated theorem proving. While prior work has focused on generation and compilation success, little attention has been paid to the critic phase-the evaluation of whether generated formalizations truly capture the semantic intent of the original problem. In this paper, we introduce CriticLean, a novel critic-guided reinforcement learning framework that elevates the role of the critic from a passive validator to an active learning component. Specifically, first, we propose the CriticLeanGPT, trained via supervised fine-tuning and reinforcement learning, to rigorously assess the semantic fidelity of Lean 4 formalizations. Then, we introduce CriticLeanBench, a benchmark designed to measure models' ability to distinguish semantically correct from incorrect formalizations, and demonstrate that our trained CriticLeanGPT models can significantly outperform strong open- and closed-source baselines. Building on the CriticLean framework, we construct FineLeanCorpus, a dataset comprising over 285K problems that exhibits rich domain diversity, broad difficulty coverage, and high correctness based on human evaluation. Overall, our findings highlight that optimizing the critic phase is essential for producing reliable formalizations, and we hope our CriticLean will provide valuable insights for future advances in formal mathematical reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 09 Jul 2025 21:01:13 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/edb04f10/1a62a44c.mp3" length="23547919" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1468</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhongyuan Peng, Yifan Yao, Kaijing Ma, Shuyue Guo, Yizhe Li, Yichi Zhang, Chenchen Zhang, Yifan Zhang, Zhouliang Yu, Luming Li, Minghao Liu, Yihang Xia, Jiawei Shen, Yuchen Wu, Yixin Cao, Zhaoxiang Zhang, Wenhao Huang, Jiaheng Liu, Ge Zhang</p>

            <p><strong>Title:</strong><br>
            CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.06181v1">http://arxiv.org/abs/2507.06181v1</a></p>

            <p><strong>Abstract:</strong><br>
            Translating natural language mathematical statements into formal, executable code is a fundamental challenge in automated theorem proving. While prior work has focused on generation and compilation success, little attention has been paid to the critic phase-the evaluation of whether generated formalizations truly capture the semantic intent of the original problem. In this paper, we introduce CriticLean, a novel critic-guided reinforcement learning framework that elevates the role of the critic from a passive validator to an active learning component. Specifically, first, we propose the CriticLeanGPT, trained via supervised fine-tuning and reinforcement learning, to rigorously assess the semantic fidelity of Lean 4 formalizations. Then, we introduce CriticLeanBench, a benchmark designed to measure models' ability to distinguish semantically correct from incorrect formalizations, and demonstrate that our trained CriticLeanGPT models can significantly outperform strong open- and closed-source baselines. Building on the CriticLean framework, we construct FineLeanCorpus, a dataset comprising over 285K problems that exhibits rich domain diversity, broad difficulty coverage, and high correctness based on human evaluation. Overall, our findings highlight that optimizing the critic phase is essential for producing reliable formalizations, and we hope our CriticLean will provide valuable insights for future advances in formal mathematical reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents</title>
      <itunes:episode>947</itunes:episode>
      <podcast:episode>947</podcast:episode>
      <itunes:title>RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0dea0cb3-72a0-4a40-838c-facf5e50be99</guid>
      <link>https://share.transistor.fm/s/75e01b0c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.CY</p>

            <p><strong>Authors:</strong><br>
            Peisong Wang, Ruotian Ma, Bang Zhang, Xingyu Chen, Zhiwei He, Kang Luo, Qingsong Lv, Qingxuan Jiang, Zheng Xie, Shanyi Wang, Yuan Li, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, Xiaolong Li</p>

            <p><strong>Title:</strong><br>
            RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.03112v1">http://arxiv.org/abs/2507.03112v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess. While reinforcement learning from verifiable rewards (RLVR) has advanced in other domains, its application to dialogue-especially for emotional intelligence-remains underexplored. In this work, we introduce RLVER, the first end-to-end reinforcement learning framework that leverages verifiable emotion rewards from simulated users to cultivate higher-order empathetic abilities in LLMs. Within this framework, self-consistent affective simulated users engage in dialogue rollouts and produce deterministic emotion scores during conversations, serving as reward signals to guide the LLM's learning. Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its Sentient-Benchmark score from 13.3 to 79.2 while largely preserving mathematical and coding competence. Extensive experiments reveal that: (i) RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and non-thinking models show distinct trends--thinking models excel in empathy and insight, while non-thinking models favor action; (iii) GRPO often yields stable gains, while PPO can push certain capabilities to a higher ceiling; (iv) More challenging environments are not always better-moderate ones can yield stronger outcomes. Our results show that RLVER is a practical route toward emotionally intelligent and broadly capable language agents.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.CY</p>

            <p><strong>Authors:</strong><br>
            Peisong Wang, Ruotian Ma, Bang Zhang, Xingyu Chen, Zhiwei He, Kang Luo, Qingsong Lv, Qingxuan Jiang, Zheng Xie, Shanyi Wang, Yuan Li, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, Xiaolong Li</p>

            <p><strong>Title:</strong><br>
            RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.03112v1">http://arxiv.org/abs/2507.03112v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess. While reinforcement learning from verifiable rewards (RLVR) has advanced in other domains, its application to dialogue-especially for emotional intelligence-remains underexplored. In this work, we introduce RLVER, the first end-to-end reinforcement learning framework that leverages verifiable emotion rewards from simulated users to cultivate higher-order empathetic abilities in LLMs. Within this framework, self-consistent affective simulated users engage in dialogue rollouts and produce deterministic emotion scores during conversations, serving as reward signals to guide the LLM's learning. Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its Sentient-Benchmark score from 13.3 to 79.2 while largely preserving mathematical and coding competence. Extensive experiments reveal that: (i) RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and non-thinking models show distinct trends--thinking models excel in empathy and insight, while non-thinking models favor action; (iii) GRPO often yields stable gains, while PPO can push certain capabilities to a higher ceiling; (iv) More challenging environments are not always better-moderate ones can yield stronger outcomes. Our results show that RLVER is a practical route toward emotionally intelligent and broadly capable language agents.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 09 Jul 2025 21:00:51 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/75e01b0c/3a122849.mp3" length="20528168" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1279</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.CY</p>

            <p><strong>Authors:</strong><br>
            Peisong Wang, Ruotian Ma, Bang Zhang, Xingyu Chen, Zhiwei He, Kang Luo, Qingsong Lv, Qingxuan Jiang, Zheng Xie, Shanyi Wang, Yuan Li, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, Xiaolong Li</p>

            <p><strong>Title:</strong><br>
            RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.03112v1">http://arxiv.org/abs/2507.03112v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess. While reinforcement learning from verifiable rewards (RLVR) has advanced in other domains, its application to dialogue-especially for emotional intelligence-remains underexplored. In this work, we introduce RLVER, the first end-to-end reinforcement learning framework that leverages verifiable emotion rewards from simulated users to cultivate higher-order empathetic abilities in LLMs. Within this framework, self-consistent affective simulated users engage in dialogue rollouts and produce deterministic emotion scores during conversations, serving as reward signals to guide the LLM's learning. Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its Sentient-Benchmark score from 13.3 to 79.2 while largely preserving mathematical and coding competence. Extensive experiments reveal that: (i) RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and non-thinking models show distinct trends--thinking models excel in empathy and insight, while non-thinking models favor action; (iii) GRPO often yields stable gains, while PPO can push certain capabilities to a higher ceiling; (iv) More challenging environments are not always better-moderate ones can yield stronger outcomes. Our results show that RLVER is a practical route toward emotionally intelligent and broadly capable language agents.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos</title>
      <itunes:episode>946</itunes:episode>
      <podcast:episode>946</podcast:episode>
      <itunes:title>MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f692e49e-4462-4c4f-9156-d10fec953ccb</guid>
      <link>https://share.transistor.fm/s/3cc9f0dd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Rongsheng Wang, Junying Chen, Ke Ji, Zhenyang Cai, Shunian Chen, Yunjin Yang, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05675v1">http://arxiv.org/abs/2507.05675v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in video generation have shown remarkable progress in open-domain settings, yet medical video generation remains largely underexplored. Medical videos are critical for applications such as clinical training, education, and simulation, requiring not only high visual fidelity but also strict medical accuracy. However, current models often produce unrealistic or erroneous content when applied to medical prompts, largely due to the lack of large-scale, high-quality datasets tailored to the medical domain. To address this gap, we introduce MedVideoCap-55K, the first large-scale, diverse, and caption-rich dataset for medical video generation. It comprises over 55,000 curated clips spanning real-world medical scenarios, providing a strong foundation for training generalist medical video generation models. Built upon this dataset, we develop MedGen, which achieves leading performance among open-source models and rivals commercial systems across multiple benchmarks in both visual quality and medical accuracy. We hope our dataset and model can serve as a valuable resource and help catalyze further research in medical video generation. Our code and data is available at https://github.com/FreedomIntelligence/MedGen</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Rongsheng Wang, Junying Chen, Ke Ji, Zhenyang Cai, Shunian Chen, Yunjin Yang, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05675v1">http://arxiv.org/abs/2507.05675v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in video generation have shown remarkable progress in open-domain settings, yet medical video generation remains largely underexplored. Medical videos are critical for applications such as clinical training, education, and simulation, requiring not only high visual fidelity but also strict medical accuracy. However, current models often produce unrealistic or erroneous content when applied to medical prompts, largely due to the lack of large-scale, high-quality datasets tailored to the medical domain. To address this gap, we introduce MedVideoCap-55K, the first large-scale, diverse, and caption-rich dataset for medical video generation. It comprises over 55,000 curated clips spanning real-world medical scenarios, providing a strong foundation for training generalist medical video generation models. Built upon this dataset, we develop MedGen, which achieves leading performance among open-source models and rivals commercial systems across multiple benchmarks in both visual quality and medical accuracy. We hope our dataset and model can serve as a valuable resource and help catalyze further research in medical video generation. Our code and data is available at https://github.com/FreedomIntelligence/MedGen</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 09 Jul 2025 21:00:29 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3cc9f0dd/72f5d931.mp3" length="18643596" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1162</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Rongsheng Wang, Junying Chen, Ke Ji, Zhenyang Cai, Shunian Chen, Yunjin Yang, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05675v1">http://arxiv.org/abs/2507.05675v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in video generation have shown remarkable progress in open-domain settings, yet medical video generation remains largely underexplored. Medical videos are critical for applications such as clinical training, education, and simulation, requiring not only high visual fidelity but also strict medical accuracy. However, current models often produce unrealistic or erroneous content when applied to medical prompts, largely due to the lack of large-scale, high-quality datasets tailored to the medical domain. To address this gap, we introduce MedVideoCap-55K, the first large-scale, diverse, and caption-rich dataset for medical video generation. It comprises over 55,000 curated clips spanning real-world medical scenarios, providing a strong foundation for training generalist medical video generation models. Built upon this dataset, we develop MedGen, which achieves leading performance among open-source models and rivals commercial systems across multiple benchmarks in both visual quality and medical accuracy. We hope our dataset and model can serve as a valuable resource and help catalyze further research in medical video generation. Our code and data is available at https://github.com/FreedomIntelligence/MedGen</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MemOS: A Memory OS for AI System</title>
      <itunes:episode>945</itunes:episode>
      <podcast:episode>945</podcast:episode>
      <itunes:title>MemOS: A Memory OS for AI System</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0126d397-c2ec-44ab-92c6-ec5d63d38821</guid>
      <link>https://share.transistor.fm/s/e9d6978c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiyu Li, Shichao Song, Chenyang Xi, Hanyu Wang, Chen Tang, Simin Niu, Ding Chen, Jiawei Yang, Chunyu Li, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhen Tao, Junpeng Ren, Huayi Lai, Hao Wu, Bo Tang, Zhenren Wang, Zhaoxin Fan, Ningyu Zhang, Linfeng Zhang, Junchi Yan, Mingchuan Yang, Tong Xu, Wei Xu, Huajun Chen, Haofeng Wang, Hongkang Yang, Wentao Zhang, Zhi-Qin John Xu, Siheng Chen, Feiyu Xiong</p>

            <p><strong>Title:</strong><br>
            MemOS: A Memory OS for AI System</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.03724v2">http://arxiv.org/abs/2507.03724v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI), yet their lack of well-defined memory management systems hinders the development of long-context reasoning, continual personalization, and knowledge consistency.Existing models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended periods.While Retrieval-Augmented Generation (RAG) introduces external knowledge in plain text, it remains a stateless workaround without lifecycle control or integration with persistent representations.Recent work has modeled the training and inference cost of LLMs from a memory hierarchy perspective, showing that introducing an explicit memory layer between parameter memory and external retrieval can substantially reduce these costs by externalizing specific knowledge. Beyond computational efficiency, LLMs face broader challenges arising from how information is distributed over time and context, requiring systems capable of managing heterogeneous knowledge spanning different temporal scales and sources. To address this challenge, we propose MemOS, a memory operating system that treats memory as a manageable system resource. It unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories, enabling cost-efficient storage and retrieval. As the basic unit, a MemCube encapsulates both memory content and metadata such as provenance and versioning. MemCubes can be composed, migrated, and fused over time, enabling flexible transitions between memory types and bridging retrieval with parameter-based learning. MemOS establishes a memory-centric system framework that brings controllability, plasticity, and evolvability to LLMs, laying the foundation for continual learning and personalized modeling.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiyu Li, Shichao Song, Chenyang Xi, Hanyu Wang, Chen Tang, Simin Niu, Ding Chen, Jiawei Yang, Chunyu Li, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhen Tao, Junpeng Ren, Huayi Lai, Hao Wu, Bo Tang, Zhenren Wang, Zhaoxin Fan, Ningyu Zhang, Linfeng Zhang, Junchi Yan, Mingchuan Yang, Tong Xu, Wei Xu, Huajun Chen, Haofeng Wang, Hongkang Yang, Wentao Zhang, Zhi-Qin John Xu, Siheng Chen, Feiyu Xiong</p>

            <p><strong>Title:</strong><br>
            MemOS: A Memory OS for AI System</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.03724v2">http://arxiv.org/abs/2507.03724v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI), yet their lack of well-defined memory management systems hinders the development of long-context reasoning, continual personalization, and knowledge consistency.Existing models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended periods.While Retrieval-Augmented Generation (RAG) introduces external knowledge in plain text, it remains a stateless workaround without lifecycle control or integration with persistent representations.Recent work has modeled the training and inference cost of LLMs from a memory hierarchy perspective, showing that introducing an explicit memory layer between parameter memory and external retrieval can substantially reduce these costs by externalizing specific knowledge. Beyond computational efficiency, LLMs face broader challenges arising from how information is distributed over time and context, requiring systems capable of managing heterogeneous knowledge spanning different temporal scales and sources. To address this challenge, we propose MemOS, a memory operating system that treats memory as a manageable system resource. It unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories, enabling cost-efficient storage and retrieval. As the basic unit, a MemCube encapsulates both memory content and metadata such as provenance and versioning. MemCubes can be composed, migrated, and fused over time, enabling flexible transitions between memory types and bridging retrieval with parameter-based learning. MemOS establishes a memory-centric system framework that brings controllability, plasticity, and evolvability to LLMs, laying the foundation for continual learning and personalized modeling.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 08 Jul 2025 20:51:56 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e9d6978c/dc363408.mp3" length="21007517" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1309</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiyu Li, Shichao Song, Chenyang Xi, Hanyu Wang, Chen Tang, Simin Niu, Ding Chen, Jiawei Yang, Chunyu Li, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhen Tao, Junpeng Ren, Huayi Lai, Hao Wu, Bo Tang, Zhenren Wang, Zhaoxin Fan, Ningyu Zhang, Linfeng Zhang, Junchi Yan, Mingchuan Yang, Tong Xu, Wei Xu, Huajun Chen, Haofeng Wang, Hongkang Yang, Wentao Zhang, Zhi-Qin John Xu, Siheng Chen, Feiyu Xiong</p>

            <p><strong>Title:</strong><br>
            MemOS: A Memory OS for AI System</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.03724v2">http://arxiv.org/abs/2507.03724v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI), yet their lack of well-defined memory management systems hinders the development of long-context reasoning, continual personalization, and knowledge consistency.Existing models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended periods.While Retrieval-Augmented Generation (RAG) introduces external knowledge in plain text, it remains a stateless workaround without lifecycle control or integration with persistent representations.Recent work has modeled the training and inference cost of LLMs from a memory hierarchy perspective, showing that introducing an explicit memory layer between parameter memory and external retrieval can substantially reduce these costs by externalizing specific knowledge. Beyond computational efficiency, LLMs face broader challenges arising from how information is distributed over time and context, requiring systems capable of managing heterogeneous knowledge spanning different temporal scales and sources. To address this challenge, we propose MemOS, a memory operating system that treats memory as a manageable system resource. It unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories, enabling cost-efficient storage and retrieval. As the basic unit, a MemCube encapsulates both memory content and metadata such as provenance and versioning. MemCubes can be composed, migrated, and fused over time, enabling flexible transitions between memory types and bridging retrieval with parameter-based learning. MemOS establishes a memory-centric system framework that brings controllability, plasticity, and evolvability to LLMs, laying the foundation for continual learning and personalized modeling.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Should We Still Pretrain Encoders with Masked Language Modeling?</title>
      <itunes:episode>944</itunes:episode>
      <podcast:episode>944</podcast:episode>
      <itunes:title>Should We Still Pretrain Encoders with Masked Language Modeling?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">22b40814-736c-403e-bf31-5acdecad9476</guid>
      <link>https://share.transistor.fm/s/d85e1cb0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Manuel Faysse, Duarte M. Alves, Emmanuel Malherbe, André F. T. Martins, Céline Hudelot, Pierre Colombo</p>

            <p><strong>Title:</strong><br>
            Should We Still Pretrain Encoders with Masked Language Modeling?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.00994v2">http://arxiv.org/abs/2507.00994v2</a></p>

            <p><strong>Abstract:</strong><br>
            Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 38 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models, reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at https://hf.co/MLMvsCLM to foster further research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Manuel Faysse, Duarte M. Alves, Emmanuel Malherbe, André F. T. Martins, Céline Hudelot, Pierre Colombo</p>

            <p><strong>Title:</strong><br>
            Should We Still Pretrain Encoders with Masked Language Modeling?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.00994v2">http://arxiv.org/abs/2507.00994v2</a></p>

            <p><strong>Abstract:</strong><br>
            Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 38 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models, reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at https://hf.co/MLMvsCLM to foster further research.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 08 Jul 2025 20:51:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d85e1cb0/5775fc5b.mp3" length="22007725" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1372</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Manuel Faysse, Duarte M. Alves, Emmanuel Malherbe, André F. T. Martins, Céline Hudelot, Pierre Colombo</p>

            <p><strong>Title:</strong><br>
            Should We Still Pretrain Encoders with Masked Language Modeling?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.00994v2">http://arxiv.org/abs/2507.00994v2</a></p>

            <p><strong>Abstract:</strong><br>
            Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 38 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models, reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at https://hf.co/MLMvsCLM to foster further research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving</title>
      <itunes:episode>943</itunes:episode>
      <podcast:episode>943</podcast:episode>
      <itunes:title>Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fc0bbc54-ef0f-4e80-aac4-6c21302cdf74</guid>
      <link>https://share.transistor.fm/s/1636239c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou</p>

            <p><strong>Title:</strong><br>
            Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.06229v1">http://arxiv.org/abs/2507.06229v1</a></p>

            <p><strong>Abstract:</strong><br>
            As language agents tackle increasingly complex tasks, they struggle with effective error correction and experience reuse across domains. We introduce Agent KB, a hierarchical experience framework that enables complex agentic problem solving via a novel Reason-Retrieve-Refine pipeline. Agent KB addresses a core limitation: agents traditionally cannot learn from each other's experiences. By capturing both high-level strategies and detailed execution logs, Agent KB creates a shared knowledge base that enables cross-agent knowledge transfer. Evaluated on the GAIA benchmark, Agent KB improves success rates by up to 16.28 percentage points. On the most challenging tasks, Claude-3 improves from 38.46% to 57.69%, while GPT-4 improves from 53.49% to 73.26% on intermediate tasks. On SWE-bench code repair, Agent KB enables Claude-3 to improve from 41.33% to 53.33%. Our results suggest that Agent KB provides a modular, framework-agnostic infrastructure for enabling agents to learn from past experiences and generalize successful strategies to new tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou</p>

            <p><strong>Title:</strong><br>
            Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.06229v1">http://arxiv.org/abs/2507.06229v1</a></p>

            <p><strong>Abstract:</strong><br>
            As language agents tackle increasingly complex tasks, they struggle with effective error correction and experience reuse across domains. We introduce Agent KB, a hierarchical experience framework that enables complex agentic problem solving via a novel Reason-Retrieve-Refine pipeline. Agent KB addresses a core limitation: agents traditionally cannot learn from each other's experiences. By capturing both high-level strategies and detailed execution logs, Agent KB creates a shared knowledge base that enables cross-agent knowledge transfer. Evaluated on the GAIA benchmark, Agent KB improves success rates by up to 16.28 percentage points. On the most challenging tasks, Claude-3 improves from 38.46% to 57.69%, while GPT-4 improves from 53.49% to 73.26% on intermediate tasks. On SWE-bench code repair, Agent KB enables Claude-3 to improve from 41.33% to 53.33%. Our results suggest that Agent KB provides a modular, framework-agnostic infrastructure for enabling agents to learn from past experiences and generalize successful strategies to new tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 08 Jul 2025 20:51:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1636239c/383949cd.mp3" length="20496392" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1277</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou</p>

            <p><strong>Title:</strong><br>
            Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.06229v1">http://arxiv.org/abs/2507.06229v1</a></p>

            <p><strong>Abstract:</strong><br>
            As language agents tackle increasingly complex tasks, they struggle with effective error correction and experience reuse across domains. We introduce Agent KB, a hierarchical experience framework that enables complex agentic problem solving via a novel Reason-Retrieve-Refine pipeline. Agent KB addresses a core limitation: agents traditionally cannot learn from each other's experiences. By capturing both high-level strategies and detailed execution logs, Agent KB creates a shared knowledge base that enables cross-agent knowledge transfer. Evaluated on the GAIA benchmark, Agent KB improves success rates by up to 16.28 percentage points. On the most challenging tasks, Claude-3 improves from 38.46% to 57.69%, while GPT-4 improves from 53.49% to 73.26% on intermediate tasks. On SWE-bench code repair, Agent KB enables Claude-3 to improve from 41.33% to 53.33%. Our results suggest that Agent KB provides a modular, framework-agnostic infrastructure for enabling agents to learn from past experiences and generalize successful strategies to new tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture</title>
      <itunes:episode>942</itunes:episode>
      <podcast:episode>942</podcast:episode>
      <itunes:title>4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e52150f0-690b-4e82-b991-5c778f0629bf</guid>
      <link>https://share.transistor.fm/s/1800ea12</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yutian Chen, Shi Guo, Tianshuo Yang, Lihe Ding, Xiuyuan Yu, Jinwei Gu, Tianfan Xue</p>

            <p><strong>Title:</strong><br>
            4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05163v1">http://arxiv.org/abs/2507.05163v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency, and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yutian Chen, Shi Guo, Tianshuo Yang, Lihe Ding, Xiuyuan Yu, Jinwei Gu, Tianfan Xue</p>

            <p><strong>Title:</strong><br>
            4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05163v1">http://arxiv.org/abs/2507.05163v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency, and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 08 Jul 2025 20:50:47 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1800ea12/6750fb88.mp3" length="19395071" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1209</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yutian Chen, Shi Guo, Tianshuo Yang, Lihe Ding, Xiuyuan Yu, Jinwei Gu, Tianfan Xue</p>

            <p><strong>Title:</strong><br>
            4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05163v1">http://arxiv.org/abs/2507.05163v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency, and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge</title>
      <itunes:episode>941</itunes:episode>
      <podcast:episode>941</podcast:episode>
      <itunes:title>DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">830e9ffe-21aa-445a-b56a-f34efd8a648f</guid>
      <link>https://share.transistor.fm/s/7d1f0867</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin</p>

            <p><strong>Title:</strong><br>
            DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.04447v1">http://arxiv.org/abs/2507.04447v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin</p>

            <p><strong>Title:</strong><br>
            DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.04447v1">http://arxiv.org/abs/2507.04447v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 08 Jul 2025 20:50:24 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7d1f0867/a65c9aa9.mp3" length="21694692" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1352</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin</p>

            <p><strong>Title:</strong><br>
            DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.04447v1">http://arxiv.org/abs/2507.04447v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Pre-Trained Policy Discriminators are General Reward Models</title>
      <itunes:episode>940</itunes:episode>
      <podcast:episode>940</podcast:episode>
      <itunes:title>Pre-Trained Policy Discriminators are General Reward Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">062a4f8a-564f-4846-bc14-2a2a3561e215</guid>
      <link>https://share.transistor.fm/s/5ec288aa</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shihan Dou, Shichun Liu, Yuming Yang, Yicheng Zou, Yunhua Zhou, Shuhao Xing, Chenhao Huang, Qiming Ge, Demin Song, Haijun Lv, Songyang Gao, Chengqi Lv, Enyu Zhou, Honglin Guo, Zhiheng Xi, Wenwei Zhang, Qipeng Guo, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Tao Gui, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Pre-Trained Policy Discriminators are General Reward Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05197v1">http://arxiv.org/abs/2507.05197v1</a></p>

            <p><strong>Abstract:</strong><br>
            We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shihan Dou, Shichun Liu, Yuming Yang, Yicheng Zou, Yunhua Zhou, Shuhao Xing, Chenhao Huang, Qiming Ge, Demin Song, Haijun Lv, Songyang Gao, Chengqi Lv, Enyu Zhou, Honglin Guo, Zhiheng Xi, Wenwei Zhang, Qipeng Guo, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Tao Gui, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Pre-Trained Policy Discriminators are General Reward Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05197v1">http://arxiv.org/abs/2507.05197v1</a></p>

            <p><strong>Abstract:</strong><br>
            We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 08 Jul 2025 20:50:01 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5ec288aa/622851cc.mp3" length="23260344" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1450</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shihan Dou, Shichun Liu, Yuming Yang, Yicheng Zou, Yunhua Zhou, Shuhao Xing, Chenhao Huang, Qiming Ge, Demin Song, Haijun Lv, Songyang Gao, Chengqi Lv, Enyu Zhou, Honglin Guo, Zhiheng Xi, Wenwei Zhang, Qipeng Guo, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Tao Gui, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Pre-Trained Policy Discriminators are General Reward Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.05197v1">http://arxiv.org/abs/2507.05197v1</a></p>

            <p><strong>Abstract:</strong><br>
            We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset</title>
      <itunes:episode>939</itunes:episode>
      <podcast:episode>939</podcast:episode>
      <itunes:title>BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e15e25f2-ebaa-4c9f-8e9c-171039bd18f2</guid>
      <link>https://share.transistor.fm/s/5101023f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhiheng Xi, Guanyu Li, Yutao Fan, Honglin Guo, Yufang Liu, Xiaoran Fan, Jiaqi Liu, Jingchao Ding, Wangmeng Zuo, Zhenfei Yin, Lei Bai, Tao Ji, Tao Gui, Qi Zhang, Philip Torr, Xuanjing Huang</p>

            <p><strong>Title:</strong><br>
            BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.03483v2">http://arxiv.org/abs/2507.03483v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce BMMR, a large-scale bilingual, multimodal, multi-disciplinary reasoning dataset for the community to develop and evaluate large multimodal models (LMMs). BMMR comprises 110k college-level questions spanning 300 UNESCO-defined subjects, spanning diverse formats-multiple-choice, fill-in-the-blank, and open-ended QA-and sourced from both print and digital media such as books, exams, and quizzes. All data are curated and filtered via a human-in-the-loop and scalable framework, and each instance is paired with a high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval that comprises 20,458 high-quality instances to comprehensively assess LMMs' knowledge and reasoning across multiple disciplines in both Chinese and English; and BMMR-Train that contains 88,991 instances to support further research and development, extending the current focus on mathematical reasoning to diverse disciplines and domains. In addition, we propose the process-based multi-discipline verifier (i.e., BMMR-Verifier) for accurate and fine-grained evaluation of reasoning paths. Extensive experiments on 24 models reveal that (i) even SOTA models (e.g., o3 and Gemini-2.5-Pro) leave substantial headroom on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs only on specific subjects; (iii) open-source models still trail their proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap. Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other in-depth studies, uncovering the challenges LMMs currently face in multidisciplinary reasoning. We will release the data, and we hope our work can offer insights and contributions to the community.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhiheng Xi, Guanyu Li, Yutao Fan, Honglin Guo, Yufang Liu, Xiaoran Fan, Jiaqi Liu, Jingchao Ding, Wangmeng Zuo, Zhenfei Yin, Lei Bai, Tao Ji, Tao Gui, Qi Zhang, Philip Torr, Xuanjing Huang</p>

            <p><strong>Title:</strong><br>
            BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.03483v2">http://arxiv.org/abs/2507.03483v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce BMMR, a large-scale bilingual, multimodal, multi-disciplinary reasoning dataset for the community to develop and evaluate large multimodal models (LMMs). BMMR comprises 110k college-level questions spanning 300 UNESCO-defined subjects, spanning diverse formats-multiple-choice, fill-in-the-blank, and open-ended QA-and sourced from both print and digital media such as books, exams, and quizzes. All data are curated and filtered via a human-in-the-loop and scalable framework, and each instance is paired with a high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval that comprises 20,458 high-quality instances to comprehensively assess LMMs' knowledge and reasoning across multiple disciplines in both Chinese and English; and BMMR-Train that contains 88,991 instances to support further research and development, extending the current focus on mathematical reasoning to diverse disciplines and domains. In addition, we propose the process-based multi-discipline verifier (i.e., BMMR-Verifier) for accurate and fine-grained evaluation of reasoning paths. Extensive experiments on 24 models reveal that (i) even SOTA models (e.g., o3 and Gemini-2.5-Pro) leave substantial headroom on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs only on specific subjects; (iii) open-source models still trail their proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap. Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other in-depth studies, uncovering the challenges LMMs currently face in multidisciplinary reasoning. We will release the data, and we hope our work can offer insights and contributions to the community.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 08 Jul 2025 20:49:38 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5101023f/cbb6b14e.mp3" length="19226635" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1198</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhiheng Xi, Guanyu Li, Yutao Fan, Honglin Guo, Yufang Liu, Xiaoran Fan, Jiaqi Liu, Jingchao Ding, Wangmeng Zuo, Zhenfei Yin, Lei Bai, Tao Ji, Tao Gui, Qi Zhang, Philip Torr, Xuanjing Huang</p>

            <p><strong>Title:</strong><br>
            BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.03483v2">http://arxiv.org/abs/2507.03483v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce BMMR, a large-scale bilingual, multimodal, multi-disciplinary reasoning dataset for the community to develop and evaluate large multimodal models (LMMs). BMMR comprises 110k college-level questions spanning 300 UNESCO-defined subjects, spanning diverse formats-multiple-choice, fill-in-the-blank, and open-ended QA-and sourced from both print and digital media such as books, exams, and quizzes. All data are curated and filtered via a human-in-the-loop and scalable framework, and each instance is paired with a high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval that comprises 20,458 high-quality instances to comprehensively assess LMMs' knowledge and reasoning across multiple disciplines in both Chinese and English; and BMMR-Train that contains 88,991 instances to support further research and development, extending the current focus on mathematical reasoning to diverse disciplines and domains. In addition, we propose the process-based multi-discipline verifier (i.e., BMMR-Verifier) for accurate and fine-grained evaluation of reasoning paths. Extensive experiments on 24 models reveal that (i) even SOTA models (e.g., o3 and Gemini-2.5-Pro) leave substantial headroom on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs only on specific subjects; (iii) open-source models still trail their proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap. Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other in-depth studies, uncovering the challenges LMMs currently face in multidisciplinary reasoning. We will release the data, and we hope our work can offer insights and contributions to the community.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WebSailor: Navigating Super-human Reasoning for Web Agent</title>
      <itunes:episode>938</itunes:episode>
      <podcast:episode>938</podcast:episode>
      <itunes:title>WebSailor: Navigating Super-human Reasoning for Web Agent</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9f7caafd-cfba-4477-83df-1c3afd04d1db</guid>
      <link>https://share.transistor.fm/s/5cd0684d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            WebSailor: Navigating Super-human Reasoning for Web Agent</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.02592v1">http://arxiv.org/abs/2507.02592v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            WebSailor: Navigating Super-human Reasoning for Web Agent</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.02592v1">http://arxiv.org/abs/2507.02592v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 04 Jul 2025 20:33:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5cd0684d/08d5b1ad.mp3" length="21466879" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1338</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            WebSailor: Navigating Super-human Reasoning for Web Agent</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.02592v1">http://arxiv.org/abs/2507.02592v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion</title>
      <itunes:episode>937</itunes:episode>
      <podcast:episode>937</podcast:episode>
      <itunes:title>LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a50df13e-ac23-4afa-8afe-5459316d6181</guid>
      <link>https://share.transistor.fm/s/2b731894</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Fangfu Liu, Hao Li, Jiawei Chi, Hanyang Wang, Minghui Yang, Fudong Wang, Yueqi Duan</p>

            <p><strong>Title:</strong><br>
            LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.02813v1">http://arxiv.org/abs/2507.02813v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability. Project Page: https://liuff19.github.io/LangScene-X.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Fangfu Liu, Hao Li, Jiawei Chi, Hanyang Wang, Minghui Yang, Fudong Wang, Yueqi Duan</p>

            <p><strong>Title:</strong><br>
            LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.02813v1">http://arxiv.org/abs/2507.02813v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability. Project Page: https://liuff19.github.io/LangScene-X.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 04 Jul 2025 20:33:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2b731894/85076b26.mp3" length="21429717" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1336</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Fangfu Liu, Hao Li, Jiawei Chi, Hanyang Wang, Minghui Yang, Fudong Wang, Yueqi Duan</p>

            <p><strong>Title:</strong><br>
            LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.02813v1">http://arxiv.org/abs/2507.02813v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability. Project Page: https://liuff19.github.io/LangScene-X.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback</title>
      <itunes:episode>936</itunes:episode>
      <podcast:episode>936</podcast:episode>
      <itunes:title>Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">72e62132-a022-4e0b-a6b6-3dc91a54bfb2</guid>
      <link>https://share.transistor.fm/s/173b850f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Nina Konovalova, Maxim Nikolaev, Andrey Kuznetsov, Aibek Alanov</p>

            <p><strong>Title:</strong><br>
            Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.02321v1">http://arxiv.org/abs/2507.02321v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Nina Konovalova, Maxim Nikolaev, Andrey Kuznetsov, Aibek Alanov</p>

            <p><strong>Title:</strong><br>
            Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.02321v1">http://arxiv.org/abs/2507.02321v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 04 Jul 2025 20:33:01 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/173b850f/7e9c4324.mp3" length="18548301" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1156</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Nina Konovalova, Maxim Nikolaev, Andrey Kuznetsov, Aibek Alanov</p>

            <p><strong>Title:</strong><br>
            Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.02321v1">http://arxiv.org/abs/2507.02321v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>IntFold: A Controllable Foundation Model for General and Specialized Biomolecular Structure Prediction</title>
      <itunes:episode>935</itunes:episode>
      <podcast:episode>935</podcast:episode>
      <itunes:title>IntFold: A Controllable Foundation Model for General and Specialized Biomolecular Structure Prediction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">88208e73-bf4c-4b4f-bade-53b4310a3322</guid>
      <link>https://share.transistor.fm/s/b93026f0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | q-bio.BM</p>

            <p><strong>Authors:</strong><br>
            The IntFold Team, Leon Qiao, Wayne Bai, He Yan, Gary Liu, Nova Xi, Xiang Zhang</p>

            <p><strong>Title:</strong><br>
            IntFold: A Controllable Foundation Model for General and Specialized Biomolecular Structure Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.02025v1">http://arxiv.org/abs/2507.02025v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce IntFold, a controllable foundation model for both general and specialized biomolecular structure prediction. IntFold demonstrates predictive accuracy comparable to the state-of-the-art AlphaFold3, while utilizing a superior customized attention kernel. Beyond standard structure prediction, IntFold can be adapted to predict allosteric states, constrained structures, and binding affinity through the use of individual adapters. Furthermore, we introduce a novel confidence head to estimate docking quality, offering a more nuanced assessment for challenging targets such as antibody-antigen complexes. Finally, we share insights gained during the training process of this computationally intensive model.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | q-bio.BM</p>

            <p><strong>Authors:</strong><br>
            The IntFold Team, Leon Qiao, Wayne Bai, He Yan, Gary Liu, Nova Xi, Xiang Zhang</p>

            <p><strong>Title:</strong><br>
            IntFold: A Controllable Foundation Model for General and Specialized Biomolecular Structure Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.02025v1">http://arxiv.org/abs/2507.02025v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce IntFold, a controllable foundation model for both general and specialized biomolecular structure prediction. IntFold demonstrates predictive accuracy comparable to the state-of-the-art AlphaFold3, while utilizing a superior customized attention kernel. Beyond standard structure prediction, IntFold can be adapted to predict allosteric states, constrained structures, and binding affinity through the use of individual adapters. Furthermore, we introduce a novel confidence head to estimate docking quality, offering a more nuanced assessment for challenging targets such as antibody-antigen complexes. Finally, we share insights gained during the training process of this computationally intensive model.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 04 Jul 2025 20:32:39 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b93026f0/d54a8192.mp3" length="21087835" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1314</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | q-bio.BM</p>

            <p><strong>Authors:</strong><br>
            The IntFold Team, Leon Qiao, Wayne Bai, He Yan, Gary Liu, Nova Xi, Xiang Zhang</p>

            <p><strong>Title:</strong><br>
            IntFold: A Controllable Foundation Model for General and Specialized Biomolecular Structure Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.02025v1">http://arxiv.org/abs/2507.02025v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce IntFold, a controllable foundation model for both general and specialized biomolecular structure prediction. IntFold demonstrates predictive accuracy comparable to the state-of-the-art AlphaFold3, while utilizing a superior customized attention kernel. Beyond standard structure prediction, IntFold can be adapted to predict allosteric states, constrained structures, and binding affinity through the use of individual adapters. Furthermore, we introduce a novel confidence head to estimate docking quality, offering a more nuanced assessment for challenging targets such as antibody-antigen complexes. Finally, we share insights gained during the training process of this computationally intensive model.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy</title>
      <itunes:episode>934</itunes:episode>
      <podcast:episode>934</podcast:episode>
      <itunes:title>Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">276ccec0-476d-4c50-a86a-db625e399dbd</guid>
      <link>https://share.transistor.fm/s/84ed10ee</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, Yahui Zhou</p>

            <p><strong>Title:</strong><br>
            Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01352v2">http://arxiv.org/abs/2507.01352v2</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the critical role of reward models (RMs) in reinforcement learning from human feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture the spectrum of nuanced and sophisticated human preferences. Even approaches that incorporate advanced training techniques have not yielded meaningful performance improvements. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present a large-scale preference dataset comprising 40 million preference pairs, named SynPref-40M. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while large language models perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling, achieving state-of-the-art performance across seven major reward model benchmarks. Ablation studies confirm that the effectiveness of our approach stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, highlighting the untapped potential of existing preference datasets and demonstrating how human-AI curation synergy can unlock significantly higher data quality.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, Yahui Zhou</p>

            <p><strong>Title:</strong><br>
            Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01352v2">http://arxiv.org/abs/2507.01352v2</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the critical role of reward models (RMs) in reinforcement learning from human feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture the spectrum of nuanced and sophisticated human preferences. Even approaches that incorporate advanced training techniques have not yielded meaningful performance improvements. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present a large-scale preference dataset comprising 40 million preference pairs, named SynPref-40M. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while large language models perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling, achieving state-of-the-art performance across seven major reward model benchmarks. Ablation studies confirm that the effectiveness of our approach stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, highlighting the untapped potential of existing preference datasets and demonstrating how human-AI curation synergy can unlock significantly higher data quality.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 04 Jul 2025 20:32:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/84ed10ee/ebfda7cf.mp3" length="20681130" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1289</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, Yahui Zhou</p>

            <p><strong>Title:</strong><br>
            Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01352v2">http://arxiv.org/abs/2507.01352v2</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the critical role of reward models (RMs) in reinforcement learning from human feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture the spectrum of nuanced and sophisticated human preferences. Even approaches that incorporate advanced training techniques have not yielded meaningful performance improvements. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present a large-scale preference dataset comprising 40 million preference pairs, named SynPref-40M. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while large language models perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling, achieving state-of-the-art performance across seven major reward model benchmarks. Ablation studies confirm that the effectiveness of our approach stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, highlighting the untapped potential of existing preference datasets and demonstrating how human-AI curation synergy can unlock significantly higher data quality.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers</title>
      <itunes:episode>933</itunes:episode>
      <podcast:episode>933</podcast:episode>
      <itunes:title>Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5982c0c1-f4c9-4cfd-b4f3-4895637bb641</guid>
      <link>https://share.transistor.fm/s/920a26ca</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, Linjie Li, Yu Cheng, Heng Ji, Junxian He, Yi R. Fung</p>

            <p><strong>Title:</strong><br>
            Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.23918v3">http://arxiv.org/abs/2506.23918v3</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, Linjie Li, Yu Cheng, Heng Ji, Junxian He, Yi R. Fung</p>

            <p><strong>Title:</strong><br>
            Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.23918v3">http://arxiv.org/abs/2506.23918v3</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 04 Jul 2025 20:31:57 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/920a26ca/b38a01ef.mp3" length="19327796" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1204</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, Linjie Li, Yu Cheng, Heng Ji, Junxian He, Yi R. Fung</p>

            <p><strong>Title:</strong><br>
            Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.23918v3">http://arxiv.org/abs/2506.23918v3</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Kwai Keye-VL Technical Report</title>
      <itunes:episode>932</itunes:episode>
      <podcast:episode>932</podcast:episode>
      <itunes:title>Kwai Keye-VL Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">12fff396-7b9a-4b91-a2c6-cdaa12b55e21</guid>
      <link>https://share.transistor.fm/s/fb9f53cf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 97 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Hao Peng, Haojie Ding, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Jin Ouyang, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yang Zhou, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zhenhua Wu, Zhenyu Li, Zhixin Ling, Ziming Li, Dehua Ma, Di Xu, Haixuan Gao, Hang Li, Jiawei Guo, Jing Wang, Lejian Ren, Muhao Wei, Qianqian Wang, Qigen Hu, Shiyao Wang, Tao Yu, Xinchen Luo, Yan Li, Yiming Liang, Yuhang Hu, Zeyi Lu, Zhuoran Yang, Zixing Zhang</p>

            <p><strong>Title:</strong><br>
            Kwai Keye-VL Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01949v1">http://arxiv.org/abs/2507.01949v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce \textbf{Kwai Keye-VL}, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video understanding while maintaining robust general-purpose vision-language abilities. The development of Keye-VL rests on two core pillars: a massive, high-quality dataset exceeding 600 billion tokens with a strong emphasis on video, and an innovative training recipe. This recipe features a four-stage pre-training process for solid vision-language alignment, followed by a meticulous two-phase post-training process. The first post-training stage enhances foundational capabilities like instruction following, while the second phase focuses on stimulating advanced reasoning. In this second phase, a key innovation is our five-mode ``cold-start'' data mixture, which includes ``thinking'', ``non-thinking'', ``auto-think'', ``think with image'', and high-quality video data. This mixture teaches the model to decide when and how to reason. Subsequent reinforcement learning (RL) and alignment steps further enhance these reasoning capabilities and correct abnormal model behaviors, such as repetitive outputs. To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks (Figure 1). Furthermore, we develop and release the \textbf{KC-MMBench}, a new benchmark tailored for real-world short-video scenarios, where Keye-VL shows a significant advantage.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 97 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Hao Peng, Haojie Ding, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Jin Ouyang, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yang Zhou, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zhenhua Wu, Zhenyu Li, Zhixin Ling, Ziming Li, Dehua Ma, Di Xu, Haixuan Gao, Hang Li, Jiawei Guo, Jing Wang, Lejian Ren, Muhao Wei, Qianqian Wang, Qigen Hu, Shiyao Wang, Tao Yu, Xinchen Luo, Yan Li, Yiming Liang, Yuhang Hu, Zeyi Lu, Zhuoran Yang, Zixing Zhang</p>

            <p><strong>Title:</strong><br>
            Kwai Keye-VL Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01949v1">http://arxiv.org/abs/2507.01949v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce \textbf{Kwai Keye-VL}, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video understanding while maintaining robust general-purpose vision-language abilities. The development of Keye-VL rests on two core pillars: a massive, high-quality dataset exceeding 600 billion tokens with a strong emphasis on video, and an innovative training recipe. This recipe features a four-stage pre-training process for solid vision-language alignment, followed by a meticulous two-phase post-training process. The first post-training stage enhances foundational capabilities like instruction following, while the second phase focuses on stimulating advanced reasoning. In this second phase, a key innovation is our five-mode ``cold-start'' data mixture, which includes ``thinking'', ``non-thinking'', ``auto-think'', ``think with image'', and high-quality video data. This mixture teaches the model to decide when and how to reason. Subsequent reinforcement learning (RL) and alignment steps further enhance these reasoning capabilities and correct abnormal model behaviors, such as repetitive outputs. To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks (Figure 1). Furthermore, we develop and release the \textbf{KC-MMBench}, a new benchmark tailored for real-world short-video scenarios, where Keye-VL shows a significant advantage.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 03 Jul 2025 20:14:51 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fb9f53cf/6261eca3.mp3" length="21822116" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1360</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 97 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Hao Peng, Haojie Ding, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Jin Ouyang, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yang Zhou, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zhenhua Wu, Zhenyu Li, Zhixin Ling, Ziming Li, Dehua Ma, Di Xu, Haixuan Gao, Hang Li, Jiawei Guo, Jing Wang, Lejian Ren, Muhao Wei, Qianqian Wang, Qigen Hu, Shiyao Wang, Tao Yu, Xinchen Luo, Yan Li, Yiming Liang, Yuhang Hu, Zeyi Lu, Zhuoran Yang, Zixing Zhang</p>

            <p><strong>Title:</strong><br>
            Kwai Keye-VL Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01949v1">http://arxiv.org/abs/2507.01949v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce \textbf{Kwai Keye-VL}, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video understanding while maintaining robust general-purpose vision-language abilities. The development of Keye-VL rests on two core pillars: a massive, high-quality dataset exceeding 600 billion tokens with a strong emphasis on video, and an innovative training recipe. This recipe features a four-stage pre-training process for solid vision-language alignment, followed by a meticulous two-phase post-training process. The first post-training stage enhances foundational capabilities like instruction following, while the second phase focuses on stimulating advanced reasoning. In this second phase, a key innovation is our five-mode ``cold-start'' data mixture, which includes ``thinking'', ``non-thinking'', ``auto-think'', ``think with image'', and high-quality video data. This mixture teaches the model to decide when and how to reason. Subsequent reinforcement learning (RL) and alignment steps further enhance these reasoning capabilities and correct abnormal model behaviors, such as repetitive outputs. To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks (Figure 1). Furthermore, we develop and release the \textbf{KC-MMBench}, a new benchmark tailored for real-world short-video scenarios, where Keye-VL shows a significant advantage.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LongAnimation: Long Animation Generation with Dynamic Global-Local Memory</title>
      <itunes:episode>931</itunes:episode>
      <podcast:episode>931</podcast:episode>
      <itunes:title>LongAnimation: Long Animation Generation with Dynamic Global-Local Memory</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">87bbbf07-b30a-4bd1-bd08-055124d9ffbf</guid>
      <link>https://share.transistor.fm/s/eb40aa6a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 61 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Nan Chen, Mengqi Huang, Yihao Meng, Zhendong Mao</p>

            <p><strong>Title:</strong><br>
            LongAnimation: Long Animation Generation with Dynamic Global-Local Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01945v1">http://arxiv.org/abs/2507.01945v1</a></p>

            <p><strong>Abstract:</strong><br>
            Animation colorization is a crucial part of real animation industry production. Long animation colorization has high labor costs. Therefore, automated long animation colorization based on the video generation model has significant research value. Existing studies are limited to short-term colorization. These studies adopt a local paradigm, fusing overlapping features to achieve smooth transitions between local segments. However, the local paradigm neglects global information, failing to maintain long-term color consistency. In this study, we argue that ideal long-term color consistency can be achieved through a dynamic global-local paradigm, i.e., dynamically extracting global color-consistent features relevant to the current generation. Specifically, we propose LongAnimation, a novel framework, which mainly includes a SketchDiT, a Dynamic Global-Local Memory (DGLM), and a Color Consistency Reward. The SketchDiT captures hybrid reference features to support the DGLM module. The DGLM module employs a long video understanding model to dynamically compress global historical features and adaptively fuse them with the current generation features. To refine the color consistency, we introduce a Color Consistency Reward. During inference, we propose a color consistency fusion to smooth the video segment transition. Extensive experiments on both short-term (14 frames) and long-term (average 500 frames) animations show the effectiveness of LongAnimation in maintaining short-term and long-term color consistency for open-domain animation colorization task. The code can be found at https://cn-makers.github.io/long_animation_web/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 61 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Nan Chen, Mengqi Huang, Yihao Meng, Zhendong Mao</p>

            <p><strong>Title:</strong><br>
            LongAnimation: Long Animation Generation with Dynamic Global-Local Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01945v1">http://arxiv.org/abs/2507.01945v1</a></p>

            <p><strong>Abstract:</strong><br>
            Animation colorization is a crucial part of real animation industry production. Long animation colorization has high labor costs. Therefore, automated long animation colorization based on the video generation model has significant research value. Existing studies are limited to short-term colorization. These studies adopt a local paradigm, fusing overlapping features to achieve smooth transitions between local segments. However, the local paradigm neglects global information, failing to maintain long-term color consistency. In this study, we argue that ideal long-term color consistency can be achieved through a dynamic global-local paradigm, i.e., dynamically extracting global color-consistent features relevant to the current generation. Specifically, we propose LongAnimation, a novel framework, which mainly includes a SketchDiT, a Dynamic Global-Local Memory (DGLM), and a Color Consistency Reward. The SketchDiT captures hybrid reference features to support the DGLM module. The DGLM module employs a long video understanding model to dynamically compress global historical features and adaptively fuse them with the current generation features. To refine the color consistency, we introduce a Color Consistency Reward. During inference, we propose a color consistency fusion to smooth the video segment transition. Extensive experiments on both short-term (14 frames) and long-term (average 500 frames) animations show the effectiveness of LongAnimation in maintaining short-term and long-term color consistency for open-domain animation colorization task. The code can be found at https://cn-makers.github.io/long_animation_web/.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 03 Jul 2025 20:14:29 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/eb40aa6a/5742ead8.mp3" length="21132109" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1317</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 61 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Nan Chen, Mengqi Huang, Yihao Meng, Zhendong Mao</p>

            <p><strong>Title:</strong><br>
            LongAnimation: Long Animation Generation with Dynamic Global-Local Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01945v1">http://arxiv.org/abs/2507.01945v1</a></p>

            <p><strong>Abstract:</strong><br>
            Animation colorization is a crucial part of real animation industry production. Long animation colorization has high labor costs. Therefore, automated long animation colorization based on the video generation model has significant research value. Existing studies are limited to short-term colorization. These studies adopt a local paradigm, fusing overlapping features to achieve smooth transitions between local segments. However, the local paradigm neglects global information, failing to maintain long-term color consistency. In this study, we argue that ideal long-term color consistency can be achieved through a dynamic global-local paradigm, i.e., dynamically extracting global color-consistent features relevant to the current generation. Specifically, we propose LongAnimation, a novel framework, which mainly includes a SketchDiT, a Dynamic Global-Local Memory (DGLM), and a Color Consistency Reward. The SketchDiT captures hybrid reference features to support the DGLM module. The DGLM module employs a long video understanding model to dynamically compress global historical features and adaptively fuse them with the current generation features. To refine the color consistency, we introduce a Color Consistency Reward. During inference, we propose a color consistency fusion to smooth the video segment transition. Extensive experiments on both short-term (14 frames) and long-term (average 500 frames) animations show the effectiveness of LongAnimation in maintaining short-term and long-term color consistency for open-domain animation colorization task. The code can be found at https://cn-makers.github.io/long_animation_web/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Depth Anything at Any Condition</title>
      <itunes:episode>930</itunes:episode>
      <podcast:episode>930</podcast:episode>
      <itunes:title>Depth Anything at Any Condition</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4250fa35-d7d6-430f-8992-86f4afb359a9</guid>
      <link>https://share.transistor.fm/s/30f39dc8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Boyuan Sun, Modi Jin, Bowen Yin, Qibin Hou</p>

            <p><strong>Title:</strong><br>
            Depth Anything at Any Condition</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01634v1">http://arxiv.org/abs/2507.01634v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Depth Anything at Any Condition (DepthAnything-AC), a foundation monocular depth estimation (MDE) model capable of handling diverse environmental conditions. Previous foundation MDE models achieve impressive performance across general scenes but not perform well in complex open-world environments that involve challenging conditions, such as illumination variations, adverse weather, and sensor-induced distortions. To overcome the challenges of data scarcity and the inability of generating high-quality pseudo-labels from corrupted images, we propose an unsupervised consistency regularization finetuning paradigm that requires only a relatively small amount of unlabeled data. Furthermore, we propose the Spatial Distance Constraint to explicitly enforce the model to learn patch-level relative relationships, resulting in clearer semantic boundaries and more accurate details. Experimental results demonstrate the zero-shot capabilities of DepthAnything-AC across diverse benchmarks, including real-world adverse weather benchmarks, synthetic corruption benchmarks, and general benchmarks.   Project Page: https://ghost233lism.github.io/depthanything-AC-page   Code: https://github.com/HVision-NKU/DepthAnythingAC</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Boyuan Sun, Modi Jin, Bowen Yin, Qibin Hou</p>

            <p><strong>Title:</strong><br>
            Depth Anything at Any Condition</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01634v1">http://arxiv.org/abs/2507.01634v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Depth Anything at Any Condition (DepthAnything-AC), a foundation monocular depth estimation (MDE) model capable of handling diverse environmental conditions. Previous foundation MDE models achieve impressive performance across general scenes but not perform well in complex open-world environments that involve challenging conditions, such as illumination variations, adverse weather, and sensor-induced distortions. To overcome the challenges of data scarcity and the inability of generating high-quality pseudo-labels from corrupted images, we propose an unsupervised consistency regularization finetuning paradigm that requires only a relatively small amount of unlabeled data. Furthermore, we propose the Spatial Distance Constraint to explicitly enforce the model to learn patch-level relative relationships, resulting in clearer semantic boundaries and more accurate details. Experimental results demonstrate the zero-shot capabilities of DepthAnything-AC across diverse benchmarks, including real-world adverse weather benchmarks, synthetic corruption benchmarks, and general benchmarks.   Project Page: https://ghost233lism.github.io/depthanything-AC-page   Code: https://github.com/HVision-NKU/DepthAnythingAC</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 03 Jul 2025 20:14:08 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/30f39dc8/ba27f68b.mp3" length="24301870" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1515</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Boyuan Sun, Modi Jin, Bowen Yin, Qibin Hou</p>

            <p><strong>Title:</strong><br>
            Depth Anything at Any Condition</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01634v1">http://arxiv.org/abs/2507.01634v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Depth Anything at Any Condition (DepthAnything-AC), a foundation monocular depth estimation (MDE) model capable of handling diverse environmental conditions. Previous foundation MDE models achieve impressive performance across general scenes but not perform well in complex open-world environments that involve challenging conditions, such as illumination variations, adverse weather, and sensor-induced distortions. To overcome the challenges of data scarcity and the inability of generating high-quality pseudo-labels from corrupted images, we propose an unsupervised consistency regularization finetuning paradigm that requires only a relatively small amount of unlabeled data. Furthermore, we propose the Spatial Distance Constraint to explicitly enforce the model to learn patch-level relative relationships, resulting in clearer semantic boundaries and more accurate details. Experimental results demonstrate the zero-shot capabilities of DepthAnything-AC across diverse benchmarks, including real-world adverse weather benchmarks, synthetic corruption benchmarks, and general benchmarks.   Project Page: https://ghost233lism.github.io/depthanything-AC-page   Code: https://github.com/HVision-NKU/DepthAnythingAC</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Survey on Vision-Language-Action Models: An Action Tokenization Perspective</title>
      <itunes:episode>929</itunes:episode>
      <podcast:episode>929</podcast:episode>
      <itunes:title>A Survey on Vision-Language-Action Models: An Action Tokenization Perspective</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">130ddc35-2453-4444-911b-affa7c5f7108</guid>
      <link>https://share.transistor.fm/s/156cd32d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, Zhiquan Qi, Yitao Liang, Yuanpei Chen, Yaodong Yang</p>

            <p><strong>Title:</strong><br>
            A Survey on Vision-Language-Action Models: An Action Tokenization Perspective</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01925v1">http://arxiv.org/abs/2507.01925v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of \textit{action tokens} that progressively encode more grounded and actionable information, ultimately generating executable actions. We further determine that the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning. However, there remains a lack of comprehensive understanding regarding action tokens, significantly impeding effective VLA development and obscuring future directions. Therefore, this survey aims to categorize and interpret existing VLA research through the lens of action tokenization, distill the strengths and limitations of each token type, and identify areas for improvement. Through this systematic review and analysis, we offer a synthesized outlook on the broader evolution of VLA models, highlight underexplored yet promising directions, and contribute guidance for future research, hoping to bring the field closer to general-purpose intelligence.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, Zhiquan Qi, Yitao Liang, Yuanpei Chen, Yaodong Yang</p>

            <p><strong>Title:</strong><br>
            A Survey on Vision-Language-Action Models: An Action Tokenization Perspective</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01925v1">http://arxiv.org/abs/2507.01925v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of \textit{action tokens} that progressively encode more grounded and actionable information, ultimately generating executable actions. We further determine that the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning. However, there remains a lack of comprehensive understanding regarding action tokens, significantly impeding effective VLA development and obscuring future directions. Therefore, this survey aims to categorize and interpret existing VLA research through the lens of action tokenization, distill the strengths and limitations of each token type, and identify areas for improvement. Through this systematic review and analysis, we offer a synthesized outlook on the broader evolution of VLA models, highlight underexplored yet promising directions, and contribute guidance for future research, hoping to bring the field closer to general-purpose intelligence.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 03 Jul 2025 20:13:47 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/156cd32d/984a94a9.mp3" length="23015020" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1435</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, Zhiquan Qi, Yitao Liang, Yuanpei Chen, Yaodong Yang</p>

            <p><strong>Title:</strong><br>
            A Survey on Vision-Language-Action Models: An Action Tokenization Perspective</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01925v1">http://arxiv.org/abs/2507.01925v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable advancements of vision and language foundation models in multimodal understanding, reasoning, and generation has sparked growing efforts to extend such intelligence to the physical world, fueling the flourishing of vision-language-action (VLA) models. Despite seemingly diverse approaches, we observe that current VLA models can be unified under a single framework: vision and language inputs are processed by a series of VLA modules, producing a chain of \textit{action tokens} that progressively encode more grounded and actionable information, ultimately generating executable actions. We further determine that the primary design choice distinguishing VLA models lies in how action tokens are formulated, which can be categorized into language description, code, affordance, trajectory, goal state, latent representation, raw action, and reasoning. However, there remains a lack of comprehensive understanding regarding action tokens, significantly impeding effective VLA development and obscuring future directions. Therefore, this survey aims to categorize and interpret existing VLA research through the lens of action tokenization, distill the strengths and limitations of each token type, and identify areas for improvement. Through this systematic review and analysis, we offer a synthesized outlook on the broader evolution of VLA models, highlight underexplored yet promising directions, and contribute guidance for future research, hoping to bring the field closer to general-purpose intelligence.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning</title>
      <itunes:episode>928</itunes:episode>
      <podcast:episode>928</podcast:episode>
      <itunes:title>GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">99e3a602-5064-4606-8915-b61f6503fe89</guid>
      <link>https://share.transistor.fm/s/45db6ada</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 141 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            GLM-V Team, :, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Wenkai Li, Wei Jia, Xin Lyu, Xuancheng Huang, Yanling Wang, Yadong Xue, Yanfeng Wang, Yifan An, Yifan Du, Yiming Shi, Yiheng Huang, Yilin Niu, Yuan Wang, Yuanchang Yue, Yuchen Li, Yutao Zhang, Yuxuan Zhang, Zhanxiao Du, Zhenyu Hou, Zhao Xue, Zhengxiao Du, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang</p>

            <p><strong>Title:</strong><br>
            GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01006v2">http://arxiv.org/abs/2507.01006v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding. We open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information are released at https://github.com/THUDM/GLM-4.1V-Thinking.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 141 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            GLM-V Team, :, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Wenkai Li, Wei Jia, Xin Lyu, Xuancheng Huang, Yanling Wang, Yadong Xue, Yanfeng Wang, Yifan An, Yifan Du, Yiming Shi, Yiheng Huang, Yilin Niu, Yuan Wang, Yuanchang Yue, Yuchen Li, Yutao Zhang, Yuxuan Zhang, Zhanxiao Du, Zhenyu Hou, Zhao Xue, Zhengxiao Du, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang</p>

            <p><strong>Title:</strong><br>
            GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01006v2">http://arxiv.org/abs/2507.01006v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding. We open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information are released at https://github.com/THUDM/GLM-4.1V-Thinking.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 02 Jul 2025 20:30:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/45db6ada/d434fd96.mp3" length="23848865" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1487</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 141 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            GLM-V Team, :, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Wenkai Li, Wei Jia, Xin Lyu, Xuancheng Huang, Yanling Wang, Yadong Xue, Yanfeng Wang, Yifan An, Yifan Du, Yiming Shi, Yiheng Huang, Yilin Niu, Yuan Wang, Yuanchang Yue, Yuchen Li, Yutao Zhang, Yuxuan Zhang, Zhanxiao Du, Zhenyu Hou, Zhao Xue, Zhengxiao Du, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang</p>

            <p><strong>Title:</strong><br>
            GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01006v2">http://arxiv.org/abs/2507.01006v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding. We open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information are released at https://github.com/THUDM/GLM-4.1V-Thinking.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning</title>
      <itunes:episode>927</itunes:episode>
      <podcast:episode>927</podcast:episode>
      <itunes:title>Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2187ebaa-de80-474c-ba32-acd958ff96bb</guid>
      <link>https://share.transistor.fm/s/bcea2414</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue</p>

            <p><strong>Title:</strong><br>
            Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.00432v1">http://arxiv.org/abs/2507.00432v1</a></p>

            <p><strong>Abstract:</strong><br>
            Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue</p>

            <p><strong>Title:</strong><br>
            Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.00432v1">http://arxiv.org/abs/2507.00432v1</a></p>

            <p><strong>Abstract:</strong><br>
            Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 02 Jul 2025 20:29:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bcea2414/474435d6.mp3" length="21017616" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1310</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue</p>

            <p><strong>Title:</strong><br>
            Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.00432v1">http://arxiv.org/abs/2507.00432v1</a></p>

            <p><strong>Abstract:</strong><br>
            Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks</title>
      <itunes:episode>926</itunes:episode>
      <podcast:episode>926</podcast:episode>
      <itunes:title>SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">71e946a5-9746-4ef8-a644-d93b634e29b0</guid>
      <link>https://share.transistor.fm/s/41055166</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yilun Zhao, Kaiyan Zhang, Tiansheng Hu, Sihong Wu, Ronan Le Bras, Taira Anderson, Jonathan Bragg, Joseph Chee Chang, Jesse Dodge, Matt Latzke, Yixin Liu, Charles McGrady, Xiangru Tang, Zihang Wang, Chen Zhao, Hannaneh Hajishirzi, Doug Downey, Arman Cohan</p>

            <p><strong>Title:</strong><br>
            SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01001v1">http://arxiv.org/abs/2507.01001v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 23 open-source and proprietary foundation models and has collected over 13,000 votes from trusted researchers across diverse scientific domains. We analyze the data collected so far and confirm that the submitted questions are diverse, aligned with real-world literature needs, and that participating researchers demonstrate strong self-consistency and inter-annotator agreement in their evaluations. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data. The benchmark measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark's challenges and emphasize the need for more reliable automated evaluation methods.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yilun Zhao, Kaiyan Zhang, Tiansheng Hu, Sihong Wu, Ronan Le Bras, Taira Anderson, Jonathan Bragg, Joseph Chee Chang, Jesse Dodge, Matt Latzke, Yixin Liu, Charles McGrady, Xiangru Tang, Zihang Wang, Chen Zhao, Hannaneh Hajishirzi, Doug Downey, Arman Cohan</p>

            <p><strong>Title:</strong><br>
            SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01001v1">http://arxiv.org/abs/2507.01001v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 23 open-source and proprietary foundation models and has collected over 13,000 votes from trusted researchers across diverse scientific domains. We analyze the data collected so far and confirm that the submitted questions are diverse, aligned with real-world literature needs, and that participating researchers demonstrate strong self-consistency and inter-annotator agreement in their evaluations. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data. The benchmark measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark's challenges and emphasize the need for more reliable automated evaluation methods.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 02 Jul 2025 20:29:25 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/41055166/d70c64f4.mp3" length="18248208" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1137</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yilun Zhao, Kaiyan Zhang, Tiansheng Hu, Sihong Wu, Ronan Le Bras, Taira Anderson, Jonathan Bragg, Joseph Chee Chang, Jesse Dodge, Matt Latzke, Yixin Liu, Charles McGrady, Xiangru Tang, Zihang Wang, Chen Zhao, Hannaneh Hajishirzi, Doug Downey, Arman Cohan</p>

            <p><strong>Title:</strong><br>
            SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2507.01001v1">http://arxiv.org/abs/2507.01001v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 23 open-source and proprietary foundation models and has collected over 13,000 votes from trusted researchers across diverse scientific domains. We analyze the data collected so far and confirm that the submitted questions are diverse, aligned with real-world literature needs, and that participating researchers demonstrate strong self-consistency and inter-annotator agreement in their evaluations. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data. The benchmark measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark's challenges and emphasize the need for more reliable automated evaluation methods.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings</title>
      <itunes:episode>925</itunes:episode>
      <podcast:episode>925</podcast:episode>
      <itunes:title>MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e32f5381-b324-4db1-99f1-7b8c0a3ab8f8</guid>
      <link>https://share.transistor.fm/s/6e6d7ec6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.23115v1">http://arxiv.org/abs/2506.23115v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal embedding models, built upon causal Vision Language Models (VLMs), have shown promise in various tasks. However, current approaches face three key limitations: the use of causal attention in VLM backbones is suboptimal for embedding tasks; scalability issues due to reliance on high-quality labeled paired data for contrastive learning; and limited diversity in training objectives and data. To address these issues, we propose MoCa, a two-stage framework for transforming pre-trained VLMs into effective bidirectional multimodal embedding models. The first stage, Modality-aware Continual Pre-training, introduces a joint reconstruction objective that simultaneously denoises interleaved text and image inputs, enhancing bidirectional context-aware reasoning. The second stage, Heterogeneous Contrastive Fine-tuning, leverages diverse, semantically rich multimodal data beyond simple image-caption pairs to enhance generalization and alignment. Our method addresses the stated limitations by introducing bidirectional attention through continual pre-training, scaling effectively with massive unlabeled datasets via joint reconstruction objectives, and utilizing diverse multimodal data for enhanced representation robustness. Experiments demonstrate that MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results, and exhibits strong scalability with both model size and training data on MMEB.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.23115v1">http://arxiv.org/abs/2506.23115v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal embedding models, built upon causal Vision Language Models (VLMs), have shown promise in various tasks. However, current approaches face three key limitations: the use of causal attention in VLM backbones is suboptimal for embedding tasks; scalability issues due to reliance on high-quality labeled paired data for contrastive learning; and limited diversity in training objectives and data. To address these issues, we propose MoCa, a two-stage framework for transforming pre-trained VLMs into effective bidirectional multimodal embedding models. The first stage, Modality-aware Continual Pre-training, introduces a joint reconstruction objective that simultaneously denoises interleaved text and image inputs, enhancing bidirectional context-aware reasoning. The second stage, Heterogeneous Contrastive Fine-tuning, leverages diverse, semantically rich multimodal data beyond simple image-caption pairs to enhance generalization and alignment. Our method addresses the stated limitations by introducing bidirectional attention through continual pre-training, scaling effectively with massive unlabeled datasets via joint reconstruction objectives, and utilizing diverse multimodal data for enhanced representation robustness. Experiments demonstrate that MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results, and exhibits strong scalability with both model size and training data on MMEB.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 02 Jul 2025 20:29:03 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6e6d7ec6/df3c99a3.mp3" length="20186705" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1258</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.23115v1">http://arxiv.org/abs/2506.23115v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal embedding models, built upon causal Vision Language Models (VLMs), have shown promise in various tasks. However, current approaches face three key limitations: the use of causal attention in VLM backbones is suboptimal for embedding tasks; scalability issues due to reliance on high-quality labeled paired data for contrastive learning; and limited diversity in training objectives and data. To address these issues, we propose MoCa, a two-stage framework for transforming pre-trained VLMs into effective bidirectional multimodal embedding models. The first stage, Modality-aware Continual Pre-training, introduces a joint reconstruction objective that simultaneously denoises interleaved text and image inputs, enhancing bidirectional context-aware reasoning. The second stage, Heterogeneous Contrastive Fine-tuning, leverages diverse, semantically rich multimodal data beyond simple image-caption pairs to enhance generalization and alignment. Our method addresses the stated limitations by introducing bidirectional attention through continual pre-training, scaling effectively with massive unlabeled datasets via joint reconstruction objectives, and utilizing diverse multimodal data for enhanced representation robustness. Experiments demonstrate that MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results, and exhibits strong scalability with both model size and training data on MMEB.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation</title>
      <itunes:episode>924</itunes:episode>
      <podcast:episode>924</podcast:episode>
      <itunes:title>Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">67144d66-601a-47e2-b3ca-bfe7204f956e</guid>
      <link>https://share.transistor.fm/s/49fff2a7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han</p>

            <p><strong>Title:</strong><br>
            Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.19852v1">http://arxiv.org/abs/2506.19852v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with $O(n \log n)$ complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard $O(n^2)$ dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9$\times$ speedup over the original dense attention. With minimal tuning, it enables video generation up to 4$\times$ longer while reducing training costs by up to 4.4$\times$ compared to direct fine-tuning and accelerating inference by up to 3.7$\times$ compared to dense attention inference.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han</p>

            <p><strong>Title:</strong><br>
            Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.19852v1">http://arxiv.org/abs/2506.19852v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with $O(n \log n)$ complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard $O(n^2)$ dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9$\times$ speedup over the original dense attention. With minimal tuning, it enables video generation up to 4$\times$ longer while reducing training costs by up to 4.4$\times$ compared to direct fine-tuning and accelerating inference by up to 3.7$\times$ compared to dense attention inference.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 02 Jul 2025 20:28:41 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/49fff2a7/8da6a7c6.mp3" length="20110635" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1253</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han</p>

            <p><strong>Title:</strong><br>
            Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.19852v1">http://arxiv.org/abs/2506.19852v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with $O(n \log n)$ complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard $O(n^2)$ dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9$\times$ speedup over the original dense attention. With minimal tuning, it enables video generation up to 4$\times$ longer while reducing training costs by up to 4.4$\times$ compared to direct fine-tuning and accelerating inference by up to 3.7$\times$ compared to dense attention inference.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Ovis-U1 Technical Report</title>
      <itunes:episode>923</itunes:episode>
      <podcast:episode>923</podcast:episode>
      <itunes:title>Ovis-U1 Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">be7b69e9-8eca-4140-8cb1-935329931ae3</guid>
      <link>https://share.transistor.fm/s/2958bdb7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, Yang Li, Qing-Guo Chen</p>

            <p><strong>Title:</strong><br>
            Ovis-U1 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.23044v2">http://arxiv.org/abs/2506.23044v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, Yang Li, Qing-Guo Chen</p>

            <p><strong>Title:</strong><br>
            Ovis-U1 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.23044v2">http://arxiv.org/abs/2506.23044v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 01 Jul 2025 20:19:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2958bdb7/7238f731.mp3" length="21426304" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1335</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, Yang Li, Qing-Guo Chen</p>

            <p><strong>Title:</strong><br>
            Ovis-U1 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.23044v2">http://arxiv.org/abs/2506.23044v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning</title>
      <itunes:episode>922</itunes:episode>
      <podcast:episode>922</podcast:episode>
      <itunes:title>SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">548d7ccc-03bc-4e1a-b78e-732dbebeca9d</guid>
      <link>https://share.transistor.fm/s/615c3e89</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques</p>

            <p><strong>Title:</strong><br>
            SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.24119v2">http://arxiv.org/abs/2506.24119v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques</p>

            <p><strong>Title:</strong><br>
            SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.24119v2">http://arxiv.org/abs/2506.24119v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 01 Jul 2025 20:18:46 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/615c3e89/b6e5b2a7.mp3" length="19872833" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1238</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques</p>

            <p><strong>Title:</strong><br>
            SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.24119v2">http://arxiv.org/abs/2506.24119v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VMoBA: Mixture-of-Block Attention for Video Diffusion Models</title>
      <itunes:episode>921</itunes:episode>
      <podcast:episode>921</podcast:episode>
      <itunes:title>VMoBA: Mixture-of-Block Attention for Video Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6e43ae35-be5d-407d-8aca-b42bd1761793</guid>
      <link>https://share.transistor.fm/s/94fef0b8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianzong Wu, Liang Hou, Haotian Yang, Xin Tao, Ye Tian, Pengfei Wan, Di Zhang, Yunhai Tong</p>

            <p><strong>Title:</strong><br>
            VMoBA: Mixture-of-Block Attention for Video Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.23858v1">http://arxiv.org/abs/2506.23858v1</a></p>

            <p><strong>Abstract:</strong><br>
            The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are designed as training-free inference accelerators or do not optimally capture the unique spatio-temporal characteristics inherent in video data when trained natively. This paper introduces Video Mixture of Block Attention (VMoBA), a novel sparse attention mechanism specifically adapted for VDMs. Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, which revealed strong spatio-temporal locality, varying query importance, and head-specific concentration levels, VMoBA enhances the original MoBA framework with three key modifications: (1) a layer-wise recurrent block partition scheme (1D-2D-3D) to dynamically adapt to diverse spatio-temporal attention patterns and improve efficiency; (2) global block selection to prioritize the most salient query-key block interactions across an entire attention head; and (3) threshold-based block selection to dynamically determine the number of attended blocks based on their cumulative similarity. Extensive experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and 1.48x latency speedup, while attaining comparable or even superior generation quality to full attention. Furthermore, VMoBA exhibits competitive performance in training-free inference, offering 2.40x FLOPs and 1.35x latency speedup for high-res video generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianzong Wu, Liang Hou, Haotian Yang, Xin Tao, Ye Tian, Pengfei Wan, Di Zhang, Yunhai Tong</p>

            <p><strong>Title:</strong><br>
            VMoBA: Mixture-of-Block Attention for Video Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.23858v1">http://arxiv.org/abs/2506.23858v1</a></p>

            <p><strong>Abstract:</strong><br>
            The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are designed as training-free inference accelerators or do not optimally capture the unique spatio-temporal characteristics inherent in video data when trained natively. This paper introduces Video Mixture of Block Attention (VMoBA), a novel sparse attention mechanism specifically adapted for VDMs. Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, which revealed strong spatio-temporal locality, varying query importance, and head-specific concentration levels, VMoBA enhances the original MoBA framework with three key modifications: (1) a layer-wise recurrent block partition scheme (1D-2D-3D) to dynamically adapt to diverse spatio-temporal attention patterns and improve efficiency; (2) global block selection to prioritize the most salient query-key block interactions across an entire attention head; and (3) threshold-based block selection to dynamically determine the number of attended blocks based on their cumulative similarity. Extensive experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and 1.48x latency speedup, while attaining comparable or even superior generation quality to full attention. Furthermore, VMoBA exhibits competitive performance in training-free inference, offering 2.40x FLOPs and 1.35x latency speedup for high-res video generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 01 Jul 2025 20:18:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/94fef0b8/1ce5f037.mp3" length="16485644" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1027</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianzong Wu, Liang Hou, Haotian Yang, Xin Tao, Ye Tian, Pengfei Wan, Di Zhang, Yunhai Tong</p>

            <p><strong>Title:</strong><br>
            VMoBA: Mixture-of-Block Attention for Video Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.23858v1">http://arxiv.org/abs/2506.23858v1</a></p>

            <p><strong>Abstract:</strong><br>
            The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are designed as training-free inference accelerators or do not optimally capture the unique spatio-temporal characteristics inherent in video data when trained natively. This paper introduces Video Mixture of Block Attention (VMoBA), a novel sparse attention mechanism specifically adapted for VDMs. Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, which revealed strong spatio-temporal locality, varying query importance, and head-specific concentration levels, VMoBA enhances the original MoBA framework with three key modifications: (1) a layer-wise recurrent block partition scheme (1D-2D-3D) to dynamically adapt to diverse spatio-temporal attention patterns and improve efficiency; (2) global block selection to prioritize the most salient query-key block interactions across an entire attention head; and (3) threshold-based block selection to dynamically determine the number of attended blocks based on their cumulative similarity. Extensive experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and 1.48x latency speedup, while attaining comparable or even superior generation quality to full attention. Furthermore, VMoBA exhibits competitive performance in training-free inference, offering 2.40x FLOPs and 1.35x latency speedup for high-res video generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Calligrapher: Freestyle Text Image Customization</title>
      <itunes:episode>920</itunes:episode>
      <podcast:episode>920</podcast:episode>
      <itunes:title>Calligrapher: Freestyle Text Image Customization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">356bf4e2-e338-4742-9f9f-6b8fb855f3d8</guid>
      <link>https://share.transistor.fm/s/460221ad</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yue Ma, Qingyan Bai, Hao Ouyang, Ka Leong Cheng, Qiuyu Wang, Hongyu Liu, Zichen Liu, Haofan Wang, Jingye Chen, Yujun Shen, Qifeng Chen</p>

            <p><strong>Title:</strong><br>
            Calligrapher: Freestyle Text Image Customization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.24123v1">http://arxiv.org/abs/2506.24123v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechanism that leverages the pre-trained text-to-image generative model itself alongside the large language model to automatically construct a style-centric typography benchmark. Second, we introduce a localized style injection framework via a trainable style encoder, which comprises both Qformer and linear layers, to extract robust style features from reference images. An in-context generation mechanism is also employed to directly embed reference images into the denoising process, further enhancing the refined alignment of target styles. Extensive quantitative and qualitative evaluations across diverse fonts and design contexts confirm Calligrapher's accurate reproduction of intricate stylistic details and precise glyph positioning. By automating high-quality, visually consistent typography, Calligrapher surpasses traditional models, empowering creative practitioners in digital art, branding, and contextual typographic design.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yue Ma, Qingyan Bai, Hao Ouyang, Ka Leong Cheng, Qiuyu Wang, Hongyu Liu, Zichen Liu, Haofan Wang, Jingye Chen, Yujun Shen, Qifeng Chen</p>

            <p><strong>Title:</strong><br>
            Calligrapher: Freestyle Text Image Customization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.24123v1">http://arxiv.org/abs/2506.24123v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechanism that leverages the pre-trained text-to-image generative model itself alongside the large language model to automatically construct a style-centric typography benchmark. Second, we introduce a localized style injection framework via a trainable style encoder, which comprises both Qformer and linear layers, to extract robust style features from reference images. An in-context generation mechanism is also employed to directly embed reference images into the denoising process, further enhancing the refined alignment of target styles. Extensive quantitative and qualitative evaluations across diverse fonts and design contexts confirm Calligrapher's accurate reproduction of intricate stylistic details and precise glyph positioning. By automating high-quality, visually consistent typography, Calligrapher surpasses traditional models, empowering creative practitioners in digital art, branding, and contextual typographic design.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 01 Jul 2025 20:17:59 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/460221ad/5c1252aa.mp3" length="21714302" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1353</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yue Ma, Qingyan Bai, Hao Ouyang, Ka Leong Cheng, Qiuyu Wang, Hongyu Liu, Zichen Liu, Haofan Wang, Jingye Chen, Yujun Shen, Qifeng Chen</p>

            <p><strong>Title:</strong><br>
            Calligrapher: Freestyle Text Image Customization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.24123v1">http://arxiv.org/abs/2506.24123v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechanism that leverages the pre-trained text-to-image generative model itself alongside the large language model to automatically construct a style-centric typography benchmark. Second, we introduce a localized style injection framework via a trainable style encoder, which comprises both Qformer and linear layers, to extract robust style features from reference images. An in-context generation mechanism is also employed to directly embed reference images into the denoising process, further enhancing the refined alignment of target styles. Extensive quantitative and qualitative evaluations across diverse fonts and design contexts confirm Calligrapher's accurate reproduction of intricate stylistic details and precise glyph positioning. By automating high-quality, visually consistent typography, Calligrapher surpasses traditional models, empowering creative practitioners in digital art, branding, and contextual typographic design.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing</title>
      <itunes:episode>919</itunes:episode>
      <podcast:episode>919</podcast:episode>
      <itunes:title>BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">48676a9f-6271-4f98-aa9b-686fcd16d7db</guid>
      <link>https://share.transistor.fm/s/962bfc60</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.GR, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiacheng Chen, Ramin Mehran, Xuhui Jia, Saining Xie, Sanghyun Woo</p>

            <p><strong>Title:</strong><br>
            BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.17450v2">http://arxiv.org/abs/2506.17450v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present BlenderFusion, a generative visual compositing framework that synthesizes new scenes by recomposing objects, camera, and background. It follows a layering-editing-compositing pipeline: (i) segmenting and converting visual inputs into editable 3D entities (layering), (ii) editing them in Blender with 3D-grounded control (editing), and (iii) fusing them into a coherent scene using a generative compositor (compositing). Our generative compositor extends a pre-trained diffusion model to process both the original (source) and edited (target) scenes in parallel. It is fine-tuned on video frames with two key training strategies: (i) source masking, enabling flexible modifications like background replacement; (ii) simulated object jittering, facilitating disentangled control over objects and camera. BlenderFusion significantly outperforms prior methods in complex compositional scene editing tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.GR, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiacheng Chen, Ramin Mehran, Xuhui Jia, Saining Xie, Sanghyun Woo</p>

            <p><strong>Title:</strong><br>
            BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.17450v2">http://arxiv.org/abs/2506.17450v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present BlenderFusion, a generative visual compositing framework that synthesizes new scenes by recomposing objects, camera, and background. It follows a layering-editing-compositing pipeline: (i) segmenting and converting visual inputs into editable 3D entities (layering), (ii) editing them in Blender with 3D-grounded control (editing), and (iii) fusing them into a coherent scene using a generative compositor (compositing). Our generative compositor extends a pre-trained diffusion model to process both the original (source) and edited (target) scenes in parallel. It is fine-tuned on video frames with two key training strategies: (i) source masking, enabling flexible modifications like background replacement; (ii) simulated object jittering, facilitating disentangled control over objects and camera. BlenderFusion significantly outperforms prior methods in complex compositional scene editing tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 30 Jun 2025 20:20:07 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/962bfc60/b825f9aa.mp3" length="21491131" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1340</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.GR, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiacheng Chen, Ramin Mehran, Xuhui Jia, Saining Xie, Sanghyun Woo</p>

            <p><strong>Title:</strong><br>
            BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.17450v2">http://arxiv.org/abs/2506.17450v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present BlenderFusion, a generative visual compositing framework that synthesizes new scenes by recomposing objects, camera, and background. It follows a layering-editing-compositing pipeline: (i) segmenting and converting visual inputs into editable 3D entities (layering), (ii) editing them in Blender with 3D-grounded control (editing), and (iii) fusing them into a coherent scene using a generative compositor (compositing). Our generative compositor extends a pre-trained diffusion model to process both the original (source) and edited (target) scenes in parallel. It is fine-tuned on video frames with two key training strategies: (i) source masking, enabling flexible modifications like background replacement; (ii) simulated object jittering, facilitating disentangled control over objects and camera. BlenderFusion significantly outperforms prior methods in complex compositional scene editing tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs</title>
      <itunes:episode>918</itunes:episode>
      <podcast:episode>918</podcast:episode>
      <itunes:title>LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f391fd0f-9482-4a08-82b7-7874cc684ef1</guid>
      <link>https://share.transistor.fm/s/2ed6b895</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.HC, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou</p>

            <p><strong>Title:</strong><br>
            LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.21862v1">http://arxiv.org/abs/2506.21862v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.HC, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou</p>

            <p><strong>Title:</strong><br>
            LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.21862v1">http://arxiv.org/abs/2506.21862v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 30 Jun 2025 20:19:45 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2ed6b895/55e1b0ae.mp3" length="20126927" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1254</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.HC, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou</p>

            <p><strong>Title:</strong><br>
            LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.21862v1">http://arxiv.org/abs/2506.21862v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation</title>
      <itunes:episode>917</itunes:episode>
      <podcast:episode>917</podcast:episode>
      <itunes:title>XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4792fcde-9a86-456d-8e35-db9b12725d1d</guid>
      <link>https://share.transistor.fm/s/5462a6b2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, Xinglong Wu</p>

            <p><strong>Title:</strong><br>
            XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.21416v1">http://arxiv.org/abs/2506.21416v1</a></p>

            <p><strong>Abstract:</strong><br>
            Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, Xinglong Wu</p>

            <p><strong>Title:</strong><br>
            XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.21416v1">http://arxiv.org/abs/2506.21416v1</a></p>

            <p><strong>Abstract:</strong><br>
            Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 30 Jun 2025 20:19:23 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5462a6b2/5d4b65ed.mp3" length="22830718" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1423</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, Xinglong Wu</p>

            <p><strong>Title:</strong><br>
            XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.21416v1">http://arxiv.org/abs/2506.21416v1</a></p>

            <p><strong>Abstract:</strong><br>
            Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback</title>
      <itunes:episode>916</itunes:episode>
      <podcast:episode>916</podcast:episode>
      <itunes:title>Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e9786ab4-eafb-4d72-b0dd-3dad22537a25</guid>
      <link>https://share.transistor.fm/s/0633b9d0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dongwei Jiang, Alvin Zhang, Andrew Wang, Nicholas Andrews, Daniel Khashabi</p>

            <p><strong>Title:</strong><br>
            Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.11930v1">http://arxiv.org/abs/2506.11930v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies have shown LLMs possess some ability to improve their responses when given external feedback. However, it remains unclear how effectively and thoroughly these models can incorporate extrinsic feedback. In an ideal scenario, if LLMs receive near-perfect and complete feedback, we would expect them to fully integrate the feedback and change their incorrect answers to correct ones. In this paper, we systematically investigate LLMs' ability to incorporate feedback by designing a controlled experimental environment. For each problem, a solver model attempts a solution, then a feedback generator with access to near-complete ground-truth answers produces targeted feedback, after which the solver tries again. We evaluate this pipeline across a diverse range of tasks, including math reasoning, knowledge reasoning, scientific reasoning, and general multi-domain evaluations with state-of-the-art language models including Claude 3.7 (with and without extended thinking). Surprisingly, even under these near-ideal conditions, solver models consistently show resistance to feedback, a limitation that we term FEEDBACK FRICTION. To mitigate this limitation, we experiment with sampling-based strategies like progressive temperature increases and explicit rejection of previously attempted incorrect answers, which yield improvements but still fail to help models achieve target performance. We also perform a rigorous exploration of potential causes of FEEDBACK FRICTION, ruling out factors such as model overconfidence and data familiarity. We hope that highlighting this issue in LLMs and ruling out several apparent causes will help future research in self-improvement.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dongwei Jiang, Alvin Zhang, Andrew Wang, Nicholas Andrews, Daniel Khashabi</p>

            <p><strong>Title:</strong><br>
            Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.11930v1">http://arxiv.org/abs/2506.11930v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies have shown LLMs possess some ability to improve their responses when given external feedback. However, it remains unclear how effectively and thoroughly these models can incorporate extrinsic feedback. In an ideal scenario, if LLMs receive near-perfect and complete feedback, we would expect them to fully integrate the feedback and change their incorrect answers to correct ones. In this paper, we systematically investigate LLMs' ability to incorporate feedback by designing a controlled experimental environment. For each problem, a solver model attempts a solution, then a feedback generator with access to near-complete ground-truth answers produces targeted feedback, after which the solver tries again. We evaluate this pipeline across a diverse range of tasks, including math reasoning, knowledge reasoning, scientific reasoning, and general multi-domain evaluations with state-of-the-art language models including Claude 3.7 (with and without extended thinking). Surprisingly, even under these near-ideal conditions, solver models consistently show resistance to feedback, a limitation that we term FEEDBACK FRICTION. To mitigate this limitation, we experiment with sampling-based strategies like progressive temperature increases and explicit rejection of previously attempted incorrect answers, which yield improvements but still fail to help models achieve target performance. We also perform a rigorous exploration of potential causes of FEEDBACK FRICTION, ruling out factors such as model overconfidence and data familiarity. We hope that highlighting this issue in LLMs and ruling out several apparent causes will help future research in self-improvement.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 16 Jun 2025 23:52:08 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0633b9d0/479a31d6.mp3" length="24067435" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1501</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dongwei Jiang, Alvin Zhang, Andrew Wang, Nicholas Andrews, Daniel Khashabi</p>

            <p><strong>Title:</strong><br>
            Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.11930v1">http://arxiv.org/abs/2506.11930v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies have shown LLMs possess some ability to improve their responses when given external feedback. However, it remains unclear how effectively and thoroughly these models can incorporate extrinsic feedback. In an ideal scenario, if LLMs receive near-perfect and complete feedback, we would expect them to fully integrate the feedback and change their incorrect answers to correct ones. In this paper, we systematically investigate LLMs' ability to incorporate feedback by designing a controlled experimental environment. For each problem, a solver model attempts a solution, then a feedback generator with access to near-complete ground-truth answers produces targeted feedback, after which the solver tries again. We evaluate this pipeline across a diverse range of tasks, including math reasoning, knowledge reasoning, scientific reasoning, and general multi-domain evaluations with state-of-the-art language models including Claude 3.7 (with and without extended thinking). Surprisingly, even under these near-ideal conditions, solver models consistently show resistance to feedback, a limitation that we term FEEDBACK FRICTION. To mitigate this limitation, we experiment with sampling-based strategies like progressive temperature increases and explicit rejection of previously attempted incorrect answers, which yield improvements but still fail to help models achieve target performance. We also perform a rigorous exploration of potential causes of FEEDBACK FRICTION, ruling out factors such as model overconfidence and data familiarity. We hope that highlighting this issue in LLMs and ruling out several apparent causes will help future research in self-improvement.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Effective Red-Teaming of Policy-Adherent Agents</title>
      <itunes:episode>915</itunes:episode>
      <podcast:episode>915</podcast:episode>
      <itunes:title>Effective Red-Teaming of Policy-Adherent Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">16ed401f-64f1-4a97-b4d4-fec58c70c8e5</guid>
      <link>https://share.transistor.fm/s/2b13536d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.MA, cs.AI, cs.CL, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, Ateret Anaby-Tavor</p>

            <p><strong>Title:</strong><br>
            Effective Red-Teaming of Policy-Adherent Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09600v1">http://arxiv.org/abs/2506.09600v1</a></p>

            <p><strong>Abstract:</strong><br>
            Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit. To address this, we present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and coercive. Building upon the existing tau-bench benchmark, we introduce tau-break, a complementary benchmark designed to rigorously assess the agent's robustness against manipulative user behavior. Finally, we evaluate several straightforward yet effective defense strategies. While these measures provide some protection, they fall short, highlighting the need for stronger, research-driven safeguards to protect policy-adherent agents from adversarial attacks</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.MA, cs.AI, cs.CL, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, Ateret Anaby-Tavor</p>

            <p><strong>Title:</strong><br>
            Effective Red-Teaming of Policy-Adherent Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09600v1">http://arxiv.org/abs/2506.09600v1</a></p>

            <p><strong>Abstract:</strong><br>
            Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit. To address this, we present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and coercive. Building upon the existing tau-bench benchmark, we introduce tau-break, a complementary benchmark designed to rigorously assess the agent's robustness against manipulative user behavior. Finally, we evaluate several straightforward yet effective defense strategies. While these measures provide some protection, they fall short, highlighting the need for stronger, research-driven safeguards to protect policy-adherent agents from adversarial attacks</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 16 Jun 2025 23:51:47 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2b13536d/7856a695.mp3" length="18852534" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1175</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.MA, cs.AI, cs.CL, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Itay Nakash, George Kour, Koren Lazar, Matan Vetzler, Guy Uziel, Ateret Anaby-Tavor</p>

            <p><strong>Title:</strong><br>
            Effective Red-Teaming of Policy-Adherent Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09600v1">http://arxiv.org/abs/2506.09600v1</a></p>

            <p><strong>Abstract:</strong><br>
            Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules. The challenge lies in ensuring that the agent consistently adheres to these rules and policies, appropriately refusing any request that would violate them, while still maintaining a helpful and natural interaction. This calls for the development of tailored design and evaluation methodologies to ensure agent resilience against malicious user behavior. We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit. To address this, we present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario, outperforming conventional jailbreak methods such as DAN prompts, emotional manipulation, and coercive. Building upon the existing tau-bench benchmark, we introduce tau-break, a complementary benchmark designed to rigorously assess the agent's robustness against manipulative user behavior. Finally, we evaluate several straightforward yet effective defense strategies. While these measures provide some protection, they fall short, highlighting the need for stronger, research-driven safeguards to protect policy-adherent agents from adversarial attacks</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation</title>
      <itunes:episode>914</itunes:episode>
      <podcast:episode>914</podcast:episode>
      <itunes:title>Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0a966312-0572-4b5e-8871-47645bcf3a87</guid>
      <link>https://share.transistor.fm/s/e94a7648</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Min-Seop Kwak, Junho Kim, Sangdoo Yun, Dongyoon Han, Taekyoung Kim, Seungryong Kim, Jin-Hwa Kim</p>

            <p><strong>Title:</strong><br>
            Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.11924v1">http://arxiv.org/abs/2506.11924v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation, where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction. We further introduce proximity-based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis on both image and geometry across a range of unseen scenes, delivers competitive reconstruction quality under interpolation settings, and produces geometrically aligned colored point clouds for comprehensive 3D completion. Project page is available at https://cvlab-kaist.github.io/MoAI.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Min-Seop Kwak, Junho Kim, Sangdoo Yun, Dongyoon Han, Taekyoung Kim, Seungryong Kim, Jin-Hwa Kim</p>

            <p><strong>Title:</strong><br>
            Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.11924v1">http://arxiv.org/abs/2506.11924v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation, where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction. We further introduce proximity-based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis on both image and geometry across a range of unseen scenes, delivers competitive reconstruction quality under interpolation settings, and produces geometrically aligned colored point clouds for comprehensive 3D completion. Project page is available at https://cvlab-kaist.github.io/MoAI.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 16 Jun 2025 23:51:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e94a7648/63f82783.mp3" length="20648126" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1287</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Min-Seop Kwak, Junho Kim, Sangdoo Yun, Dongyoon Han, Taekyoung Kim, Seungryong Kim, Jin-Hwa Kim</p>

            <p><strong>Title:</strong><br>
            Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.11924v1">http://arxiv.org/abs/2506.11924v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation, where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction. We further introduce proximity-based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis on both image and geometry across a range of unseen scenes, delivers competitive reconstruction quality under interpolation settings, and produces geometrically aligned colored point clouds for comprehensive 3D completion. Project page is available at https://cvlab-kaist.github.io/MoAI.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning</title>
      <itunes:episode>913</itunes:episode>
      <podcast:episode>913</podcast:episode>
      <itunes:title>ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">19568328-be6f-4e37-a064-6352bfd23953</guid>
      <link>https://share.transistor.fm/s/cf379d02</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CL, cs.AI, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Yu Rong, Wenbing Huang, Qifeng Bai, Tingyang Xu</p>

            <p><strong>Title:</strong><br>
            ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09513v1">http://arxiv.org/abs/2506.09513v1</a></p>

            <p><strong>Abstract:</strong><br>
            Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a \textit{multi-agent verification and refinement process}, where we design an \textit{Error Refiner} to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17\% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60\%.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CL, cs.AI, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Yu Rong, Wenbing Huang, Qifeng Bai, Tingyang Xu</p>

            <p><strong>Title:</strong><br>
            ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09513v1">http://arxiv.org/abs/2506.09513v1</a></p>

            <p><strong>Abstract:</strong><br>
            Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a \textit{multi-agent verification and refinement process}, where we design an \textit{Error Refiner} to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17\% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60\%.</p>
            ]]>
      </content:encoded>
      <pubDate>Sat, 14 Jun 2025 10:56:41 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cf379d02/40c4356b.mp3" length="21085304" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1314</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CL, cs.AI, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Yu Rong, Wenbing Huang, Qifeng Bai, Tingyang Xu</p>

            <p><strong>Title:</strong><br>
            ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09513v1">http://arxiv.org/abs/2506.09513v1</a></p>

            <p><strong>Abstract:</strong><br>
            Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a \textit{multi-agent verification and refinement process}, where we design an \textit{Error Refiner} to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17\% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60\%.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks</title>
      <itunes:episode>912</itunes:episode>
      <podcast:episode>912</podcast:episode>
      <itunes:title>SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0ab74d51-a1e2-4ec6-9bbd-4e5e15254630</guid>
      <link>https://share.transistor.fm/s/65c4fdc6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lianghong Guo, Yanlin Wang, Caihua Li, Pengyu Yang, Jiachi Chen, Wei Tao, Yingtian Zou, Duyu Tang, Zibin Zheng</p>

            <p><strong>Title:</strong><br>
            SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.10954v1">http://arxiv.org/abs/2506.10954v1</a></p>

            <p><strong>Abstract:</strong><br>
            Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges. To tackle these issues, our pipeline integrates three core automated components. First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency. Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers. Finally, we automate the fail2pass validation process using these reliable exit code signals. Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at $0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of $0.024 per instance. We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00. We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation. Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lianghong Guo, Yanlin Wang, Caihua Li, Pengyu Yang, Jiachi Chen, Wei Tao, Yingtian Zou, Duyu Tang, Zibin Zheng</p>

            <p><strong>Title:</strong><br>
            SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.10954v1">http://arxiv.org/abs/2506.10954v1</a></p>

            <p><strong>Abstract:</strong><br>
            Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges. To tackle these issues, our pipeline integrates three core automated components. First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency. Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers. Finally, we automate the fail2pass validation process using these reliable exit code signals. Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at $0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of $0.024 per instance. We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00. We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation. Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory.</p>
            ]]>
      </content:encoded>
      <pubDate>Sat, 14 Jun 2025 10:56:19 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/65c4fdc6/2e4514fd.mp3" length="21239966" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1324</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lianghong Guo, Yanlin Wang, Caihua Li, Pengyu Yang, Jiachi Chen, Wei Tao, Yingtian Zou, Duyu Tang, Zibin Zheng</p>

            <p><strong>Title:</strong><br>
            SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.10954v1">http://arxiv.org/abs/2506.10954v1</a></p>

            <p><strong>Abstract:</strong><br>
            Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges. To tackle these issues, our pipeline integrates three core automated components. First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency. Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers. Finally, we automate the fail2pass validation process using these reliable exit code signals. Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at $0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of $0.024 per instance. We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00. We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation. Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Text-Aware Image Restoration with Diffusion Models</title>
      <itunes:episode>911</itunes:episode>
      <podcast:episode>911</podcast:episode>
      <itunes:title>Text-Aware Image Restoration with Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ceb4ed6b-9488-4e11-a555-ebceaa479feb</guid>
      <link>https://share.transistor.fm/s/0db591a0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jaewon Min, Jin Hyeon Kim, Paul Hyunbin Cho, Jaeeun Lee, Jihye Park, Minkyu Park, Sangpil Kim, Hyunhee Park, Seungryong Kim</p>

            <p><strong>Title:</strong><br>
            Text-Aware Image Restoration with Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09993v1">http://arxiv.org/abs/2506.09993v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training. This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy. See our project page: https://cvlab-kaist.github.io/TAIR/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jaewon Min, Jin Hyeon Kim, Paul Hyunbin Cho, Jaeeun Lee, Jihye Park, Minkyu Park, Sangpil Kim, Hyunhee Park, Seungryong Kim</p>

            <p><strong>Title:</strong><br>
            Text-Aware Image Restoration with Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09993v1">http://arxiv.org/abs/2506.09993v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training. This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy. See our project page: https://cvlab-kaist.github.io/TAIR/</p>
            ]]>
      </content:encoded>
      <pubDate>Sat, 14 Jun 2025 10:55:58 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0db591a0/242fdeb1.mp3" length="22838614" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1424</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jaewon Min, Jin Hyeon Kim, Paul Hyunbin Cho, Jaeeun Lee, Jihye Park, Minkyu Park, Sangpil Kim, Hyunhee Park, Seungryong Kim</p>

            <p><strong>Title:</strong><br>
            Text-Aware Image Restoration with Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09993v1">http://arxiv.org/abs/2506.09993v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training. This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy. See our project page: https://cvlab-kaist.github.io/TAIR/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation</title>
      <itunes:episode>910</itunes:episode>
      <podcast:episode>910</podcast:episode>
      <itunes:title>AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b32c938c-bbcd-4f09-b0c3-fcfd68aaa7ea</guid>
      <link>https://share.transistor.fm/s/10d8f9f1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.MA, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haoyuan Shi, Yunxin Li, Xinyu Chen, Longyue Wang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.10540v1">http://arxiv.org/abs/2506.10540v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker's approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.MA, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haoyuan Shi, Yunxin Li, Xinyu Chen, Longyue Wang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.10540v1">http://arxiv.org/abs/2506.10540v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker's approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.</p>
            ]]>
      </content:encoded>
      <pubDate>Sat, 14 Jun 2025 10:55:37 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/10d8f9f1/465fed36.mp3" length="19409295" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1209</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.MA, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haoyuan Shi, Yunxin Li, Xinyu Chen, Longyue Wang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.10540v1">http://arxiv.org/abs/2506.10540v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker's approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos</title>
      <itunes:episode>909</itunes:episode>
      <podcast:episode>909</podcast:episode>
      <itunes:title>VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2ed22702-3cba-4c9a-989f-e1e9941b25fa</guid>
      <link>https://share.transistor.fm/s/38a6d7c0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.AI, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang</p>

            <p><strong>Title:</strong><br>
            VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.10857v1">http://arxiv.org/abs/2506.10857v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 1,010 long videos (with an average duration of 1.6 hours), along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains, each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution, implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress-level LLM-guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on VRBench, we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.AI, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang</p>

            <p><strong>Title:</strong><br>
            VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.10857v1">http://arxiv.org/abs/2506.10857v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 1,010 long videos (with an average duration of 1.6 hours), along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains, each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution, implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress-level LLM-guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on VRBench, we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Sat, 14 Jun 2025 10:55:16 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/38a6d7c0/5f5a61ee.mp3" length="21259584" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1325</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.AI, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang</p>

            <p><strong>Title:</strong><br>
            VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.10857v1">http://arxiv.org/abs/2506.10857v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 1,010 long videos (with an average duration of 1.6 hours), along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains, each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution, implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress-level LLM-guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on VRBench, we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Discrete Audio Tokens: More Than a Survey!</title>
      <itunes:episode>908</itunes:episode>
      <podcast:episode>908</podcast:episode>
      <itunes:title>Discrete Audio Tokens: More Than a Survey!</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c291bddd-6b90-4d06-8dc2-5eceb5c06a3d</guid>
      <link>https://share.transistor.fm/s/b9df0f5e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.SD, cs.AI, cs.CL, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli</p>

            <p><strong>Title:</strong><br>
            Discrete Audio Tokens: More Than a Survey!</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.10274v1">http://arxiv.org/abs/2506.10274v1</a></p>

            <p><strong>Abstract:</strong><br>
            Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks.They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.SD, cs.AI, cs.CL, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli</p>

            <p><strong>Title:</strong><br>
            Discrete Audio Tokens: More Than a Survey!</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.10274v1">http://arxiv.org/abs/2506.10274v1</a></p>

            <p><strong>Abstract:</strong><br>
            Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks.They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.</p>
            ]]>
      </content:encoded>
      <pubDate>Sat, 14 Jun 2025 10:54:55 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b9df0f5e/679e9a4e.mp3" length="23903984" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1490</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.SD, cs.AI, cs.CL, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli</p>

            <p><strong>Title:</strong><br>
            Discrete Audio Tokens: More Than a Survey!</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.10274v1">http://arxiv.org/abs/2506.10274v1</a></p>

            <p><strong>Abstract:</strong><br>
            Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks.They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models</title>
      <itunes:episode>907</itunes:episode>
      <podcast:episode>907</podcast:episode>
      <itunes:title>Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0819bf23-adad-4e1e-936e-95c69ee15792</guid>
      <link>https://share.transistor.fm/s/380dbd52</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 76 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, Ivan Oseledets</p>

            <p><strong>Title:</strong><br>
            Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.06395v3">http://arxiv.org/abs/2506.06395v3</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 16 samples per question and 10 or 20 training steps, RLSC improves accuracy by +13.4% on AIME2024, +21.2% on MATH500, +21.7% on Minerva Math, +20.8% on Olympiadbench, and +9.7% on AMC23. RLSC provides a simple, scalable post-training method for inference models, requiring only a small number of samples and unlabelled supervision.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 76 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, Ivan Oseledets</p>

            <p><strong>Title:</strong><br>
            Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.06395v3">http://arxiv.org/abs/2506.06395v3</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 16 samples per question and 10 or 20 training steps, RLSC improves accuracy by +13.4% on AIME2024, +21.2% on MATH500, +21.7% on Minerva Math, +20.8% on Olympiadbench, and +9.7% on AMC23. RLSC provides a simple, scalable post-training method for inference models, requiring only a small number of samples and unlabelled supervision.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 13 Jun 2025 03:43:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/380dbd52/6ea53fc4.mp3" length="19886588" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1239</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 76 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, Ivan Oseledets</p>

            <p><strong>Title:</strong><br>
            Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.06395v3">http://arxiv.org/abs/2506.06395v3</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 16 samples per question and 10 or 20 training steps, RLSC improves accuracy by +13.4% on AIME2024, +21.2% on MATH500, +21.7% on Minerva Math, +20.8% on Olympiadbench, and +9.7% on AMC23. RLSC provides a simple, scalable post-training method for inference models, requiring only a small number of samples and unlabelled supervision.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Seedance 1.0: Exploring the Boundaries of Video Generation Models</title>
      <itunes:episode>906</itunes:episode>
      <podcast:episode>906</podcast:episode>
      <itunes:title>Seedance 1.0: Exploring the Boundaries of Video Generation Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">54eebc22-c6e0-426f-b04e-35d66759eb9d</guid>
      <link>https://share.transistor.fm/s/b2cd54ac</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, Xunsong Li, Yifu Li, Shanchuan Lin, Zhijie Lin, Jiawei Liu, Shu Liu, Xiaonan Nie, Zhiwu Qing, Yuxi Ren, Li Sun, Zhi Tian, Rui Wang, Sen Wang, Guoqiang Wei, Guohong Wu, Jie Wu, Ruiqi Xia, Fei Xiao, Xuefeng Xiao, Jiangqiao Yan, Ceyuan Yang, Jianchao Yang, Runkai Yang, Tao Yang, Yihang Yang, Zilyu Ye, Xuejiao Zeng, Yan Zeng, Heng Zhang, Yang Zhao, Xiaozheng Zheng, Peihao Zhu, Jiaxin Zou, Feilong Zuo</p>

            <p><strong>Title:</strong><br>
            Seedance 1.0: Exploring the Boundaries of Video Generation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09113v1">http://arxiv.org/abs/2506.09113v1</a></p>

            <p><strong>Abstract:</strong><br>
            Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm, which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning, and video-specific RLHF with multi-dimensional reward mechanisms for comprehensive performance improvements; (iv) excellent model acceleration achieving ~10x inference speedup through multi-stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds (NVIDIA-L20). Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation having superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, Xunsong Li, Yifu Li, Shanchuan Lin, Zhijie Lin, Jiawei Liu, Shu Liu, Xiaonan Nie, Zhiwu Qing, Yuxi Ren, Li Sun, Zhi Tian, Rui Wang, Sen Wang, Guoqiang Wei, Guohong Wu, Jie Wu, Ruiqi Xia, Fei Xiao, Xuefeng Xiao, Jiangqiao Yan, Ceyuan Yang, Jianchao Yang, Runkai Yang, Tao Yang, Yihang Yang, Zilyu Ye, Xuejiao Zeng, Yan Zeng, Heng Zhang, Yang Zhao, Xiaozheng Zheng, Peihao Zhu, Jiaxin Zou, Feilong Zuo</p>

            <p><strong>Title:</strong><br>
            Seedance 1.0: Exploring the Boundaries of Video Generation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09113v1">http://arxiv.org/abs/2506.09113v1</a></p>

            <p><strong>Abstract:</strong><br>
            Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm, which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning, and video-specific RLHF with multi-dimensional reward mechanisms for comprehensive performance improvements; (iv) excellent model acceleration achieving ~10x inference speedup through multi-stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds (NVIDIA-L20). Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation having superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 13 Jun 2025 03:42:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b2cd54ac/41458bb1.mp3" length="19961398" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1244</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, Xunsong Li, Yifu Li, Shanchuan Lin, Zhijie Lin, Jiawei Liu, Shu Liu, Xiaonan Nie, Zhiwu Qing, Yuxi Ren, Li Sun, Zhi Tian, Rui Wang, Sen Wang, Guoqiang Wei, Guohong Wu, Jie Wu, Ruiqi Xia, Fei Xiao, Xuefeng Xiao, Jiangqiao Yan, Ceyuan Yang, Jianchao Yang, Runkai Yang, Tao Yang, Yihang Yang, Zilyu Ye, Xuejiao Zeng, Yan Zeng, Heng Zhang, Yang Zhao, Xiaozheng Zheng, Peihao Zhu, Jiaxin Zou, Feilong Zuo</p>

            <p><strong>Title:</strong><br>
            Seedance 1.0: Exploring the Boundaries of Video Generation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09113v1">http://arxiv.org/abs/2506.09113v1</a></p>

            <p><strong>Abstract:</strong><br>
            Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm, which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning, and video-specific RLHF with multi-dimensional reward mechanisms for comprehensive performance improvements; (iv) excellent model acceleration achieving ~10x inference speedup through multi-stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds (NVIDIA-L20). Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation having superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation</title>
      <itunes:episode>905</itunes:episode>
      <podcast:episode>905</podcast:episode>
      <itunes:title>Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9c08710c-797f-42e9-b5a4-d6518a08093b</guid>
      <link>https://share.transistor.fm/s/69b40cae</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xinyu Yang, Yuwei An, Hongyi Liu, Tianqi Chen, Beidi Chen</p>

            <p><strong>Title:</strong><br>
            Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09991v1">http://arxiv.org/abs/2506.09991v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive Large Language Models (AR-LLMs) frequently exhibit implicit parallelism in sequential generation. Inspired by this, we introduce Multiverse, a new generative model that enables natively parallel generation. Multiverse internalizes a MapReduce paradigm, generating automatically through three stages: (i) a Map stage for adaptive task decomposition, (ii) a Process stage for parallel subtask execution, and (iii) a Reduce stage for lossless result synthesis. Next, we build a real-world Multiverse reasoning model with co-design of data, algorithm, and system, enabling rapid and seamless transfer from frontier AR-LLMs. Starting from sequential reasoning chains, we create Multiverse 1K by converting them into structured training data using an automated LLM-assisted pipeline, avoiding costly human annotations. Algorithmically, we design Multiverse Attention to separate parallel reasoning steps while keeping compatibility with causal attention for efficient training. Systematically, we implement Multiverse Engine to enable parallel inference. It features a dedicated scheduler that dynamically switches between sequential and parallel generation, triggered directly by the model. After a 3-hour fine-tuning with 1K examples, our Multiverse-32B stands as the only open-sourced non-AR model achieving performance on par with leading AR-LLMs of the same scale, evidenced by AIME24 &amp; 25 scores of 54% and 46%, respectively. Moreover, our budget control experiments show that Multiverse-32B exhibits superior scaling, outperforming AR-LLMs by 1.87% on average using the same context length. Such scaling further leads to practical efficiency gain, achieving up to 2x speedup across varying batch sizes. We have open-sourced the entire Multiverse ecosystem, including data, model weights, engine, supporting tools, as well as complete data curation prompts and detailed training and evaluation recipes.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xinyu Yang, Yuwei An, Hongyi Liu, Tianqi Chen, Beidi Chen</p>

            <p><strong>Title:</strong><br>
            Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09991v1">http://arxiv.org/abs/2506.09991v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive Large Language Models (AR-LLMs) frequently exhibit implicit parallelism in sequential generation. Inspired by this, we introduce Multiverse, a new generative model that enables natively parallel generation. Multiverse internalizes a MapReduce paradigm, generating automatically through three stages: (i) a Map stage for adaptive task decomposition, (ii) a Process stage for parallel subtask execution, and (iii) a Reduce stage for lossless result synthesis. Next, we build a real-world Multiverse reasoning model with co-design of data, algorithm, and system, enabling rapid and seamless transfer from frontier AR-LLMs. Starting from sequential reasoning chains, we create Multiverse 1K by converting them into structured training data using an automated LLM-assisted pipeline, avoiding costly human annotations. Algorithmically, we design Multiverse Attention to separate parallel reasoning steps while keeping compatibility with causal attention for efficient training. Systematically, we implement Multiverse Engine to enable parallel inference. It features a dedicated scheduler that dynamically switches between sequential and parallel generation, triggered directly by the model. After a 3-hour fine-tuning with 1K examples, our Multiverse-32B stands as the only open-sourced non-AR model achieving performance on par with leading AR-LLMs of the same scale, evidenced by AIME24 &amp; 25 scores of 54% and 46%, respectively. Moreover, our budget control experiments show that Multiverse-32B exhibits superior scaling, outperforming AR-LLMs by 1.87% on average using the same context length. Such scaling further leads to practical efficiency gain, achieving up to 2x speedup across varying batch sizes. We have open-sourced the entire Multiverse ecosystem, including data, model weights, engine, supporting tools, as well as complete data curation prompts and detailed training and evaluation recipes.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 13 Jun 2025 03:42:24 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/69b40cae/ccb865c7.mp3" length="20467987" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1276</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xinyu Yang, Yuwei An, Hongyi Liu, Tianqi Chen, Beidi Chen</p>

            <p><strong>Title:</strong><br>
            Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09991v1">http://arxiv.org/abs/2506.09991v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive Large Language Models (AR-LLMs) frequently exhibit implicit parallelism in sequential generation. Inspired by this, we introduce Multiverse, a new generative model that enables natively parallel generation. Multiverse internalizes a MapReduce paradigm, generating automatically through three stages: (i) a Map stage for adaptive task decomposition, (ii) a Process stage for parallel subtask execution, and (iii) a Reduce stage for lossless result synthesis. Next, we build a real-world Multiverse reasoning model with co-design of data, algorithm, and system, enabling rapid and seamless transfer from frontier AR-LLMs. Starting from sequential reasoning chains, we create Multiverse 1K by converting them into structured training data using an automated LLM-assisted pipeline, avoiding costly human annotations. Algorithmically, we design Multiverse Attention to separate parallel reasoning steps while keeping compatibility with causal attention for efficient training. Systematically, we implement Multiverse Engine to enable parallel inference. It features a dedicated scheduler that dynamically switches between sequential and parallel generation, triggered directly by the model. After a 3-hour fine-tuning with 1K examples, our Multiverse-32B stands as the only open-sourced non-AR model achieving performance on par with leading AR-LLMs of the same scale, evidenced by AIME24 &amp; 25 scores of 54% and 46%, respectively. Moreover, our budget control experiments show that Multiverse-32B exhibits superior scaling, outperforming AR-LLMs by 1.87% on average using the same context length. Such scaling further leads to practical efficiency gain, achieving up to 2x speedup across varying batch sizes. We have open-sourced the entire Multiverse ecosystem, including data, model weights, engine, supporting tools, as well as complete data curation prompts and detailed training and evaluation recipes.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation</title>
      <itunes:episode>904</itunes:episode>
      <podcast:episode>904</podcast:episode>
      <itunes:title>Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">70d48505-dd13-421d-aa7d-26d87fd62373</guid>
      <link>https://share.transistor.fm/s/2cceb065</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, Lu Jiang</p>

            <p><strong>Title:</strong><br>
            Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09350v1">http://arxiv.org/abs/2506.09350v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to transform a pre-trained latent video diffusion model into a real-time, interactive video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as controls to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This not only allows us to design an architecture that is more efficient for one-step generation while fully utilizing the KV cache, but also enables training the model in a student-forcing manner that proves to be effective in reducing error accumulation during long video generation. Our experiments demonstrate that our 8B model achieves real-time, 24fps, streaming video generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to a minute long (1440 frames). Visit our research website at https://seaweed-apt.com/2</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, Lu Jiang</p>

            <p><strong>Title:</strong><br>
            Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09350v1">http://arxiv.org/abs/2506.09350v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to transform a pre-trained latent video diffusion model into a real-time, interactive video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as controls to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This not only allows us to design an architecture that is more efficient for one-step generation while fully utilizing the KV cache, but also enables training the model in a student-forcing manner that proves to be effective in reducing error accumulation during long video generation. Our experiments demonstrate that our 8B model achieves real-time, 24fps, streaming video generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to a minute long (1440 frames). Visit our research website at https://seaweed-apt.com/2</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 13 Jun 2025 03:42:01 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2cceb065/78774f51.mp3" length="25059682" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1563</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, Lu Jiang</p>

            <p><strong>Title:</strong><br>
            Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09350v1">http://arxiv.org/abs/2506.09350v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to transform a pre-trained latent video diffusion model into a real-time, interactive video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as controls to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This not only allows us to design an architecture that is more efficient for one-step generation while fully utilizing the KV cache, but also enables training the model in a student-forcing manner that proves to be effective in reducing error accumulation during long video generation. Our experiments demonstrate that our 8B model achieves real-time, 24fps, streaming video generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to a minute long (1440 frames). Visit our research website at https://seaweed-apt.com/2</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ComfyUI-R1: Exploring Reasoning Models for Workflow Generation</title>
      <itunes:episode>903</itunes:episode>
      <podcast:episode>903</podcast:episode>
      <itunes:title>ComfyUI-R1: Exploring Reasoning Models for Workflow Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">dc4bdc10-80e9-4f0b-b480-4d4da34726c4</guid>
      <link>https://share.transistor.fm/s/a55a0843</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CL, cs.CV, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Zhenran Xu, Yiyu Wang, Xue Yang, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            ComfyUI-R1: Exploring Reasoning Models for Workflow Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09790v1">http://arxiv.org/abs/2506.09790v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI-R1, the first large reasoning model for automated workflow generation. Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection, workflow planning, and code-level workflow representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward, ensuring format validity, structural integrity, and node-level fidelity. Experiments show that our 7B-parameter model achieves a 97\% format validity rate, along with high pass rate, node-level and graph-level F1 scores, significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series. Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes, underscoring the potential of long CoT reasoning in AI art creation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CL, cs.CV, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Zhenran Xu, Yiyu Wang, Xue Yang, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            ComfyUI-R1: Exploring Reasoning Models for Workflow Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09790v1">http://arxiv.org/abs/2506.09790v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI-R1, the first large reasoning model for automated workflow generation. Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection, workflow planning, and code-level workflow representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward, ensuring format validity, structural integrity, and node-level fidelity. Experiments show that our 7B-parameter model achieves a 97\% format validity rate, along with high pass rate, node-level and graph-level F1 scores, significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series. Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes, underscoring the potential of long CoT reasoning in AI art creation.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 13 Jun 2025 03:41:37 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a55a0843/2bdd8026.mp3" length="22176160" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1382</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CL, cs.CV, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Zhenran Xu, Yiyu Wang, Xue Yang, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            ComfyUI-R1: Exploring Reasoning Models for Workflow Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09790v1">http://arxiv.org/abs/2506.09790v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI-R1, the first large reasoning model for automated workflow generation. Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection, workflow planning, and code-level workflow representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward, ensuring format validity, structural integrity, and node-level fidelity. Experiments show that our 7B-parameter model achieves a 97\% format validity rate, along with high pass rate, node-level and graph-level F1 scores, significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series. Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes, underscoring the potential of long CoT reasoning in AI art creation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PlayerOne: Egocentric World Simulator</title>
      <itunes:episode>902</itunes:episode>
      <podcast:episode>902</podcast:episode>
      <itunes:title>PlayerOne: Egocentric World Simulator</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fe1d5fa9-68a9-434a-a7be-b1ccc109ba6f</guid>
      <link>https://share.transistor.fm/s/9f7b5cbe</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            PlayerOne: Egocentric World Simulator</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09995v1">http://arxiv.org/abs/2506.09995v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and worldconsistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            PlayerOne: Egocentric World Simulator</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09995v1">http://arxiv.org/abs/2506.09995v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and worldconsistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 13 Jun 2025 03:41:14 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9f7b5cbe/70d1c8cd.mp3" length="19420948" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1210</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuanpeng Tu, Hao Luo, Xi Chen, Xiang Bai, Fan Wang, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            PlayerOne: Egocentric World Simulator</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09995v1">http://arxiv.org/abs/2506.09995v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and worldconsistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation</title>
      <itunes:episode>901</itunes:episode>
      <podcast:episode>901</podcast:episode>
      <itunes:title>Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8ae8a1eb-ce36-4648-a282-b1b0a6b67d24</guid>
      <link>https://share.transistor.fm/s/6b799f49</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.SD, cs.AI, cs.LG, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Or Tal, Felix Kreuk, Yossi Adi</p>

            <p><strong>Title:</strong><br>
            Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.08570v2">http://arxiv.org/abs/2506.08570v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significantly across many dimensions, such as training datasets, modeling paradigms, and architectural choices. This diversity complicates efforts to evaluate models fairly and pinpoint which design choices most influence performance. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm. We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade-offs and emergent behaviors that can guide future text-to-music generation systems. Specifically, we compare the two arguably most common modeling paradigms: Auto-Regressive decoding and Conditional Flow-Matching. We conduct a controlled comparison by training all models from scratch using identical datasets, training configurations, and similar backbone architectures. Performance is evaluated across multiple axes, including generation quality, robustness to inference configurations, scalability, adherence to both textual and temporally aligned conditioning, and editing capabilities in the form of audio inpainting. This comparative study sheds light on distinct strengths and limitations of each paradigm, providing actionable insights that can inform future architectural and training decisions in the evolving landscape of text-to-music generation. Audio sampled examples are available at: https://huggingface.co/spaces/ortal1602/ARvsFM</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.SD, cs.AI, cs.LG, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Or Tal, Felix Kreuk, Yossi Adi</p>

            <p><strong>Title:</strong><br>
            Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.08570v2">http://arxiv.org/abs/2506.08570v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significantly across many dimensions, such as training datasets, modeling paradigms, and architectural choices. This diversity complicates efforts to evaluate models fairly and pinpoint which design choices most influence performance. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm. We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade-offs and emergent behaviors that can guide future text-to-music generation systems. Specifically, we compare the two arguably most common modeling paradigms: Auto-Regressive decoding and Conditional Flow-Matching. We conduct a controlled comparison by training all models from scratch using identical datasets, training configurations, and similar backbone architectures. Performance is evaluated across multiple axes, including generation quality, robustness to inference configurations, scalability, adherence to both textual and temporally aligned conditioning, and editing capabilities in the form of audio inpainting. This comparative study sheds light on distinct strengths and limitations of each paradigm, providing actionable insights that can inform future architectural and training decisions in the evolving landscape of text-to-music generation. Audio sampled examples are available at: https://huggingface.co/spaces/ortal1602/ARvsFM</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 13 Jun 2025 03:40:51 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6b799f49/b22569d3.mp3" length="20820761" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1298</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.SD, cs.AI, cs.LG, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Or Tal, Felix Kreuk, Yossi Adi</p>

            <p><strong>Title:</strong><br>
            Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.08570v2">http://arxiv.org/abs/2506.08570v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significantly across many dimensions, such as training datasets, modeling paradigms, and architectural choices. This diversity complicates efforts to evaluate models fairly and pinpoint which design choices most influence performance. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm. We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade-offs and emergent behaviors that can guide future text-to-music generation systems. Specifically, we compare the two arguably most common modeling paradigms: Auto-Regressive decoding and Conditional Flow-Matching. We conduct a controlled comparison by training all models from scratch using identical datasets, training configurations, and similar backbone architectures. Performance is evaluated across multiple axes, including generation quality, robustness to inference configurations, scalability, adherence to both textual and temporally aligned conditioning, and editing capabilities in the form of audio inpainting. This comparative study sheds light on distinct strengths and limitations of each paradigm, providing actionable insights that can inform future architectural and training decisions in the evolving landscape of text-to-music generation. Audio sampled examples are available at: https://huggingface.co/spaces/ortal1602/ARvsFM</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models</title>
      <itunes:episode>900</itunes:episode>
      <podcast:episode>900</podcast:episode>
      <itunes:title>Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cb39ff63-9491-4718-87ca-f2d4a1f20aa3</guid>
      <link>https://share.transistor.fm/s/f64a6a79</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mikhail Salnikov, Dmitrii Korzh, Ivan Lazichny, Elvir Karimov, Artyom Iudin, Ivan Oseledets, Oleg Y. Rogov, Alexander Panchenko, Natalia Loukachevitch, Elena Tutubalina</p>

            <p><strong>Title:</strong><br>
            Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.06751v1">http://arxiv.org/abs/2506.06751v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper evaluates geopolitical biases in LLMs with respect to various countries though an analysis of their interpretation of historical events with conflicting national perspectives (USA, UK, USSR, and China). We introduce a novel dataset with neutral event descriptions and contrasting viewpoints from different countries. Our findings show significant geopolitical biases, with models favoring specific national narratives. Additionally, simple debiasing prompts had a limited effect in reducing these biases. Experiments with manipulated participant labels reveal models' sensitivity to attribution, sometimes amplifying biases or recognizing inconsistencies, especially with swapped labels. This work highlights national narrative biases in LLMs, challenges the effectiveness of simple debiasing methods, and offers a framework and dataset for future geopolitical bias research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mikhail Salnikov, Dmitrii Korzh, Ivan Lazichny, Elvir Karimov, Artyom Iudin, Ivan Oseledets, Oleg Y. Rogov, Alexander Panchenko, Natalia Loukachevitch, Elena Tutubalina</p>

            <p><strong>Title:</strong><br>
            Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.06751v1">http://arxiv.org/abs/2506.06751v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper evaluates geopolitical biases in LLMs with respect to various countries though an analysis of their interpretation of historical events with conflicting national perspectives (USA, UK, USSR, and China). We introduce a novel dataset with neutral event descriptions and contrasting viewpoints from different countries. Our findings show significant geopolitical biases, with models favoring specific national narratives. Additionally, simple debiasing prompts had a limited effect in reducing these biases. Experiments with manipulated participant labels reveal models' sensitivity to attribution, sometimes amplifying biases or recognizing inconsistencies, especially with swapped labels. This work highlights national narrative biases in LLMs, challenges the effectiveness of simple debiasing methods, and offers a framework and dataset for future geopolitical bias research.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 11 Jun 2025 23:34:54 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f64a6a79/91e743d0.mp3" length="20671559" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1288</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mikhail Salnikov, Dmitrii Korzh, Ivan Lazichny, Elvir Karimov, Artyom Iudin, Ivan Oseledets, Oleg Y. Rogov, Alexander Panchenko, Natalia Loukachevitch, Elena Tutubalina</p>

            <p><strong>Title:</strong><br>
            Geopolitical biases in LLMs: what are the "good" and the "bad" countries according to contemporary language models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.06751v1">http://arxiv.org/abs/2506.06751v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper evaluates geopolitical biases in LLMs with respect to various countries though an analysis of their interpretation of historical events with conflicting national perspectives (USA, UK, USSR, and China). We introduce a novel dataset with neutral event descriptions and contrasting viewpoints from different countries. Our findings show significant geopolitical biases, with models favoring specific national narratives. Additionally, simple debiasing prompts had a limited effect in reducing these biases. Experiments with manipulated participant labels reveal models' sensitivity to attribution, sometimes amplifying biases or recognizing inconsistencies, especially with swapped labels. This work highlights national narrative biases in LLMs, challenges the effectiveness of simple debiasing methods, and offers a framework and dataset for future geopolitical bias research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better</title>
      <itunes:episode>899</itunes:episode>
      <podcast:episode>899</podcast:episode>
      <itunes:title>Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d41eafab-b82b-4a3d-be9e-1c3e99c9ac86</guid>
      <link>https://share.transistor.fm/s/d1cf507f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09040v1">http://arxiv.org/abs/2506.09040v1</a></p>

            <p><strong>Abstract:</strong><br>
            Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09040v1">http://arxiv.org/abs/2506.09040v1</a></p>

            <p><strong>Abstract:</strong><br>
            Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 11 Jun 2025 23:34:31 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d1cf507f/d72abde9.mp3" length="20157848" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1256</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.09040v1">http://arxiv.org/abs/2506.09040v1</a></p>

            <p><strong>Abstract:</strong><br>
            Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling</title>
      <itunes:episode>898</itunes:episode>
      <podcast:episode>898</podcast:episode>
      <itunes:title>RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">970fc3c4-1912-4b79-9b1d-1b19d8e46461</guid>
      <link>https://share.transistor.fm/s/e9172357</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yang Liu, Jiaqi Li, Zilong Zheng</p>

            <p><strong>Title:</strong><br>
            RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.08672v1">http://arxiv.org/abs/2506.08672v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rule-based reasoning has been acknowledged as one of the fundamental problems in reasoning, while deviations in rule formats, types, and complexity in real-world applications pose severe challenges. Recent studies have shown that large reasoning models (LRMs) have remarkable reasoning capabilities, and their performance is substantially enhanced by reinforcement learning (RL). However, it remains an open question whether small reasoning models (SRMs) can learn rule-based reasoning effectively with robust generalization across diverse tasks and domains. To address this, we introduce Reinforced Rule-based Reasoning, a.k.a. RuleReasoner, a simple yet effective method to conduct rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach. Specifically, RuleReasoner resamples each training batch by updating the sampling weights of different domains based on historical rewards. This facilitates domain augmentation and flexible online learning schedules for RL, obviating the need for pre-hoc human-engineered mix-training recipes used in existing methods. Empirical evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin ($\Delta$4.1% average points on eight ID tasks and $\Delta$10.4% average points on three OOD tasks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior dynamic sampling methods for RL.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yang Liu, Jiaqi Li, Zilong Zheng</p>

            <p><strong>Title:</strong><br>
            RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.08672v1">http://arxiv.org/abs/2506.08672v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rule-based reasoning has been acknowledged as one of the fundamental problems in reasoning, while deviations in rule formats, types, and complexity in real-world applications pose severe challenges. Recent studies have shown that large reasoning models (LRMs) have remarkable reasoning capabilities, and their performance is substantially enhanced by reinforcement learning (RL). However, it remains an open question whether small reasoning models (SRMs) can learn rule-based reasoning effectively with robust generalization across diverse tasks and domains. To address this, we introduce Reinforced Rule-based Reasoning, a.k.a. RuleReasoner, a simple yet effective method to conduct rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach. Specifically, RuleReasoner resamples each training batch by updating the sampling weights of different domains based on historical rewards. This facilitates domain augmentation and flexible online learning schedules for RL, obviating the need for pre-hoc human-engineered mix-training recipes used in existing methods. Empirical evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin ($\Delta$4.1% average points on eight ID tasks and $\Delta$10.4% average points on three OOD tasks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior dynamic sampling methods for RL.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 11 Jun 2025 23:34:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e9172357/1356660d.mp3" length="19335727" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1205</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yang Liu, Jiaqi Li, Zilong Zheng</p>

            <p><strong>Title:</strong><br>
            RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.08672v1">http://arxiv.org/abs/2506.08672v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rule-based reasoning has been acknowledged as one of the fundamental problems in reasoning, while deviations in rule formats, types, and complexity in real-world applications pose severe challenges. Recent studies have shown that large reasoning models (LRMs) have remarkable reasoning capabilities, and their performance is substantially enhanced by reinforcement learning (RL). However, it remains an open question whether small reasoning models (SRMs) can learn rule-based reasoning effectively with robust generalization across diverse tasks and domains. To address this, we introduce Reinforced Rule-based Reasoning, a.k.a. RuleReasoner, a simple yet effective method to conduct rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach. Specifically, RuleReasoner resamples each training batch by updating the sampling weights of different domains based on historical rewards. This facilitates domain augmentation and flexible online learning schedules for RL, obviating the need for pre-hoc human-engineered mix-training recipes used in existing methods. Empirical evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin ($\Delta$4.1% average points on eight ID tasks and $\Delta$10.4% average points on three OOD tasks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior dynamic sampling methods for RL.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Reinforcement Pre-Training</title>
      <itunes:episode>897</itunes:episode>
      <podcast:episode>897</podcast:episode>
      <itunes:title>Reinforcement Pre-Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2de28679-d566-4aa2-827a-5d7b32154fbe</guid>
      <link>https://share.transistor.fm/s/b11f8bb8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 150 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qingxiu Dong, Li Dong, Yao Tang, Tianzhu Ye, Yutao Sun, Zhifang Sui, Furu Wei</p>

            <p><strong>Title:</strong><br>
            Reinforcement Pre-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.08007v1">http://arxiv.org/abs/2506.08007v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 150 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qingxiu Dong, Li Dong, Yao Tang, Tianzhu Ye, Yutao Sun, Zhifang Sui, Furu Wei</p>

            <p><strong>Title:</strong><br>
            Reinforcement Pre-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.08007v1">http://arxiv.org/abs/2506.08007v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 11 Jun 2025 03:53:11 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b11f8bb8/8abaeb76.mp3" length="19525427" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1217</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 150 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qingxiu Dong, Li Dong, Yao Tang, Tianzhu Ye, Yutao Sun, Zhifang Sui, Furu Wei</p>

            <p><strong>Title:</strong><br>
            Reinforcement Pre-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.08007v1">http://arxiv.org/abs/2506.08007v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance</title>
      <itunes:episode>896</itunes:episode>
      <podcast:episode>896</podcast:episode>
      <itunes:title>Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">245ca8c1-e407-468c-a8b4-0906670d565b</guid>
      <link>https://share.transistor.fm/s/a0c107ad</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.LG, cs.AI, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Ruizhong Qiu, Gaotang Li, Tianxin Wei, Jingrui He, Hanghang Tong</p>

            <p><strong>Title:</strong><br>
            Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.06444v1">http://arxiv.org/abs/2506.06444v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning capabilities but remains unexplored in the context of safety assurance. Addressing this gap, our work pioneers inference scaling for robust and effective LLM safety against emerging threats. We reveal that conventional inference scaling techniques, despite their success in reasoning tasks, perform poorly in safety contexts, even falling short of basic approaches like Best-of-N Sampling. We attribute this inefficiency to a newly identified challenge, the exploration--efficiency dilemma, arising from the high computational overhead associated with frequent process reward model (PRM) evaluations. To overcome this dilemma, we propose SAFFRON, a novel inference scaling paradigm tailored explicitly for safety assurance. Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations. To operationalize this paradigm, we further propose: (i) a partial supervision training objective for MRM, (ii) a conservative exploration constraint to prevent out-of-distribution explorations, and (iii) a Trie-based key--value caching strategy that facilitates cache sharing across sequences during tree search. Extensive experiments validate the effectiveness of our method. Additionally, we publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M) to accelerate future research in LLM safety. Our code, model, and data are publicly available at https://github.com/q-rz/saffron , and our project homepage is at https://q-rz.github.io/p/saffron .</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.LG, cs.AI, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Ruizhong Qiu, Gaotang Li, Tianxin Wei, Jingrui He, Hanghang Tong</p>

            <p><strong>Title:</strong><br>
            Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.06444v1">http://arxiv.org/abs/2506.06444v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning capabilities but remains unexplored in the context of safety assurance. Addressing this gap, our work pioneers inference scaling for robust and effective LLM safety against emerging threats. We reveal that conventional inference scaling techniques, despite their success in reasoning tasks, perform poorly in safety contexts, even falling short of basic approaches like Best-of-N Sampling. We attribute this inefficiency to a newly identified challenge, the exploration--efficiency dilemma, arising from the high computational overhead associated with frequent process reward model (PRM) evaluations. To overcome this dilemma, we propose SAFFRON, a novel inference scaling paradigm tailored explicitly for safety assurance. Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations. To operationalize this paradigm, we further propose: (i) a partial supervision training objective for MRM, (ii) a conservative exploration constraint to prevent out-of-distribution explorations, and (iii) a Trie-based key--value caching strategy that facilitates cache sharing across sequences during tree search. Extensive experiments validate the effectiveness of our method. Additionally, we publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M) to accelerate future research in LLM safety. Our code, model, and data are publicly available at https://github.com/q-rz/saffron , and our project homepage is at https://q-rz.github.io/p/saffron .</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 11 Jun 2025 03:52:50 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a0c107ad/522aa134.mp3" length="20516456" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1279</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.LG, cs.AI, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Ruizhong Qiu, Gaotang Li, Tianxin Wei, Jingrui He, Hanghang Tong</p>

            <p><strong>Title:</strong><br>
            Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.06444v1">http://arxiv.org/abs/2506.06444v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods' susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning capabilities but remains unexplored in the context of safety assurance. Addressing this gap, our work pioneers inference scaling for robust and effective LLM safety against emerging threats. We reveal that conventional inference scaling techniques, despite their success in reasoning tasks, perform poorly in safety contexts, even falling short of basic approaches like Best-of-N Sampling. We attribute this inefficiency to a newly identified challenge, the exploration--efficiency dilemma, arising from the high computational overhead associated with frequent process reward model (PRM) evaluations. To overcome this dilemma, we propose SAFFRON, a novel inference scaling paradigm tailored explicitly for safety assurance. Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations. To operationalize this paradigm, we further propose: (i) a partial supervision training objective for MRM, (ii) a conservative exploration constraint to prevent out-of-distribution explorations, and (iii) a Trie-based key--value caching strategy that facilitates cache sharing across sequences during tree search. Extensive experiments validate the effectiveness of our method. Additionally, we publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M) to accelerate future research in LLM safety. Our code, model, and data are publicly available at https://github.com/q-rz/saffron , and our project homepage is at https://q-rz.github.io/p/saffron .</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MiniCPM4: Ultra-Efficient LLMs on End Devices</title>
      <itunes:episode>895</itunes:episode>
      <podcast:episode>895</podcast:episode>
      <itunes:title>MiniCPM4: Ultra-Efficient LLMs on End Devices</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2de9d7d2-f3b5-4e6f-9f59-a05301f6a596</guid>
      <link>https://share.transistor.fm/s/5dc3c091</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, Ning Ding, Shengdan Fan, Yewei Fang, Zixuan Fu, Wenyu Guan, Yitong Guan, Junshao Guo, Yufeng Han, Bingxiang He, Yuxiang Huang, Cunliang Kong, Qiuzuo Li, Siyuan Li, Wenhao Li, Yanghao Li, Yishan Li, Zhen Li, Dan Liu, Biyuan Lin, Yankai Lin, Xiang Long, Quanyu Lu, Yaxi Lu, Peiyan Luo, Hongya Lyu, Litu Ou, Yinxu Pan, Zekai Qu, Qundong Shi, Zijun Song, Jiayuan Su, Zhou Su, Ao Sun, Xianghui Sun, Peijun Tang, Fangzheng Wang, Feng Wang, Shuo Wang, Yudong Wang, Yesai Wu, Zhenyu Xiao, Jie Xie, Zihao Xie, Yukun Yan, Jiarui Yuan, Kaihuo Zhang, Lei Zhang, Linyue Zhang, Xueren Zhang, Yudi Zhang, Hengyu Zhao, Weilin Zhao, Weilun Zhao, Yuanqian Zhao, Zhi Zheng, Ge Zhou, Jie Zhou, Wei Zhou, Zihan Zhou, Zixuan Zhou, Zhiyuan Liu, Guoyang Zeng, Chao Jia, Dahai Li, Maosong Sun</p>

            <p><strong>Title:</strong><br>
            MiniCPM4: Ultra-Efficient LLMs on End Devices</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.07900v1">http://arxiv.org/abs/2506.07900v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose CPM.cu that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Sufficient evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences. Through further adaptation, MiniCPM4 successfully powers diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, Ning Ding, Shengdan Fan, Yewei Fang, Zixuan Fu, Wenyu Guan, Yitong Guan, Junshao Guo, Yufeng Han, Bingxiang He, Yuxiang Huang, Cunliang Kong, Qiuzuo Li, Siyuan Li, Wenhao Li, Yanghao Li, Yishan Li, Zhen Li, Dan Liu, Biyuan Lin, Yankai Lin, Xiang Long, Quanyu Lu, Yaxi Lu, Peiyan Luo, Hongya Lyu, Litu Ou, Yinxu Pan, Zekai Qu, Qundong Shi, Zijun Song, Jiayuan Su, Zhou Su, Ao Sun, Xianghui Sun, Peijun Tang, Fangzheng Wang, Feng Wang, Shuo Wang, Yudong Wang, Yesai Wu, Zhenyu Xiao, Jie Xie, Zihao Xie, Yukun Yan, Jiarui Yuan, Kaihuo Zhang, Lei Zhang, Linyue Zhang, Xueren Zhang, Yudi Zhang, Hengyu Zhao, Weilin Zhao, Weilun Zhao, Yuanqian Zhao, Zhi Zheng, Ge Zhou, Jie Zhou, Wei Zhou, Zihan Zhou, Zixuan Zhou, Zhiyuan Liu, Guoyang Zeng, Chao Jia, Dahai Li, Maosong Sun</p>

            <p><strong>Title:</strong><br>
            MiniCPM4: Ultra-Efficient LLMs on End Devices</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.07900v1">http://arxiv.org/abs/2506.07900v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose CPM.cu that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Sufficient evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences. Through further adaptation, MiniCPM4 successfully powers diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 11 Jun 2025 03:52:29 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5dc3c091/57a1cb9a.mp3" length="19465678" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1213</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, Ning Ding, Shengdan Fan, Yewei Fang, Zixuan Fu, Wenyu Guan, Yitong Guan, Junshao Guo, Yufeng Han, Bingxiang He, Yuxiang Huang, Cunliang Kong, Qiuzuo Li, Siyuan Li, Wenhao Li, Yanghao Li, Yishan Li, Zhen Li, Dan Liu, Biyuan Lin, Yankai Lin, Xiang Long, Quanyu Lu, Yaxi Lu, Peiyan Luo, Hongya Lyu, Litu Ou, Yinxu Pan, Zekai Qu, Qundong Shi, Zijun Song, Jiayuan Su, Zhou Su, Ao Sun, Xianghui Sun, Peijun Tang, Fangzheng Wang, Feng Wang, Shuo Wang, Yudong Wang, Yesai Wu, Zhenyu Xiao, Jie Xie, Zihao Xie, Yukun Yan, Jiarui Yuan, Kaihuo Zhang, Lei Zhang, Linyue Zhang, Xueren Zhang, Yudi Zhang, Hengyu Zhao, Weilin Zhao, Weilun Zhao, Yuanqian Zhao, Zhi Zheng, Ge Zhou, Jie Zhou, Wei Zhou, Zihan Zhou, Zixuan Zhou, Zhiyuan Liu, Guoyang Zeng, Chao Jia, Dahai Li, Maosong Sun</p>

            <p><strong>Title:</strong><br>
            MiniCPM4: Ultra-Efficient LLMs on End Devices</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.07900v1">http://arxiv.org/abs/2506.07900v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose CPM.cu that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Sufficient evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences. Through further adaptation, MiniCPM4 successfully powers diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SpatialLM: Training Large Language Models for Structured Indoor Modeling</title>
      <itunes:episode>894</itunes:episode>
      <podcast:episode>894</podcast:episode>
      <itunes:title>SpatialLM: Training Large Language Models for Structured Indoor Modeling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">04143e0a-8a8e-4fda-9a8c-08f2950ceb5e</guid>
      <link>https://share.transistor.fm/s/41e83294</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, Zihan Zhou</p>

            <p><strong>Title:</strong><br>
            SpatialLM: Training Large Language Models for Structured Indoor Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.07491v1">http://arxiv.org/abs/2506.07491v1</a></p>

            <p><strong>Abstract:</strong><br>
            SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object boxes with their semantic categories. Unlike previous methods which exploit task-specific network designs, our model adheres to the standard multimodal LLM architecture and is fine-tuned directly from open-source LLMs.   To train SpatialLM, we collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12,328 indoor scenes (54,778 rooms) with ground-truth 3D annotations, and conduct a careful study on various modeling and training decisions. On public benchmarks, our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection. With that, we show a feasible path for enhancing the spatial understanding capabilities of modern LLMs for applications in augmented reality, embodied robotics, and more.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, Zihan Zhou</p>

            <p><strong>Title:</strong><br>
            SpatialLM: Training Large Language Models for Structured Indoor Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.07491v1">http://arxiv.org/abs/2506.07491v1</a></p>

            <p><strong>Abstract:</strong><br>
            SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object boxes with their semantic categories. Unlike previous methods which exploit task-specific network designs, our model adheres to the standard multimodal LLM architecture and is fine-tuned directly from open-source LLMs.   To train SpatialLM, we collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12,328 indoor scenes (54,778 rooms) with ground-truth 3D annotations, and conduct a careful study on various modeling and training decisions. On public benchmarks, our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection. With that, we show a feasible path for enhancing the spatial understanding capabilities of modern LLMs for applications in augmented reality, embodied robotics, and more.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 11 Jun 2025 03:52:08 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/41e83294/88347f45.mp3" length="20868376" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1301</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, Zihan Zhou</p>

            <p><strong>Title:</strong><br>
            SpatialLM: Training Large Language Models for Structured Indoor Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.07491v1">http://arxiv.org/abs/2506.07491v1</a></p>

            <p><strong>Abstract:</strong><br>
            SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object boxes with their semantic categories. Unlike previous methods which exploit task-specific network designs, our model adheres to the standard multimodal LLM architecture and is fine-tuned directly from open-source LLMs.   To train SpatialLM, we collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12,328 indoor scenes (54,778 rooms) with ground-truth 3D annotations, and conduct a careful study on various modeling and training decisions. On public benchmarks, our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection. With that, we show a feasible path for enhancing the spatial understanding capabilities of modern LLMs for applications in augmented reality, embodied robotics, and more.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Image Reconstruction as a Tool for Feature Analysis</title>
      <itunes:episode>893</itunes:episode>
      <podcast:episode>893</podcast:episode>
      <itunes:title>Image Reconstruction as a Tool for Feature Analysis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a3798f04-f7f7-4309-98ee-392946698301</guid>
      <link>https://share.transistor.fm/s/5e64e753</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV, 68T10, 68T30, 68T45, I.2.10</p>

            <p><strong>Authors:</strong><br>
            Eduard Allakhverdov, Dmitrii Tarasov, Elizaveta Goncharova, Andrey Kuznetsov</p>

            <p><strong>Title:</strong><br>
            Image Reconstruction as a Tool for Feature Analysis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.07803v1">http://arxiv.org/abs/2506.07803v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision encoders are increasingly used in modern applications, from vision-only models to multimodal systems such as vision-language models. Despite their remarkable success, it remains unclear how these architectures represent features internally. Here, we propose a novel approach for interpreting vision features via image reconstruction. We compare two related model families, SigLIP and SigLIP2, which differ only in their training objective, and show that encoders pre-trained on image-based tasks retain significantly more image information than those trained on non-image tasks such as contrastive learning. We further apply our method to a range of vision encoders, ranking them by the informativeness of their feature representations. Finally, we demonstrate that manipulating the feature space yields predictable changes in reconstructed images, revealing that orthogonal rotations (rather than spatial transformations) control color encoding. Our approach can be applied to any vision encoder, shedding light on the inner structure of its feature space. The code and model weights to reproduce the experiments are available in GitHub.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV, 68T10, 68T30, 68T45, I.2.10</p>

            <p><strong>Authors:</strong><br>
            Eduard Allakhverdov, Dmitrii Tarasov, Elizaveta Goncharova, Andrey Kuznetsov</p>

            <p><strong>Title:</strong><br>
            Image Reconstruction as a Tool for Feature Analysis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.07803v1">http://arxiv.org/abs/2506.07803v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision encoders are increasingly used in modern applications, from vision-only models to multimodal systems such as vision-language models. Despite their remarkable success, it remains unclear how these architectures represent features internally. Here, we propose a novel approach for interpreting vision features via image reconstruction. We compare two related model families, SigLIP and SigLIP2, which differ only in their training objective, and show that encoders pre-trained on image-based tasks retain significantly more image information than those trained on non-image tasks such as contrastive learning. We further apply our method to a range of vision encoders, ranking them by the informativeness of their feature representations. Finally, we demonstrate that manipulating the feature space yields predictable changes in reconstructed images, revealing that orthogonal rotations (rather than spatial transformations) control color encoding. Our approach can be applied to any vision encoder, shedding light on the inner structure of its feature space. The code and model weights to reproduce the experiments are available in GitHub.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 11 Jun 2025 03:51:46 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5e64e753/e99409a2.mp3" length="21640326" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1349</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV, 68T10, 68T30, 68T45, I.2.10</p>

            <p><strong>Authors:</strong><br>
            Eduard Allakhverdov, Dmitrii Tarasov, Elizaveta Goncharova, Andrey Kuznetsov</p>

            <p><strong>Title:</strong><br>
            Image Reconstruction as a Tool for Feature Analysis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.07803v1">http://arxiv.org/abs/2506.07803v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision encoders are increasingly used in modern applications, from vision-only models to multimodal systems such as vision-language models. Despite their remarkable success, it remains unclear how these architectures represent features internally. Here, we propose a novel approach for interpreting vision features via image reconstruction. We compare two related model families, SigLIP and SigLIP2, which differ only in their training objective, and show that encoders pre-trained on image-based tasks retain significantly more image information than those trained on non-image tasks such as contrastive learning. We further apply our method to a range of vision encoders, ranking them by the informativeness of their feature representations. Finally, we demonstrate that manipulating the feature space yields predictable changes in reconstructed images, revealing that orthogonal rotations (rather than spatial transformations) control color encoding. Our approach can be applied to any vision encoder, shedding light on the inner structure of its feature space. The code and model weights to reproduce the experiments are available in GitHub.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning</title>
      <itunes:episode>892</itunes:episode>
      <podcast:episode>892</podcast:episode>
      <itunes:title>Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8bba728b-274c-4378-9fc7-e6d7d1d85371</guid>
      <link>https://share.transistor.fm/s/1db50ed0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.RO, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Sheng Chen, Peiyu He, Jiaxin Hu, Ziyang Liu, Yansheng Wang, Tao Xu, Chi Zhang, Chongchong Zhang, Chao An, Shiyu Cai, Duo Cao, Kangping Chen, Shuai Chu, Tianwei Chu, Mingdi Dan, Min Du, Weiwei Fang, Pengyou Fu, Junkai Hu, Xiaowei Jiang, Zhaodi Jiang, Fuxuan Li, Jun Li, Minghui Li, Mingyao Li, Yanchang Li, Zhibin Li, Guangming Liu, Kairui Liu, Lihao Liu, Weizhi Liu, Xiaoshun Liu, Yufei Liu, Yunfei Liu, Qiang Lu, Yuanfei Luo, Xiang Lv, Hongying Ma, Sai Ma, Lingxian Mi, Sha Sa, Hongxiang Shu, Lei Tian, Chengzhi Wang, Jiayu Wang, Kaijie Wang, Qingyi Wang, Renwen Wang, Tao Wang, Wei Wang, Xirui Wang, Chao Wei, Xuguang Wei, Zijun Xia, Zhaohao Xiao, Tingshuai Yan, Liyan Yang, Yifan Yang, Zhikai Yang, Zhong Yin, Li Yuan, Liuchun Yuan, Chi Zhang, Jinyang Zhang, Junhui Zhang, Linge Zhang, Zhenyi Zhang, Zheyu Zhang, Dongjie Zhu, Hang Li, Yangang Zhang</p>

            <p><strong>Title:</strong><br>
            Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.06205v1">http://arxiv.org/abs/2506.06205v1</a></p>

            <p><strong>Abstract:</strong><br>
            Modern robot navigation systems encounter difficulties in diverse and complex indoor environments. Traditional approaches rely on multiple modules with small models or rule-based systems and thus lack adaptability to new environments. To address this, we developed Astra, a comprehensive dual-model architecture, Astra-Global and Astra-Local, for mobile robot navigation. Astra-Global, a multimodal LLM, processes vision and language inputs to perform self and goal localization using a hybrid topological-semantic graph as the global map, and outperforms traditional visual place recognition methods. Astra-Local, a multitask network, handles local path planning and odometry estimation. Its 4D spatial-temporal encoder, trained through self-supervised learning, generates robust 4D features for downstream tasks. The planning head utilizes flow matching and a novel masked ESDF loss to minimize collision risks for generating local trajectories, and the odometry head integrates multi-sensor inputs via a transformer encoder to predict the relative pose of the robot. Deployed on real in-house mobile robots, Astra achieves high end-to-end mission success rate across diverse indoor environments.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.RO, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Sheng Chen, Peiyu He, Jiaxin Hu, Ziyang Liu, Yansheng Wang, Tao Xu, Chi Zhang, Chongchong Zhang, Chao An, Shiyu Cai, Duo Cao, Kangping Chen, Shuai Chu, Tianwei Chu, Mingdi Dan, Min Du, Weiwei Fang, Pengyou Fu, Junkai Hu, Xiaowei Jiang, Zhaodi Jiang, Fuxuan Li, Jun Li, Minghui Li, Mingyao Li, Yanchang Li, Zhibin Li, Guangming Liu, Kairui Liu, Lihao Liu, Weizhi Liu, Xiaoshun Liu, Yufei Liu, Yunfei Liu, Qiang Lu, Yuanfei Luo, Xiang Lv, Hongying Ma, Sai Ma, Lingxian Mi, Sha Sa, Hongxiang Shu, Lei Tian, Chengzhi Wang, Jiayu Wang, Kaijie Wang, Qingyi Wang, Renwen Wang, Tao Wang, Wei Wang, Xirui Wang, Chao Wei, Xuguang Wei, Zijun Xia, Zhaohao Xiao, Tingshuai Yan, Liyan Yang, Yifan Yang, Zhikai Yang, Zhong Yin, Li Yuan, Liuchun Yuan, Chi Zhang, Jinyang Zhang, Junhui Zhang, Linge Zhang, Zhenyi Zhang, Zheyu Zhang, Dongjie Zhu, Hang Li, Yangang Zhang</p>

            <p><strong>Title:</strong><br>
            Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.06205v1">http://arxiv.org/abs/2506.06205v1</a></p>

            <p><strong>Abstract:</strong><br>
            Modern robot navigation systems encounter difficulties in diverse and complex indoor environments. Traditional approaches rely on multiple modules with small models or rule-based systems and thus lack adaptability to new environments. To address this, we developed Astra, a comprehensive dual-model architecture, Astra-Global and Astra-Local, for mobile robot navigation. Astra-Global, a multimodal LLM, processes vision and language inputs to perform self and goal localization using a hybrid topological-semantic graph as the global map, and outperforms traditional visual place recognition methods. Astra-Local, a multitask network, handles local path planning and odometry estimation. Its 4D spatial-temporal encoder, trained through self-supervised learning, generates robust 4D features for downstream tasks. The planning head utilizes flow matching and a novel masked ESDF loss to minimize collision risks for generating local trajectories, and the odometry head integrates multi-sensor inputs via a transformer encoder to predict the relative pose of the robot. Deployed on real in-house mobile robots, Astra achieves high end-to-end mission success rate across diverse indoor environments.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 11 Jun 2025 03:51:25 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1db50ed0/191f9bc9.mp3" length="21061063" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1313</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.RO, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Sheng Chen, Peiyu He, Jiaxin Hu, Ziyang Liu, Yansheng Wang, Tao Xu, Chi Zhang, Chongchong Zhang, Chao An, Shiyu Cai, Duo Cao, Kangping Chen, Shuai Chu, Tianwei Chu, Mingdi Dan, Min Du, Weiwei Fang, Pengyou Fu, Junkai Hu, Xiaowei Jiang, Zhaodi Jiang, Fuxuan Li, Jun Li, Minghui Li, Mingyao Li, Yanchang Li, Zhibin Li, Guangming Liu, Kairui Liu, Lihao Liu, Weizhi Liu, Xiaoshun Liu, Yufei Liu, Yunfei Liu, Qiang Lu, Yuanfei Luo, Xiang Lv, Hongying Ma, Sai Ma, Lingxian Mi, Sha Sa, Hongxiang Shu, Lei Tian, Chengzhi Wang, Jiayu Wang, Kaijie Wang, Qingyi Wang, Renwen Wang, Tao Wang, Wei Wang, Xirui Wang, Chao Wei, Xuguang Wei, Zijun Xia, Zhaohao Xiao, Tingshuai Yan, Liyan Yang, Yifan Yang, Zhikai Yang, Zhong Yin, Li Yuan, Liuchun Yuan, Chi Zhang, Jinyang Zhang, Junhui Zhang, Linge Zhang, Zhenyi Zhang, Zheyu Zhang, Dongjie Zhu, Hang Li, Yangang Zhang</p>

            <p><strong>Title:</strong><br>
            Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.06205v1">http://arxiv.org/abs/2506.06205v1</a></p>

            <p><strong>Abstract:</strong><br>
            Modern robot navigation systems encounter difficulties in diverse and complex indoor environments. Traditional approaches rely on multiple modules with small models or rule-based systems and thus lack adaptability to new environments. To address this, we developed Astra, a comprehensive dual-model architecture, Astra-Global and Astra-Local, for mobile robot navigation. Astra-Global, a multimodal LLM, processes vision and language inputs to perform self and goal localization using a hybrid topological-semantic graph as the global map, and outperforms traditional visual place recognition methods. Astra-Local, a multitask network, handles local path planning and odometry estimation. Its 4D spatial-temporal encoder, trained through self-supervised learning, generates robust 4D features for downstream tasks. The planning head utilizes flow matching and a novel masked ESDF loss to minimize collision risks for generating local trajectories, and the odometry head integrates multi-sensor inputs via a transformer encoder to predict the relative pose of the robot. Deployed on real in-house mobile robots, Astra achieves high end-to-end mission success rate across diverse indoor environments.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA</title>
      <itunes:episode>891</itunes:episode>
      <podcast:episode>891</podcast:episode>
      <itunes:title>Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">210a6bb4-ec0e-4871-8ab1-65d279ee7741</guid>
      <link>https://share.transistor.fm/s/d36f756d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sergey Pletenev, Maria Marina, Nikolay Ivanov, Daria Galimzianova, Nikita Krayko, Mikhail Salnikov, Vasily Konovalov, Alexander Panchenko, Viktor Moskvoretskii</p>

            <p><strong>Title:</strong><br>
            Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21115v1">http://arxiv.org/abs/2505.21115v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o retrieval behavior.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sergey Pletenev, Maria Marina, Nikolay Ivanov, Daria Galimzianova, Nikita Krayko, Mikhail Salnikov, Vasily Konovalov, Alexander Panchenko, Viktor Moskvoretskii</p>

            <p><strong>Title:</strong><br>
            Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21115v1">http://arxiv.org/abs/2505.21115v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o retrieval behavior.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 09 Jun 2025 20:23:35 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d36f756d/e2dbf856.mp3" length="20696627" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1290</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sergey Pletenev, Maria Marina, Nikolay Ivanov, Daria Galimzianova, Nikita Krayko, Mikhail Salnikov, Vasily Konovalov, Alexander Panchenko, Viktor Moskvoretskii</p>

            <p><strong>Title:</strong><br>
            Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21115v1">http://arxiv.org/abs/2505.21115v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o retrieval behavior.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion</title>
      <itunes:episode>890</itunes:episode>
      <podcast:episode>890</podcast:episode>
      <itunes:title>FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5a303269-0d3f-447e-ac2d-aac84877c197</guid>
      <link>https://share.transistor.fm/s/2a8daec4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.SD, cs.AI, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Shunian Chen, Xinyuan Xie, Zheshu Chen, Liyan Zhao, Owen Lee, Zhan Su, Qilin Sun, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.01111v1">http://arxiv.org/abs/2506.01111v1</a></p>

            <p><strong>Abstract:</strong><br>
            High-quality, large-scale audio captioning is crucial for advancing audio understanding, yet current automated methods often generate captions that lack fine-grained detail and contextual accuracy, primarily due to their reliance on limited unimodal or superficial multimodal information. Drawing inspiration from human auditory perception, which adeptly integrates cross-modal cues and performs sophisticated auditory scene analysis, we introduce a novel two-stage automated pipeline. This pipeline first employs specialized pretrained models to extract diverse contextual cues (e.g., speech, music, general sounds, and visual information from associated video). A large language model (LLM) then synthesizes these rich, multimodal inputs to generate detailed and context-aware audio captions. Key contributions of this work include: (1) the proposed scalable method for fine-grained audio caption generation; (2) FusionAudio, a new large-scale dataset comprising 1.2 million such detailed captions, combined with 6 million QA pairs; and (3) enhanced audio models developed using FusionAudio, specifically a CLAP-based audio encoder with superior audio-text alignment and instruction following. This paper paves the way for more nuanced and accurate automated understanding of complex audio environments. Code and data can be found in https://github.com/satsuki2486441738/FusionAudio.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.SD, cs.AI, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Shunian Chen, Xinyuan Xie, Zheshu Chen, Liyan Zhao, Owen Lee, Zhan Su, Qilin Sun, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.01111v1">http://arxiv.org/abs/2506.01111v1</a></p>

            <p><strong>Abstract:</strong><br>
            High-quality, large-scale audio captioning is crucial for advancing audio understanding, yet current automated methods often generate captions that lack fine-grained detail and contextual accuracy, primarily due to their reliance on limited unimodal or superficial multimodal information. Drawing inspiration from human auditory perception, which adeptly integrates cross-modal cues and performs sophisticated auditory scene analysis, we introduce a novel two-stage automated pipeline. This pipeline first employs specialized pretrained models to extract diverse contextual cues (e.g., speech, music, general sounds, and visual information from associated video). A large language model (LLM) then synthesizes these rich, multimodal inputs to generate detailed and context-aware audio captions. Key contributions of this work include: (1) the proposed scalable method for fine-grained audio caption generation; (2) FusionAudio, a new large-scale dataset comprising 1.2 million such detailed captions, combined with 6 million QA pairs; and (3) enhanced audio models developed using FusionAudio, specifically a CLAP-based audio encoder with superior audio-text alignment and instruction following. This paper paves the way for more nuanced and accurate automated understanding of complex audio environments. Code and data can be found in https://github.com/satsuki2486441738/FusionAudio.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 09 Jun 2025 20:23:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2a8daec4/e6994e91.mp3" length="20258591" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1262</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.SD, cs.AI, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Shunian Chen, Xinyuan Xie, Zheshu Chen, Liyan Zhao, Owen Lee, Zhan Su, Qilin Sun, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.01111v1">http://arxiv.org/abs/2506.01111v1</a></p>

            <p><strong>Abstract:</strong><br>
            High-quality, large-scale audio captioning is crucial for advancing audio understanding, yet current automated methods often generate captions that lack fine-grained detail and contextual accuracy, primarily due to their reliance on limited unimodal or superficial multimodal information. Drawing inspiration from human auditory perception, which adeptly integrates cross-modal cues and performs sophisticated auditory scene analysis, we introduce a novel two-stage automated pipeline. This pipeline first employs specialized pretrained models to extract diverse contextual cues (e.g., speech, music, general sounds, and visual information from associated video). A large language model (LLM) then synthesizes these rich, multimodal inputs to generate detailed and context-aware audio captions. Key contributions of this work include: (1) the proposed scalable method for fine-grained audio caption generation; (2) FusionAudio, a new large-scale dataset comprising 1.2 million such detailed captions, combined with 6 million QA pairs; and (3) enhanced audio models developed using FusionAudio, specifically a CLAP-based audio encoder with superior audio-text alignment and instruction following. This paper paves the way for more nuanced and accurate automated understanding of complex audio environments. Code and data can be found in https://github.com/satsuki2486441738/FusionAudio.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning</title>
      <itunes:episode>889</itunes:episode>
      <podcast:episode>889</podcast:episode>
      <itunes:title>MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b2bc5503-a66c-4d18-8693-70a64c505a62</guid>
      <link>https://share.transistor.fm/s/08f006c8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, Yongyuan Liang, Tom Goldstein, Furong Huang</p>

            <p><strong>Title:</strong><br>
            MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05523v1">http://arxiv.org/abs/2506.05523v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they overwhelmingly rely on static images, failing to capture the temporal complexity of real-world environments. Second, they narrowly focus on mathematical problem-solving, neglecting the broader spectrum of reasoning skills -- including abstract, physical, planning, spatial, and temporal capabilities -- required for robust multimodal intelligence. Third, many benchmarks quickly saturate, offering limited headroom for diagnosing failure modes or measuring continued progress. We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary reasoning categories. Each instance is programmatically generated using deterministic Python scripts (via Manim, Matplotlib, MoviePy), generative video models, and curated real footage. This script-driven design allows fine-grained control over visual complexity, distractor density, and temporal dynamics -- enabling difficulty to be scaled systematically as models improve. Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve: its controllable generation pipeline supports the creation of arbitrarily challenging new instances, making it ideally suited for stress-testing next-generation models. Initial experiments with state-of-the-art systems -- including various Gemini 2.5 Pro and OpenAI o3 which represent the strongest available at the time, alongside strong open-source models -- reveal substantial performance gaps across all categories, with particularly large deficits in abstract and planning tasks. We release the full dataset, generation scripts, and evaluation harness to support transparent, reproducible, and forward-looking multimodal reasoning research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, Yongyuan Liang, Tom Goldstein, Furong Huang</p>

            <p><strong>Title:</strong><br>
            MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05523v1">http://arxiv.org/abs/2506.05523v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they overwhelmingly rely on static images, failing to capture the temporal complexity of real-world environments. Second, they narrowly focus on mathematical problem-solving, neglecting the broader spectrum of reasoning skills -- including abstract, physical, planning, spatial, and temporal capabilities -- required for robust multimodal intelligence. Third, many benchmarks quickly saturate, offering limited headroom for diagnosing failure modes or measuring continued progress. We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary reasoning categories. Each instance is programmatically generated using deterministic Python scripts (via Manim, Matplotlib, MoviePy), generative video models, and curated real footage. This script-driven design allows fine-grained control over visual complexity, distractor density, and temporal dynamics -- enabling difficulty to be scaled systematically as models improve. Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve: its controllable generation pipeline supports the creation of arbitrarily challenging new instances, making it ideally suited for stress-testing next-generation models. Initial experiments with state-of-the-art systems -- including various Gemini 2.5 Pro and OpenAI o3 which represent the strongest available at the time, alongside strong open-source models -- reveal substantial performance gaps across all categories, with particularly large deficits in abstract and planning tasks. We release the full dataset, generation scripts, and evaluation harness to support transparent, reproducible, and forward-looking multimodal reasoning research.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 09 Jun 2025 20:22:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/08f006c8/a20cac10.mp3" length="22520173" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1404</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, Yongyuan Liang, Tom Goldstein, Furong Huang</p>

            <p><strong>Title:</strong><br>
            MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05523v1">http://arxiv.org/abs/2506.05523v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they overwhelmingly rely on static images, failing to capture the temporal complexity of real-world environments. Second, they narrowly focus on mathematical problem-solving, neglecting the broader spectrum of reasoning skills -- including abstract, physical, planning, spatial, and temporal capabilities -- required for robust multimodal intelligence. Third, many benchmarks quickly saturate, offering limited headroom for diagnosing failure modes or measuring continued progress. We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary reasoning categories. Each instance is programmatically generated using deterministic Python scripts (via Manim, Matplotlib, MoviePy), generative video models, and curated real footage. This script-driven design allows fine-grained control over visual complexity, distractor density, and temporal dynamics -- enabling difficulty to be scaled systematically as models improve. Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve: its controllable generation pipeline supports the creation of arbitrarily challenging new instances, making it ideally suited for stress-testing next-generation models. Initial experiments with state-of-the-art systems -- including various Gemini 2.5 Pro and OpenAI o3 which represent the strongest available at the time, alongside strong open-source models -- reveal substantial performance gaps across all categories, with particularly large deficits in abstract and planning tasks. We release the full dataset, generation scripts, and evaluation harness to support transparent, reproducible, and forward-looking multimodal reasoning research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs</title>
      <itunes:episode>888</itunes:episode>
      <podcast:episode>888</podcast:episode>
      <itunes:title>Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">06cb0793-88a4-4e54-b5f9-73beb17ea5ee</guid>
      <link>https://share.transistor.fm/s/abdfa95c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ananth Muppidi, Abhilash Nandy, Sambaran Bandyopadhyay</p>

            <p><strong>Title:</strong><br>
            Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05629v1">http://arxiv.org/abs/2506.05629v1</a></p>

            <p><strong>Abstract:</strong><br>
            The performance of large language models in domain-specific tasks necessitates fine-tuning, which is computationally expensive and technically challenging. This paper focuses on parameter-efficient fine-tuning using soft prompting, a promising approach that adapts pre-trained models to downstream tasks by learning a small set of parameters. We propose a novel Input Dependent Soft Prompting technique with a self-Attention Mechanism (ID-SPAM) that generates soft prompts based on the input tokens and attends different tokens with varying importance. Our method is simple and efficient, keeping the number of trainable parameters small. We show the merits of the proposed approach compared to state-of-the-art techniques on various tasks and show the improved zero shot domain transfer capability.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ananth Muppidi, Abhilash Nandy, Sambaran Bandyopadhyay</p>

            <p><strong>Title:</strong><br>
            Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05629v1">http://arxiv.org/abs/2506.05629v1</a></p>

            <p><strong>Abstract:</strong><br>
            The performance of large language models in domain-specific tasks necessitates fine-tuning, which is computationally expensive and technically challenging. This paper focuses on parameter-efficient fine-tuning using soft prompting, a promising approach that adapts pre-trained models to downstream tasks by learning a small set of parameters. We propose a novel Input Dependent Soft Prompting technique with a self-Attention Mechanism (ID-SPAM) that generates soft prompts based on the input tokens and attends different tokens with varying importance. Our method is simple and efficient, keeping the number of trainable parameters small. We show the merits of the proposed approach compared to state-of-the-art techniques on various tasks and show the improved zero shot domain transfer capability.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 09 Jun 2025 20:22:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/abdfa95c/7e1015ab.mp3" length="19318579" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1204</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ananth Muppidi, Abhilash Nandy, Sambaran Bandyopadhyay</p>

            <p><strong>Title:</strong><br>
            Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05629v1">http://arxiv.org/abs/2506.05629v1</a></p>

            <p><strong>Abstract:</strong><br>
            The performance of large language models in domain-specific tasks necessitates fine-tuning, which is computationally expensive and technically challenging. This paper focuses on parameter-efficient fine-tuning using soft prompting, a promising approach that adapts pre-trained models to downstream tasks by learning a small set of parameters. We propose a novel Input Dependent Soft Prompting technique with a self-Attention Mechanism (ID-SPAM) that generates soft prompts based on the input tokens and attends different tokens with varying importance. Our method is simple and efficient, keeping the number of trainable parameters small. We show the merits of the proposed approach compared to state-of-the-art techniques on various tasks and show the improved zero shot domain transfer capability.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training</title>
      <itunes:episode>887</itunes:episode>
      <podcast:episode>887</podcast:episode>
      <itunes:title>SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">864dfb91-a468-4061-901b-185ebcb3f80f</guid>
      <link>https://share.transistor.fm/s/3f80637b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, Xuefeng Xiao, Chen Change Loy, Lu Jiang</p>

            <p><strong>Title:</strong><br>
            SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05301v1">http://arxiv.org/abs/2506.05301v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, Xuefeng Xiao, Chen Change Loy, Lu Jiang</p>

            <p><strong>Title:</strong><br>
            SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05301v1">http://arxiv.org/abs/2506.05301v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 06 Jun 2025 21:15:37 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3f80637b/305b7ba8.mp3" length="20931073" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1305</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, Xuefeng Xiao, Chen Change Loy, Lu Jiang</p>

            <p><strong>Title:</strong><br>
            SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05301v1">http://arxiv.org/abs/2506.05301v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development</title>
      <itunes:episode>886</itunes:episode>
      <podcast:episode>886</podcast:episode>
      <itunes:title>ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d51f5836-70fc-4434-bdaf-ffc55742e9d2</guid>
      <link>https://share.transistor.fm/s/a6bf573a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhenran Xu, Xue Yang, Yiyu Wang, Qingli Hu, Zijiao Wu, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05010v1">http://arxiv.org/abs/2506.05010v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce ComfyUI-Copilot, a large language model-powered plugin designed to enhance the usability and efficiency of ComfyUI, an open-source platform for AI-driven art creation. Despite its flexibility and user-friendly interface, ComfyUI can present challenges to newcomers, including limited documentation, model misconfigurations, and the complexity of workflow design. ComfyUI-Copilot addresses these challenges by offering intelligent node and model recommendations, along with automated one-click workflow construction. At its core, the system employs a hierarchical multi-agent framework comprising a central assistant agent for task delegation and specialized worker agents for different usages, supported by our curated ComfyUI knowledge bases to streamline debugging and deployment. We validate the effectiveness of ComfyUI-Copilot through both offline quantitative evaluations and online user feedback, showing that it accurately recommends nodes and accelerates workflow development. Additionally, use cases illustrate that ComfyUI-Copilot lowers entry barriers for beginners and enhances workflow efficiency for experienced users. The ComfyUI-Copilot installation package and a demo video are available at https://github.com/AIDC-AI/ComfyUI-Copilot.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhenran Xu, Xue Yang, Yiyu Wang, Qingli Hu, Zijiao Wu, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05010v1">http://arxiv.org/abs/2506.05010v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce ComfyUI-Copilot, a large language model-powered plugin designed to enhance the usability and efficiency of ComfyUI, an open-source platform for AI-driven art creation. Despite its flexibility and user-friendly interface, ComfyUI can present challenges to newcomers, including limited documentation, model misconfigurations, and the complexity of workflow design. ComfyUI-Copilot addresses these challenges by offering intelligent node and model recommendations, along with automated one-click workflow construction. At its core, the system employs a hierarchical multi-agent framework comprising a central assistant agent for task delegation and specialized worker agents for different usages, supported by our curated ComfyUI knowledge bases to streamline debugging and deployment. We validate the effectiveness of ComfyUI-Copilot through both offline quantitative evaluations and online user feedback, showing that it accurately recommends nodes and accelerates workflow development. Additionally, use cases illustrate that ComfyUI-Copilot lowers entry barriers for beginners and enhances workflow efficiency for experienced users. The ComfyUI-Copilot installation package and a demo video are available at https://github.com/AIDC-AI/ComfyUI-Copilot.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 06 Jun 2025 21:15:14 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a6bf573a/62020e04.mp3" length="19778761" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1232</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhenran Xu, Xue Yang, Yiyu Wang, Qingli Hu, Zijiao Wu, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05010v1">http://arxiv.org/abs/2506.05010v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce ComfyUI-Copilot, a large language model-powered plugin designed to enhance the usability and efficiency of ComfyUI, an open-source platform for AI-driven art creation. Despite its flexibility and user-friendly interface, ComfyUI can present challenges to newcomers, including limited documentation, model misconfigurations, and the complexity of workflow design. ComfyUI-Copilot addresses these challenges by offering intelligent node and model recommendations, along with automated one-click workflow construction. At its core, the system employs a hierarchical multi-agent framework comprising a central assistant agent for task delegation and specialized worker agents for different usages, supported by our curated ComfyUI knowledge bases to streamline debugging and deployment. We validate the effectiveness of ComfyUI-Copilot through both offline quantitative evaluations and online user feedback, showing that it accurately recommends nodes and accelerates workflow development. Additionally, use cases illustrate that ComfyUI-Copilot lowers entry barriers for beginners and enhances workflow efficiency for experienced users. The ComfyUI-Copilot installation package and a demo video are available at https://github.com/AIDC-AI/ComfyUI-Copilot.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts</title>
      <itunes:episode>885</itunes:episode>
      <podcast:episode>885</podcast:episode>
      <itunes:title>Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e9f7c324-fe3c-46a0-ab49-b3bedb381938</guid>
      <link>https://share.transistor.fm/s/ab490315</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Danil Sivtsov, Ivan Rodkin, Gleb Kuzmin, Yuri Kuratov, Ivan Oseledets</p>

            <p><strong>Title:</strong><br>
            Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05229v1">http://arxiv.org/abs/2506.05229v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. However, their memory update mechanism leads to sequential execution, causing a performance bottleneck.   We introduce Diagonal Batching, a scheduling scheme that unlocks parallelism across segments in RMTs while preserving exact recurrence. This approach eliminates the sequential constraint, enabling efficient GPU inference even for single long-context inputs without complex batching and pipelining techniques. Because the technique is purely a run-time computation reordering, existing RMT models adopt it with no retraining.   Applied to a LLaMA-1B ARMT model, Diagonal Batching yields a 3.3x speedup over standard full-attention LLaMA-1B and a 1.8x speedup over the sequential RMT implementation on 131,072-token sequences. By removing sequential bottleneck, Diagonal Batching reduces inference cost and latency, thereby strengthening RMTs as a practical solution for real-world, long-context applications.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Danil Sivtsov, Ivan Rodkin, Gleb Kuzmin, Yuri Kuratov, Ivan Oseledets</p>

            <p><strong>Title:</strong><br>
            Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05229v1">http://arxiv.org/abs/2506.05229v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. However, their memory update mechanism leads to sequential execution, causing a performance bottleneck.   We introduce Diagonal Batching, a scheduling scheme that unlocks parallelism across segments in RMTs while preserving exact recurrence. This approach eliminates the sequential constraint, enabling efficient GPU inference even for single long-context inputs without complex batching and pipelining techniques. Because the technique is purely a run-time computation reordering, existing RMT models adopt it with no retraining.   Applied to a LLaMA-1B ARMT model, Diagonal Batching yields a 3.3x speedup over standard full-attention LLaMA-1B and a 1.8x speedup over the sequential RMT implementation on 131,072-token sequences. By removing sequential bottleneck, Diagonal Batching reduces inference cost and latency, thereby strengthening RMTs as a practical solution for real-world, long-context applications.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 06 Jun 2025 21:14:51 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ab490315/c54db968.mp3" length="19277639" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1201</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Danil Sivtsov, Ivan Rodkin, Gleb Kuzmin, Yuri Kuratov, Ivan Oseledets</p>

            <p><strong>Title:</strong><br>
            Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05229v1">http://arxiv.org/abs/2506.05229v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. However, their memory update mechanism leads to sequential execution, causing a performance bottleneck.   We introduce Diagonal Batching, a scheduling scheme that unlocks parallelism across segments in RMTs while preserving exact recurrence. This approach eliminates the sequential constraint, enabling efficient GPU inference even for single long-context inputs without complex batching and pipelining techniques. Because the technique is purely a run-time computation reordering, existing RMT models adopt it with no retraining.   Applied to a LLaMA-1B ARMT model, Diagonal Batching yields a 3.3x speedup over standard full-attention LLaMA-1B and a 1.8x speedup over the sequential RMT implementation on 131,072-token sequences. By removing sequential bottleneck, Diagonal Batching reduces inference cost and latency, thereby strengthening RMTs as a practical solution for real-world, long-context applications.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics</title>
      <itunes:episode>884</itunes:episode>
      <podcast:episode>884</podcast:episode>
      <itunes:title>RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">769202cd-de9e-42ca-9acc-414f216155ee</guid>
      <link>https://share.transistor.fm/s/827adc01</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.RO, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, Shanghang Zhang</p>

            <p><strong>Title:</strong><br>
            RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.04308v1">http://arxiv.org/abs/2506.04308v1</a></p>

            <p><strong>Abstract:</strong><br>
            Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.RO, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, Shanghang Zhang</p>

            <p><strong>Title:</strong><br>
            RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.04308v1">http://arxiv.org/abs/2506.04308v1</a></p>

            <p><strong>Abstract:</strong><br>
            Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 06 Jun 2025 21:14:28 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/827adc01/911fd4a1.mp3" length="22846177" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1424</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.RO, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, Shanghang Zhang</p>

            <p><strong>Title:</strong><br>
            RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.04308v1">http://arxiv.org/abs/2506.04308v1</a></p>

            <p><strong>Abstract:</strong><br>
            Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Video World Models with Long-term Spatial Memory</title>
      <itunes:episode>883</itunes:episode>
      <podcast:episode>883</podcast:episode>
      <itunes:title>Video World Models with Long-term Spatial Memory</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">577acde6-8dc6-42de-969b-5af050dfcd4a</guid>
      <link>https://share.transistor.fm/s/2e20d20e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, Gordon Wetzstein</p>

            <p><strong>Title:</strong><br>
            Video World Models with Long-term Spatial Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05284v1">http://arxiv.org/abs/2506.05284v1</a></p>

            <p><strong>Abstract:</strong><br>
            Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, Gordon Wetzstein</p>

            <p><strong>Title:</strong><br>
            Video World Models with Long-term Spatial Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05284v1">http://arxiv.org/abs/2506.05284v1</a></p>

            <p><strong>Abstract:</strong><br>
            Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 06 Jun 2025 21:14:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2e20d20e/a8523e8d.mp3" length="21365306" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1332</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, Gordon Wetzstein</p>

            <p><strong>Title:</strong><br>
            Video World Models with Long-term Spatial Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05284v1">http://arxiv.org/abs/2506.05284v1</a></p>

            <p><strong>Abstract:</strong><br>
            Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights</title>
      <itunes:episode>882</itunes:episode>
      <podcast:episode>882</podcast:episode>
      <itunes:title>Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b99e4bc8-157c-45cc-a385-8a5f7a0a7794</guid>
      <link>https://share.transistor.fm/s/254d4966</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mathieu Andreux, Breno Baldas Skuk, Hamza Benchekroun, Emilien Biré, Antoine Bonnet, Riaz Bordie, Matthias Brunel, Pierre-Louis Cedoz, Antoine Chassang, Mickaël Chen, Alexandra D. Constantinou, Antoine d'Andigné, Hubert de La Jonquière, Aurélien Delfosse, Ludovic Denoyer, Alexis Deprez, Augustin Derupti, Michael Eickenberg, Mathïs Federico, Charles Kantor, Xavier Koegler, Yann Labbé, Matthew C. H. Lee, Erwan Le Jumeau de Kergaradec, Amir Mahla, Avshalom Manevich, Adrien Maret, Charles Masson, Rafaël Maurin, Arturo Mena, Philippe Modard, Axel Moyal, Axel Nguyen Kerbel, Julien Revelle, Mats L. Richter, María Santos, Laurent Sifre, Maxime Theillard, Marc Thibault, Louis Thiry, Léo Tronchon, Nicolas Usunier, Tony Wu</p>

            <p><strong>Title:</strong><br>
            Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.02865v1">http://arxiv.org/abs/2506.02865v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Surfer-H, a cost-efficient web agent that integrates Vision-Language Models (VLM) to perform user-defined tasks on the web. We pair it with Holo1, a new open-weight collection of VLMs specialized in web navigation and information extraction. Holo1 was trained on carefully curated data sources, including open-access web content, synthetic examples, and self-produced agentic data. Holo1 tops generalist User Interface (UI) benchmarks as well as our new web UI localization benchmark, WebClick. When powered by Holo1, Surfer-H achieves a 92.2% state-of-the-art performance on WebVoyager, striking a Pareto-optimal balance between accuracy and cost-efficiency. To accelerate research advancement in agentic systems, we are open-sourcing both our WebClick evaluation dataset and the Holo1 model weights.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mathieu Andreux, Breno Baldas Skuk, Hamza Benchekroun, Emilien Biré, Antoine Bonnet, Riaz Bordie, Matthias Brunel, Pierre-Louis Cedoz, Antoine Chassang, Mickaël Chen, Alexandra D. Constantinou, Antoine d'Andigné, Hubert de La Jonquière, Aurélien Delfosse, Ludovic Denoyer, Alexis Deprez, Augustin Derupti, Michael Eickenberg, Mathïs Federico, Charles Kantor, Xavier Koegler, Yann Labbé, Matthew C. H. Lee, Erwan Le Jumeau de Kergaradec, Amir Mahla, Avshalom Manevich, Adrien Maret, Charles Masson, Rafaël Maurin, Arturo Mena, Philippe Modard, Axel Moyal, Axel Nguyen Kerbel, Julien Revelle, Mats L. Richter, María Santos, Laurent Sifre, Maxime Theillard, Marc Thibault, Louis Thiry, Léo Tronchon, Nicolas Usunier, Tony Wu</p>

            <p><strong>Title:</strong><br>
            Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.02865v1">http://arxiv.org/abs/2506.02865v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Surfer-H, a cost-efficient web agent that integrates Vision-Language Models (VLM) to perform user-defined tasks on the web. We pair it with Holo1, a new open-weight collection of VLMs specialized in web navigation and information extraction. Holo1 was trained on carefully curated data sources, including open-access web content, synthetic examples, and self-produced agentic data. Holo1 tops generalist User Interface (UI) benchmarks as well as our new web UI localization benchmark, WebClick. When powered by Holo1, Surfer-H achieves a 92.2% state-of-the-art performance on WebVoyager, striking a Pareto-optimal balance between accuracy and cost-efficiency. To accelerate research advancement in agentic systems, we are open-sourcing both our WebClick evaluation dataset and the Holo1 model weights.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 06 Jun 2025 21:13:41 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/254d4966/5eb3e77f.mp3" length="24400129" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1521</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mathieu Andreux, Breno Baldas Skuk, Hamza Benchekroun, Emilien Biré, Antoine Bonnet, Riaz Bordie, Matthias Brunel, Pierre-Louis Cedoz, Antoine Chassang, Mickaël Chen, Alexandra D. Constantinou, Antoine d'Andigné, Hubert de La Jonquière, Aurélien Delfosse, Ludovic Denoyer, Alexis Deprez, Augustin Derupti, Michael Eickenberg, Mathïs Federico, Charles Kantor, Xavier Koegler, Yann Labbé, Matthew C. H. Lee, Erwan Le Jumeau de Kergaradec, Amir Mahla, Avshalom Manevich, Adrien Maret, Charles Masson, Rafaël Maurin, Arturo Mena, Philippe Modard, Axel Moyal, Axel Nguyen Kerbel, Julien Revelle, Mats L. Richter, María Santos, Laurent Sifre, Maxime Theillard, Marc Thibault, Louis Thiry, Léo Tronchon, Nicolas Usunier, Tony Wu</p>

            <p><strong>Title:</strong><br>
            Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.02865v1">http://arxiv.org/abs/2506.02865v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Surfer-H, a cost-efficient web agent that integrates Vision-Language Models (VLM) to perform user-defined tasks on the web. We pair it with Holo1, a new open-weight collection of VLMs specialized in web navigation and information extraction. Holo1 was trained on carefully curated data sources, including open-access web content, synthetic examples, and self-produced agentic data. Holo1 tops generalist User Interface (UI) benchmarks as well as our new web UI localization benchmark, WebClick. When powered by Holo1, Surfer-H achieves a 92.2% state-of-the-art performance on WebVoyager, striking a Pareto-optimal balance between accuracy and cost-efficiency. To accelerate research advancement in agentic systems, we are open-sourcing both our WebClick evaluation dataset and the Holo1 model weights.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models</title>
      <itunes:episode>881</itunes:episode>
      <podcast:episode>881</podcast:episode>
      <itunes:title>Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cac5d845-a112-46d0-917b-a6a19b90331c</guid>
      <link>https://share.transistor.fm/s/6d3e526a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05176v1">http://arxiv.org/abs/2506.05176v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs' robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05176v1">http://arxiv.org/abs/2506.05176v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs' robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 06 Jun 2025 21:13:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6d3e526a/26463272.mp3" length="20368924" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1269</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05176v1">http://arxiv.org/abs/2506.05176v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs' robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models</title>
      <itunes:episode>880</itunes:episode>
      <podcast:episode>880</podcast:episode>
      <itunes:title>VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a0ec29d4-a605-4a59-b37e-5ee80056c3c7</guid>
      <link>https://share.transistor.fm/s/91558edc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23656v1">http://arxiv.org/abs/2505.23656v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at https://videorepa.github.io/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23656v1">http://arxiv.org/abs/2505.23656v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at https://videorepa.github.io/.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 06 Jun 2025 21:12:55 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/91558edc/b0e9700f.mp3" length="22409002" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1397</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23656v1">http://arxiv.org/abs/2505.23656v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at https://videorepa.github.io/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text</title>
      <itunes:episode>879</itunes:episode>
      <podcast:episode>879</podcast:episode>
      <itunes:title>The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">221972ce-fdb2-44ac-899a-775d3159dc35</guid>
      <link>https://share.transistor.fm/s/aa0cd4bf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Nikhil Kandpal, Brian Lester, Colin Raffel, Sebastian Majstorovic, Stella Biderman, Baber Abbasi, Luca Soldaini, Enrico Shippole, A. Feder Cooper, Aviya Skowron, John Kirchenbauer, Shayne Longpre, Lintang Sutawika, Alon Albalak, Zhenlin Xu, Guilherme Penedo, Loubna Ben Allal, Elie Bakouch, John David Pressman, Honglu Fan, Dashiell Stander, Guangyu Song, Aaron Gokaslan, Tom Goldstein, Brian R. Bartoldson, Bhavya Kailkhura, Tyler Murray</p>

            <p><strong>Title:</strong><br>
            The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05209v1">http://arxiv.org/abs/2506.05209v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Nikhil Kandpal, Brian Lester, Colin Raffel, Sebastian Majstorovic, Stella Biderman, Baber Abbasi, Luca Soldaini, Enrico Shippole, A. Feder Cooper, Aviya Skowron, John Kirchenbauer, Shayne Longpre, Lintang Sutawika, Alon Albalak, Zhenlin Xu, Guilherme Penedo, Loubna Ben Allal, Elie Bakouch, John David Pressman, Honglu Fan, Dashiell Stander, Guangyu Song, Aaron Gokaslan, Tom Goldstein, Brian R. Bartoldson, Bhavya Kailkhura, Tyler Murray</p>

            <p><strong>Title:</strong><br>
            The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05209v1">http://arxiv.org/abs/2506.05209v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 06 Jun 2025 21:12:32 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/aa0cd4bf/739914b3.mp3" length="17635050" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1099</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Nikhil Kandpal, Brian Lester, Colin Raffel, Sebastian Majstorovic, Stella Biderman, Baber Abbasi, Luca Soldaini, Enrico Shippole, A. Feder Cooper, Aviya Skowron, John Kirchenbauer, Shayne Longpre, Lintang Sutawika, Alon Albalak, Zhenlin Xu, Guilherme Penedo, Loubna Ben Allal, Elie Bakouch, John David Pressman, Honglu Fan, Dashiell Stander, Guangyu Song, Aaron Gokaslan, Tom Goldstein, Brian R. Bartoldson, Bhavya Kailkhura, Tyler Murray</p>

            <p><strong>Title:</strong><br>
            The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05209v1">http://arxiv.org/abs/2506.05209v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos</title>
      <itunes:episode>878</itunes:episode>
      <podcast:episode>878</podcast:episode>
      <itunes:title>VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bfda76ae-e9f3-401d-86ac-c67b4c579ab0</guid>
      <link>https://share.transistor.fm/s/cca1cfd8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, Fahad Khan</p>

            <p><strong>Title:</strong><br>
            VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05349v1">http://arxiv.org/abs/2506.05349v1</a></p>

            <p><strong>Abstract:</strong><br>
            Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over $920$ man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: https://mbzuai-oryx.github.io/VideoMathQA</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, Fahad Khan</p>

            <p><strong>Title:</strong><br>
            VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05349v1">http://arxiv.org/abs/2506.05349v1</a></p>

            <p><strong>Abstract:</strong><br>
            Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over $920$ man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: https://mbzuai-oryx.github.io/VideoMathQA</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 06 Jun 2025 21:12:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cca1cfd8/02cbc2c5.mp3" length="19800505" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1234</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, Fahad Khan</p>

            <p><strong>Title:</strong><br>
            VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.05349v1">http://arxiv.org/abs/2506.05349v1</a></p>

            <p><strong>Abstract:</strong><br>
            Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over $920$ man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: https://mbzuai-oryx.github.io/VideoMathQA</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MiMo-VL Technical Report</title>
      <itunes:episode>877</itunes:episode>
      <podcast:episode>877</podcast:episode>
      <itunes:title>MiMo-VL Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d64f8130-1fcb-4ad2-b6ea-87c79981707a</guid>
      <link>https://share.transistor.fm/s/42172452</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xiaomi LLM-Core Team, :, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huang, Xu Wang, Xinzhe Xu, Xingchen Song, Xing Zhang, Xing Yong, Xin Zhang, Xiangwei Deng, Wenyu Yang, Wenhan Ma, Weiwei Lv, Weiji Zhuang, Wei Liu, Sirui Deng, Shuo Liu, Shimao Chen, Shihua Yu, Shaohui Liu, Shande Wang, Rui Ma, Qiantong Wang, Peng Wang, Nuo Chen, Menghang Zhu, Kangyang Zhou, Kang Zhou, Kai Fang, Jun Shi, Jinhao Dong, Jiebao Xiao, Jiaming Xu, Huaqiu Liu, Hongshen Xu, Heng Qu, Haochen Zhao, Hanglong Lv, Guoan Wang, Duo Zhang, Dong Zhang, Di Zhang, Chong Ma, Chang Liu, Can Cai, Bingquan Xia</p>

            <p><strong>Title:</strong><br>
            MiMo-VL Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.03569v1">http://arxiv.org/abs/2506.03569v1</a></p>

            <p><strong>Abstract:</strong><br>
            We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote reproducibility and advance the field. The model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-VL.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xiaomi LLM-Core Team, :, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huang, Xu Wang, Xinzhe Xu, Xingchen Song, Xing Zhang, Xing Yong, Xin Zhang, Xiangwei Deng, Wenyu Yang, Wenhan Ma, Weiwei Lv, Weiji Zhuang, Wei Liu, Sirui Deng, Shuo Liu, Shimao Chen, Shihua Yu, Shaohui Liu, Shande Wang, Rui Ma, Qiantong Wang, Peng Wang, Nuo Chen, Menghang Zhu, Kangyang Zhou, Kang Zhou, Kai Fang, Jun Shi, Jinhao Dong, Jiebao Xiao, Jiaming Xu, Huaqiu Liu, Hongshen Xu, Heng Qu, Haochen Zhao, Hanglong Lv, Guoan Wang, Duo Zhang, Dong Zhang, Di Zhang, Chong Ma, Chang Liu, Can Cai, Bingquan Xia</p>

            <p><strong>Title:</strong><br>
            MiMo-VL Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.03569v1">http://arxiv.org/abs/2506.03569v1</a></p>

            <p><strong>Abstract:</strong><br>
            We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote reproducibility and advance the field. The model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-VL.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 05 Jun 2025 20:57:17 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/42172452/8698d465.mp3" length="18605079" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1159</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xiaomi LLM-Core Team, :, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huang, Xu Wang, Xinzhe Xu, Xingchen Song, Xing Zhang, Xing Yong, Xin Zhang, Xiangwei Deng, Wenyu Yang, Wenhan Ma, Weiwei Lv, Weiji Zhuang, Wei Liu, Sirui Deng, Shuo Liu, Shimao Chen, Shihua Yu, Shaohui Liu, Shande Wang, Rui Ma, Qiantong Wang, Peng Wang, Nuo Chen, Menghang Zhu, Kangyang Zhou, Kang Zhou, Kai Fang, Jun Shi, Jinhao Dong, Jiebao Xiao, Jiaming Xu, Huaqiu Liu, Hongshen Xu, Heng Qu, Haochen Zhao, Hanglong Lv, Guoan Wang, Duo Zhang, Dong Zhang, Di Zhang, Chong Ma, Chang Liu, Can Cai, Bingquan Xia</p>

            <p><strong>Title:</strong><br>
            MiMo-VL Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.03569v1">http://arxiv.org/abs/2506.03569v1</a></p>

            <p><strong>Abstract:</strong><br>
            We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote reproducibility and advance the field. The model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-VL.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning</title>
      <itunes:episode>876</itunes:episode>
      <podcast:episode>876</podcast:episode>
      <itunes:title>Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7c9ca80a-1437-4163-91f6-08970857429d</guid>
      <link>https://share.transistor.fm/s/230278ca</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.LG, cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.04207v1">http://arxiv.org/abs/2506.04207v1</a></p>

            <p><strong>Abstract:</strong><br>
            Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning. In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL. 2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning. This staged training approach effectively balances perceptual grounding and cognitive reasoning development. By incorporating the above insights and addressing multimodal RL issues, we introduce ReVisual-R1, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.LG, cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.04207v1">http://arxiv.org/abs/2506.04207v1</a></p>

            <p><strong>Abstract:</strong><br>
            Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning. In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL. 2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning. This staged training approach effectively balances perceptual grounding and cognitive reasoning development. By incorporating the above insights and addressing multimodal RL issues, we introduce ReVisual-R1, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 05 Jun 2025 20:56:54 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/230278ca/9b69ce68.mp3" length="19406373" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1209</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.LG, cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.04207v1">http://arxiv.org/abs/2506.04207v1</a></p>

            <p><strong>Abstract:</strong><br>
            Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning. In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL. 2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning. This staged training approach effectively balances perceptual grounding and cognitive reasoning development. By incorporating the above insights and addressing multimodal RL issues, we introduce ReVisual-R1, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment</title>
      <itunes:episode>875</itunes:episode>
      <podcast:episode>875</podcast:episode>
      <itunes:title>AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">52ea052b-b2bd-4b69-97c2-df01a3d28916</guid>
      <link>https://share.transistor.fm/s/9776d00d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.LG, cs.AI, cs.CL, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Anastasiia Ivanova, Eva Bakaeva, Zoya Volovikova, Alexey K. Kovalev, Aleksandr I. Panov</p>

            <p><strong>Title:</strong><br>
            AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.04089v1">http://arxiv.org/abs/2506.04089v1</a></p>

            <p><strong>Abstract:</strong><br>
            As a part of an embodied agent, Large Language Models (LLMs) are typically used for behavior planning given natural language instructions from the user. However, dealing with ambiguous instructions in real-world environments remains a challenge for LLMs. Various methods for task ambiguity detection have been proposed. However, it is difficult to compare them because they are tested on different datasets and there is no universal benchmark. For this reason, we propose AmbiK (Ambiguous Tasks in Kitchen Environment), the fully textual dataset of ambiguous instructions addressed to a robot in a kitchen environment. AmbiK was collected with the assistance of LLMs and is human-validated. It comprises 1000 pairs of ambiguous tasks and their unambiguous counterparts, categorized by ambiguity type (Human Preferences, Common Sense Knowledge, Safety), with environment descriptions, clarifying questions and answers, user intents, and task plans, for a total of 2000 tasks. We hope that AmbiK will enable researchers to perform a unified comparison of ambiguity detection methods. AmbiK is available at https://github.com/cog-model/AmbiK-dataset.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.LG, cs.AI, cs.CL, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Anastasiia Ivanova, Eva Bakaeva, Zoya Volovikova, Alexey K. Kovalev, Aleksandr I. Panov</p>

            <p><strong>Title:</strong><br>
            AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.04089v1">http://arxiv.org/abs/2506.04089v1</a></p>

            <p><strong>Abstract:</strong><br>
            As a part of an embodied agent, Large Language Models (LLMs) are typically used for behavior planning given natural language instructions from the user. However, dealing with ambiguous instructions in real-world environments remains a challenge for LLMs. Various methods for task ambiguity detection have been proposed. However, it is difficult to compare them because they are tested on different datasets and there is no universal benchmark. For this reason, we propose AmbiK (Ambiguous Tasks in Kitchen Environment), the fully textual dataset of ambiguous instructions addressed to a robot in a kitchen environment. AmbiK was collected with the assistance of LLMs and is human-validated. It comprises 1000 pairs of ambiguous tasks and their unambiguous counterparts, categorized by ambiguity type (Human Preferences, Common Sense Knowledge, Safety), with environment descriptions, clarifying questions and answers, user intents, and task plans, for a total of 2000 tasks. We hope that AmbiK will enable researchers to perform a unified comparison of ambiguity detection methods. AmbiK is available at https://github.com/cog-model/AmbiK-dataset.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 05 Jun 2025 20:56:28 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9776d00d/03928655.mp3" length="20157830" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1256</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.LG, cs.AI, cs.CL, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Anastasiia Ivanova, Eva Bakaeva, Zoya Volovikova, Alexey K. Kovalev, Aleksandr I. Panov</p>

            <p><strong>Title:</strong><br>
            AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.04089v1">http://arxiv.org/abs/2506.04089v1</a></p>

            <p><strong>Abstract:</strong><br>
            As a part of an embodied agent, Large Language Models (LLMs) are typically used for behavior planning given natural language instructions from the user. However, dealing with ambiguous instructions in real-world environments remains a challenge for LLMs. Various methods for task ambiguity detection have been proposed. However, it is difficult to compare them because they are tested on different datasets and there is no universal benchmark. For this reason, we propose AmbiK (Ambiguous Tasks in Kitchen Environment), the fully textual dataset of ambiguous instructions addressed to a robot in a kitchen environment. AmbiK was collected with the assistance of LLMs and is human-validated. It comprises 1000 pairs of ambiguous tasks and their unambiguous counterparts, categorized by ambiguity type (Human Preferences, Common Sense Knowledge, Safety), with environment descriptions, clarifying questions and answers, user intents, and task plans, for a total of 2000 tasks. We hope that AmbiK will enable researchers to perform a unified comparison of ambiguity detection methods. AmbiK is available at https://github.com/cog-model/AmbiK-dataset.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark</title>
      <itunes:episode>874</itunes:episode>
      <podcast:episode>874</podcast:episode>
      <itunes:title>CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0a699000-f53d-4c21-ae80-24afee033005</guid>
      <link>https://share.transistor.fm/s/b51ab7af</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.AR, cs.AI, cs.CL, cs.LG, cs.PL</p>

            <p><strong>Authors:</strong><br>
            Ahmed Heakl, Sarim Hashmi, Gustavo Bertolo Stahl, Seung Hun Eddie Han, Salman Khan, Abdulrahman Mahmoud</p>

            <p><strong>Title:</strong><br>
            CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16968v3">http://arxiv.org/abs/2505.16968v3</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA &lt;--&gt; HIP) and assembly-level (Nvidia SASS &lt;--&gt; AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the CASS family of domain-specific language models, achieving 95% source translation accuracy and 37.5% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce CASS-Bench, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.AR, cs.AI, cs.CL, cs.LG, cs.PL</p>

            <p><strong>Authors:</strong><br>
            Ahmed Heakl, Sarim Hashmi, Gustavo Bertolo Stahl, Seung Hun Eddie Han, Salman Khan, Abdulrahman Mahmoud</p>

            <p><strong>Title:</strong><br>
            CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16968v3">http://arxiv.org/abs/2505.16968v3</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA &lt;--&gt; HIP) and assembly-level (Nvidia SASS &lt;--&gt; AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the CASS family of domain-specific language models, achieving 95% source translation accuracy and 37.5% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce CASS-Bench, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 05 Jun 2025 20:56:05 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b51ab7af/4738382f.mp3" length="21952556" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1368</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.AR, cs.AI, cs.CL, cs.LG, cs.PL</p>

            <p><strong>Authors:</strong><br>
            Ahmed Heakl, Sarim Hashmi, Gustavo Bertolo Stahl, Seung Hun Eddie Han, Salman Khan, Abdulrahman Mahmoud</p>

            <p><strong>Title:</strong><br>
            CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16968v3">http://arxiv.org/abs/2505.16968v3</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA &lt;--&gt; HIP) and assembly-level (Nvidia SASS &lt;--&gt; AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the CASS family of domain-specific language models, achieving 95% source translation accuracy and 37.5% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce CASS-Bench, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Controllable Examination for Long-Context Language Models</title>
      <itunes:episode>873</itunes:episode>
      <podcast:episode>873</podcast:episode>
      <itunes:title>A Controllable Examination for Long-Context Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">be8f7224-b198-4c65-a79b-71d44a859332</guid>
      <link>https://share.transistor.fm/s/590b7b4e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yijun Yang, Zeyu Huang, Wenhao Zhu, Zihan Qiu, Fei Yuan, Jeff Z. Pan, Ivan Titov</p>

            <p><strong>Title:</strong><br>
            A Controllable Examination for Long-Context Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.02921v1">http://arxiv.org/abs/2506.02921v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing frameworks for evaluating long-context language models (LCLM) can be broadly categorized into real-world and synthetic tasks. Despite their utility, both approaches are accompanied by certain intrinsic limitations. Real-world tasks are too complex to interpret or characterize and are susceptible to data contamination. In contrast, synthetic tasks often adopt the needle-in-the-haystack (NIAH) format, wherein a lack of coherence between the "needle" and the "haystack" compromises their validity as proxies for realistic applications. In response to these challenges, we posit that an ideal long-context evaluation framework should be characterized by three essential features: $\textit{seamless context}$, $\textit{controllable setting}$, and $\textit{sound evaluation}$. This study introduces $\textbf{LongBioBench}$, a novel benchmark that utilizes artificially generated biographies as a controlled environment for assessing LCLMs across dimensions of $\textit{understanding}$, $\textit{reasoning}$, and $\textit{trustworthiness}$. Our experimental evaluation, which includes $\textbf{18}$ LCLMs in total, demonstrates that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results and are less trustworthy as context length increases. Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence, numerical needles, and the absence of distractors, rendering them vulnerable to test the model long-context capabilities. Moreover, we also reveal that long-context continual pretraining primarily adjusts RoPE embedding to accommodate extended context lengths. To sum up, compared to previous synthetic benchmarks, LongBioBench achieves a better trade-off between mirroring authentic language tasks and maintaining controllability, and is highly interpretable and configurable.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yijun Yang, Zeyu Huang, Wenhao Zhu, Zihan Qiu, Fei Yuan, Jeff Z. Pan, Ivan Titov</p>

            <p><strong>Title:</strong><br>
            A Controllable Examination for Long-Context Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.02921v1">http://arxiv.org/abs/2506.02921v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing frameworks for evaluating long-context language models (LCLM) can be broadly categorized into real-world and synthetic tasks. Despite their utility, both approaches are accompanied by certain intrinsic limitations. Real-world tasks are too complex to interpret or characterize and are susceptible to data contamination. In contrast, synthetic tasks often adopt the needle-in-the-haystack (NIAH) format, wherein a lack of coherence between the "needle" and the "haystack" compromises their validity as proxies for realistic applications. In response to these challenges, we posit that an ideal long-context evaluation framework should be characterized by three essential features: $\textit{seamless context}$, $\textit{controllable setting}$, and $\textit{sound evaluation}$. This study introduces $\textbf{LongBioBench}$, a novel benchmark that utilizes artificially generated biographies as a controlled environment for assessing LCLMs across dimensions of $\textit{understanding}$, $\textit{reasoning}$, and $\textit{trustworthiness}$. Our experimental evaluation, which includes $\textbf{18}$ LCLMs in total, demonstrates that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results and are less trustworthy as context length increases. Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence, numerical needles, and the absence of distractors, rendering them vulnerable to test the model long-context capabilities. Moreover, we also reveal that long-context continual pretraining primarily adjusts RoPE embedding to accommodate extended context lengths. To sum up, compared to previous synthetic benchmarks, LongBioBench achieves a better trade-off between mirroring authentic language tasks and maintaining controllability, and is highly interpretable and configurable.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 05 Jun 2025 20:55:42 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/590b7b4e/de790498.mp3" length="20848719" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1299</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yijun Yang, Zeyu Huang, Wenhao Zhu, Zihan Qiu, Fei Yuan, Jeff Z. Pan, Ivan Titov</p>

            <p><strong>Title:</strong><br>
            A Controllable Examination for Long-Context Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.02921v1">http://arxiv.org/abs/2506.02921v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing frameworks for evaluating long-context language models (LCLM) can be broadly categorized into real-world and synthetic tasks. Despite their utility, both approaches are accompanied by certain intrinsic limitations. Real-world tasks are too complex to interpret or characterize and are susceptible to data contamination. In contrast, synthetic tasks often adopt the needle-in-the-haystack (NIAH) format, wherein a lack of coherence between the "needle" and the "haystack" compromises their validity as proxies for realistic applications. In response to these challenges, we posit that an ideal long-context evaluation framework should be characterized by three essential features: $\textit{seamless context}$, $\textit{controllable setting}$, and $\textit{sound evaluation}$. This study introduces $\textbf{LongBioBench}$, a novel benchmark that utilizes artificially generated biographies as a controlled environment for assessing LCLMs across dimensions of $\textit{understanding}$, $\textit{reasoning}$, and $\textit{trustworthiness}$. Our experimental evaluation, which includes $\textbf{18}$ LCLMs in total, demonstrates that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results and are less trustworthy as context length increases. Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence, numerical needles, and the absence of distractors, rendering them vulnerable to test the model long-context capabilities. Moreover, we also reveal that long-context continual pretraining primarily adjusts RoPE embedding to accommodate extended context lengths. To sum up, compared to previous synthetic benchmarks, LongBioBench achieves a better trade-off between mirroring authentic language tasks and maintaining controllability, and is highly interpretable and configurable.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos</title>
      <itunes:episode>872</itunes:episode>
      <podcast:episode>872</podcast:episode>
      <itunes:title>MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">eec8242c-333b-4f91-aeff-3ebe78ff1ad0</guid>
      <link>https://share.transistor.fm/s/8e31e3d7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kejian Zhu, Zhuoran Jin, Hongbang Yuan, Jiachun Li, Shangqing Tu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao</p>

            <p><strong>Title:</strong><br>
            MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.04141v1">http://arxiv.org/abs/2506.04141v1</a></p>

            <p><strong>Abstract:</strong><br>
            The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, o4-mini, achieves only 52.5% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Further analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kejian Zhu, Zhuoran Jin, Hongbang Yuan, Jiachun Li, Shangqing Tu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao</p>

            <p><strong>Title:</strong><br>
            MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.04141v1">http://arxiv.org/abs/2506.04141v1</a></p>

            <p><strong>Abstract:</strong><br>
            The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, o4-mini, achieves only 52.5% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Further analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 05 Jun 2025 20:55:19 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8e31e3d7/bb9a4d71.mp3" length="22400621" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1396</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kejian Zhu, Zhuoran Jin, Hongbang Yuan, Jiachun Li, Shangqing Tu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao</p>

            <p><strong>Title:</strong><br>
            MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.04141v1">http://arxiv.org/abs/2506.04141v1</a></p>

            <p><strong>Abstract:</strong><br>
            The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, o4-mini, achieves only 52.5% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Further analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis</title>
      <itunes:episode>871</itunes:episode>
      <podcast:episode>871</podcast:episode>
      <itunes:title>Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b5cad2f7-3971-471a-9a21-16d0de50eb69</guid>
      <link>https://share.transistor.fm/s/b2fa5000</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kejian Zhu, Shangqing Tu, Zhuoran Jin, Lei Hou, Juanzi Li, Jun Zhao</p>

            <p><strong>Title:</strong><br>
            Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.04142v1">http://arxiv.org/abs/2506.04142v1</a></p>

            <p><strong>Abstract:</strong><br>
            The development of large language models (LLMs) depends on trustworthy evaluation. However, most current evaluations rely on public benchmarks, which are prone to data contamination issues that significantly compromise fairness. Previous researches have focused on constructing dynamic benchmarks to address contamination. However, continuously building new benchmarks is costly and cyclical. In this work, we aim to tackle contamination by analyzing the mechanisms of contaminated models themselves. Through our experiments, we discover that the overestimation of contaminated models is likely due to parameters acquiring shortcut solutions in training. We further propose a novel method for identifying shortcut neurons through comparative and causal analysis. Building on this, we introduce an evaluation method called shortcut neuron patching to suppress shortcut neurons. Experiments validate the effectiveness of our approach in mitigating contamination. Additionally, our evaluation results exhibit a strong linear correlation with MixEval, a recently released trustworthy benchmark, achieving a Spearman coefficient ($\rho$) exceeding 0.95. This high correlation indicates that our method closely reveals true capabilities of the models and is trustworthy. We conduct further experiments to demonstrate the generalizability of our method across various benchmarks and hyperparameter settings. Code: https://github.com/GaryStack/Trustworthy-Evaluation</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kejian Zhu, Shangqing Tu, Zhuoran Jin, Lei Hou, Juanzi Li, Jun Zhao</p>

            <p><strong>Title:</strong><br>
            Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.04142v1">http://arxiv.org/abs/2506.04142v1</a></p>

            <p><strong>Abstract:</strong><br>
            The development of large language models (LLMs) depends on trustworthy evaluation. However, most current evaluations rely on public benchmarks, which are prone to data contamination issues that significantly compromise fairness. Previous researches have focused on constructing dynamic benchmarks to address contamination. However, continuously building new benchmarks is costly and cyclical. In this work, we aim to tackle contamination by analyzing the mechanisms of contaminated models themselves. Through our experiments, we discover that the overestimation of contaminated models is likely due to parameters acquiring shortcut solutions in training. We further propose a novel method for identifying shortcut neurons through comparative and causal analysis. Building on this, we introduce an evaluation method called shortcut neuron patching to suppress shortcut neurons. Experiments validate the effectiveness of our approach in mitigating contamination. Additionally, our evaluation results exhibit a strong linear correlation with MixEval, a recently released trustworthy benchmark, achieving a Spearman coefficient ($\rho$) exceeding 0.95. This high correlation indicates that our method closely reveals true capabilities of the models and is trustworthy. We conduct further experiments to demonstrate the generalizability of our method across various benchmarks and hyperparameter settings. Code: https://github.com/GaryStack/Trustworthy-Evaluation</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 05 Jun 2025 20:54:56 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b2fa5000/753dd2f9.mp3" length="19526305" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1217</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kejian Zhu, Shangqing Tu, Zhuoran Jin, Lei Hou, Juanzi Li, Jun Zhao</p>

            <p><strong>Title:</strong><br>
            Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.04142v1">http://arxiv.org/abs/2506.04142v1</a></p>

            <p><strong>Abstract:</strong><br>
            The development of large language models (LLMs) depends on trustworthy evaluation. However, most current evaluations rely on public benchmarks, which are prone to data contamination issues that significantly compromise fairness. Previous researches have focused on constructing dynamic benchmarks to address contamination. However, continuously building new benchmarks is costly and cyclical. In this work, we aim to tackle contamination by analyzing the mechanisms of contaminated models themselves. Through our experiments, we discover that the overestimation of contaminated models is likely due to parameters acquiring shortcut solutions in training. We further propose a novel method for identifying shortcut neurons through comparative and causal analysis. Building on this, we introduce an evaluation method called shortcut neuron patching to suppress shortcut neurons. Experiments validate the effectiveness of our approach in mitigating contamination. Additionally, our evaluation results exhibit a strong linear correlation with MixEval, a recently released trustworthy benchmark, achieving a Spearman coefficient ($\rho$) exceeding 0.95. This high correlation indicates that our method closely reveals true capabilities of the models and is trustworthy. We conduct further experiments to demonstrate the generalizability of our method across various benchmarks and hyperparameter settings. Code: https://github.com/GaryStack/Trustworthy-Evaluation</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models</title>
      <itunes:episode>870</itunes:episode>
      <podcast:episode>870</podcast:episode>
      <itunes:title>SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">15255265-f195-436b-a0e6-44e418097bc8</guid>
      <link>https://share.transistor.fm/s/fcee054b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuhao Wu, Yushi Bai, Zhiqiang Hu, Juanzi Li, Roy Ka-Wei Lee</p>

            <p><strong>Title:</strong><br>
            SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.04180v1">http://arxiv.org/abs/2506.04180v1</a></p>

            <p><strong>Abstract:</strong><br>
            Long-form text generation remains a significant challenge for large language models (LLMs), particularly in maintaining coherence, ensuring logical consistency, and preserving text quality as sequence length increases. To address these limitations, we propose SuperWriter-Agent, an agent-based framework designed to enhance the quality and consistency of long-form text generation. SuperWriter-Agent introduces explicit structured thinking-through planning and refinement stages into the generation pipeline, guiding the model to follow a more deliberate and cognitively grounded process akin to that of a professional writer. Based on this framework, we construct a supervised fine-tuning dataset to train a 7B SuperWriter-LM. We further develop a hierarchical Direct Preference Optimization (DPO) procedure that uses Monte Carlo Tree Search (MCTS) to propagate final quality assessments and optimize each generation step accordingly. Empirical results across diverse benchmarks demonstrate that SuperWriter-LM achieves state-of-the-art performance, surpassing even larger-scale baseline models in both automatic evaluation and human evaluation. Furthermore, comprehensive ablation studies demonstrate the effectiveness of hierarchical DPO and underscore the value of incorporating structured thinking steps to improve the quality of long-form text generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuhao Wu, Yushi Bai, Zhiqiang Hu, Juanzi Li, Roy Ka-Wei Lee</p>

            <p><strong>Title:</strong><br>
            SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.04180v1">http://arxiv.org/abs/2506.04180v1</a></p>

            <p><strong>Abstract:</strong><br>
            Long-form text generation remains a significant challenge for large language models (LLMs), particularly in maintaining coherence, ensuring logical consistency, and preserving text quality as sequence length increases. To address these limitations, we propose SuperWriter-Agent, an agent-based framework designed to enhance the quality and consistency of long-form text generation. SuperWriter-Agent introduces explicit structured thinking-through planning and refinement stages into the generation pipeline, guiding the model to follow a more deliberate and cognitively grounded process akin to that of a professional writer. Based on this framework, we construct a supervised fine-tuning dataset to train a 7B SuperWriter-LM. We further develop a hierarchical Direct Preference Optimization (DPO) procedure that uses Monte Carlo Tree Search (MCTS) to propagate final quality assessments and optimize each generation step accordingly. Empirical results across diverse benchmarks demonstrate that SuperWriter-LM achieves state-of-the-art performance, surpassing even larger-scale baseline models in both automatic evaluation and human evaluation. Furthermore, comprehensive ablation studies demonstrate the effectiveness of hierarchical DPO and underscore the value of incorporating structured thinking steps to improve the quality of long-form text generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 05 Jun 2025 20:54:32 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fcee054b/c2eab857.mp3" length="21236604" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1324</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuhao Wu, Yushi Bai, Zhiqiang Hu, Juanzi Li, Roy Ka-Wei Lee</p>

            <p><strong>Title:</strong><br>
            SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.04180v1">http://arxiv.org/abs/2506.04180v1</a></p>

            <p><strong>Abstract:</strong><br>
            Long-form text generation remains a significant challenge for large language models (LLMs), particularly in maintaining coherence, ensuring logical consistency, and preserving text quality as sequence length increases. To address these limitations, we propose SuperWriter-Agent, an agent-based framework designed to enhance the quality and consistency of long-form text generation. SuperWriter-Agent introduces explicit structured thinking-through planning and refinement stages into the generation pipeline, guiding the model to follow a more deliberate and cognitively grounded process akin to that of a professional writer. Based on this framework, we construct a supervised fine-tuning dataset to train a 7B SuperWriter-LM. We further develop a hierarchical Direct Preference Optimization (DPO) procedure that uses Monte Carlo Tree Search (MCTS) to propagate final quality assessments and optimize each generation step accordingly. Empirical results across diverse benchmarks demonstrate that SuperWriter-LM achieves state-of-the-art performance, surpassing even larger-scale baseline models in both automatic evaluation and human evaluation. Furthermore, comprehensive ablation studies demonstrate the effectiveness of hierarchical DPO and underscore the value of incorporating structured thinking steps to improve the quality of long-form text generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning</title>
      <itunes:episode>869</itunes:episode>
      <podcast:episode>869</podcast:episode>
      <itunes:title>Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">00a75c05-f6cd-40c4-81fb-0087608f5f62</guid>
      <link>https://share.transistor.fm/s/88ee765a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 144 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shelly Bensal, Umar Jamil, Christopher Bryant, Melisa Russak, Kiran Kamble, Dmytro Mozolevskyi, Muayad Ali, Waseem AlShikh</p>

            <p><strong>Title:</strong><br>
            Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24726v1">http://arxiv.org/abs/2505.24726v1</a></p>

            <p><strong>Abstract:</strong><br>
            We explore a method for improving the performance of large language models through self-reflection and reinforcement learning. By incentivizing the model to generate better self-reflections when it answers incorrectly, we demonstrate that a model's ability to solve complex, verifiable tasks can be enhanced even when generating synthetic data is infeasible and only binary feedback is available. Our framework operates in two stages: first, upon failing a given task, the model generates a self-reflective commentary analyzing its previous attempt; second, the model is given another attempt at the task with the self-reflection in context. If the subsequent attempt succeeds, the tokens generated during the self-reflection phase are rewarded. Our experimental results show substantial performance gains across a variety of model architectures, as high as 34.7% improvement at math equation writing and 18.1% improvement at function calling. Notably, smaller fine-tuned models (1.5 billion to 7 billion parameters) outperform models in the same family that are 10 times larger. Our novel paradigm is thus an exciting pathway to more useful and reliable language models that can self-improve on challenging tasks with limited external feedback.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 144 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shelly Bensal, Umar Jamil, Christopher Bryant, Melisa Russak, Kiran Kamble, Dmytro Mozolevskyi, Muayad Ali, Waseem AlShikh</p>

            <p><strong>Title:</strong><br>
            Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24726v1">http://arxiv.org/abs/2505.24726v1</a></p>

            <p><strong>Abstract:</strong><br>
            We explore a method for improving the performance of large language models through self-reflection and reinforcement learning. By incentivizing the model to generate better self-reflections when it answers incorrectly, we demonstrate that a model's ability to solve complex, verifiable tasks can be enhanced even when generating synthetic data is infeasible and only binary feedback is available. Our framework operates in two stages: first, upon failing a given task, the model generates a self-reflective commentary analyzing its previous attempt; second, the model is given another attempt at the task with the self-reflection in context. If the subsequent attempt succeeds, the tokens generated during the self-reflection phase are rewarded. Our experimental results show substantial performance gains across a variety of model architectures, as high as 34.7% improvement at math equation writing and 18.1% improvement at function calling. Notably, smaller fine-tuned models (1.5 billion to 7 billion parameters) outperform models in the same family that are 10 times larger. Our novel paradigm is thus an exciting pathway to more useful and reliable language models that can self-improve on challenging tasks with limited external feedback.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 04 Jun 2025 21:05:55 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/88ee765a/45fad0f7.mp3" length="22042004" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1374</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 144 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shelly Bensal, Umar Jamil, Christopher Bryant, Melisa Russak, Kiran Kamble, Dmytro Mozolevskyi, Muayad Ali, Waseem AlShikh</p>

            <p><strong>Title:</strong><br>
            Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24726v1">http://arxiv.org/abs/2505.24726v1</a></p>

            <p><strong>Abstract:</strong><br>
            We explore a method for improving the performance of large language models through self-reflection and reinforcement learning. By incentivizing the model to generate better self-reflections when it answers incorrectly, we demonstrate that a model's ability to solve complex, verifiable tasks can be enhanced even when generating synthetic data is infeasible and only binary feedback is available. Our framework operates in two stages: first, upon failing a given task, the model generates a self-reflective commentary analyzing its previous attempt; second, the model is given another attempt at the task with the self-reflection in context. If the subsequent attempt succeeds, the tokens generated during the self-reflection phase are rewarded. Our experimental results show substantial performance gains across a variety of model architectures, as high as 34.7% improvement at math equation writing and 18.1% improvement at function calling. Notably, smaller fine-tuned models (1.5 billion to 7 billion parameters) outperform models in the same family that are 10 times larger. Our novel paradigm is thus an exciting pathway to more useful and reliable language models that can self-improve on challenging tasks with limited external feedback.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments</title>
      <itunes:episode>868</itunes:episode>
      <podcast:episode>868</podcast:episode>
      <itunes:title>VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">81600822-181b-4931-81b7-79063240f7fc</guid>
      <link>https://share.transistor.fm/s/69ef5559</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Xinlei Chen, Yi Wu, Chao Yu, Yu Wang</p>

            <p><strong>Title:</strong><br>
            VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.02387v1">http://arxiv.org/abs/2506.02387v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and linguistic contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic reasoning and decision-making in multi-agent environments. VS-Bench comprises eight vision-grounded environments spanning cooperative, competitive, and mixed-motive interactions, designed to assess agents' ability to predict others' future moves and optimize for long-term objectives. We consider two complementary evaluation dimensions, including offline evaluation of strategic reasoning by next-action prediction accuracy and online evaluation of decision-making by normalized episode return. Extensive experiments of fourteen leading VLMs reveal a significant gap between current models and optimal performance, with the best models attaining 47.8% prediction accuracy and 24.3% normalized return. We further conduct in-depth analyses on multimodal observations, test-time scaling, social behaviors, and failure cases of VLM agents. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents. Code and data are available at https://vs-bench.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Xinlei Chen, Yi Wu, Chao Yu, Yu Wang</p>

            <p><strong>Title:</strong><br>
            VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.02387v1">http://arxiv.org/abs/2506.02387v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and linguistic contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic reasoning and decision-making in multi-agent environments. VS-Bench comprises eight vision-grounded environments spanning cooperative, competitive, and mixed-motive interactions, designed to assess agents' ability to predict others' future moves and optimize for long-term objectives. We consider two complementary evaluation dimensions, including offline evaluation of strategic reasoning by next-action prediction accuracy and online evaluation of decision-making by normalized episode return. Extensive experiments of fourteen leading VLMs reveal a significant gap between current models and optimal performance, with the best models attaining 47.8% prediction accuracy and 24.3% normalized return. We further conduct in-depth analyses on multimodal observations, test-time scaling, social behaviors, and failure cases of VLM agents. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents. Code and data are available at https://vs-bench.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 04 Jun 2025 21:05:34 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/69ef5559/4f495600.mp3" length="23459748" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1463</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Xinlei Chen, Yi Wu, Chao Yu, Yu Wang</p>

            <p><strong>Title:</strong><br>
            VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.02387v1">http://arxiv.org/abs/2506.02387v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and linguistic contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic reasoning and decision-making in multi-agent environments. VS-Bench comprises eight vision-grounded environments spanning cooperative, competitive, and mixed-motive interactions, designed to assess agents' ability to predict others' future moves and optimize for long-term objectives. We consider two complementary evaluation dimensions, including offline evaluation of strategic reasoning by next-action prediction accuracy and online evaluation of decision-making by normalized episode return. Extensive experiments of fourteen leading VLMs reveal a significant gap between current models and optimal performance, with the best models attaining 47.8% prediction accuracy and 24.3% normalized return. We further conduct in-depth analyses on multimodal observations, test-time scaling, social behaviors, and failure cases of VLM agents. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents. Code and data are available at https://vs-bench.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation</title>
      <itunes:episode>867</itunes:episode>
      <podcast:episode>867</podcast:episode>
      <itunes:title>UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">37ca408e-0754-4e14-8d5c-0117ae255983</guid>
      <link>https://share.transistor.fm/s/c7d1e160</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, Yatian Pang, Li Yuan</p>

            <p><strong>Title:</strong><br>
            UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.03147v2">http://arxiv.org/abs/2506.03147v2</a></p>

            <p><strong>Abstract:</strong><br>
            Although existing unified models achieve strong performance in vision-language understanding and text-to-image generation, they remain limited in addressing image perception and manipulation -- capabilities increasingly demanded in practical applications. Recently, OpenAI introduced the powerful GPT-4o-Image model, which showcases advanced capabilities in comprehensive image perception and manipulation, sparking widespread interest. Through carefully designed experiments, we observe that GPT-4o-Image likely relies on semantic encoders rather than VAEs for feature extraction, despite VAEs being commonly regarded as crucial for image manipulation tasks. Inspired by this insight, we propose UniWorld, a unified generative framework built upon semantic features extracted from powerful multimodal large language models and contrastive semantic encoders. Using only 2.7M training data, UniWorld achieves impressive performance across diverse tasks, including image understanding, generation, manipulation, and perception. We fully open-source the UniWorld framework, including model weights, training and evaluation scripts, and datasets to promote reproducibility and further research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, Yatian Pang, Li Yuan</p>

            <p><strong>Title:</strong><br>
            UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.03147v2">http://arxiv.org/abs/2506.03147v2</a></p>

            <p><strong>Abstract:</strong><br>
            Although existing unified models achieve strong performance in vision-language understanding and text-to-image generation, they remain limited in addressing image perception and manipulation -- capabilities increasingly demanded in practical applications. Recently, OpenAI introduced the powerful GPT-4o-Image model, which showcases advanced capabilities in comprehensive image perception and manipulation, sparking widespread interest. Through carefully designed experiments, we observe that GPT-4o-Image likely relies on semantic encoders rather than VAEs for feature extraction, despite VAEs being commonly regarded as crucial for image manipulation tasks. Inspired by this insight, we propose UniWorld, a unified generative framework built upon semantic features extracted from powerful multimodal large language models and contrastive semantic encoders. Using only 2.7M training data, UniWorld achieves impressive performance across diverse tasks, including image understanding, generation, manipulation, and perception. We fully open-source the UniWorld framework, including model weights, training and evaluation scripts, and datasets to promote reproducibility and further research.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 04 Jun 2025 21:05:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c7d1e160/a195af11.mp3" length="18471399" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1151</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, Yatian Pang, Li Yuan</p>

            <p><strong>Title:</strong><br>
            UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.03147v2">http://arxiv.org/abs/2506.03147v2</a></p>

            <p><strong>Abstract:</strong><br>
            Although existing unified models achieve strong performance in vision-language understanding and text-to-image generation, they remain limited in addressing image perception and manipulation -- capabilities increasingly demanded in practical applications. Recently, OpenAI introduced the powerful GPT-4o-Image model, which showcases advanced capabilities in comprehensive image perception and manipulation, sparking widespread interest. Through carefully designed experiments, we observe that GPT-4o-Image likely relies on semantic encoders rather than VAEs for feature extraction, despite VAEs being commonly regarded as crucial for image manipulation tasks. Inspired by this insight, we propose UniWorld, a unified generative framework built upon semantic features extracted from powerful multimodal large language models and contrastive semantic encoders. Using only 2.7M training data, UniWorld achieves impressive performance across diverse tasks, including image understanding, generation, manipulation, and perception. We fully open-source the UniWorld framework, including model weights, training and evaluation scripts, and datasets to promote reproducibility and further research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis</title>
      <itunes:episode>866</itunes:episode>
      <podcast:episode>866</podcast:episode>
      <itunes:title>SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">86e87b31-8f05-4f34-b20f-3a1a87c8ab08</guid>
      <link>https://share.transistor.fm/s/e06bbdb4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.LG, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zijian Wu, Jinjie Ni, Xiangyan Liu, Zichen Liu, Hang Yan, Michael Qizhe Shieh</p>

            <p><strong>Title:</strong><br>
            SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.02096v1">http://arxiv.org/abs/2506.02096v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLMs) trained via reinforcement learning with verifiable reward (RLVR) have shown notable progress in scaling test-time compute effectively. In this work, we investigate how synthesized RL data can further improve RLVR. To this end, we propose \textbf{SynthRL}-a scalable and guaranteed pipeline for automatic data scaling in reasoning-oriented RL training. SynthRL comprises three key stages: (1) selecting seed questions with appropriate distribution, (2) augmenting them into more challenging variants while preserving the original answers, and (3) a guaranteed verification stage that ensures near-perfect correctness and difficulty enhancement. Our empirical experiments demonstrate SynthRL's scalability and effectiveness. When applied to the MMK12 dataset, SynthRL synthesizes over 3.3K additional verifiable, challenging questions from approximately 8K seed samples. Models trained with our synthesized data achieve consistent gains across five out-of-domain visual math reasoning benchmarks, with a significant improvement over baseline models trained on seed data alone. Notably, detailed analysis reveals that the gains are more pronounced on the most challenging evaluation samples, highlighting SynthRL's effectiveness in eliciting deeper and more complex reasoning patterns.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.LG, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zijian Wu, Jinjie Ni, Xiangyan Liu, Zichen Liu, Hang Yan, Michael Qizhe Shieh</p>

            <p><strong>Title:</strong><br>
            SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.02096v1">http://arxiv.org/abs/2506.02096v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLMs) trained via reinforcement learning with verifiable reward (RLVR) have shown notable progress in scaling test-time compute effectively. In this work, we investigate how synthesized RL data can further improve RLVR. To this end, we propose \textbf{SynthRL}-a scalable and guaranteed pipeline for automatic data scaling in reasoning-oriented RL training. SynthRL comprises three key stages: (1) selecting seed questions with appropriate distribution, (2) augmenting them into more challenging variants while preserving the original answers, and (3) a guaranteed verification stage that ensures near-perfect correctness and difficulty enhancement. Our empirical experiments demonstrate SynthRL's scalability and effectiveness. When applied to the MMK12 dataset, SynthRL synthesizes over 3.3K additional verifiable, challenging questions from approximately 8K seed samples. Models trained with our synthesized data achieve consistent gains across five out-of-domain visual math reasoning benchmarks, with a significant improvement over baseline models trained on seed data alone. Notably, detailed analysis reveals that the gains are more pronounced on the most challenging evaluation samples, highlighting SynthRL's effectiveness in eliciting deeper and more complex reasoning patterns.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 04 Jun 2025 21:04:51 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e06bbdb4/9d8977b0.mp3" length="17620825" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1098</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.LG, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zijian Wu, Jinjie Ni, Xiangyan Liu, Zichen Liu, Hang Yan, Michael Qizhe Shieh</p>

            <p><strong>Title:</strong><br>
            SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.02096v1">http://arxiv.org/abs/2506.02096v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLMs) trained via reinforcement learning with verifiable reward (RLVR) have shown notable progress in scaling test-time compute effectively. In this work, we investigate how synthesized RL data can further improve RLVR. To this end, we propose \textbf{SynthRL}-a scalable and guaranteed pipeline for automatic data scaling in reasoning-oriented RL training. SynthRL comprises three key stages: (1) selecting seed questions with appropriate distribution, (2) augmenting them into more challenging variants while preserving the original answers, and (3) a guaranteed verification stage that ensures near-perfect correctness and difficulty enhancement. Our empirical experiments demonstrate SynthRL's scalability and effectiveness. When applied to the MMK12 dataset, SynthRL synthesizes over 3.3K additional verifiable, challenging questions from approximately 8K seed samples. Models trained with our synthesized data achieve consistent gains across five out-of-domain visual math reasoning benchmarks, with a significant improvement over baseline models trained on seed data alone. Notably, detailed analysis reveals that the gains are more pronounced on the most challenging evaluation samples, highlighting SynthRL's effectiveness in eliciting deeper and more complex reasoning patterns.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs</title>
      <itunes:episode>865</itunes:episode>
      <podcast:episode>865</podcast:episode>
      <itunes:title>CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">430effd5-c83c-4034-910e-102be66f1b64</guid>
      <link>https://share.transistor.fm/s/34db89ce</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ai Jian, Weijie Qiu, Xiaokun Wang, Peiyu Wang, Yunzhuo Hao, Jiangbo Pei, Yichen Wei, Yi Peng, Xuchen Song</p>

            <p><strong>Title:</strong><br>
            CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24120v1">http://arxiv.org/abs/2505.24120v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal understanding, yet their capabilities for scientific reasoning remains inadequately assessed. Current multimodal benchmarks predominantly evaluate generic image comprehension or text-driven reasoning, lacking authentic scientific contexts that require domain-specific knowledge integration with visual evidence analysis. To fill this gap, we present CSVQA, a diagnostic multimodal benchmark specifically designed for evaluating scientific reasoning through domain-grounded visual question answering.Our benchmark features 1,378 carefully constructed question-answer pairs spanning diverse STEM disciplines, each demanding domain knowledge, integration of visual evidence, and higher-order reasoning. Compared to prior multimodal benchmarks, CSVQA places greater emphasis on real-world scientific content and complex reasoning.We additionally propose a rigorous evaluation protocol to systematically assess whether model predictions are substantiated by valid intermediate reasoning steps based on curated explanations. Our comprehensive evaluation of 15 VLMs on this benchmark reveals notable performance disparities, as even the top-ranked proprietary model attains only 49.6\% accuracy.This empirical evidence underscores the pressing need for advancing scientific reasoning capabilities in VLMs. Our CSVQA is released at https://huggingface.co/datasets/Skywork/CSVQA.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ai Jian, Weijie Qiu, Xiaokun Wang, Peiyu Wang, Yunzhuo Hao, Jiangbo Pei, Yichen Wei, Yi Peng, Xuchen Song</p>

            <p><strong>Title:</strong><br>
            CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24120v1">http://arxiv.org/abs/2505.24120v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal understanding, yet their capabilities for scientific reasoning remains inadequately assessed. Current multimodal benchmarks predominantly evaluate generic image comprehension or text-driven reasoning, lacking authentic scientific contexts that require domain-specific knowledge integration with visual evidence analysis. To fill this gap, we present CSVQA, a diagnostic multimodal benchmark specifically designed for evaluating scientific reasoning through domain-grounded visual question answering.Our benchmark features 1,378 carefully constructed question-answer pairs spanning diverse STEM disciplines, each demanding domain knowledge, integration of visual evidence, and higher-order reasoning. Compared to prior multimodal benchmarks, CSVQA places greater emphasis on real-world scientific content and complex reasoning.We additionally propose a rigorous evaluation protocol to systematically assess whether model predictions are substantiated by valid intermediate reasoning steps based on curated explanations. Our comprehensive evaluation of 15 VLMs on this benchmark reveals notable performance disparities, as even the top-ranked proprietary model attains only 49.6\% accuracy.This empirical evidence underscores the pressing need for advancing scientific reasoning capabilities in VLMs. Our CSVQA is released at https://huggingface.co/datasets/Skywork/CSVQA.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 04 Jun 2025 21:04:29 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/34db89ce/1ec54b64.mp3" length="20274890" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1263</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ai Jian, Weijie Qiu, Xiaokun Wang, Peiyu Wang, Yunzhuo Hao, Jiangbo Pei, Yichen Wei, Yi Peng, Xuchen Song</p>

            <p><strong>Title:</strong><br>
            CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24120v1">http://arxiv.org/abs/2505.24120v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal understanding, yet their capabilities for scientific reasoning remains inadequately assessed. Current multimodal benchmarks predominantly evaluate generic image comprehension or text-driven reasoning, lacking authentic scientific contexts that require domain-specific knowledge integration with visual evidence analysis. To fill this gap, we present CSVQA, a diagnostic multimodal benchmark specifically designed for evaluating scientific reasoning through domain-grounded visual question answering.Our benchmark features 1,378 carefully constructed question-answer pairs spanning diverse STEM disciplines, each demanding domain knowledge, integration of visual evidence, and higher-order reasoning. Compared to prior multimodal benchmarks, CSVQA places greater emphasis on real-world scientific content and complex reasoning.We additionally propose a rigorous evaluation protocol to systematically assess whether model predictions are substantiated by valid intermediate reasoning steps based on curated explanations. Our comprehensive evaluation of 15 VLMs on this benchmark reveals notable performance disparities, as even the top-ranked proprietary model attains only 49.6\% accuracy.This empirical evidence underscores the pressing need for advancing scientific reasoning capabilities in VLMs. Our CSVQA is released at https://huggingface.co/datasets/Skywork/CSVQA.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents</title>
      <itunes:episode>864</itunes:episode>
      <podcast:episode>864</podcast:episode>
      <itunes:title>GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">568f1856-b508-40c5-b057-0d59225c673f</guid>
      <link>https://share.transistor.fm/s/84c08e4b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, Si Qin, Lars Liden, Qingwei Lin, Huan Zhang, Tong Zhang, Jianbing Zhang, Dongmei Zhang, Jianfeng Gao</p>

            <p><strong>Title:</strong><br>
            GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.03143v1">http://arxiv.org/abs/2506.03143v1</a></p>

            <p><strong>Abstract:</strong><br>
            One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to handle ambiguous supervision targets, and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated  token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, achieving scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, Si Qin, Lars Liden, Qingwei Lin, Huan Zhang, Tong Zhang, Jianbing Zhang, Dongmei Zhang, Jianfeng Gao</p>

            <p><strong>Title:</strong><br>
            GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.03143v1">http://arxiv.org/abs/2506.03143v1</a></p>

            <p><strong>Abstract:</strong><br>
            One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to handle ambiguous supervision targets, and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated  token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, achieving scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 04 Jun 2025 21:03:47 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/84c08e4b/17accda4.mp3" length="21552561" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1343</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, Si Qin, Lars Liden, Qingwei Lin, Huan Zhang, Tong Zhang, Jianbing Zhang, Dongmei Zhang, Jianfeng Gao</p>

            <p><strong>Title:</strong><br>
            GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.03143v1">http://arxiv.org/abs/2506.03143v1</a></p>

            <p><strong>Abstract:</strong><br>
            One of the principal challenges in building VLM-powered GUI agents is visual grounding, i.e., localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment, inability to handle ambiguous supervision targets, and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated  token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution. Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, achieving scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones. Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces</title>
      <itunes:episode>863</itunes:episode>
      <podcast:episode>863</podcast:episode>
      <itunes:title>Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b6d1943f-d774-415e-ade7-0bd288f16e67</guid>
      <link>https://share.transistor.fm/s/235ab0fd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Haonan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, Shenglong Ye, Lewei Lu, Jingbo Wang, Wenhai Wang, Jifeng Dai, Yu Qiao, Rongrong Ji, Xizhou Zhu</p>

            <p><strong>Title:</strong><br>
            Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.00123v1">http://arxiv.org/abs/2506.00123v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable progress of Multimodal Large Language Models (MLLMs) has attracted increasing attention to extend them to physical entities like legged robot. This typically requires MLLMs to not only grasp multimodal understanding abilities, but also integrate visual-spatial reasoning and physical interaction capabilities. Nevertheless,existing methods struggle to unify these capabilities due to their fundamental differences.In this paper, we present the Visual Embodied Brain (VeBrain), a unified framework for perception, reasoning, and control in real world. VeBrain reformulates robotic control into common text-based MLLM tasks in the 2D visual space, thus unifying the objectives and mapping spaces of different tasks. Then, a novel robotic adapter is proposed to convert textual control signals from MLLMs to motion policies of real robots. From the data perspective, we further introduce VeBrain-600k, a high-quality instruction dataset encompassing various capabilities of VeBrain. In VeBrain-600k, we take hundreds of hours to collect, curate and annotate the data, and adopt multimodal chain-of-thought(CoT) to mix the different capabilities into a single conversation. Extensive experiments on 13 multimodal benchmarks and 5 spatial intelligence benchmarks demonstrate the superior performance of VeBrain to existing MLLMs like Qwen2.5-VL. When deployed to legged robots and robotic arms, VeBrain shows strong adaptability, flexibility, and compositional capabilities compared to existing methods. For example, compared to Qwen2.5-VL, VeBrain not only achieves substantial gains on MMVet by +5.6%, but also excels in legged robot tasks with +50% average gains.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Haonan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, Shenglong Ye, Lewei Lu, Jingbo Wang, Wenhai Wang, Jifeng Dai, Yu Qiao, Rongrong Ji, Xizhou Zhu</p>

            <p><strong>Title:</strong><br>
            Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.00123v1">http://arxiv.org/abs/2506.00123v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable progress of Multimodal Large Language Models (MLLMs) has attracted increasing attention to extend them to physical entities like legged robot. This typically requires MLLMs to not only grasp multimodal understanding abilities, but also integrate visual-spatial reasoning and physical interaction capabilities. Nevertheless,existing methods struggle to unify these capabilities due to their fundamental differences.In this paper, we present the Visual Embodied Brain (VeBrain), a unified framework for perception, reasoning, and control in real world. VeBrain reformulates robotic control into common text-based MLLM tasks in the 2D visual space, thus unifying the objectives and mapping spaces of different tasks. Then, a novel robotic adapter is proposed to convert textual control signals from MLLMs to motion policies of real robots. From the data perspective, we further introduce VeBrain-600k, a high-quality instruction dataset encompassing various capabilities of VeBrain. In VeBrain-600k, we take hundreds of hours to collect, curate and annotate the data, and adopt multimodal chain-of-thought(CoT) to mix the different capabilities into a single conversation. Extensive experiments on 13 multimodal benchmarks and 5 spatial intelligence benchmarks demonstrate the superior performance of VeBrain to existing MLLMs like Qwen2.5-VL. When deployed to legged robots and robotic arms, VeBrain shows strong adaptability, flexibility, and compositional capabilities compared to existing methods. For example, compared to Qwen2.5-VL, VeBrain not only achieves substantial gains on MMVet by +5.6%, but also excels in legged robot tasks with +50% average gains.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 04 Jun 2025 21:03:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/235ab0fd/a3ca2619.mp3" length="21453122" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1337</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Haonan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, Shenglong Ye, Lewei Lu, Jingbo Wang, Wenhai Wang, Jifeng Dai, Yu Qiao, Rongrong Ji, Xizhou Zhu</p>

            <p><strong>Title:</strong><br>
            Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.00123v1">http://arxiv.org/abs/2506.00123v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable progress of Multimodal Large Language Models (MLLMs) has attracted increasing attention to extend them to physical entities like legged robot. This typically requires MLLMs to not only grasp multimodal understanding abilities, but also integrate visual-spatial reasoning and physical interaction capabilities. Nevertheless,existing methods struggle to unify these capabilities due to their fundamental differences.In this paper, we present the Visual Embodied Brain (VeBrain), a unified framework for perception, reasoning, and control in real world. VeBrain reformulates robotic control into common text-based MLLM tasks in the 2D visual space, thus unifying the objectives and mapping spaces of different tasks. Then, a novel robotic adapter is proposed to convert textual control signals from MLLMs to motion policies of real robots. From the data perspective, we further introduce VeBrain-600k, a high-quality instruction dataset encompassing various capabilities of VeBrain. In VeBrain-600k, we take hundreds of hours to collect, curate and annotate the data, and adopt multimodal chain-of-thought(CoT) to mix the different capabilities into a single conversation. Extensive experiments on 13 multimodal benchmarks and 5 spatial intelligence benchmarks demonstrate the superior performance of VeBrain to existing MLLMs like Qwen2.5-VL. When deployed to legged robots and robotic arms, VeBrain shows strong adaptability, flexibility, and compositional capabilities compared to existing methods. For example, compared to Qwen2.5-VL, VeBrain not only achieves substantial gains on MMVet by +5.6%, but also excels in legged robot tasks with +50% average gains.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation</title>
      <itunes:episode>862</itunes:episode>
      <podcast:episode>862</podcast:episode>
      <itunes:title>OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a70997f5-abca-46c4-8b7d-28d6ff5b9b05</guid>
      <link>https://share.transistor.fm/s/7d5cfe44</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shengjia Zhang, Junjie Wu, Jiawei Chen, Changwang Zhang, Xingyu Lou, Wangchunshu Zhou, Sheng Zhou, Can Wang, Jun Wang</p>

            <p><strong>Title:</strong><br>
            OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.02397v1">http://arxiv.org/abs/2506.02397v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advanced large reasoning models (LRMs) leverage extended chain-of-thought (CoT) reasoning to solve complex tasks, achieving state-of-the-art performance. Despite their success, we identify a critical issue: a substantial portion of simple tasks solved by LRMs can also be addressed by non-reasoning LLMs using significantly fewer tokens, indicating the complex reasoning may not always be necessary. To address this, we systematically analyze the reasoning trajectories of LRMs and present a method utilizing identified paradigms and LLM-Judge to classify these trajectories as either Redundant Reasoning or Essential Reasoning. And we introduce OThink-R1, a method that prunes redundant reasoning steps while preserving logical validity. OThink-R1 dynamically employs the non-thinking mode (fast-thinking) for straightforward problems while engaging in deliberate thinking (slow-thinking) for complex problems. Experiments across mathematical and question-answering tasks demonstrate that OThink-R1 reduces reasoning redundancy by almost 23\% on average without compromising accuracy, offering practical guidelines for efficient reasoning models. The code is available at https://github.com/AgenticIR-Lab/OThink-R1.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shengjia Zhang, Junjie Wu, Jiawei Chen, Changwang Zhang, Xingyu Lou, Wangchunshu Zhou, Sheng Zhou, Can Wang, Jun Wang</p>

            <p><strong>Title:</strong><br>
            OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.02397v1">http://arxiv.org/abs/2506.02397v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advanced large reasoning models (LRMs) leverage extended chain-of-thought (CoT) reasoning to solve complex tasks, achieving state-of-the-art performance. Despite their success, we identify a critical issue: a substantial portion of simple tasks solved by LRMs can also be addressed by non-reasoning LLMs using significantly fewer tokens, indicating the complex reasoning may not always be necessary. To address this, we systematically analyze the reasoning trajectories of LRMs and present a method utilizing identified paradigms and LLM-Judge to classify these trajectories as either Redundant Reasoning or Essential Reasoning. And we introduce OThink-R1, a method that prunes redundant reasoning steps while preserving logical validity. OThink-R1 dynamically employs the non-thinking mode (fast-thinking) for straightforward problems while engaging in deliberate thinking (slow-thinking) for complex problems. Experiments across mathematical and question-answering tasks demonstrate that OThink-R1 reduces reasoning redundancy by almost 23\% on average without compromising accuracy, offering practical guidelines for efficient reasoning models. The code is available at https://github.com/AgenticIR-Lab/OThink-R1.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 04 Jun 2025 21:03:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7d5cfe44/610964f2.mp3" length="23552104" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1468</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shengjia Zhang, Junjie Wu, Jiawei Chen, Changwang Zhang, Xingyu Lou, Wangchunshu Zhou, Sheng Zhou, Can Wang, Jun Wang</p>

            <p><strong>Title:</strong><br>
            OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.02397v1">http://arxiv.org/abs/2506.02397v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advanced large reasoning models (LRMs) leverage extended chain-of-thought (CoT) reasoning to solve complex tasks, achieving state-of-the-art performance. Despite their success, we identify a critical issue: a substantial portion of simple tasks solved by LRMs can also be addressed by non-reasoning LLMs using significantly fewer tokens, indicating the complex reasoning may not always be necessary. To address this, we systematically analyze the reasoning trajectories of LRMs and present a method utilizing identified paradigms and LLM-Judge to classify these trajectories as either Redundant Reasoning or Essential Reasoning. And we introduce OThink-R1, a method that prunes redundant reasoning steps while preserving logical validity. OThink-R1 dynamically employs the non-thinking mode (fast-thinking) for straightforward problems while engaging in deliberate thinking (slow-thinking) for complex problems. Experiments across mathematical and question-answering tasks demonstrate that OThink-R1 reduces reasoning redundancy by almost 23\% on average without compromising accuracy, offering practical guidelines for efficient reasoning models. The code is available at https://github.com/AgenticIR-Lab/OThink-R1.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning</title>
      <itunes:episode>861</itunes:episode>
      <podcast:episode>861</podcast:episode>
      <itunes:title>Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1b10ec62-ada9-41a0-9b41-b98cf5f13afa</guid>
      <link>https://share.transistor.fm/s/715a065a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 99 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.01939v1">http://arxiv.org/abs/2506.01939v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base model's entropy patterns, primarily adjusting the entropy of high-entropy tokens. These findings highlight the significance of high-entropy tokens (i.e., forking tokens) to RLVR. We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME'25 and +7.71 on AIME'24) and Qwen3-14B (+4.79 on AIME'25 and +5.21 on AIME'24) base models, highlighting a strong scaling trend. In contrast, training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that decide reasoning directions. Collectively, our results highlight the potential to understand RLVR through a token-entropy perspective and optimize RLVR by leveraging high-entropy minority tokens to further improve LLM reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 99 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.01939v1">http://arxiv.org/abs/2506.01939v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base model's entropy patterns, primarily adjusting the entropy of high-entropy tokens. These findings highlight the significance of high-entropy tokens (i.e., forking tokens) to RLVR. We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME'25 and +7.71 on AIME'24) and Qwen3-14B (+4.79 on AIME'25 and +5.21 on AIME'24) base models, highlighting a strong scaling trend. In contrast, training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that decide reasoning directions. Collectively, our results highlight the potential to understand RLVR through a token-entropy perspective and optimize RLVR by leveraging high-entropy minority tokens to further improve LLM reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 03 Jun 2025 21:17:52 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/715a065a/53fe9e7b.mp3" length="21302254" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1328</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 99 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.01939v1">http://arxiv.org/abs/2506.01939v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base model's entropy patterns, primarily adjusting the entropy of high-entropy tokens. These findings highlight the significance of high-entropy tokens (i.e., forking tokens) to RLVR. We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME'25 and +7.71 on AIME'24) and Qwen3-14B (+4.79 on AIME'25 and +5.21 on AIME'24) base models, highlighting a strong scaling trend. In contrast, training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that decide reasoning directions. Collectively, our results highlight the potential to understand RLVR through a token-entropy perspective and optimize RLVR by leveraging high-entropy minority tokens to further improve LLM reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards</title>
      <itunes:episode>860</itunes:episode>
      <podcast:episode>860</podcast:episode>
      <itunes:title>REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a8de99ee-fe97-4cfa-a98c-a89ef51f1889</guid>
      <link>https://share.transistor.fm/s/47a63132</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, Andreas Köpf</p>

            <p><strong>Title:</strong><br>
            REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24760v1">http://arxiv.org/abs/2505.24760v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels. Our experimental results demonstrate the efficacy of RG in both evaluating and reinforcement learning of reasoning models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, Andreas Köpf</p>

            <p><strong>Title:</strong><br>
            REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24760v1">http://arxiv.org/abs/2505.24760v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels. Our experimental results demonstrate the efficacy of RG in both evaluating and reinforcement learning of reasoning models.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 03 Jun 2025 21:17:29 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/47a63132/fe291e4b.mp3" length="20821581" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1298</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, Andreas Köpf</p>

            <p><strong>Title:</strong><br>
            REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24760v1">http://arxiv.org/abs/2505.24760v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels. Our experimental results demonstrate the efficacy of RG in both evaluating and reinforcement learning of reasoning models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics</title>
      <itunes:episode>859</itunes:episode>
      <podcast:episode>859</podcast:episode>
      <itunes:title>SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c34bc591-6cad-44e5-b4d9-2dac47085170</guid>
      <link>https://share.transistor.fm/s/ba43fc0e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, Remi Cadene</p>

            <p><strong>Title:</strong><br>
            SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.01844v1">http://arxiv.org/abs/2506.01844v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive--often with billions of parameters--leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms. In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance. SmolVLA is designed to be trained on a single GPU and deployed on consumer-grade GPUs or even CPUs. To further improve responsiveness, we introduce an asynchronous inference stack decoupling perception and action prediction from action execution, allowing higher control rates with chunked action generation. Despite its compact size, SmolVLA achieves performance comparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of both simulated as well as real-world robotic benchmarks and release all code, pretrained models, and training data.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, Remi Cadene</p>

            <p><strong>Title:</strong><br>
            SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.01844v1">http://arxiv.org/abs/2506.01844v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive--often with billions of parameters--leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms. In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance. SmolVLA is designed to be trained on a single GPU and deployed on consumer-grade GPUs or even CPUs. To further improve responsiveness, we introduce an asynchronous inference stack decoupling perception and action prediction from action execution, allowing higher control rates with chunked action generation. Despite its compact size, SmolVLA achieves performance comparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of both simulated as well as real-world robotic benchmarks and release all code, pretrained models, and training data.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 03 Jun 2025 21:17:06 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ba43fc0e/ae36fc15.mp3" length="20316257" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1266</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, Remi Cadene</p>

            <p><strong>Title:</strong><br>
            SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.01844v1">http://arxiv.org/abs/2506.01844v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive--often with billions of parameters--leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms. In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance. SmolVLA is designed to be trained on a single GPU and deployed on consumer-grade GPUs or even CPUs. To further improve responsiveness, we introduce an asynchronous inference stack decoupling perception and action prediction from action execution, allowing higher control rates with chunked action generation. Despite its compact size, SmolVLA achieves performance comparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of both simulated as well as real-world robotic benchmarks and release all code, pretrained models, and training data.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Taming LLMs by Scaling Learning Rates with Gradient Grouping</title>
      <itunes:episode>858</itunes:episode>
      <podcast:episode>858</podcast:episode>
      <itunes:title>Taming LLMs by Scaling Learning Rates with Gradient Grouping</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">910c6c8f-8cf3-442d-aa47-191e23b899ce</guid>
      <link>https://share.transistor.fm/s/93e6ec4a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Siyuan Li, Juanxi Tian, Zedong Wang, Xin Jin, Zicheng Liu, Wentao Zhang, Dan Xu</p>

            <p><strong>Title:</strong><br>
            Taming LLMs by Scaling Learning Rates with Gradient Grouping</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.01049v1">http://arxiv.org/abs/2506.01049v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training large language models (LLMs) poses challenges due to their massive scale and heterogeneous architectures. While adaptive optimizers like AdamW help address gradient variations, they still struggle with efficient and effective parameter-wise learning rate estimation, resulting in training instability, slow convergence, and poor compatibility with parameter-efficient fine-tuning (PEFT) techniques. This work introduces Scaling with Gradient Grouping (SGG), an optimizer wrapper that improves adaptive learning rate estimation by dynamic grouping and group-specific scaling. SGG first groups gradient statistics in each layer into clusters and then applies cluster-specific scaling to calibrate learning rates for each parameter, thus imposing collective group-wise constraints while maintaining precise per-parameter adaptation. Experiments on diverse (M)LLM benchmarks show that SGG integrates seamlessly with existing optimizers, and offers consistent gains and faster convergence over baselines, with various model sizes. Its stability across varying batch sizes and learning rates establishes SGG as a robust choice for LLM optimization.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Siyuan Li, Juanxi Tian, Zedong Wang, Xin Jin, Zicheng Liu, Wentao Zhang, Dan Xu</p>

            <p><strong>Title:</strong><br>
            Taming LLMs by Scaling Learning Rates with Gradient Grouping</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.01049v1">http://arxiv.org/abs/2506.01049v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training large language models (LLMs) poses challenges due to their massive scale and heterogeneous architectures. While adaptive optimizers like AdamW help address gradient variations, they still struggle with efficient and effective parameter-wise learning rate estimation, resulting in training instability, slow convergence, and poor compatibility with parameter-efficient fine-tuning (PEFT) techniques. This work introduces Scaling with Gradient Grouping (SGG), an optimizer wrapper that improves adaptive learning rate estimation by dynamic grouping and group-specific scaling. SGG first groups gradient statistics in each layer into clusters and then applies cluster-specific scaling to calibrate learning rates for each parameter, thus imposing collective group-wise constraints while maintaining precise per-parameter adaptation. Experiments on diverse (M)LLM benchmarks show that SGG integrates seamlessly with existing optimizers, and offers consistent gains and faster convergence over baselines, with various model sizes. Its stability across varying batch sizes and learning rates establishes SGG as a robust choice for LLM optimization.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 03 Jun 2025 21:16:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/93e6ec4a/b8c17b97.mp3" length="19579378" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1220</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Siyuan Li, Juanxi Tian, Zedong Wang, Xin Jin, Zicheng Liu, Wentao Zhang, Dan Xu</p>

            <p><strong>Title:</strong><br>
            Taming LLMs by Scaling Learning Rates with Gradient Grouping</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.01049v1">http://arxiv.org/abs/2506.01049v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training large language models (LLMs) poses challenges due to their massive scale and heterogeneous architectures. While adaptive optimizers like AdamW help address gradient variations, they still struggle with efficient and effective parameter-wise learning rate estimation, resulting in training instability, slow convergence, and poor compatibility with parameter-efficient fine-tuning (PEFT) techniques. This work introduces Scaling with Gradient Grouping (SGG), an optimizer wrapper that improves adaptive learning rate estimation by dynamic grouping and group-specific scaling. SGG first groups gradient statistics in each layer into clusters and then applies cluster-specific scaling to calibrate learning rates for each parameter, thus imposing collective group-wise constraints while maintaining precise per-parameter adaptation. Experiments on diverse (M)LLM benchmarks show that SGG integrates seamlessly with existing optimizers, and offers consistent gains and faster convergence over baselines, with various model sizes. Its stability across varying batch sizes and learning rates establishes SGG as a robust choice for LLM optimization.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ARIA: Training Language Agents with Intention-Driven Reward Aggregation</title>
      <itunes:episode>857</itunes:episode>
      <podcast:episode>857</podcast:episode>
      <itunes:title>ARIA: Training Language Agents with Intention-Driven Reward Aggregation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">60bb386d-02cf-4b1c-9f5b-0cb7e67f90a8</guid>
      <link>https://share.transistor.fm/s/d97b0d22</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ruihan Yang, Yikai Zhang, Aili Chen, Xintao Wang, Siyu Yuan, Jiangjie Chen, Deqing Yang, Yanghua Xiao</p>

            <p><strong>Title:</strong><br>
            ARIA: Training Language Agents with Intention-Driven Reward Aggregation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.00539v1">http://arxiv.org/abs/2506.00539v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have enabled agents to perform complex reasoning and decision-making through free-form language interactions. However, in open-ended language action environments (e.g., negotiation or question-asking games), the action space can be formulated as a joint distribution over tokens, resulting in an exponentially large action space. Sampling actions in such a space can lead to extreme reward sparsity, which brings large reward variance, hindering effective reinforcement learning (RL). To address this, we propose ARIA, a method that Aggregates Rewards in Intention space to enable efficient and effective language Agents training. ARIA aims to project natural language actions from the high-dimensional joint token distribution space into a low-dimensional intention space, where semantically similar actions are clustered and assigned shared rewards. This intention-aware reward aggregation reduces reward variance by densifying reward signals, fostering better policy optimization. Extensive experiments demonstrate that ARIA not only significantly reduces policy gradient variance, but also delivers substantial performance gains of an average of 9.95% across four downstream tasks, consistently outperforming offline and online RL baselines.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ruihan Yang, Yikai Zhang, Aili Chen, Xintao Wang, Siyu Yuan, Jiangjie Chen, Deqing Yang, Yanghua Xiao</p>

            <p><strong>Title:</strong><br>
            ARIA: Training Language Agents with Intention-Driven Reward Aggregation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.00539v1">http://arxiv.org/abs/2506.00539v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have enabled agents to perform complex reasoning and decision-making through free-form language interactions. However, in open-ended language action environments (e.g., negotiation or question-asking games), the action space can be formulated as a joint distribution over tokens, resulting in an exponentially large action space. Sampling actions in such a space can lead to extreme reward sparsity, which brings large reward variance, hindering effective reinforcement learning (RL). To address this, we propose ARIA, a method that Aggregates Rewards in Intention space to enable efficient and effective language Agents training. ARIA aims to project natural language actions from the high-dimensional joint token distribution space into a low-dimensional intention space, where semantically similar actions are clustered and assigned shared rewards. This intention-aware reward aggregation reduces reward variance by densifying reward signals, fostering better policy optimization. Extensive experiments demonstrate that ARIA not only significantly reduces policy gradient variance, but also delivers substantial performance gains of an average of 9.95% across four downstream tasks, consistently outperforming offline and online RL baselines.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 03 Jun 2025 21:16:20 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d97b0d22/f98722c6.mp3" length="22858279" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1425</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ruihan Yang, Yikai Zhang, Aili Chen, Xintao Wang, Siyu Yuan, Jiangjie Chen, Deqing Yang, Yanghua Xiao</p>

            <p><strong>Title:</strong><br>
            ARIA: Training Language Agents with Intention-Driven Reward Aggregation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.00539v1">http://arxiv.org/abs/2506.00539v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have enabled agents to perform complex reasoning and decision-making through free-form language interactions. However, in open-ended language action environments (e.g., negotiation or question-asking games), the action space can be formulated as a joint distribution over tokens, resulting in an exponentially large action space. Sampling actions in such a space can lead to extreme reward sparsity, which brings large reward variance, hindering effective reinforcement learning (RL). To address this, we propose ARIA, a method that Aggregates Rewards in Intention space to enable efficient and effective language Agents training. ARIA aims to project natural language actions from the high-dimensional joint token distribution space into a low-dimensional intention space, where semantically similar actions are clustered and assigned shared rewards. This intention-aware reward aggregation reduces reward variance by densifying reward signals, fostering better policy optimization. Extensive experiments demonstrate that ARIA not only significantly reduces policy gradient variance, but also delivers substantial performance gains of an average of 9.95% across four downstream tasks, consistently outperforming offline and online RL baselines.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models</title>
      <itunes:episode>856</itunes:episode>
      <podcast:episode>856</podcast:episode>
      <itunes:title>Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">990e5bb0-4666-4749-b2fe-0fbbfc876443</guid>
      <link>https://share.transistor.fm/s/b97ee02b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kinam Kim, Junha Hyung, Jaegul Choo</p>

            <p><strong>Title:</strong><br>
            Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.00996v1">http://arxiv.org/abs/2506.00996v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in text-to-video diffusion models have enabled high-quality video synthesis, but controllable generation remains challenging, particularly under limited data and compute. Existing fine-tuning methods for conditional generation often rely on external encoders or architectural modifications, which demand large datasets and are typically restricted to spatially aligned conditioning, limiting flexibility and scalability. In this work, we introduce Temporal In-Context Fine-Tuning (TIC-FT), an efficient and versatile approach for adapting pretrained video diffusion models to diverse conditional generation tasks. Our key idea is to concatenate condition and target frames along the temporal axis and insert intermediate buffer frames with progressively increasing noise levels. These buffer frames enable smooth transitions, aligning the fine-tuning process with the pretrained model's temporal dynamics. TIC-FT requires no architectural changes and achieves strong performance with as few as 10-30 training samples. We validate our method across a range of tasks, including image-to-video and video-to-video generation, using large-scale base models such as CogVideoX-5B and Wan-14B. Extensive experiments show that TIC-FT outperforms existing baselines in both condition fidelity and visual quality, while remaining highly efficient in both training and inference. For additional results, visit https://kinam0252.github.io/TIC-FT/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kinam Kim, Junha Hyung, Jaegul Choo</p>

            <p><strong>Title:</strong><br>
            Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.00996v1">http://arxiv.org/abs/2506.00996v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in text-to-video diffusion models have enabled high-quality video synthesis, but controllable generation remains challenging, particularly under limited data and compute. Existing fine-tuning methods for conditional generation often rely on external encoders or architectural modifications, which demand large datasets and are typically restricted to spatially aligned conditioning, limiting flexibility and scalability. In this work, we introduce Temporal In-Context Fine-Tuning (TIC-FT), an efficient and versatile approach for adapting pretrained video diffusion models to diverse conditional generation tasks. Our key idea is to concatenate condition and target frames along the temporal axis and insert intermediate buffer frames with progressively increasing noise levels. These buffer frames enable smooth transitions, aligning the fine-tuning process with the pretrained model's temporal dynamics. TIC-FT requires no architectural changes and achieves strong performance with as few as 10-30 training samples. We validate our method across a range of tasks, including image-to-video and video-to-video generation, using large-scale base models such as CogVideoX-5B and Wan-14B. Extensive experiments show that TIC-FT outperforms existing baselines in both condition fidelity and visual quality, while remaining highly efficient in both training and inference. For additional results, visit https://kinam0252.github.io/TIC-FT/</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 03 Jun 2025 21:15:57 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b97ee02b/262449c8.mp3" length="19309395" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1203</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kinam Kim, Junha Hyung, Jaegul Choo</p>

            <p><strong>Title:</strong><br>
            Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.00996v1">http://arxiv.org/abs/2506.00996v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in text-to-video diffusion models have enabled high-quality video synthesis, but controllable generation remains challenging, particularly under limited data and compute. Existing fine-tuning methods for conditional generation often rely on external encoders or architectural modifications, which demand large datasets and are typically restricted to spatially aligned conditioning, limiting flexibility and scalability. In this work, we introduce Temporal In-Context Fine-Tuning (TIC-FT), an efficient and versatile approach for adapting pretrained video diffusion models to diverse conditional generation tasks. Our key idea is to concatenate condition and target frames along the temporal axis and insert intermediate buffer frames with progressively increasing noise levels. These buffer frames enable smooth transitions, aligning the fine-tuning process with the pretrained model's temporal dynamics. TIC-FT requires no architectural changes and achieves strong performance with as few as 10-30 training samples. We validate our method across a range of tasks, including image-to-video and video-to-video generation, using large-scale base models such as CogVideoX-5B and Wan-14B. Extensive experiments show that TIC-FT outperforms existing baselines in both condition fidelity and visual quality, while remaining highly efficient in both training and inference. For additional results, visit https://kinam0252.github.io/TIC-FT/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks</title>
      <itunes:episode>855</itunes:episode>
      <podcast:episode>855</podcast:episode>
      <itunes:title>LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cb8e6c7a-7e40-4df9-aa7b-a9e2d7312b1c</guid>
      <link>https://share.transistor.fm/s/7221b991</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.RO, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yi Yang, Jiaxuan Sun, Siqi Kou, Yihan Wang, Zhijie Deng</p>

            <p><strong>Title:</strong><br>
            LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.00411v1">http://arxiv.org/abs/2506.00411v1</a></p>

            <p><strong>Abstract:</strong><br>
            Real-world embodied agents face long-horizon tasks, characterized by high-level goals demanding multi-step solutions beyond single actions. Successfully navigating these requires both high-level task planning (i.e., decomposing goals into sub-tasks) and low-level motion control (i.e., generating precise robot actions). While existing vision language action (VLA) models and hierarchical architectures offer potential in embodied tasks, the former often falter in planning, and the latter can suffer from coordination issues, both hampering performance. We introduce a new unified VLA framework for long-horizon tasks, dubbed LoHoVLA, to overcome these limitations. LoHoVLA leverages a large pretrained vision language model (VLM) as the backbone to jointly generate language and action tokens for sub-task generation and robot action prediction, respectively. This shared representation promotes better generalization across tasks. Additionally, LoHoVLA embraces a hierarchical closed-loop control mechanism to mitigate errors originating from both high-level planning and low-level control. To train LoHoVLA, we introduce LoHoSet, a dataset built on the Ravens simulator, containing 20 long-horizon tasks, each with 1,000 expert demonstrations composed of visual observations, linguistic goals, sub-tasks, and robot actions. Experimental results show that LoHoVLA significantly surpasses both hierarchical and standard VLA approaches on long-horizon embodied tasks in the Ravens simulator. These findings underscore the promise of unified architectures for advancing generalizable embodied intelligence.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.RO, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yi Yang, Jiaxuan Sun, Siqi Kou, Yihan Wang, Zhijie Deng</p>

            <p><strong>Title:</strong><br>
            LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.00411v1">http://arxiv.org/abs/2506.00411v1</a></p>

            <p><strong>Abstract:</strong><br>
            Real-world embodied agents face long-horizon tasks, characterized by high-level goals demanding multi-step solutions beyond single actions. Successfully navigating these requires both high-level task planning (i.e., decomposing goals into sub-tasks) and low-level motion control (i.e., generating precise robot actions). While existing vision language action (VLA) models and hierarchical architectures offer potential in embodied tasks, the former often falter in planning, and the latter can suffer from coordination issues, both hampering performance. We introduce a new unified VLA framework for long-horizon tasks, dubbed LoHoVLA, to overcome these limitations. LoHoVLA leverages a large pretrained vision language model (VLM) as the backbone to jointly generate language and action tokens for sub-task generation and robot action prediction, respectively. This shared representation promotes better generalization across tasks. Additionally, LoHoVLA embraces a hierarchical closed-loop control mechanism to mitigate errors originating from both high-level planning and low-level control. To train LoHoVLA, we introduce LoHoSet, a dataset built on the Ravens simulator, containing 20 long-horizon tasks, each with 1,000 expert demonstrations composed of visual observations, linguistic goals, sub-tasks, and robot actions. Experimental results show that LoHoVLA significantly surpasses both hierarchical and standard VLA approaches on long-horizon embodied tasks in the Ravens simulator. These findings underscore the promise of unified architectures for advancing generalizable embodied intelligence.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 03 Jun 2025 21:15:34 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7221b991/03333457.mp3" length="18723835" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1167</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.RO, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yi Yang, Jiaxuan Sun, Siqi Kou, Yihan Wang, Zhijie Deng</p>

            <p><strong>Title:</strong><br>
            LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.00411v1">http://arxiv.org/abs/2506.00411v1</a></p>

            <p><strong>Abstract:</strong><br>
            Real-world embodied agents face long-horizon tasks, characterized by high-level goals demanding multi-step solutions beyond single actions. Successfully navigating these requires both high-level task planning (i.e., decomposing goals into sub-tasks) and low-level motion control (i.e., generating precise robot actions). While existing vision language action (VLA) models and hierarchical architectures offer potential in embodied tasks, the former often falter in planning, and the latter can suffer from coordination issues, both hampering performance. We introduce a new unified VLA framework for long-horizon tasks, dubbed LoHoVLA, to overcome these limitations. LoHoVLA leverages a large pretrained vision language model (VLM) as the backbone to jointly generate language and action tokens for sub-task generation and robot action prediction, respectively. This shared representation promotes better generalization across tasks. Additionally, LoHoVLA embraces a hierarchical closed-loop control mechanism to mitigate errors originating from both high-level planning and low-level control. To train LoHoVLA, we introduce LoHoSet, a dataset built on the Ravens simulator, containing 20 long-horizon tasks, each with 1,000 expert demonstrations composed of visual observations, linguistic goals, sub-tasks, and robot actions. Experimental results show that LoHoVLA significantly surpasses both hierarchical and standard VLA approaches on long-horizon embodied tasks in the Ravens simulator. These findings underscore the promise of unified architectures for advancing generalizable embodied intelligence.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles</title>
      <itunes:episode>854</itunes:episode>
      <podcast:episode>854</podcast:episode>
      <itunes:title>Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">328c0a3f-6df6-4398-b6ed-e8aa367a45f3</guid>
      <link>https://share.transistor.fm/s/a574cb7d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Jiaqian Yu, Matthew B. Blaschko</p>

            <p><strong>Title:</strong><br>
            Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23590v2">http://arxiv.org/abs/2505.23590v2</a></p>

            <p><strong>Abstract:</strong><br>
            The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy tasks. This paper provides a comprehensive study of rule-based visual RL, using jigsaw puzzles as a structured experimental framework. Jigsaw puzzles offer inherent ground truth, adjustable difficulty, and demand complex decision-making, making them ideal for this study. Our research reveals several key findings: \textit{Firstly,} we find that MLLMs, initially performing near to random guessing on the simplest jigsaw puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. \textit{Secondly,} training on jigsaw puzzles can induce generalization to other visual tasks, with effectiveness tied to specific task configurations. \textit{Thirdly,} MLLMs can learn and generalize with or without explicit reasoning, though open-source models often favor direct answering. Consequently, even when trained for step-by-step reasoning, they can ignore the thinking process in deriving the final answer. \textit{Fourthly,} we observe that complex reasoning patterns appear to be pre-existing rather than emergent, with their frequency increasing alongside training and task difficulty. \textit{Finally,} our results demonstrate that RL exhibits more effective generalization than Supervised Fine-Tuning (SFT), and an initial SFT cold start phase can hinder subsequent RL optimization. Although these observations are based on jigsaw puzzles and may vary across other visual tasks, this research contributes a valuable piece of jigsaw to the larger puzzle of collective understanding rule-based visual RL and its potential in multimodal learning. The code is available at: https://github.com/zifuwanggg/Jigsaw-R1.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Jiaqian Yu, Matthew B. Blaschko</p>

            <p><strong>Title:</strong><br>
            Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23590v2">http://arxiv.org/abs/2505.23590v2</a></p>

            <p><strong>Abstract:</strong><br>
            The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy tasks. This paper provides a comprehensive study of rule-based visual RL, using jigsaw puzzles as a structured experimental framework. Jigsaw puzzles offer inherent ground truth, adjustable difficulty, and demand complex decision-making, making them ideal for this study. Our research reveals several key findings: \textit{Firstly,} we find that MLLMs, initially performing near to random guessing on the simplest jigsaw puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. \textit{Secondly,} training on jigsaw puzzles can induce generalization to other visual tasks, with effectiveness tied to specific task configurations. \textit{Thirdly,} MLLMs can learn and generalize with or without explicit reasoning, though open-source models often favor direct answering. Consequently, even when trained for step-by-step reasoning, they can ignore the thinking process in deriving the final answer. \textit{Fourthly,} we observe that complex reasoning patterns appear to be pre-existing rather than emergent, with their frequency increasing alongside training and task difficulty. \textit{Finally,} our results demonstrate that RL exhibits more effective generalization than Supervised Fine-Tuning (SFT), and an initial SFT cold start phase can hinder subsequent RL optimization. Although these observations are based on jigsaw puzzles and may vary across other visual tasks, this research contributes a valuable piece of jigsaw to the larger puzzle of collective understanding rule-based visual RL and its potential in multimodal learning. The code is available at: https://github.com/zifuwanggg/Jigsaw-R1.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 03 Jun 2025 21:15:11 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a574cb7d/87d5b595.mp3" length="24065774" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1500</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Jiaqian Yu, Matthew B. Blaschko</p>

            <p><strong>Title:</strong><br>
            Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23590v2">http://arxiv.org/abs/2505.23590v2</a></p>

            <p><strong>Abstract:</strong><br>
            The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy tasks. This paper provides a comprehensive study of rule-based visual RL, using jigsaw puzzles as a structured experimental framework. Jigsaw puzzles offer inherent ground truth, adjustable difficulty, and demand complex decision-making, making them ideal for this study. Our research reveals several key findings: \textit{Firstly,} we find that MLLMs, initially performing near to random guessing on the simplest jigsaw puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. \textit{Secondly,} training on jigsaw puzzles can induce generalization to other visual tasks, with effectiveness tied to specific task configurations. \textit{Thirdly,} MLLMs can learn and generalize with or without explicit reasoning, though open-source models often favor direct answering. Consequently, even when trained for step-by-step reasoning, they can ignore the thinking process in deriving the final answer. \textit{Fourthly,} we observe that complex reasoning patterns appear to be pre-existing rather than emergent, with their frequency increasing alongside training and task difficulty. \textit{Finally,} our results demonstrate that RL exhibits more effective generalization than Supervised Fine-Tuning (SFT), and an initial SFT cold start phase can hinder subsequent RL optimization. Although these observations are based on jigsaw puzzles and may vary across other visual tasks, this research contributes a valuable piece of jigsaw to the larger puzzle of collective understanding rule-based visual RL and its potential in multimodal learning. The code is available at: https://github.com/zifuwanggg/Jigsaw-R1.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding</title>
      <itunes:episode>853</itunes:episode>
      <podcast:episode>853</podcast:episode>
      <itunes:title>ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">49130cdf-8541-4780-915e-eb59a1c1fa79</guid>
      <link>https://share.transistor.fm/s/3eb083c6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junliang Ye, Zhengyi Wang, Ruowen Zhao, Shenghao Xie, Jun Zhu</p>

            <p><strong>Title:</strong><br>
            ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.01853v1">http://arxiv.org/abs/2506.01853v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, the powerful text-to-image capabilities of ChatGPT-4o have led to growing appreciation for native multimodal large language models. However, its multimodal capabilities remain confined to images and text. Yet beyond images, the ability to understand and generate 3D content is equally crucial. To address this gap, we propose ShapeLLM-Omni-a native 3D large language model capable of understanding and generating 3D assets and text in any sequence. First, we train a 3D vector-quantized variational autoencoder (VQVAE), which maps 3D objects into a discrete latent space to achieve efficient and accurate shape representation and reconstruction. Building upon the 3D-aware discrete tokens, we innovatively construct a large-scale continuous training dataset named 3D-Alpaca, encompassing generation, comprehension, and editing, thus providing rich resources for future research and training. Finally, by performing instruction-based training of the Qwen-2.5-vl-7B-Instruct model on the 3D-Alpaca dataset. Our work provides an effective attempt at extending multimodal models with basic 3D capabilities, which contributes to future research in 3D-native AI. Project page: https://github.com/JAMESYJL/ShapeLLM-Omni</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junliang Ye, Zhengyi Wang, Ruowen Zhao, Shenghao Xie, Jun Zhu</p>

            <p><strong>Title:</strong><br>
            ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.01853v1">http://arxiv.org/abs/2506.01853v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, the powerful text-to-image capabilities of ChatGPT-4o have led to growing appreciation for native multimodal large language models. However, its multimodal capabilities remain confined to images and text. Yet beyond images, the ability to understand and generate 3D content is equally crucial. To address this gap, we propose ShapeLLM-Omni-a native 3D large language model capable of understanding and generating 3D assets and text in any sequence. First, we train a 3D vector-quantized variational autoencoder (VQVAE), which maps 3D objects into a discrete latent space to achieve efficient and accurate shape representation and reconstruction. Building upon the 3D-aware discrete tokens, we innovatively construct a large-scale continuous training dataset named 3D-Alpaca, encompassing generation, comprehension, and editing, thus providing rich resources for future research and training. Finally, by performing instruction-based training of the Qwen-2.5-vl-7B-Instruct model on the 3D-Alpaca dataset. Our work provides an effective attempt at extending multimodal models with basic 3D capabilities, which contributes to future research in 3D-native AI. Project page: https://github.com/JAMESYJL/ShapeLLM-Omni</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 03 Jun 2025 21:14:48 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3eb083c6/fc387b08.mp3" length="21269201" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1326</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junliang Ye, Zhengyi Wang, Ruowen Zhao, Shenghao Xie, Jun Zhu</p>

            <p><strong>Title:</strong><br>
            ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.01853v1">http://arxiv.org/abs/2506.01853v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, the powerful text-to-image capabilities of ChatGPT-4o have led to growing appreciation for native multimodal large language models. However, its multimodal capabilities remain confined to images and text. Yet beyond images, the ability to understand and generate 3D content is equally crucial. To address this gap, we propose ShapeLLM-Omni-a native 3D large language model capable of understanding and generating 3D assets and text in any sequence. First, we train a 3D vector-quantized variational autoencoder (VQVAE), which maps 3D objects into a discrete latent space to achieve efficient and accurate shape representation and reconstruction. Building upon the 3D-aware discrete tokens, we innovatively construct a large-scale continuous training dataset named 3D-Alpaca, encompassing generation, comprehension, and editing, thus providing rich resources for future research and training. Finally, by performing instruction-based training of the Qwen-2.5-vl-7B-Instruct model on the 3D-Alpaca dataset. Our work provides an effective attempt at extending multimodal models with basic 3D capabilities, which contributes to future research in 3D-native AI. Project page: https://github.com/JAMESYJL/ShapeLLM-Omni</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning</title>
      <itunes:episode>852</itunes:episode>
      <podcast:episode>852</podcast:episode>
      <itunes:title>SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bb779104-bcbf-454d-aaef-ae180ce213fc</guid>
      <link>https://share.transistor.fm/s/09320907</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Yangfan He, Mi Zhang, Shen Yan</p>

            <p><strong>Title:</strong><br>
            SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.01713v1">http://arxiv.org/abs/2506.01713v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Yangfan He, Mi Zhang, Shen Yan</p>

            <p><strong>Title:</strong><br>
            SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.01713v1">http://arxiv.org/abs/2506.01713v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 03 Jun 2025 21:14:25 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/09320907/b0400af9.mp3" length="19887438" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1239</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Yangfan He, Mi Zhang, Shen Yan</p>

            <p><strong>Title:</strong><br>
            SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2506.01713v1">http://arxiv.org/abs/2506.01713v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models</title>
      <itunes:episode>851</itunes:episode>
      <podcast:episode>851</podcast:episode>
      <itunes:title>ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">326faa71-4d81-4818-a72b-9e1c4835291b</guid>
      <link>https://share.transistor.fm/s/c0a40ba0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, Yi Dong</p>

            <p><strong>Title:</strong><br>
            ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24864v1">http://arxiv.org/abs/2505.24864v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model's reasoning capabilities or merely amplifies high-reward outputs already latent in the base model's distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a wide range of pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. We further show that reasoning boundary improvements correlates strongly with task competence of base model and training duration, suggesting that RL can explore and populate new regions of solution space over time. These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models and establish a foundation for future work on long-horizon RL for reasoning. We release model weights to support further research: https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, Yi Dong</p>

            <p><strong>Title:</strong><br>
            ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24864v1">http://arxiv.org/abs/2505.24864v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model's reasoning capabilities or merely amplifies high-reward outputs already latent in the base model's distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a wide range of pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. We further show that reasoning boundary improvements correlates strongly with task competence of base model and training duration, suggesting that RL can explore and populate new regions of solution space over time. These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models and establish a foundation for future work on long-horizon RL for reasoning. We release model weights to support further research: https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 02 Jun 2025 21:08:36 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c0a40ba0/072c6cce.mp3" length="20663597" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1288</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, Yi Dong</p>

            <p><strong>Title:</strong><br>
            ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24864v1">http://arxiv.org/abs/2505.24864v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model's reasoning capabilities or merely amplifies high-reward outputs already latent in the base model's distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a wide range of pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. We further show that reasoning boundary improvements correlates strongly with task competence of base model and training duration, suggesting that RL can explore and populate new regions of solution space over time. These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models and establish a foundation for future work on long-horizon RL for reasoning. We release model weights to support further research: https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time</title>
      <itunes:episode>850</itunes:episode>
      <podcast:episode>850</podcast:episode>
      <itunes:title>AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4ab87b02-442e-47ea-b86a-44ffe7d85022</guid>
      <link>https://share.transistor.fm/s/bcf1e65b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junyu Zhang, Runpei Dong, Han Wang, Xuying Ning, Haoran Geng, Peihao Li, Xialin He, Yutong Bai, Jitendra Malik, Saurabh Gupta, Huan Zhang</p>

            <p><strong>Title:</strong><br>
            AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24863v1">http://arxiv.org/abs/2505.24863v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents AlphaOne ($\alpha$1), a universal framework for modulating reasoning progress in large reasoning models (LRMs) at test time. $\alpha$1 first introduces $\alpha$ moment, which represents the scaled thinking phase with a universal parameter $\alpha$. Within this scaled pre-$\alpha$ moment phase, it dynamically schedules slow thinking transitions by modeling the insertion of reasoning transition tokens as a Bernoulli stochastic process. After the $\alpha$ moment, $\alpha$1 deterministically terminates slow thinking with the end-of-thinking token, thereby fostering fast reasoning and efficient answer generation. This approach unifies and generalizes existing monotonic scaling methods by enabling flexible and dense slow-to-fast reasoning modulation. Extensive empirical studies on various challenging benchmarks across mathematical, coding, and scientific domains demonstrate $\alpha$1's superior reasoning capability and efficiency. Project page: https://alphaone-project.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junyu Zhang, Runpei Dong, Han Wang, Xuying Ning, Haoran Geng, Peihao Li, Xialin He, Yutong Bai, Jitendra Malik, Saurabh Gupta, Huan Zhang</p>

            <p><strong>Title:</strong><br>
            AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24863v1">http://arxiv.org/abs/2505.24863v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents AlphaOne ($\alpha$1), a universal framework for modulating reasoning progress in large reasoning models (LRMs) at test time. $\alpha$1 first introduces $\alpha$ moment, which represents the scaled thinking phase with a universal parameter $\alpha$. Within this scaled pre-$\alpha$ moment phase, it dynamically schedules slow thinking transitions by modeling the insertion of reasoning transition tokens as a Bernoulli stochastic process. After the $\alpha$ moment, $\alpha$1 deterministically terminates slow thinking with the end-of-thinking token, thereby fostering fast reasoning and efficient answer generation. This approach unifies and generalizes existing monotonic scaling methods by enabling flexible and dense slow-to-fast reasoning modulation. Extensive empirical studies on various challenging benchmarks across mathematical, coding, and scientific domains demonstrate $\alpha$1's superior reasoning capability and efficiency. Project page: https://alphaone-project.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 02 Jun 2025 21:08:14 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bcf1e65b/7e762c8d.mp3" length="20164105" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1257</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junyu Zhang, Runpei Dong, Han Wang, Xuying Ning, Haoran Geng, Peihao Li, Xialin He, Yutong Bai, Jitendra Malik, Saurabh Gupta, Huan Zhang</p>

            <p><strong>Title:</strong><br>
            AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24863v1">http://arxiv.org/abs/2505.24863v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents AlphaOne ($\alpha$1), a universal framework for modulating reasoning progress in large reasoning models (LRMs) at test time. $\alpha$1 first introduces $\alpha$ moment, which represents the scaled thinking phase with a universal parameter $\alpha$. Within this scaled pre-$\alpha$ moment phase, it dynamically schedules slow thinking transitions by modeling the insertion of reasoning transition tokens as a Bernoulli stochastic process. After the $\alpha$ moment, $\alpha$1 deterministically terminates slow thinking with the end-of-thinking token, thereby fostering fast reasoning and efficient answer generation. This approach unifies and generalizes existing monotonic scaling methods by enabling flexible and dense slow-to-fast reasoning modulation. Extensive empirical studies on various challenging benchmarks across mathematical, coding, and scientific domains demonstrate $\alpha$1's superior reasoning capability and efficiency. Project page: https://alphaone-project.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Time Blindness: Why Video-Language Models Can't See What Humans Can?</title>
      <itunes:episode>849</itunes:episode>
      <podcast:episode>849</podcast:episode>
      <itunes:title>Time Blindness: Why Video-Language Models Can't See What Humans Can?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6e912b2e-6380-4511-b1bd-83305b883ea4</guid>
      <link>https://share.transistor.fm/s/bb8689a7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ujjwal Upadhyay, Mukul Ranjan, Zhiqiang Shen, Mohamed Elhoseiny</p>

            <p><strong>Title:</strong><br>
            Time Blindness: Why Video-Language Models Can't See What Humans Can?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24867v1">http://arxiv.org/abs/2505.24867v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce $\textbf{SpookyBench}$, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ujjwal Upadhyay, Mukul Ranjan, Zhiqiang Shen, Mohamed Elhoseiny</p>

            <p><strong>Title:</strong><br>
            Time Blindness: Why Video-Language Models Can't See What Humans Can?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24867v1">http://arxiv.org/abs/2505.24867v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce $\textbf{SpookyBench}$, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 02 Jun 2025 21:07:53 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bb8689a7/02beb082.mp3" length="21632402" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1348</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ujjwal Upadhyay, Mukul Ranjan, Zhiqiang Shen, Mohamed Elhoseiny</p>

            <p><strong>Title:</strong><br>
            Time Blindness: Why Video-Language Models Can't See What Humans Can?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24867v1">http://arxiv.org/abs/2505.24867v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce $\textbf{SpookyBench}$, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>HardTests: Synthesizing High-Quality Test Cases for LLM Coding</title>
      <itunes:episode>848</itunes:episode>
      <podcast:episode>848</podcast:episode>
      <itunes:title>HardTests: Synthesizing High-Quality Test Cases for LLM Coding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">474b2b69-dbc4-4943-9fb2-530431d9fa88</guid>
      <link>https://share.transistor.fm/s/cbdc89ff</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhongmou He, Yee Man Choi, Kexun Zhang, Jiabao Ji, Junting Zhou, Dejia Xu, Ivan Bercovich, Aidan Zhang, Lei Li</p>

            <p><strong>Title:</strong><br>
            HardTests: Synthesizing High-Quality Test Cases for LLM Coding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24098v1">http://arxiv.org/abs/2505.24098v1</a></p>

            <p><strong>Abstract:</strong><br>
            Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguised wrong solution may only be detected by carefully human-written edge cases that are difficult to synthesize. To address this issue, we propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this pipeline, we curate a comprehensive competitive programming dataset HARDTESTS with 47k problems and synthetic high-quality tests. Compared with existing tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points higher and recall that is 17.5 percentage points higher when evaluating LLM-generated code. For harder problems, the improvement in precision can be as large as 40 points. HARDTESTS also proves to be more effective for model training, measured by downstream code generation performance. We will open-source our dataset and synthesis pipeline at https://leililab.github.io/HardTests/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhongmou He, Yee Man Choi, Kexun Zhang, Jiabao Ji, Junting Zhou, Dejia Xu, Ivan Bercovich, Aidan Zhang, Lei Li</p>

            <p><strong>Title:</strong><br>
            HardTests: Synthesizing High-Quality Test Cases for LLM Coding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24098v1">http://arxiv.org/abs/2505.24098v1</a></p>

            <p><strong>Abstract:</strong><br>
            Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguised wrong solution may only be detected by carefully human-written edge cases that are difficult to synthesize. To address this issue, we propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this pipeline, we curate a comprehensive competitive programming dataset HARDTESTS with 47k problems and synthetic high-quality tests. Compared with existing tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points higher and recall that is 17.5 percentage points higher when evaluating LLM-generated code. For harder problems, the improvement in precision can be as large as 40 points. HARDTESTS also proves to be more effective for model training, measured by downstream code generation performance. We will open-source our dataset and synthesis pipeline at https://leililab.github.io/HardTests/.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 02 Jun 2025 21:07:31 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cbdc89ff/03181745.mp3" length="20863351" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1300</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhongmou He, Yee Man Choi, Kexun Zhang, Jiabao Ji, Junting Zhou, Dejia Xu, Ivan Bercovich, Aidan Zhang, Lei Li</p>

            <p><strong>Title:</strong><br>
            HardTests: Synthesizing High-Quality Test Cases for LLM Coding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24098v1">http://arxiv.org/abs/2505.24098v1</a></p>

            <p><strong>Abstract:</strong><br>
            Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguised wrong solution may only be detected by carefully human-written edge cases that are difficult to synthesize. To address this issue, we propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this pipeline, we curate a comprehensive competitive programming dataset HARDTESTS with 47k problems and synthetic high-quality tests. Compared with existing tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points higher and recall that is 17.5 percentage points higher when evaluating LLM-generated code. For harder problems, the improvement in precision can be as large as 40 points. HARDTESTS also proves to be more effective for model training, measured by downstream code generation performance. We will open-source our dataset and synthesis pipeline at https://leililab.github.io/HardTests/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Large Language Models for Data Synthesis</title>
      <itunes:episode>847</itunes:episode>
      <podcast:episode>847</podcast:episode>
      <itunes:title>Large Language Models for Data Synthesis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3703b065-eb4c-4efd-ae4e-ad654b223e20</guid>
      <link>https://share.transistor.fm/s/4f137abc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yihong Tang, Menglin Kong, Lijun Sun</p>

            <p><strong>Title:</strong><br>
            Large Language Models for Data Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14752v1">http://arxiv.org/abs/2505.14752v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generating synthetic data that faithfully captures the statistical structure of real-world distributions is a fundamental challenge in data modeling. Classical approaches often depend on strong parametric assumptions or manual structural design and struggle in high-dimensional or heterogeneous domains. Recent progress in Large Language Models (LLMs) reveals their potential as flexible, high-dimensional priors over real-world distributions. However, when applied to data synthesis, standard LLM-based sampling is inefficient, constrained by fixed context limits, and fails to ensure statistical alignment. Given this, we introduce LLMSynthor, a general framework for data synthesis that transforms LLMs into structure-aware simulators guided by distributional feedback. LLMSynthor treats the LLM as a nonparametric copula simulator for modeling high-order dependencies and introduces LLM Proposal Sampling to generate grounded proposal distributions that improve sampling efficiency without requiring rejection. By minimizing discrepancies in the summary statistics space, the iterative synthesis loop aligns real and synthetic data while gradually uncovering and refining the latent generative structure. We evaluate LLMSynthor in both controlled and real-world settings using heterogeneous datasets in privacy-sensitive domains (e.g., e-commerce, population, and mobility) that encompass both structured and unstructured formats. The synthetic data produced by LLMSynthor shows high statistical fidelity, practical utility, and cross-data adaptability, positioning it as a valuable tool across economics, social science, urban studies, and beyond.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yihong Tang, Menglin Kong, Lijun Sun</p>

            <p><strong>Title:</strong><br>
            Large Language Models for Data Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14752v1">http://arxiv.org/abs/2505.14752v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generating synthetic data that faithfully captures the statistical structure of real-world distributions is a fundamental challenge in data modeling. Classical approaches often depend on strong parametric assumptions or manual structural design and struggle in high-dimensional or heterogeneous domains. Recent progress in Large Language Models (LLMs) reveals their potential as flexible, high-dimensional priors over real-world distributions. However, when applied to data synthesis, standard LLM-based sampling is inefficient, constrained by fixed context limits, and fails to ensure statistical alignment. Given this, we introduce LLMSynthor, a general framework for data synthesis that transforms LLMs into structure-aware simulators guided by distributional feedback. LLMSynthor treats the LLM as a nonparametric copula simulator for modeling high-order dependencies and introduces LLM Proposal Sampling to generate grounded proposal distributions that improve sampling efficiency without requiring rejection. By minimizing discrepancies in the summary statistics space, the iterative synthesis loop aligns real and synthetic data while gradually uncovering and refining the latent generative structure. We evaluate LLMSynthor in both controlled and real-world settings using heterogeneous datasets in privacy-sensitive domains (e.g., e-commerce, population, and mobility) that encompass both structured and unstructured formats. The synthetic data produced by LLMSynthor shows high statistical fidelity, practical utility, and cross-data adaptability, positioning it as a valuable tool across economics, social science, urban studies, and beyond.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 02 Jun 2025 21:07:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4f137abc/46c34d2f.mp3" length="21698829" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1352</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yihong Tang, Menglin Kong, Lijun Sun</p>

            <p><strong>Title:</strong><br>
            Large Language Models for Data Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14752v1">http://arxiv.org/abs/2505.14752v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generating synthetic data that faithfully captures the statistical structure of real-world distributions is a fundamental challenge in data modeling. Classical approaches often depend on strong parametric assumptions or manual structural design and struggle in high-dimensional or heterogeneous domains. Recent progress in Large Language Models (LLMs) reveals their potential as flexible, high-dimensional priors over real-world distributions. However, when applied to data synthesis, standard LLM-based sampling is inefficient, constrained by fixed context limits, and fails to ensure statistical alignment. Given this, we introduce LLMSynthor, a general framework for data synthesis that transforms LLMs into structure-aware simulators guided by distributional feedback. LLMSynthor treats the LLM as a nonparametric copula simulator for modeling high-order dependencies and introduces LLM Proposal Sampling to generate grounded proposal distributions that improve sampling efficiency without requiring rejection. By minimizing discrepancies in the summary statistics space, the iterative synthesis loop aligns real and synthetic data while gradually uncovering and refining the latent generative structure. We evaluate LLMSynthor in both controlled and real-world settings using heterogeneous datasets in privacy-sensitive domains (e.g., e-commerce, population, and mobility) that encompass both structured and unstructured formats. The synthetic data produced by LLMSynthor shows high statistical fidelity, practical utility, and cross-data adaptability, positioning it as a valuable tool across economics, social science, urban studies, and beyond.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation</title>
      <itunes:episode>846</itunes:episode>
      <podcast:episode>846</podcast:episode>
      <itunes:title>Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">879b2572-2cc4-4b98-a18c-ea64736f384d</guid>
      <link>https://share.transistor.fm/s/73c25ec1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiwan Chung, Junhyeok Kim, Siyeol Kim, Jaeyoung Lee, Min Soo Kim, Youngjae Yu</p>

            <p><strong>Title:</strong><br>
            Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.18842v1">http://arxiv.org/abs/2505.18842v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present v1, a lightweight extension to Multimodal Large Language Models (MLLMs) that enables selective visual revisitation during inference. While current MLLMs typically consume visual input only once and reason purely over internal memory, v1 introduces a simple point-and-copy mechanism that allows the model to dynamically retrieve relevant image regions throughout the reasoning process. This mechanism augments existing architectures with minimal modifications, enabling contextual access to visual tokens based on the model's evolving hypotheses. To train this capability, we construct v1g, a dataset of 300K multimodal reasoning traces with interleaved visual grounding annotations. Experiments on three multimodal mathematical reasoning benchmarks -- MathVista, MathVision, and MathVerse -- demonstrate that v1 consistently improves performance over comparable baselines, particularly on tasks requiring fine-grained visual reference and multi-step reasoning. Our results suggest that dynamic visual access is a promising direction for enhancing grounded multimodal reasoning. Code, models, and data will be released to support future research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiwan Chung, Junhyeok Kim, Siyeol Kim, Jaeyoung Lee, Min Soo Kim, Youngjae Yu</p>

            <p><strong>Title:</strong><br>
            Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.18842v1">http://arxiv.org/abs/2505.18842v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present v1, a lightweight extension to Multimodal Large Language Models (MLLMs) that enables selective visual revisitation during inference. While current MLLMs typically consume visual input only once and reason purely over internal memory, v1 introduces a simple point-and-copy mechanism that allows the model to dynamically retrieve relevant image regions throughout the reasoning process. This mechanism augments existing architectures with minimal modifications, enabling contextual access to visual tokens based on the model's evolving hypotheses. To train this capability, we construct v1g, a dataset of 300K multimodal reasoning traces with interleaved visual grounding annotations. Experiments on three multimodal mathematical reasoning benchmarks -- MathVista, MathVision, and MathVerse -- demonstrate that v1 consistently improves performance over comparable baselines, particularly on tasks requiring fine-grained visual reference and multi-step reasoning. Our results suggest that dynamic visual access is a promising direction for enhancing grounded multimodal reasoning. Code, models, and data will be released to support future research.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 02 Jun 2025 21:06:48 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/73c25ec1/ca53483d.mp3" length="21234951" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1323</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiwan Chung, Junhyeok Kim, Siyeol Kim, Jaeyoung Lee, Min Soo Kim, Youngjae Yu</p>

            <p><strong>Title:</strong><br>
            Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.18842v1">http://arxiv.org/abs/2505.18842v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present v1, a lightweight extension to Multimodal Large Language Models (MLLMs) that enables selective visual revisitation during inference. While current MLLMs typically consume visual input only once and reason purely over internal memory, v1 introduces a simple point-and-copy mechanism that allows the model to dynamically retrieve relevant image regions throughout the reasoning process. This mechanism augments existing architectures with minimal modifications, enabling contextual access to visual tokens based on the model's evolving hypotheses. To train this capability, we construct v1g, a dataset of 300K multimodal reasoning traces with interleaved visual grounding annotations. Experiments on three multimodal mathematical reasoning benchmarks -- MathVista, MathVision, and MathVerse -- demonstrate that v1 consistently improves performance over comparable baselines, particularly on tasks requiring fine-grained visual reference and multi-step reasoning. Our results suggest that dynamic visual access is a promising direction for enhancing grounded multimodal reasoning. Code, models, and data will be released to support future research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ViStoryBench: Comprehensive Benchmark Suite for Story Visualization</title>
      <itunes:episode>845</itunes:episode>
      <podcast:episode>845</podcast:episode>
      <itunes:title>ViStoryBench: Comprehensive Benchmark Suite for Story Visualization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">988a162c-f45f-46f5-85b8-1180483c1dd0</guid>
      <link>https://share.transistor.fm/s/c81d41e4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Cailin Zhuang, Ailin Huang, Wei Cheng, Jingwei Wu, Yaoqi Hu, Jiaqi Liao, Zhewei Huang, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, Xuanyang Zhang, Xianfang Zeng, Gang Yu, Chi Zhang</p>

            <p><strong>Title:</strong><br>
            ViStoryBench: Comprehensive Benchmark Suite for Story Visualization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24862v1">http://arxiv.org/abs/2505.24862v1</a></p>

            <p><strong>Abstract:</strong><br>
            Story visualization, which aims to generate a sequence of visually coherent images aligning with a given narrative and reference images, has seen significant progress with recent advancements in generative models. To further enhance the performance of story visualization frameworks in real-world scenarios, we introduce a comprehensive evaluation benchmark, ViStoryBench. We collect a diverse dataset encompassing various story types and artistic styles, ensuring models are evaluated across multiple dimensions such as different plots (e.g., comedy, horror) and visual aesthetics (e.g., anime, 3D renderings). ViStoryBench is carefully curated to balance narrative structures and visual elements, featuring stories with single and multiple protagonists to test models' ability to maintain character consistency. Additionally, it includes complex plots and intricate world-building to challenge models in generating accurate visuals. To ensure comprehensive comparisons, our benchmark incorporates a wide range of evaluation metrics assessing critical aspects. This structured and multifaceted framework enables researchers to thoroughly identify both the strengths and weaknesses of different models, fostering targeted improvements.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Cailin Zhuang, Ailin Huang, Wei Cheng, Jingwei Wu, Yaoqi Hu, Jiaqi Liao, Zhewei Huang, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, Xuanyang Zhang, Xianfang Zeng, Gang Yu, Chi Zhang</p>

            <p><strong>Title:</strong><br>
            ViStoryBench: Comprehensive Benchmark Suite for Story Visualization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24862v1">http://arxiv.org/abs/2505.24862v1</a></p>

            <p><strong>Abstract:</strong><br>
            Story visualization, which aims to generate a sequence of visually coherent images aligning with a given narrative and reference images, has seen significant progress with recent advancements in generative models. To further enhance the performance of story visualization frameworks in real-world scenarios, we introduce a comprehensive evaluation benchmark, ViStoryBench. We collect a diverse dataset encompassing various story types and artistic styles, ensuring models are evaluated across multiple dimensions such as different plots (e.g., comedy, horror) and visual aesthetics (e.g., anime, 3D renderings). ViStoryBench is carefully curated to balance narrative structures and visual elements, featuring stories with single and multiple protagonists to test models' ability to maintain character consistency. Additionally, it includes complex plots and intricate world-building to challenge models in generating accurate visuals. To ensure comprehensive comparisons, our benchmark incorporates a wide range of evaluation metrics assessing critical aspects. This structured and multifaceted framework enables researchers to thoroughly identify both the strengths and weaknesses of different models, fostering targeted improvements.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 02 Jun 2025 21:06:27 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c81d41e4/0978d6f1.mp3" length="20132345" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1255</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Cailin Zhuang, Ailin Huang, Wei Cheng, Jingwei Wu, Yaoqi Hu, Jiaqi Liao, Zhewei Huang, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, Xuanyang Zhang, Xianfang Zeng, Gang Yu, Chi Zhang</p>

            <p><strong>Title:</strong><br>
            ViStoryBench: Comprehensive Benchmark Suite for Story Visualization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24862v1">http://arxiv.org/abs/2505.24862v1</a></p>

            <p><strong>Abstract:</strong><br>
            Story visualization, which aims to generate a sequence of visually coherent images aligning with a given narrative and reference images, has seen significant progress with recent advancements in generative models. To further enhance the performance of story visualization frameworks in real-world scenarios, we introduce a comprehensive evaluation benchmark, ViStoryBench. We collect a diverse dataset encompassing various story types and artistic styles, ensuring models are evaluated across multiple dimensions such as different plots (e.g., comedy, horror) and visual aesthetics (e.g., anime, 3D renderings). ViStoryBench is carefully curated to balance narrative structures and visual elements, featuring stories with single and multiple protagonists to test models' ability to maintain character consistency. Additionally, it includes complex plots and intricate world-building to challenge models in generating accurate visuals. To ensure comprehensive comparisons, our benchmark incorporates a wide range of evaluation metrics assessing critical aspects. This structured and multifaceted framework enables researchers to thoroughly identify both the strengths and weaknesses of different models, fostering targeted improvements.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models</title>
      <itunes:episode>844</itunes:episode>
      <podcast:episode>844</podcast:episode>
      <itunes:title>DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9d48da52-5b02-4d2e-815a-3800ea6679b8</guid>
      <link>https://share.transistor.fm/s/06938269</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chenbin Pan, Wenbin He, Zhengzhong Tu, Liu Ren</p>

            <p><strong>Title:</strong><br>
            DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24025v1">http://arxiv.org/abs/2505.24025v1</a></p>

            <p><strong>Abstract:</strong><br>
            The recent explosive interest in the reasoning capabilities of large language models, such as DeepSeek-R1, has demonstrated remarkable success through reinforcement learning-based fine-tuning frameworks, exemplified by methods like Group Relative Policy Optimization (GRPO). However, such reasoning abilities remain underexplored and notably absent in vision foundation models, including representation models like the DINO series. In this work, we propose \textbf{DINO-R1}, the first such attempt to incentivize visual in-context reasoning capabilities of vision foundation models using reinforcement learning. Specifically, DINO-R1 introduces \textbf{Group Relative Query Optimization (GRQO)}, a novel reinforcement-style training strategy explicitly designed for query-based representation models, which computes query-level rewards based on group-normalized alignment quality. We also apply KL-regularization to stabilize the objectness distribution to reduce the training instability. This joint optimization enables dense and expressive supervision across queries while mitigating overfitting and distributional drift. Building upon Grounding-DINO, we train a series of DINO-R1 family models that integrate a visual prompt encoder and a visual-guided query selection mechanism. Extensive experiments on COCO, LVIS, and ODinW demonstrate that DINO-R1 significantly outperforms supervised fine-tuning baselines, achieving strong generalization in both open-vocabulary and closed-set visual prompting scenarios.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chenbin Pan, Wenbin He, Zhengzhong Tu, Liu Ren</p>

            <p><strong>Title:</strong><br>
            DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24025v1">http://arxiv.org/abs/2505.24025v1</a></p>

            <p><strong>Abstract:</strong><br>
            The recent explosive interest in the reasoning capabilities of large language models, such as DeepSeek-R1, has demonstrated remarkable success through reinforcement learning-based fine-tuning frameworks, exemplified by methods like Group Relative Policy Optimization (GRPO). However, such reasoning abilities remain underexplored and notably absent in vision foundation models, including representation models like the DINO series. In this work, we propose \textbf{DINO-R1}, the first such attempt to incentivize visual in-context reasoning capabilities of vision foundation models using reinforcement learning. Specifically, DINO-R1 introduces \textbf{Group Relative Query Optimization (GRQO)}, a novel reinforcement-style training strategy explicitly designed for query-based representation models, which computes query-level rewards based on group-normalized alignment quality. We also apply KL-regularization to stabilize the objectness distribution to reduce the training instability. This joint optimization enables dense and expressive supervision across queries while mitigating overfitting and distributional drift. Building upon Grounding-DINO, we train a series of DINO-R1 family models that integrate a visual prompt encoder and a visual-guided query selection mechanism. Extensive experiments on COCO, LVIS, and ODinW demonstrate that DINO-R1 significantly outperforms supervised fine-tuning baselines, achieving strong generalization in both open-vocabulary and closed-set visual prompting scenarios.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 02 Jun 2025 21:06:05 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/06938269/e2be04c4.mp3" length="21897391" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1365</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chenbin Pan, Wenbin He, Zhengzhong Tu, Liu Ren</p>

            <p><strong>Title:</strong><br>
            DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.24025v1">http://arxiv.org/abs/2505.24025v1</a></p>

            <p><strong>Abstract:</strong><br>
            The recent explosive interest in the reasoning capabilities of large language models, such as DeepSeek-R1, has demonstrated remarkable success through reinforcement learning-based fine-tuning frameworks, exemplified by methods like Group Relative Policy Optimization (GRPO). However, such reasoning abilities remain underexplored and notably absent in vision foundation models, including representation models like the DINO series. In this work, we propose \textbf{DINO-R1}, the first such attempt to incentivize visual in-context reasoning capabilities of vision foundation models using reinforcement learning. Specifically, DINO-R1 introduces \textbf{Group Relative Query Optimization (GRQO)}, a novel reinforcement-style training strategy explicitly designed for query-based representation models, which computes query-level rewards based on group-normalized alignment quality. We also apply KL-regularization to stabilize the objectness distribution to reduce the training instability. This joint optimization enables dense and expressive supervision across queries while mitigating overfitting and distributional drift. Building upon Grounding-DINO, we train a series of DINO-R1 family models that integrate a visual prompt encoder and a visual-guided query selection mechanism. Extensive experiments on COCO, LVIS, and ODinW demonstrate that DINO-R1 significantly outperforms supervised fine-tuning baselines, achieving strong generalization in both open-vocabulary and closed-set visual prompting scenarios.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Table-R1: Inference-Time Scaling for Table Reasoning</title>
      <itunes:episode>843</itunes:episode>
      <podcast:episode>843</podcast:episode>
      <itunes:title>Table-R1: Inference-Time Scaling for Table Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">131ec61a-017f-4f92-9232-e80e22b65eab</guid>
      <link>https://share.transistor.fm/s/54d561d6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zheyuan Yang, Lyuhao Chen, Arman Cohan, Yilun Zhao</p>

            <p><strong>Title:</strong><br>
            Table-R1: Inference-Time Scaling for Table Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23621v1">http://arxiv.org/abs/2505.23621v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we present the first study to explore inference-time scaling on table reasoning tasks. We develop and evaluate two post-training strategies to enable inference-time scaling: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR). For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1, which we use to fine-tune LLMs into the Table-R1-SFT model. For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model. We evaluate our Table-R1-series models across diverse table reasoning tasks, including short-form QA, fact verification, and free-form QA. Notably, the Table-R1-Zero model matches or exceeds the performance of GPT-4.1 and DeepSeek-R1, while using only a 7B-parameter LLM. It also demonstrates strong generalization to out-of-domain datasets. Extensive ablation and qualitative analyses reveal the benefits of instruction tuning, model architecture choices, and cross-task generalization, as well as emergence of essential table reasoning skills during RL training.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zheyuan Yang, Lyuhao Chen, Arman Cohan, Yilun Zhao</p>

            <p><strong>Title:</strong><br>
            Table-R1: Inference-Time Scaling for Table Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23621v1">http://arxiv.org/abs/2505.23621v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we present the first study to explore inference-time scaling on table reasoning tasks. We develop and evaluate two post-training strategies to enable inference-time scaling: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR). For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1, which we use to fine-tune LLMs into the Table-R1-SFT model. For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model. We evaluate our Table-R1-series models across diverse table reasoning tasks, including short-form QA, fact verification, and free-form QA. Notably, the Table-R1-Zero model matches or exceeds the performance of GPT-4.1 and DeepSeek-R1, while using only a 7B-parameter LLM. It also demonstrates strong generalization to out-of-domain datasets. Extensive ablation and qualitative analyses reveal the benefits of instruction tuning, model architecture choices, and cross-task generalization, as well as emergence of essential table reasoning skills during RL training.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 30 May 2025 20:46:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/54d561d6/158fb84e.mp3" length="20668572" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1288</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 66 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zheyuan Yang, Lyuhao Chen, Arman Cohan, Yilun Zhao</p>

            <p><strong>Title:</strong><br>
            Table-R1: Inference-Time Scaling for Table Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23621v1">http://arxiv.org/abs/2505.23621v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we present the first study to explore inference-time scaling on table reasoning tasks. We develop and evaluate two post-training strategies to enable inference-time scaling: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR). For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1, which we use to fine-tune LLMs into the Table-R1-SFT model. For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model. We evaluate our Table-R1-series models across diverse table reasoning tasks, including short-form QA, fact verification, and free-form QA. Notably, the Table-R1-Zero model matches or exceeds the performance of GPT-4.1 and DeepSeek-R1, while using only a 7B-parameter LLM. It also demonstrates strong generalization to out-of-domain datasets. Extensive ablation and qualitative analyses reveal the benefits of instruction tuning, model architecture choices, and cross-task generalization, as well as emergence of essential table reasoning skills during RL training.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence</title>
      <itunes:episode>842</itunes:episode>
      <podcast:episode>842</podcast:episode>
      <itunes:title>Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">04623e3d-5c8f-4e53-9fd2-8edb811c488a</guid>
      <link>https://share.transistor.fm/s/de4b980a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.CV, cs.AI, cs.LG, I.2.6; I.2</p>

            <p><strong>Authors:</strong><br>
            Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan</p>

            <p><strong>Title:</strong><br>
            Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23747v1">http://arxiv.org/abs/2505.23747v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.CV, cs.AI, cs.LG, I.2.6; I.2</p>

            <p><strong>Authors:</strong><br>
            Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan</p>

            <p><strong>Title:</strong><br>
            Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23747v1">http://arxiv.org/abs/2505.23747v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 30 May 2025 20:45:41 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/de4b980a/3463e0ef.mp3" length="19164780" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1194</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.CV, cs.AI, cs.LG, I.2.6; I.2</p>

            <p><strong>Authors:</strong><br>
            Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan</p>

            <p><strong>Title:</strong><br>
            Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23747v1">http://arxiv.org/abs/2505.23747v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos</title>
      <itunes:episode>841</itunes:episode>
      <podcast:episode>841</podcast:episode>
      <itunes:title>VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2eb61c43-d9e4-4b09-b945-a40357d49045</guid>
      <link>https://share.transistor.fm/s/87d10bda</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tingyu Song, Tongyan Hu, Guo Gan, Yilun Zhao</p>

            <p><strong>Title:</strong><br>
            VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23693v1">http://arxiv.org/abs/2505.23693v1</a></p>

            <p><strong>Abstract:</strong><br>
            MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tingyu Song, Tongyan Hu, Guo Gan, Yilun Zhao</p>

            <p><strong>Title:</strong><br>
            VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23693v1">http://arxiv.org/abs/2505.23693v1</a></p>

            <p><strong>Abstract:</strong><br>
            MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 30 May 2025 20:45:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/87d10bda/d04298e6.mp3" length="24760832" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1544</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tingyu Song, Tongyan Hu, Guo Gan, Yilun Zhao</p>

            <p><strong>Title:</strong><br>
            VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23693v1">http://arxiv.org/abs/2505.23693v1</a></p>

            <p><strong>Abstract:</strong><br>
            MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason</title>
      <itunes:episode>840</itunes:episode>
      <podcast:episode>840</podcast:episode>
      <itunes:title>The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a5d4bf60-8406-43f5-9daf-c53acea74555</guid>
      <link>https://share.transistor.fm/s/2e939283</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ang Lv, Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan</p>

            <p><strong>Title:</strong><br>
            The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22653v1">http://arxiv.org/abs/2505.22653v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies on post-training large language models (LLMs) for reasoning through reinforcement learning (RL) typically focus on tasks that can be accurately verified and rewarded, such as solving math problems. In contrast, our research investigates the impact of reward noise, a more practical consideration for real-world scenarios involving the post-training of LLMs using reward models. We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function's outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid convergence, improving its performance on math tasks from 5% to 72%, compared to the 75% accuracy achieved by a model trained with noiseless rewards. Surprisingly, by only rewarding the appearance of key reasoning phrases (namely reasoning pattern reward, RPR), such as ``first, I need to''-without verifying the correctness of answers, the model achieved peak downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models trained with strict correctness verification and accurate rewards. Recognizing the importance of the reasoning process over the final results, we combined RPR with noisy reward models. RPR helped calibrate the noisy reward models, mitigating potential false negatives and enhancing the LLM's performance on open-ended tasks. These findings suggest the importance of improving models' foundational abilities during the pre-training phase while providing insights for advancing post-training techniques. Our code and scripts are available at https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ang Lv, Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan</p>

            <p><strong>Title:</strong><br>
            The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22653v1">http://arxiv.org/abs/2505.22653v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies on post-training large language models (LLMs) for reasoning through reinforcement learning (RL) typically focus on tasks that can be accurately verified and rewarded, such as solving math problems. In contrast, our research investigates the impact of reward noise, a more practical consideration for real-world scenarios involving the post-training of LLMs using reward models. We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function's outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid convergence, improving its performance on math tasks from 5% to 72%, compared to the 75% accuracy achieved by a model trained with noiseless rewards. Surprisingly, by only rewarding the appearance of key reasoning phrases (namely reasoning pattern reward, RPR), such as ``first, I need to''-without verifying the correctness of answers, the model achieved peak downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models trained with strict correctness verification and accurate rewards. Recognizing the importance of the reasoning process over the final results, we combined RPR with noisy reward models. RPR helped calibrate the noisy reward models, mitigating potential false negatives and enhancing the LLM's performance on open-ended tasks. These findings suggest the importance of improving models' foundational abilities during the pre-training phase while providing insights for advancing post-training techniques. Our code and scripts are available at https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 30 May 2025 20:44:54 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2e939283/41fdeca9.mp3" length="21245811" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1324</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ang Lv, Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan</p>

            <p><strong>Title:</strong><br>
            The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22653v1">http://arxiv.org/abs/2505.22653v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies on post-training large language models (LLMs) for reasoning through reinforcement learning (RL) typically focus on tasks that can be accurately verified and rewarded, such as solving math problems. In contrast, our research investigates the impact of reward noise, a more practical consideration for real-world scenarios involving the post-training of LLMs using reward models. We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function's outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid convergence, improving its performance on math tasks from 5% to 72%, compared to the 75% accuracy achieved by a model trained with noiseless rewards. Surprisingly, by only rewarding the appearance of key reasoning phrases (namely reasoning pattern reward, RPR), such as ``first, I need to''-without verifying the correctness of answers, the model achieved peak downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models trained with strict correctness verification and accurate rewards. Recognizing the importance of the reasoning process over the final results, we combined RPR with noisy reward models. RPR helped calibrate the noisy reward models, mitigating potential false negatives and enhancing the LLM's performance on open-ended tasks. These findings suggest the importance of improving models' foundational abilities during the pre-training phase while providing insights for advancing post-training techniques. Our code and scripts are available at https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ZeroGUI: Automating Online GUI Learning at Zero Human Cost</title>
      <itunes:episode>839</itunes:episode>
      <podcast:episode>839</podcast:episode>
      <itunes:title>ZeroGUI: Automating Online GUI Learning at Zero Human Cost</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9a660149-7b2c-474e-bc5e-400f2c7d0525</guid>
      <link>https://share.transistor.fm/s/fa308c6e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, Jifeng Dai</p>

            <p><strong>Title:</strong><br>
            ZeroGUI: Automating Online GUI Learning at Zero Human Cost</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23762v1">http://arxiv.org/abs/2505.23762v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, Jifeng Dai</p>

            <p><strong>Title:</strong><br>
            ZeroGUI: Automating Online GUI Learning at Zero Human Cost</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23762v1">http://arxiv.org/abs/2505.23762v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 30 May 2025 20:44:31 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fa308c6e/4c4df62e.mp3" length="18305436" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1140</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, Jifeng Dai</p>

            <p><strong>Title:</strong><br>
            ZeroGUI: Automating Online GUI Learning at Zero Human Cost</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23762v1">http://arxiv.org/abs/2505.23762v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?</title>
      <itunes:episode>838</itunes:episode>
      <podcast:episode>838</podcast:episode>
      <itunes:title>VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">13950093-71fb-4f0b-afc8-cdd3182e5d9c</guid>
      <link>https://share.transistor.fm/s/afe6cc31</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y. Charles, Xinyu Zhou, Xu Sun</p>

            <p><strong>Title:</strong><br>
            VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23359v1">http://arxiv.org/abs/2505.23359v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning, e.g., GPT-4o achieves only 6.9% accuracy, while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on "test-time scaling" further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y. Charles, Xinyu Zhou, Xu Sun</p>

            <p><strong>Title:</strong><br>
            VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23359v1">http://arxiv.org/abs/2505.23359v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning, e.g., GPT-4o achieves only 6.9% accuracy, while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on "test-time scaling" further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 30 May 2025 20:44:08 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/afe6cc31/00c530c8.mp3" length="20724183" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1292</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuanxin Liu, Kun Ouyang, Haoning Wu, Yi Liu, Lin Sui, Xinhao Li, Yan Zhong, Y. Charles, Xinyu Zhou, Xu Sun</p>

            <p><strong>Title:</strong><br>
            VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23359v1">http://arxiv.org/abs/2505.23359v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning, e.g., GPT-4o achieves only 6.9% accuracy, while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on "test-time scaling" further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering</title>
      <itunes:episode>837</itunes:episode>
      <podcast:episode>837</podcast:episode>
      <itunes:title>Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0b388cf9-dcd8-48df-99a6-1bedea570500</guid>
      <link>https://share.transistor.fm/s/88258794</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.AI, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Guangtao Zeng, Maohao Shen, Delin Chen, Zhenting Qi, Subhro Das, Dan Gutfreund, David Cox, Gregory Wornell, Wei Lu, Zhang-Wei Hong, Chuang Gan</p>

            <p><strong>Title:</strong><br>
            Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23604v1">http://arxiv.org/abs/2505.23604v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring, limiting its practical application. We propose Evolutionary Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process. By iteratively refining outputs via selection and mutation, EvoScale shifts the output distribution toward higher-scoring regions, reducing the number of samples needed to find correct solutions. To reduce the overhead from repeatedly sampling and selection, we train the model to self-evolve using reinforcement learning (RL). Rather than relying on external verifiers at inference time, the model learns to self-improve the scores of its own generations across iterations. Evaluated on SWE-Bench-Verified, EvoScale enables our 32B model, Satori-SWE-32B, to match or exceed the performance of models with over 100B parameters while using a few samples. Code, data, and models will be fully open-sourced.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.AI, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Guangtao Zeng, Maohao Shen, Delin Chen, Zhenting Qi, Subhro Das, Dan Gutfreund, David Cox, Gregory Wornell, Wei Lu, Zhang-Wei Hong, Chuang Gan</p>

            <p><strong>Title:</strong><br>
            Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23604v1">http://arxiv.org/abs/2505.23604v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring, limiting its practical application. We propose Evolutionary Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process. By iteratively refining outputs via selection and mutation, EvoScale shifts the output distribution toward higher-scoring regions, reducing the number of samples needed to find correct solutions. To reduce the overhead from repeatedly sampling and selection, we train the model to self-evolve using reinforcement learning (RL). Rather than relying on external verifiers at inference time, the model learns to self-improve the scores of its own generations across iterations. Evaluated on SWE-Bench-Verified, EvoScale enables our 32B model, Satori-SWE-32B, to match or exceed the performance of models with over 100B parameters while using a few samples. Code, data, and models will be fully open-sourced.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 30 May 2025 20:43:45 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/88258794/ab9c1bc2.mp3" length="20735059" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1292</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.AI, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Guangtao Zeng, Maohao Shen, Delin Chen, Zhenting Qi, Subhro Das, Dan Gutfreund, David Cox, Gregory Wornell, Wei Lu, Zhang-Wei Hong, Chuang Gan</p>

            <p><strong>Title:</strong><br>
            Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.23604v1">http://arxiv.org/abs/2505.23604v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring, limiting its practical application. We propose Evolutionary Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process. By iteratively refining outputs via selection and mutation, EvoScale shifts the output distribution toward higher-scoring regions, reducing the number of samples needed to find correct solutions. To reduce the overhead from repeatedly sampling and selection, we train the model to self-evolve using reinforcement learning (RL). Rather than relying on external verifiers at inference time, the model learns to self-improve the scores of its own generations across iterations. Evaluated on SWE-Bench-Verified, EvoScale enables our 32B model, Satori-SWE-32B, to match or exceed the performance of models with over 100B parameters while using a few samples. Code, data, and models will be fully open-sourced.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models</title>
      <itunes:episode>836</itunes:episode>
      <podcast:episode>836</podcast:episode>
      <itunes:title>The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9420c78b-3fcc-4ab5-8adb-94c86d4a8872</guid>
      <link>https://share.transistor.fm/s/33ada166</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 84 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, Ning Ding</p>

            <p><strong>Title:</strong><br>
            The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22617v1">http://arxiv.org/abs/2505.22617v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b. Our finding necessitates entropy management for continuous exploration toward scaling compute for RL. To this end, we investigate entropy dynamics both theoretically and empirically. Our derivation highlights that, the change in policy entropy is driven by the covariance between action probability and the change in logits, which is proportional to its advantage when using Policy Gradient-like algorithms. Empirical study shows that, the values of covariance term and entropy differences matched exactly, supporting the theoretical conclusion. Moreover, the covariance term stays mostly positive throughout training, further explaining why policy entropy would decrease monotonically. Through understanding the mechanism behind entropy dynamics, we motivate to control entropy by restricting the update of high-covariance tokens. Specifically, we propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively. Experiments show that these methods encourage exploration, thus helping policy escape entropy collapse and achieve better downstream performance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 84 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, Ning Ding</p>

            <p><strong>Title:</strong><br>
            The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22617v1">http://arxiv.org/abs/2505.22617v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b. Our finding necessitates entropy management for continuous exploration toward scaling compute for RL. To this end, we investigate entropy dynamics both theoretically and empirically. Our derivation highlights that, the change in policy entropy is driven by the covariance between action probability and the change in logits, which is proportional to its advantage when using Policy Gradient-like algorithms. Empirical study shows that, the values of covariance term and entropy differences matched exactly, supporting the theoretical conclusion. Moreover, the covariance term stays mostly positive throughout training, further explaining why policy entropy would decrease monotonically. Through understanding the mechanism behind entropy dynamics, we motivate to control entropy by restricting the update of high-covariance tokens. Specifically, we propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively. Experiments show that these methods encourage exploration, thus helping policy escape entropy collapse and achieve better downstream performance.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 29 May 2025 21:12:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/33ada166/dfc26c9b.mp3" length="21313926" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1328</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 84 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, Ning Ding</p>

            <p><strong>Title:</strong><br>
            The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22617v1">http://arxiv.org/abs/2505.22617v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b. Our finding necessitates entropy management for continuous exploration toward scaling compute for RL. To this end, we investigate entropy dynamics both theoretically and empirically. Our derivation highlights that, the change in policy entropy is driven by the covariance between action probability and the change in logits, which is proportional to its advantage when using Policy Gradient-like algorithms. Empirical study shows that, the values of covariance term and entropy differences matched exactly, supporting the theoretical conclusion. Moreover, the covariance term stays mostly positive throughout training, further explaining why policy entropy would decrease monotonically. Through understanding the mechanism behind entropy dynamics, we motivate to control entropy by restricting the update of high-covariance tokens. Specifically, we propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively. Experiments show that these methods encourage exploration, thus helping policy escape entropy collapse and achieve better downstream performance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents</title>
      <itunes:episode>835</itunes:episode>
      <podcast:episode>835</podcast:episode>
      <itunes:title>SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c0665a5d-d3fe-477e-b127-9e7a540b848a</guid>
      <link>https://share.transistor.fm/s/d5c873e7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.SE, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, Boris Yangel</p>

            <p><strong>Title:</strong><br>
            SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.20411v1">http://arxiv.org/abs/2505.20411v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.SE, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, Boris Yangel</p>

            <p><strong>Title:</strong><br>
            SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.20411v1">http://arxiv.org/abs/2505.20411v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 29 May 2025 21:12:01 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d5c873e7/73eec16f.mp3" length="20268230" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1263</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.SE, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, Boris Yangel</p>

            <p><strong>Title:</strong><br>
            SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.20411v1">http://arxiv.org/abs/2505.20411v1</a></p>

            <p><strong>Abstract:</strong><br>
            LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing</title>
      <itunes:episode>834</itunes:episode>
      <podcast:episode>834</podcast:episode>
      <itunes:title>R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c4d0d4b8-e2ac-4c68-a741-7657dd73fd3a</guid>
      <link>https://share.transistor.fm/s/b82b127b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.CL, cs.AI, cs.LG, cs.PF, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang</p>

            <p><strong>Title:</strong><br>
            R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21600v1">http://arxiv.org/abs/2505.21600v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce **Roads to Rome (R2R)**, a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at https://github.com/thu-nics/R2R.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.CL, cs.AI, cs.LG, cs.PF, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang</p>

            <p><strong>Title:</strong><br>
            R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21600v1">http://arxiv.org/abs/2505.21600v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce **Roads to Rome (R2R)**, a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at https://github.com/thu-nics/R2R.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 29 May 2025 21:11:40 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b82b127b/6440f505.mp3" length="22156126" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1381</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.CL, cs.AI, cs.LG, cs.PF, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang</p>

            <p><strong>Title:</strong><br>
            R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21600v1">http://arxiv.org/abs/2505.21600v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce **Roads to Rome (R2R)**, a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at https://github.com/thu-nics/R2R.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Skywork Open Reasoner 1 Technical Report</title>
      <itunes:episode>833</itunes:episode>
      <podcast:episode>833</podcast:episode>
      <itunes:title>Skywork Open Reasoner 1 Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1c5292c9-d9cd-491c-a55d-682ccec4142a</guid>
      <link>https://share.transistor.fm/s/48a298a7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, Yahui Zhou</p>

            <p><strong>Title:</strong><br>
            Skywork Open Reasoner 1 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22312v2">http://arxiv.org/abs/2505.22312v2</a></p>

            <p><strong>Abstract:</strong><br>
            The success of DeepSeek-R1 underscores the significant role of reinforcement learning (RL) in enhancing the reasoning capabilities of large language models (LLMs). In this work, we present Skywork-OR1, an effective and scalable RL implementation for long Chain-of-Thought (CoT) models. Building on the DeepSeek-R1-Distill model series, our RL approach achieves notable performance gains, increasing average accuracy across AIME24, AIME25, and LiveCodeBench from 57.8% to 72.8% (+15.0%) for the 32B model and from 43.6% to 57.5% (+13.9%) for the 7B model. Our Skywork-OR1-32B model surpasses both DeepSeek-R1 and Qwen3-32B on the AIME24 and AIME25 benchmarks, while achieving comparable results on LiveCodeBench. The Skywork-OR1-7B and Skywork-OR1-Math-7B models demonstrate competitive reasoning capabilities among models of similar size. We perform comprehensive ablation studies on the core components of our training pipeline to validate their effectiveness. Additionally, we thoroughly investigate the phenomenon of entropy collapse, identify key factors affecting entropy dynamics, and demonstrate that mitigating premature entropy collapse is critical for improved test performance. To support community research, we fully open-source our model weights, training code, and training datasets.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, Yahui Zhou</p>

            <p><strong>Title:</strong><br>
            Skywork Open Reasoner 1 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22312v2">http://arxiv.org/abs/2505.22312v2</a></p>

            <p><strong>Abstract:</strong><br>
            The success of DeepSeek-R1 underscores the significant role of reinforcement learning (RL) in enhancing the reasoning capabilities of large language models (LLMs). In this work, we present Skywork-OR1, an effective and scalable RL implementation for long Chain-of-Thought (CoT) models. Building on the DeepSeek-R1-Distill model series, our RL approach achieves notable performance gains, increasing average accuracy across AIME24, AIME25, and LiveCodeBench from 57.8% to 72.8% (+15.0%) for the 32B model and from 43.6% to 57.5% (+13.9%) for the 7B model. Our Skywork-OR1-32B model surpasses both DeepSeek-R1 and Qwen3-32B on the AIME24 and AIME25 benchmarks, while achieving comparable results on LiveCodeBench. The Skywork-OR1-7B and Skywork-OR1-Math-7B models demonstrate competitive reasoning capabilities among models of similar size. We perform comprehensive ablation studies on the core components of our training pipeline to validate their effectiveness. Additionally, we thoroughly investigate the phenomenon of entropy collapse, identify key factors affecting entropy dynamics, and demonstrate that mitigating premature entropy collapse is critical for improved test performance. To support community research, we fully open-source our model weights, training code, and training datasets.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 29 May 2025 21:11:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/48a298a7/45ecc671.mp3" length="21513255" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1341</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, Yahui Zhou</p>

            <p><strong>Title:</strong><br>
            Skywork Open Reasoner 1 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22312v2">http://arxiv.org/abs/2505.22312v2</a></p>

            <p><strong>Abstract:</strong><br>
            The success of DeepSeek-R1 underscores the significant role of reinforcement learning (RL) in enhancing the reasoning capabilities of large language models (LLMs). In this work, we present Skywork-OR1, an effective and scalable RL implementation for long Chain-of-Thought (CoT) models. Building on the DeepSeek-R1-Distill model series, our RL approach achieves notable performance gains, increasing average accuracy across AIME24, AIME25, and LiveCodeBench from 57.8% to 72.8% (+15.0%) for the 32B model and from 43.6% to 57.5% (+13.9%) for the 7B model. Our Skywork-OR1-32B model surpasses both DeepSeek-R1 and Qwen3-32B on the AIME24 and AIME25 benchmarks, while achieving comparable results on LiveCodeBench. The Skywork-OR1-7B and Skywork-OR1-Math-7B models demonstrate competitive reasoning capabilities among models of similar size. We perform comprehensive ablation studies on the core components of our training pipeline to validate their effectiveness. Additionally, we thoroughly investigate the phenomenon of entropy collapse, identify key factors affecting entropy dynamics, and demonstrate that mitigating premature entropy collapse is critical for improved test performance. To support community research, we fully open-source our model weights, training code, and training datasets.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Sherlock: Self-Correcting Reasoning in Vision-Language Models</title>
      <itunes:episode>832</itunes:episode>
      <podcast:episode>832</podcast:episode>
      <itunes:title>Sherlock: Self-Correcting Reasoning in Vision-Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">25d18245-d5ea-4fc4-8ae6-e790256ae46b</guid>
      <link>https://share.transistor.fm/s/bfd7c93d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yi Ding, Ruqi Zhang</p>

            <p><strong>Title:</strong><br>
            Sherlock: Self-Correcting Reasoning in Vision-Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22651v1">http://arxiv.org/abs/2505.22651v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic $\beta$ for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yi Ding, Ruqi Zhang</p>

            <p><strong>Title:</strong><br>
            Sherlock: Self-Correcting Reasoning in Vision-Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22651v1">http://arxiv.org/abs/2505.22651v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic $\beta$ for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 29 May 2025 21:10:57 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bfd7c93d/e753d621.mp3" length="20590004" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1283</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yi Ding, Ruqi Zhang</p>

            <p><strong>Title:</strong><br>
            Sherlock: Self-Correcting Reasoning in Vision-Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22651v1">http://arxiv.org/abs/2505.22651v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic $\beta$ for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO</title>
      <itunes:episode>831</itunes:episode>
      <podcast:episode>831</podcast:episode>
      <itunes:title>Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">079455ce-52b5-403d-9f36-5f7496c56328</guid>
      <link>https://share.transistor.fm/s/545a50ce</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CL, cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, Lichao Sun</p>

            <p><strong>Title:</strong><br>
            Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22453v1">http://arxiv.org/abs/2505.22453v1</a></p>

            <p><strong>Abstract:</strong><br>
            Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). However, these supervised methods require expensive and manually annotated multi-modal data--an ultimately unsustainable resource. While recent efforts have explored unsupervised post-training, their methods are complex and difficult to iterate. In this work, we are the first to investigate the use of GRPO, a stable and scalable online RL algorithm, for enabling continual self-improvement without any external supervision. We propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs. MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that MM-UPT significantly improves the reasoning ability of Qwen2.5-VL-7B (e.g., 66.3 %$\rightarrow$72.9 % on MathVista, 62.9 %$\rightarrow$68.7 % on We-Math), using standard dataset without ground truth labels. MM-UPT also outperforms prior unsupervised baselines and even approaches the results of supervised GRPO. Furthermore, we show that incorporating synthetic questions, generated solely by MLLM itself, can boost performance as well, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for continual, autonomous enhancement of MLLMs in the absence of external supervision. Our code is available at https://github.com/waltonfuture/MM-UPT.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CL, cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, Lichao Sun</p>

            <p><strong>Title:</strong><br>
            Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22453v1">http://arxiv.org/abs/2505.22453v1</a></p>

            <p><strong>Abstract:</strong><br>
            Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). However, these supervised methods require expensive and manually annotated multi-modal data--an ultimately unsustainable resource. While recent efforts have explored unsupervised post-training, their methods are complex and difficult to iterate. In this work, we are the first to investigate the use of GRPO, a stable and scalable online RL algorithm, for enabling continual self-improvement without any external supervision. We propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs. MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that MM-UPT significantly improves the reasoning ability of Qwen2.5-VL-7B (e.g., 66.3 %$\rightarrow$72.9 % on MathVista, 62.9 %$\rightarrow$68.7 % on We-Math), using standard dataset without ground truth labels. MM-UPT also outperforms prior unsupervised baselines and even approaches the results of supervised GRPO. Furthermore, we show that incorporating synthetic questions, generated solely by MLLM itself, can boost performance as well, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for continual, autonomous enhancement of MLLMs in the absence of external supervision. Our code is available at https://github.com/waltonfuture/MM-UPT.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 29 May 2025 21:10:36 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/545a50ce/2bcc8835.mp3" length="22034893" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1373</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CL, cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, Lichao Sun</p>

            <p><strong>Title:</strong><br>
            Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22453v1">http://arxiv.org/abs/2505.22453v1</a></p>

            <p><strong>Abstract:</strong><br>
            Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). However, these supervised methods require expensive and manually annotated multi-modal data--an ultimately unsustainable resource. While recent efforts have explored unsupervised post-training, their methods are complex and difficult to iterate. In this work, we are the first to investigate the use of GRPO, a stable and scalable online RL algorithm, for enabling continual self-improvement without any external supervision. We propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs. MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that MM-UPT significantly improves the reasoning ability of Qwen2.5-VL-7B (e.g., 66.3 %$\rightarrow$72.9 % on MathVista, 62.9 %$\rightarrow$68.7 % on We-Math), using standard dataset without ground truth labels. MM-UPT also outperforms prior unsupervised baselines and even approaches the results of supervised GRPO. Furthermore, we show that incorporating synthetic questions, generated solely by MLLM itself, can boost performance as well, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for continual, autonomous enhancement of MLLMs in the absence of external supervision. Our code is available at https://github.com/waltonfuture/MM-UPT.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SageAttention2++: A More Efficient Implementation of SageAttention2</title>
      <itunes:episode>830</itunes:episode>
      <podcast:episode>830</podcast:episode>
      <itunes:title>SageAttention2++: A More Efficient Implementation of SageAttention2</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">74a08df4-1fd6-418d-a9ab-25fd2bd2b681</guid>
      <link>https://share.transistor.fm/s/227ce55b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG, cs.AI, cs.AR, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jintao Zhang, Xiaoming Xu, Jia Wei, Haofeng Huang, Pengle Zhang, Chendong Xiang, Jun Zhu, Jianfei Chen</p>

            <p><strong>Title:</strong><br>
            SageAttention2++: A More Efficient Implementation of SageAttention2</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21136v2">http://arxiv.org/abs/2505.21136v2</a></p>

            <p><strong>Abstract:</strong><br>
            The efficiency of attention is critical because its time complexity grows quadratically with sequence length. SageAttention2 addresses this by utilizing quantization to accelerate matrix multiplications (Matmul) in attention. To further accelerate SageAttention2, we propose to utilize the faster instruction of FP8 Matmul accumulated in FP16. The instruction is 2x faster than the FP8 Matmul used in SageAttention2. Our experiments show that SageAttention2++ achieves a 3.9x speedup over FlashAttention while maintaining the same attention accuracy as SageAttention2. This means SageAttention2++ effectively accelerates various models, including those for language, image, and video generation, with negligible end-to-end metrics loss. The code will be available at https://github.com/thu-ml/SageAttention.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG, cs.AI, cs.AR, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jintao Zhang, Xiaoming Xu, Jia Wei, Haofeng Huang, Pengle Zhang, Chendong Xiang, Jun Zhu, Jianfei Chen</p>

            <p><strong>Title:</strong><br>
            SageAttention2++: A More Efficient Implementation of SageAttention2</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21136v2">http://arxiv.org/abs/2505.21136v2</a></p>

            <p><strong>Abstract:</strong><br>
            The efficiency of attention is critical because its time complexity grows quadratically with sequence length. SageAttention2 addresses this by utilizing quantization to accelerate matrix multiplications (Matmul) in attention. To further accelerate SageAttention2, we propose to utilize the faster instruction of FP8 Matmul accumulated in FP16. The instruction is 2x faster than the FP8 Matmul used in SageAttention2. Our experiments show that SageAttention2++ achieves a 3.9x speedup over FlashAttention while maintaining the same attention accuracy as SageAttention2. This means SageAttention2++ effectively accelerates various models, including those for language, image, and video generation, with negligible end-to-end metrics loss. The code will be available at https://github.com/thu-ml/SageAttention.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 29 May 2025 21:10:14 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/227ce55b/2a6e62d1.mp3" length="18967075" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1182</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG, cs.AI, cs.AR, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jintao Zhang, Xiaoming Xu, Jia Wei, Haofeng Huang, Pengle Zhang, Chendong Xiang, Jun Zhu, Jianfei Chen</p>

            <p><strong>Title:</strong><br>
            SageAttention2++: A More Efficient Implementation of SageAttention2</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21136v2">http://arxiv.org/abs/2505.21136v2</a></p>

            <p><strong>Abstract:</strong><br>
            The efficiency of attention is critical because its time complexity grows quadratically with sequence length. SageAttention2 addresses this by utilizing quantization to accelerate matrix multiplications (Matmul) in attention. To further accelerate SageAttention2, we propose to utilize the faster instruction of FP8 Matmul accumulated in FP16. The instruction is 2x faster than the FP8 Matmul used in SageAttention2. Our experiments show that SageAttention2++ achieves a 3.9x speedup over FlashAttention while maintaining the same attention accuracy as SageAttention2. This means SageAttention2++ effectively accelerates various models, including those for language, image, and video generation, with negligible end-to-end metrics loss. The code will be available at https://github.com/thu-ml/SageAttention.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start</title>
      <itunes:episode>829</itunes:episode>
      <podcast:episode>829</podcast:episode>
      <itunes:title>Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2c5f972a-8162-4ecd-ba6e-6103784f3e53</guid>
      <link>https://share.transistor.fm/s/4bec2047</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, Weiran Huang</p>

            <p><strong>Title:</strong><br>
            Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22334v1">http://arxiv.org/abs/2505.22334v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patterns--where models exhibit self-correction through reflection--are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %$\rightarrow$73.4 % on MathVista, 62.9 %$\rightarrow$70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, Weiran Huang</p>

            <p><strong>Title:</strong><br>
            Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22334v1">http://arxiv.org/abs/2505.22334v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patterns--where models exhibit self-correction through reflection--are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %$\rightarrow$73.4 % on MathVista, 62.9 %$\rightarrow$70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 29 May 2025 21:09:53 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4bec2047/9d09da60.mp3" length="20762216" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1294</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, Weiran Huang</p>

            <p><strong>Title:</strong><br>
            Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22334v1">http://arxiv.org/abs/2505.22334v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patterns--where models exhibit self-correction through reflection--are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %$\rightarrow$73.4 % on MathVista, 62.9 %$\rightarrow$70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Fostering Video Reasoning via Next-Event Prediction</title>
      <itunes:episode>828</itunes:episode>
      <podcast:episode>828</podcast:episode>
      <itunes:title>Fostering Video Reasoning via Next-Event Prediction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">08cd962a-14fc-4bc3-9c58-4d46ba647fc4</guid>
      <link>https://share.transistor.fm/s/ba0f70d6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haonan Wang, Hongfu Liu, Xiangyan Liu, Chao Du, Kenji Kawaguchi, Ye Wang, Tianyu Pang</p>

            <p><strong>Title:</strong><br>
            Fostering Video Reasoning via Next-Event Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22457v1">http://arxiv.org/abs/2505.22457v1</a></p>

            <p><strong>Abstract:</strong><br>
            Next-token prediction serves as the foundational learning task enabling reasoning in LLMs. But what should the learning task be when aiming to equip MLLMs with temporal reasoning capabilities over video inputs? Existing tasks such as video question answering often rely on annotations from humans or much stronger MLLMs, while video captioning tends to entangle temporal reasoning with spatial information. To address this gap, we propose next-event prediction (NEP), a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning. We segment each video into past and future frames: the MLLM takes the past frames as input and predicts a summary of events derived from the future frames, thereby encouraging the model to reason temporally in order to complete the task. To support this task, we curate V1-33K, a dataset comprising 33,000 automatically extracted video segments spanning diverse real-world scenarios. We further explore a range of video instruction-tuning strategies to study their effects on temporal reasoning. To evaluate progress, we introduce FutureBench to assess coherence in predicting unseen future events. Experiments validate that NEP offers a scalable and effective training paradigm for fostering temporal reasoning in MLLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haonan Wang, Hongfu Liu, Xiangyan Liu, Chao Du, Kenji Kawaguchi, Ye Wang, Tianyu Pang</p>

            <p><strong>Title:</strong><br>
            Fostering Video Reasoning via Next-Event Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22457v1">http://arxiv.org/abs/2505.22457v1</a></p>

            <p><strong>Abstract:</strong><br>
            Next-token prediction serves as the foundational learning task enabling reasoning in LLMs. But what should the learning task be when aiming to equip MLLMs with temporal reasoning capabilities over video inputs? Existing tasks such as video question answering often rely on annotations from humans or much stronger MLLMs, while video captioning tends to entangle temporal reasoning with spatial information. To address this gap, we propose next-event prediction (NEP), a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning. We segment each video into past and future frames: the MLLM takes the past frames as input and predicts a summary of events derived from the future frames, thereby encouraging the model to reason temporally in order to complete the task. To support this task, we curate V1-33K, a dataset comprising 33,000 automatically extracted video segments spanning diverse real-world scenarios. We further explore a range of video instruction-tuning strategies to study their effects on temporal reasoning. To evaluate progress, we introduce FutureBench to assess coherence in predicting unseen future events. Experiments validate that NEP offers a scalable and effective training paradigm for fostering temporal reasoning in MLLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 29 May 2025 21:09:32 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ba0f70d6/132e0aa9.mp3" length="23968776" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1494</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haonan Wang, Hongfu Liu, Xiangyan Liu, Chao Du, Kenji Kawaguchi, Ye Wang, Tianyu Pang</p>

            <p><strong>Title:</strong><br>
            Fostering Video Reasoning via Next-Event Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.22457v1">http://arxiv.org/abs/2505.22457v1</a></p>

            <p><strong>Abstract:</strong><br>
            Next-token prediction serves as the foundational learning task enabling reasoning in LLMs. But what should the learning task be when aiming to equip MLLMs with temporal reasoning capabilities over video inputs? Existing tasks such as video question answering often rely on annotations from humans or much stronger MLLMs, while video captioning tends to entangle temporal reasoning with spatial information. To address this gap, we propose next-event prediction (NEP), a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning. We segment each video into past and future frames: the MLLM takes the past frames as input and predicts a summary of events derived from the future frames, thereby encouraging the model to reason temporally in order to complete the task. To support this task, we curate V1-33K, a dataset comprising 33,000 automatically extracted video segments spanning diverse real-world scenarios. We further explore a range of video instruction-tuning strategies to study their effects on temporal reasoning. To evaluate progress, we introduce FutureBench to assess coherence in predicting unseen future events. Experiments validate that NEP offers a scalable and effective training paradigm for fostering temporal reasoning in MLLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination</title>
      <itunes:episode>827</itunes:episode>
      <podcast:episode>827</podcast:episode>
      <itunes:title>RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">67d17f32-96fa-4acf-bc44-6536b185c4be</guid>
      <link>https://share.transistor.fm/s/63b30b55</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.GR, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chong Zeng, Yue Dong, Pieter Peers, Hongzhi Wu, Xin Tong</p>

            <p><strong>Title:</strong><br>
            RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21925v1">http://arxiv.org/abs/2505.21925v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present RenderFormer, a neural rendering pipeline that directly renders an image from a triangle-based representation of a scene with full global illumination effects and that does not require per-scene training or fine-tuning. Instead of taking a physics-centric approach to rendering, we formulate rendering as a sequence-to-sequence transformation where a sequence of tokens representing triangles with reflectance properties is converted to a sequence of output tokens representing small patches of pixels. RenderFormer follows a two stage pipeline: a view-independent stage that models triangle-to-triangle light transport, and a view-dependent stage that transforms a token representing a bundle of rays to the corresponding pixel values guided by the triangle-sequence from the view-independent stage. Both stages are based on the transformer architecture and are learned with minimal prior constraints. We demonstrate and evaluate RenderFormer on scenes with varying complexity in shape and light transport.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.GR, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chong Zeng, Yue Dong, Pieter Peers, Hongzhi Wu, Xin Tong</p>

            <p><strong>Title:</strong><br>
            RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21925v1">http://arxiv.org/abs/2505.21925v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present RenderFormer, a neural rendering pipeline that directly renders an image from a triangle-based representation of a scene with full global illumination effects and that does not require per-scene training or fine-tuning. Instead of taking a physics-centric approach to rendering, we formulate rendering as a sequence-to-sequence transformation where a sequence of tokens representing triangles with reflectance properties is converted to a sequence of output tokens representing small patches of pixels. RenderFormer follows a two stage pipeline: a view-independent stage that models triangle-to-triangle light transport, and a view-dependent stage that transforms a token representing a bundle of rays to the corresponding pixel values guided by the triangle-sequence from the view-independent stage. Both stages are based on the transformer architecture and are learned with minimal prior constraints. We demonstrate and evaluate RenderFormer on scenes with varying complexity in shape and light transport.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 29 May 2025 21:09:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/63b30b55/0d1bdc52.mp3" length="22467508" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1401</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.GR, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chong Zeng, Yue Dong, Pieter Peers, Hongzhi Wu, Xin Tong</p>

            <p><strong>Title:</strong><br>
            RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21925v1">http://arxiv.org/abs/2505.21925v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present RenderFormer, a neural rendering pipeline that directly renders an image from a triangle-based representation of a scene with full global illumination effects and that does not require per-scene training or fine-tuning. Instead of taking a physics-centric approach to rendering, we formulate rendering as a sequence-to-sequence transformation where a sequence of tokens representing triangles with reflectance properties is converted to a sequence of output tokens representing small patches of pixels. RenderFormer follows a two stage pipeline: a view-independent stage that models triangle-to-triangle light transport, and a view-dependent stage that transforms a token representing a bundle of rays to the corresponding pixel values guided by the triangle-sequence from the view-independent stage. Both stages are based on the transformer architecture and are learned with minimal prior constraints. We demonstrate and evaluate RenderFormer on scenes with varying complexity in shape and light transport.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows</title>
      <itunes:episode>826</itunes:episode>
      <podcast:episode>826</podcast:episode>
      <itunes:title>ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">dd360820-8a20-4824-9416-1837e9976ae2</guid>
      <link>https://share.transistor.fm/s/1e0d1595</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 85 | cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, Zhiyong Wu</p>

            <p><strong>Title:</strong><br>
            ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19897v1">http://arxiv.org/abs/2505.19897v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 85 | cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, Zhiyong Wu</p>

            <p><strong>Title:</strong><br>
            ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19897v1">http://arxiv.org/abs/2505.19897v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 28 May 2025 21:05:36 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1e0d1595/e33ca290.mp3" length="21397110" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1334</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 85 | cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, Zhiyong Wu</p>

            <p><strong>Title:</strong><br>
            ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19897v1">http://arxiv.org/abs/2505.19897v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs</title>
      <itunes:episode>825</itunes:episode>
      <podcast:episode>825</podcast:episode>
      <itunes:title>MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e89a560d-ff3a-44f7-819b-e83b367d2544</guid>
      <link>https://share.transistor.fm/s/1d0348aa</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 73 | cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, Xiangyu Yue</p>

            <p><strong>Title:</strong><br>
            MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21327v1">http://arxiv.org/abs/2505.21327v1</a></p>

            <p><strong>Abstract:</strong><br>
            Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensively evaluate their reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning. To address these issues, we introduce MME-Reasoning, a comprehensive benchmark designed to evaluate the reasoning ability of MLLMs, which covers all three types of reasoning (i.e., inductive, deductive, and abductive) in its questions. We carefully curate the data to ensure that each question effectively evaluates reasoning ability rather than perceptual skills or knowledge breadth, and extend the evaluation protocols to cover the evaluation of diverse questions. Our evaluation reveals substantial limitations of state-of-the-art MLLMs when subjected to holistic assessments of logical reasoning capabilities. Even the most advanced MLLMs show limited performance in comprehensive logical reasoning, with notable performance imbalances across reasoning types. In addition, we conducted an in-depth analysis of approaches such as ``thinking mode'' and Rule-based RL, which are commonly believed to enhance reasoning abilities. These findings highlight the critical limitations and performance imbalances of current MLLMs in diverse logical reasoning scenarios, providing comprehensive and systematic insights into the understanding and evaluation of reasoning capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 73 | cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, Xiangyu Yue</p>

            <p><strong>Title:</strong><br>
            MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21327v1">http://arxiv.org/abs/2505.21327v1</a></p>

            <p><strong>Abstract:</strong><br>
            Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensively evaluate their reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning. To address these issues, we introduce MME-Reasoning, a comprehensive benchmark designed to evaluate the reasoning ability of MLLMs, which covers all three types of reasoning (i.e., inductive, deductive, and abductive) in its questions. We carefully curate the data to ensure that each question effectively evaluates reasoning ability rather than perceptual skills or knowledge breadth, and extend the evaluation protocols to cover the evaluation of diverse questions. Our evaluation reveals substantial limitations of state-of-the-art MLLMs when subjected to holistic assessments of logical reasoning capabilities. Even the most advanced MLLMs show limited performance in comprehensive logical reasoning, with notable performance imbalances across reasoning types. In addition, we conducted an in-depth analysis of approaches such as ``thinking mode'' and Rule-based RL, which are commonly believed to enhance reasoning abilities. These findings highlight the critical limitations and performance imbalances of current MLLMs in diverse logical reasoning scenarios, providing comprehensive and systematic insights into the understanding and evaluation of reasoning capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 28 May 2025 21:05:15 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1d0348aa/aff6891b.mp3" length="20310400" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1266</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 73 | cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, Xiangyu Yue</p>

            <p><strong>Title:</strong><br>
            MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21327v1">http://arxiv.org/abs/2505.21327v1</a></p>

            <p><strong>Abstract:</strong><br>
            Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensively evaluate their reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning. To address these issues, we introduce MME-Reasoning, a comprehensive benchmark designed to evaluate the reasoning ability of MLLMs, which covers all three types of reasoning (i.e., inductive, deductive, and abductive) in its questions. We carefully curate the data to ensure that each question effectively evaluates reasoning ability rather than perceptual skills or knowledge breadth, and extend the evaluation protocols to cover the evaluation of diverse questions. Our evaluation reveals substantial limitations of state-of-the-art MLLMs when subjected to holistic assessments of logical reasoning capabilities. Even the most advanced MLLMs show limited performance in comprehensive logical reasoning, with notable performance imbalances across reasoning types. In addition, we conducted an in-depth analysis of approaches such as ``thinking mode'' and Rule-based RL, which are commonly believed to enhance reasoning abilities. These findings highlight the critical limitations and performance imbalances of current MLLMs in diverse logical reasoning scenarios, providing comprehensive and systematic insights into the understanding and evaluation of reasoning capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers</title>
      <itunes:episode>824</itunes:episode>
      <podcast:episode>824</podcast:episode>
      <itunes:title>Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">40b4b332-4fa1-4537-9bf0-a44cf6462468</guid>
      <link>https://share.transistor.fm/s/d66d72fe</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 73 | cs.CV, cs.AI, cs.CL, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr</p>

            <p><strong>Title:</strong><br>
            Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21497v1">http://arxiv.org/abs/2505.21497v1</a></p>

            <p><strong>Abstract:</strong><br>
            Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 73 | cs.CV, cs.AI, cs.CL, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr</p>

            <p><strong>Title:</strong><br>
            Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21497v1">http://arxiv.org/abs/2505.21497v1</a></p>

            <p><strong>Abstract:</strong><br>
            Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 28 May 2025 21:04:53 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d66d72fe/dd4a0dd4.mp3" length="17252194" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1075</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 73 | cs.CV, cs.AI, cs.CL, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr</p>

            <p><strong>Title:</strong><br>
            Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21497v1">http://arxiv.org/abs/2505.21497v1</a></p>

            <p><strong>Abstract:</strong><br>
            Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data</title>
      <itunes:episode>823</itunes:episode>
      <podcast:episode>823</podcast:episode>
      <itunes:title>OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d1e611ad-1721-436b-9627-899ce950293c</guid>
      <link>https://share.transistor.fm/s/52fdc9b3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yiren Song, Cheng Liu, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.18445v1">http://arxiv.org/abs/2505.18445v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have advanced image stylization significantly, yet two core challenges persist: (1) maintaining consistent stylization in complex scenes, particularly identity, composition, and fine details, and (2) preventing style degradation in image-to-image pipelines with style LoRAs. GPT-4o's exceptional stylization consistency highlights the performance gap between open-source methods and proprietary models. To bridge this gap, we propose \textbf{OmniConsistency}, a universal consistency plugin leveraging large-scale Diffusion Transformers (DiTs). OmniConsistency contributes: (1) an in-context consistency learning framework trained on aligned image pairs for robust generalization; (2) a two-stage progressive learning strategy decoupling style learning from consistency preservation to mitigate style degradation; and (3) a fully plug-and-play design compatible with arbitrary style LoRAs under the Flux framework. Extensive experiments show that OmniConsistency significantly enhances visual coherence and aesthetic quality, achieving performance comparable to commercial state-of-the-art model GPT-4o.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yiren Song, Cheng Liu, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.18445v1">http://arxiv.org/abs/2505.18445v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have advanced image stylization significantly, yet two core challenges persist: (1) maintaining consistent stylization in complex scenes, particularly identity, composition, and fine details, and (2) preventing style degradation in image-to-image pipelines with style LoRAs. GPT-4o's exceptional stylization consistency highlights the performance gap between open-source methods and proprietary models. To bridge this gap, we propose \textbf{OmniConsistency}, a universal consistency plugin leveraging large-scale Diffusion Transformers (DiTs). OmniConsistency contributes: (1) an in-context consistency learning framework trained on aligned image pairs for robust generalization; (2) a two-stage progressive learning strategy decoupling style learning from consistency preservation to mitigate style degradation; and (3) a fully plug-and-play design compatible with arbitrary style LoRAs under the Flux framework. Extensive experiments show that OmniConsistency significantly enhances visual coherence and aesthetic quality, achieving performance comparable to commercial state-of-the-art model GPT-4o.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 28 May 2025 21:04:32 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/52fdc9b3/7294a611.mp3" length="23488989" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1464</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yiren Song, Cheng Liu, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.18445v1">http://arxiv.org/abs/2505.18445v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have advanced image stylization significantly, yet two core challenges persist: (1) maintaining consistent stylization in complex scenes, particularly identity, composition, and fine details, and (2) preventing style degradation in image-to-image pipelines with style LoRAs. GPT-4o's exceptional stylization consistency highlights the performance gap between open-source methods and proprietary models. To bridge this gap, we propose \textbf{OmniConsistency}, a universal consistency plugin leveraging large-scale Diffusion Transformers (DiTs). OmniConsistency contributes: (1) an in-context consistency learning framework trained on aligned image pairs for robust generalization; (2) a two-stage progressive learning strategy decoupling style learning from consistency preservation to mitigate style degradation; and (3) a fully plug-and-play design compatible with arbitrary style LoRAs under the Flux framework. Extensive experiments show that OmniConsistency significantly enhances visual coherence and aesthetic quality, achieving performance comparable to commercial state-of-the-art model GPT-4o.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation</title>
      <itunes:episode>822</itunes:episode>
      <podcast:episode>822</podcast:episode>
      <itunes:title>OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4c9e773b-1caf-4612-95a9-fce0c0927011</guid>
      <link>https://share.transistor.fm/s/e603ed35</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, Li Yuan</p>

            <p><strong>Title:</strong><br>
            OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.20292v3">http://arxiv.org/abs/2505.20292v3</a></p>

            <p><strong>Abstract:</strong><br>
            Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 16 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations. Through OpenS2V-Nexus, we deliver a robust infrastructure to accelerate future S2V generation research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, Li Yuan</p>

            <p><strong>Title:</strong><br>
            OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.20292v3">http://arxiv.org/abs/2505.20292v3</a></p>

            <p><strong>Abstract:</strong><br>
            Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 16 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations. Through OpenS2V-Nexus, we deliver a robust infrastructure to accelerate future S2V generation research.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 28 May 2025 21:04:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e603ed35/61e73f60.mp3" length="19145987" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1193</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, Li Yuan</p>

            <p><strong>Title:</strong><br>
            OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.20292v3">http://arxiv.org/abs/2505.20292v3</a></p>

            <p><strong>Abstract:</strong><br>
            Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 16 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations. Through OpenS2V-Nexus, we deliver a robust infrastructure to accelerate future S2V generation research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond</title>
      <itunes:episode>821</itunes:episode>
      <podcast:episode>821</podcast:episode>
      <itunes:title>SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e6868326-1369-4304-b36f-1f81115e6459</guid>
      <link>https://share.transistor.fm/s/d98297e8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, Junxian He</p>

            <p><strong>Title:</strong><br>
            SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19641v3">http://arxiv.org/abs/2505.19641v3</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances such as OpenAI-o1 and DeepSeek R1 have demonstrated the potential of Reinforcement Learning (RL) to enhance reasoning abilities in Large Language Models (LLMs). While open-source replication efforts have primarily focused on mathematical and coding domains, methods and resources for developing general reasoning capabilities remain underexplored. This gap is partly due to the challenge of collecting diverse and verifiable reasoning data suitable for RL. We hypothesize that logical reasoning is critical for developing general reasoning capabilities, as logic forms a fundamental building block of reasoning. In this work, we present SynLogic, a data synthesis framework and dataset that generates diverse logical reasoning data at scale, encompassing 35 diverse logical reasoning tasks. The SynLogic approach enables controlled synthesis of data with adjustable difficulty and quantity. Importantly, all examples can be verified by simple rules, making them ideally suited for RL with verifiable rewards. In our experiments, we validate the effectiveness of RL training on the SynLogic dataset based on 7B and 32B models. SynLogic leads to state-of-the-art logical reasoning performance among open-source datasets, surpassing DeepSeek-R1-Distill-Qwen-32B by 6 points on BBEH. Furthermore, mixing SynLogic data with mathematical and coding tasks improves the training efficiency of these domains and significantly enhances reasoning generalization. Notably, our mixed training model outperforms DeepSeek-R1-Zero-Qwen-32B across multiple benchmarks. These findings position SynLogic as a valuable resource for advancing the broader reasoning capabilities of LLMs. We open-source both the data synthesis pipeline and the SynLogic dataset at https://github.com/MiniMax-AI/SynLogic.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, Junxian He</p>

            <p><strong>Title:</strong><br>
            SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19641v3">http://arxiv.org/abs/2505.19641v3</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances such as OpenAI-o1 and DeepSeek R1 have demonstrated the potential of Reinforcement Learning (RL) to enhance reasoning abilities in Large Language Models (LLMs). While open-source replication efforts have primarily focused on mathematical and coding domains, methods and resources for developing general reasoning capabilities remain underexplored. This gap is partly due to the challenge of collecting diverse and verifiable reasoning data suitable for RL. We hypothesize that logical reasoning is critical for developing general reasoning capabilities, as logic forms a fundamental building block of reasoning. In this work, we present SynLogic, a data synthesis framework and dataset that generates diverse logical reasoning data at scale, encompassing 35 diverse logical reasoning tasks. The SynLogic approach enables controlled synthesis of data with adjustable difficulty and quantity. Importantly, all examples can be verified by simple rules, making them ideally suited for RL with verifiable rewards. In our experiments, we validate the effectiveness of RL training on the SynLogic dataset based on 7B and 32B models. SynLogic leads to state-of-the-art logical reasoning performance among open-source datasets, surpassing DeepSeek-R1-Distill-Qwen-32B by 6 points on BBEH. Furthermore, mixing SynLogic data with mathematical and coding tasks improves the training efficiency of these domains and significantly enhances reasoning generalization. Notably, our mixed training model outperforms DeepSeek-R1-Zero-Qwen-32B across multiple benchmarks. These findings position SynLogic as a valuable resource for advancing the broader reasoning capabilities of LLMs. We open-source both the data synthesis pipeline and the SynLogic dataset at https://github.com/MiniMax-AI/SynLogic.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 28 May 2025 21:03:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d98297e8/60e4b395.mp3" length="21047290" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1312</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, Junxian He</p>

            <p><strong>Title:</strong><br>
            SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19641v3">http://arxiv.org/abs/2505.19641v3</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances such as OpenAI-o1 and DeepSeek R1 have demonstrated the potential of Reinforcement Learning (RL) to enhance reasoning abilities in Large Language Models (LLMs). While open-source replication efforts have primarily focused on mathematical and coding domains, methods and resources for developing general reasoning capabilities remain underexplored. This gap is partly due to the challenge of collecting diverse and verifiable reasoning data suitable for RL. We hypothesize that logical reasoning is critical for developing general reasoning capabilities, as logic forms a fundamental building block of reasoning. In this work, we present SynLogic, a data synthesis framework and dataset that generates diverse logical reasoning data at scale, encompassing 35 diverse logical reasoning tasks. The SynLogic approach enables controlled synthesis of data with adjustable difficulty and quantity. Importantly, all examples can be verified by simple rules, making them ideally suited for RL with verifiable rewards. In our experiments, we validate the effectiveness of RL training on the SynLogic dataset based on 7B and 32B models. SynLogic leads to state-of-the-art logical reasoning performance among open-source datasets, surpassing DeepSeek-R1-Distill-Qwen-32B by 6 points on BBEH. Furthermore, mixing SynLogic data with mathematical and coding tasks improves the training efficiency of these domains and significantly enhances reasoning generalization. Notably, our mixed training model outperforms DeepSeek-R1-Zero-Qwen-32B across multiple benchmarks. These findings position SynLogic as a valuable resource for advancing the broader reasoning capabilities of LLMs. We open-source both the data synthesis pipeline and the SynLogic dataset at https://github.com/MiniMax-AI/SynLogic.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning</title>
      <itunes:episode>820</itunes:episode>
      <podcast:episode>820</podcast:episode>
      <itunes:title>Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1ee2407d-b68d-49b7-9c6f-5253abf2c61d</guid>
      <link>https://share.transistor.fm/s/1877ffcd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Michael Hassid, Gabriel Synnaeve, Yossi Adi, Roy Schwartz</p>

            <p><strong>Title:</strong><br>
            Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17813v1">http://arxiv.org/abs/2505.17813v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning large language models (LLMs) heavily rely on scaling test-time compute to perform complex reasoning tasks by generating extensive "thinking" chains. While demonstrating impressive results, this approach incurs significant computational costs and inference time. In this work, we challenge the assumption that long thinking chains results in better reasoning capabilities. We first demonstrate that shorter reasoning chains within individual questions are significantly more likely to yield correct answers - up to 34.5% more accurate than the longest chain sampled for the same question. Based on these results, we suggest short-m@k, a novel reasoning LLM inference method. Our method executes k independent generations in parallel and halts computation once the first m thinking processes are done. The final answer is chosen using majority voting among these m chains. Basic short-1@k demonstrates similar or even superior performance over standard majority voting in low-compute settings - using up to 40% fewer thinking tokens. short-3@k, while slightly less efficient than short-1@k, consistently surpasses majority voting across all compute budgets, while still being substantially faster (up to 33% wall time reduction). Inspired by our results, we finetune an LLM using short, long, and randomly selected reasoning chains. We then observe that training on the shorter ones leads to better performance. Our findings suggest rethinking current methods of test-time compute in reasoning LLMs, emphasizing that longer "thinking" does not necessarily translate to improved performance and can, counter-intuitively, lead to degraded results.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Michael Hassid, Gabriel Synnaeve, Yossi Adi, Roy Schwartz</p>

            <p><strong>Title:</strong><br>
            Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17813v1">http://arxiv.org/abs/2505.17813v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning large language models (LLMs) heavily rely on scaling test-time compute to perform complex reasoning tasks by generating extensive "thinking" chains. While demonstrating impressive results, this approach incurs significant computational costs and inference time. In this work, we challenge the assumption that long thinking chains results in better reasoning capabilities. We first demonstrate that shorter reasoning chains within individual questions are significantly more likely to yield correct answers - up to 34.5% more accurate than the longest chain sampled for the same question. Based on these results, we suggest short-m@k, a novel reasoning LLM inference method. Our method executes k independent generations in parallel and halts computation once the first m thinking processes are done. The final answer is chosen using majority voting among these m chains. Basic short-1@k demonstrates similar or even superior performance over standard majority voting in low-compute settings - using up to 40% fewer thinking tokens. short-3@k, while slightly less efficient than short-1@k, consistently surpasses majority voting across all compute budgets, while still being substantially faster (up to 33% wall time reduction). Inspired by our results, we finetune an LLM using short, long, and randomly selected reasoning chains. We then observe that training on the shorter ones leads to better performance. Our findings suggest rethinking current methods of test-time compute in reasoning LLMs, emphasizing that longer "thinking" does not necessarily translate to improved performance and can, counter-intuitively, lead to degraded results.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 28 May 2025 21:03:28 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1877ffcd/8883017b.mp3" length="18206403" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1134</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Michael Hassid, Gabriel Synnaeve, Yossi Adi, Roy Schwartz</p>

            <p><strong>Title:</strong><br>
            Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17813v1">http://arxiv.org/abs/2505.17813v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning large language models (LLMs) heavily rely on scaling test-time compute to perform complex reasoning tasks by generating extensive "thinking" chains. While demonstrating impressive results, this approach incurs significant computational costs and inference time. In this work, we challenge the assumption that long thinking chains results in better reasoning capabilities. We first demonstrate that shorter reasoning chains within individual questions are significantly more likely to yield correct answers - up to 34.5% more accurate than the longest chain sampled for the same question. Based on these results, we suggest short-m@k, a novel reasoning LLM inference method. Our method executes k independent generations in parallel and halts computation once the first m thinking processes are done. The final answer is chosen using majority voting among these m chains. Basic short-1@k demonstrates similar or even superior performance over standard majority voting in low-compute settings - using up to 40% fewer thinking tokens. short-3@k, while slightly less efficient than short-1@k, consistently surpasses majority voting across all compute budgets, while still being substantially faster (up to 33% wall time reduction). Inspired by our results, we finetune an LLM using short, long, and randomly selected reasoning chains. We then observe that training on the shorter ones leads to better performance. Our findings suggest rethinking current methods of test-time compute in reasoning LLMs, emphasizing that longer "thinking" does not necessarily translate to improved performance and can, counter-intuitively, lead to degraded results.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Exploring the Latent Capacity of LLMs for One-Step Text Generation</title>
      <itunes:episode>819</itunes:episode>
      <podcast:episode>819</podcast:episode>
      <itunes:title>Exploring the Latent Capacity of LLMs for One-Step Text Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b3e645b0-ac0b-4253-846e-51bd0ab8f9f7</guid>
      <link>https://share.transistor.fm/s/9ac649d1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Gleb Mezentsev, Ivan Oseledets</p>

            <p><strong>Title:</strong><br>
            Exploring the Latent Capacity of LLMs for One-Step Text Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21189v1">http://arxiv.org/abs/2505.21189v1</a></p>

            <p><strong>Abstract:</strong><br>
            A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts - up to thousands of tokens - via autoregressive generation from just one specially trained input embedding. In this work, we explore whether such reconstruction is possible without autoregression. We show that frozen LLMs can generate hundreds of accurate tokens in just one forward pass, when provided with only two learned embeddings. This reveals a surprising and underexplored capability of LLMs - multi-token generation without iterative decoding. We investigate the behaviour of these embeddings and provide insight into the type of information they encode. We also empirically show that although these representations are not unique for a given text, they form connected and local regions in embedding space - a property that suggests the potential of learning a dedicated encoder into that space.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Gleb Mezentsev, Ivan Oseledets</p>

            <p><strong>Title:</strong><br>
            Exploring the Latent Capacity of LLMs for One-Step Text Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21189v1">http://arxiv.org/abs/2505.21189v1</a></p>

            <p><strong>Abstract:</strong><br>
            A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts - up to thousands of tokens - via autoregressive generation from just one specially trained input embedding. In this work, we explore whether such reconstruction is possible without autoregression. We show that frozen LLMs can generate hundreds of accurate tokens in just one forward pass, when provided with only two learned embeddings. This reveals a surprising and underexplored capability of LLMs - multi-token generation without iterative decoding. We investigate the behaviour of these embeddings and provide insight into the type of information they encode. We also empirically show that although these representations are not unique for a given text, they form connected and local regions in embedding space - a property that suggests the potential of learning a dedicated encoder into that space.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 28 May 2025 21:03:07 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9ac649d1/2c5c4ad2.mp3" length="19818457" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1235</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Gleb Mezentsev, Ivan Oseledets</p>

            <p><strong>Title:</strong><br>
            Exploring the Latent Capacity of LLMs for One-Step Text Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.21189v1">http://arxiv.org/abs/2505.21189v1</a></p>

            <p><strong>Abstract:</strong><br>
            A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts - up to thousands of tokens - via autoregressive generation from just one specially trained input embedding. In this work, we explore whether such reconstruction is possible without autoregression. We show that frozen LLMs can generate hundreds of accurate tokens in just one forward pass, when provided with only two learned embeddings. This reveals a surprising and underexplored capability of LLMs - multi-token generation without iterative decoding. We investigate the behaviour of these embeddings and provide insight into the type of information they encode. We also empirically show that although these representations are not unique for a given text, they form connected and local regions in embedding space - a property that suggests the potential of learning a dedicated encoder into that space.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence</title>
      <itunes:episode>818</itunes:episode>
      <podcast:episode>818</podcast:episode>
      <itunes:title>Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3509b2fd-e9c6-4cfd-b739-260bc41284b7</guid>
      <link>https://share.transistor.fm/s/61987aa5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Amirhosein Ghasemabadi, Keith G. Mills, Baochun Li, Di Niu</p>

            <p><strong>Title:</strong><br>
            Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.20325v1">http://arxiv.org/abs/2505.20325v1</a></p>

            <p><strong>Abstract:</strong><br>
            Test-Time Scaling (TTS) methods for enhancing Large Language Model (LLM) reasoning often incur substantial computational costs, primarily due to extensive reliance on external Process Reward Models (PRMs) or sampling methods like Best-of-N (BoN). This paper introduces Guided by Gut (GG), an efficient self-guided TTS framework that achieves PRM-level performance without costly external verifier models. Our method employs a lightweight tree search guided solely by intrinsic LLM signals, token-level confidence and step novelty. One critical innovation is improving the reliability of internal confidence estimates via a targeted reinforcement learning fine-tuning phase. Empirical evaluations on challenging mathematical reasoning benchmarks demonstrate that GG enables smaller models (e.g., 1.5B parameters) to achieve accuracy matching or surpassing significantly larger models (e.g., 32B-70B parameters), while reducing GPU memory usage by up to 10x. Compared to PRM-based methods, GG achieves comparable accuracy with 8x faster inference speeds and 4-5x lower memory usage. Additionally, GG reduces KV cache memory usage by approximately 50% compared to the BoN strategy, facilitating more efficient and practical deployment of TTS techniques.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Amirhosein Ghasemabadi, Keith G. Mills, Baochun Li, Di Niu</p>

            <p><strong>Title:</strong><br>
            Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.20325v1">http://arxiv.org/abs/2505.20325v1</a></p>

            <p><strong>Abstract:</strong><br>
            Test-Time Scaling (TTS) methods for enhancing Large Language Model (LLM) reasoning often incur substantial computational costs, primarily due to extensive reliance on external Process Reward Models (PRMs) or sampling methods like Best-of-N (BoN). This paper introduces Guided by Gut (GG), an efficient self-guided TTS framework that achieves PRM-level performance without costly external verifier models. Our method employs a lightweight tree search guided solely by intrinsic LLM signals, token-level confidence and step novelty. One critical innovation is improving the reliability of internal confidence estimates via a targeted reinforcement learning fine-tuning phase. Empirical evaluations on challenging mathematical reasoning benchmarks demonstrate that GG enables smaller models (e.g., 1.5B parameters) to achieve accuracy matching or surpassing significantly larger models (e.g., 32B-70B parameters), while reducing GPU memory usage by up to 10x. Compared to PRM-based methods, GG achieves comparable accuracy with 8x faster inference speeds and 4-5x lower memory usage. Additionally, GG reduces KV cache memory usage by approximately 50% compared to the BoN strategy, facilitating more efficient and practical deployment of TTS techniques.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 28 May 2025 21:02:46 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/61987aa5/e53c29b2.mp3" length="21540462" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1343</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Amirhosein Ghasemabadi, Keith G. Mills, Baochun Li, Di Niu</p>

            <p><strong>Title:</strong><br>
            Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.20325v1">http://arxiv.org/abs/2505.20325v1</a></p>

            <p><strong>Abstract:</strong><br>
            Test-Time Scaling (TTS) methods for enhancing Large Language Model (LLM) reasoning often incur substantial computational costs, primarily due to extensive reliance on external Process Reward Models (PRMs) or sampling methods like Best-of-N (BoN). This paper introduces Guided by Gut (GG), an efficient self-guided TTS framework that achieves PRM-level performance without costly external verifier models. Our method employs a lightweight tree search guided solely by intrinsic LLM signals, token-level confidence and step novelty. One critical innovation is improving the reliability of internal confidence estimates via a targeted reinforcement learning fine-tuning phase. Empirical evaluations on challenging mathematical reasoning benchmarks demonstrate that GG enables smaller models (e.g., 1.5B parameters) to achieve accuracy matching or surpassing significantly larger models (e.g., 32B-70B parameters), while reducing GPU memory usage by up to 10x. Compared to PRM-based methods, GG achieves comparable accuracy with 8x faster inference speeds and 4-5x lower memory usage. Additionally, GG reduces KV cache memory usage by approximately 50% compared to the BoN strategy, facilitating more efficient and practical deployment of TTS techniques.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization</title>
      <itunes:episode>817</itunes:episode>
      <podcast:episode>817</podcast:episode>
      <itunes:title>VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">939f2f26-a745-45de-aed3-2e96d7caa78e</guid>
      <link>https://share.transistor.fm/s/a3c41858</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunxin Li, Xinyu Chen, Zitao Li, Zhenyu Liu, Longyue Wang, Wenhan Luo, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19000v1">http://arxiv.org/abs/2505.19000v1</a></p>

            <p><strong>Abstract:</strong><br>
            Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream performance.To address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO's expansive search and DPO's targeted optimization. Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunxin Li, Xinyu Chen, Zitao Li, Zhenyu Liu, Longyue Wang, Wenhan Luo, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19000v1">http://arxiv.org/abs/2505.19000v1</a></p>

            <p><strong>Abstract:</strong><br>
            Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream performance.To address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO's expansive search and DPO's targeted optimization. Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 28 May 2025 21:02:24 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a3c41858/c17aa3f0.mp3" length="20193398" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1258</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunxin Li, Xinyu Chen, Zitao Li, Zhenyu Liu, Longyue Wang, Wenhan Luo, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19000v1">http://arxiv.org/abs/2505.19000v1</a></p>

            <p><strong>Abstract:</strong><br>
            Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream performance.To address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO's expansive search and DPO's targeted optimization. Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model</title>
      <itunes:episode>816</itunes:episode>
      <podcast:episode>816</podcast:episode>
      <itunes:title>Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fe7cf0a9-0f9e-42e5-8e48-9df62a8d1a18</guid>
      <link>https://share.transistor.fm/s/ece44486</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 178 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Khalil Hennara, Muhammad Hreden, Mohamed Motaism Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan</p>

            <p><strong>Title:</strong><br>
            Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17894v1">http://arxiv.org/abs/2505.17894v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Mutarjim, a compact yet powerful language model for bidirectional Arabic-English translation. While large-scale LLMs have shown impressive progress in natural language processing tasks, including machine translation, smaller models. Leveraging this insight, we developed Mutarjim based on Kuwain-1.5B , a language model tailored for both Arabic and English. Despite its modest size, Mutarjim outperforms much larger models on several established benchmarks, achieved through an optimized two-phase training approach and a carefully curated, high-quality training corpus.. Experimental results show that Mutarjim rivals models up to 20 times larger while significantly reducing computational costs and training requirements. We also introduce Tarjama-25, a new benchmark designed to overcome limitations in existing Arabic-English benchmarking datasets, such as domain narrowness, short sentence lengths, and English-source bias. Tarjama-25 comprises 5,000 expert-reviewed sentence pairs and spans a wide range of domains, offering a more comprehensive and balanced evaluation framework. Notably, Mutarjim achieves state-of-the-art performance on the English-to-Arabic task in Tarjama-25, surpassing even significantly larger and proprietary models like GPT-4o mini. We publicly release Tarjama-25 to support future research and advance the evaluation of Arabic-English translation systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 178 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Khalil Hennara, Muhammad Hreden, Mohamed Motaism Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan</p>

            <p><strong>Title:</strong><br>
            Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17894v1">http://arxiv.org/abs/2505.17894v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Mutarjim, a compact yet powerful language model for bidirectional Arabic-English translation. While large-scale LLMs have shown impressive progress in natural language processing tasks, including machine translation, smaller models. Leveraging this insight, we developed Mutarjim based on Kuwain-1.5B , a language model tailored for both Arabic and English. Despite its modest size, Mutarjim outperforms much larger models on several established benchmarks, achieved through an optimized two-phase training approach and a carefully curated, high-quality training corpus.. Experimental results show that Mutarjim rivals models up to 20 times larger while significantly reducing computational costs and training requirements. We also introduce Tarjama-25, a new benchmark designed to overcome limitations in existing Arabic-English benchmarking datasets, such as domain narrowness, short sentence lengths, and English-source bias. Tarjama-25 comprises 5,000 expert-reviewed sentence pairs and spans a wide range of domains, offering a more comprehensive and balanced evaluation framework. Notably, Mutarjim achieves state-of-the-art performance on the English-to-Arabic task in Tarjama-25, surpassing even significantly larger and proprietary models like GPT-4o mini. We publicly release Tarjama-25 to support future research and advance the evaluation of Arabic-English translation systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 27 May 2025 21:30:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ece44486/67ae7f9f.mp3" length="19999873" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1246</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 178 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Khalil Hennara, Muhammad Hreden, Mohamed Motaism Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan</p>

            <p><strong>Title:</strong><br>
            Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17894v1">http://arxiv.org/abs/2505.17894v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Mutarjim, a compact yet powerful language model for bidirectional Arabic-English translation. While large-scale LLMs have shown impressive progress in natural language processing tasks, including machine translation, smaller models. Leveraging this insight, we developed Mutarjim based on Kuwain-1.5B , a language model tailored for both Arabic and English. Despite its modest size, Mutarjim outperforms much larger models on several established benchmarks, achieved through an optimized two-phase training approach and a carefully curated, high-quality training corpus.. Experimental results show that Mutarjim rivals models up to 20 times larger while significantly reducing computational costs and training requirements. We also introduce Tarjama-25, a new benchmark designed to overcome limitations in existing Arabic-English benchmarking datasets, such as domain narrowness, short sentence lengths, and English-source bias. Tarjama-25 comprises 5,000 expert-reviewed sentence pairs and spans a wide range of domains, offering a more comprehensive and balanced evaluation framework. Notably, Mutarjim achieves state-of-the-art performance on the English-to-Arabic task in Tarjama-25, surpassing even significantly larger and proprietary models like GPT-4o mini. We publicly release Tarjama-25 to support future research and advance the evaluation of Arabic-English translation systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Shifting AI Efficiency From Model-Centric to Data-Centric Compression</title>
      <itunes:episode>815</itunes:episode>
      <podcast:episode>815</podcast:episode>
      <itunes:title>Shifting AI Efficiency From Model-Centric to Data-Centric Compression</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e3bf07e0-14aa-4f96-8a9c-49d9a32999fa</guid>
      <link>https://share.transistor.fm/s/db6d722a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 124 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang</p>

            <p><strong>Title:</strong><br>
            Shifting AI Efficiency From Model-Centric to Data-Centric Compression</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19147v1">http://arxiv.org/abs/2505.19147v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gains. However, as we approach hardware limits on model size, the dominant computational bottleneck has fundamentally shifted to the quadratic cost of self-attention over long token sequences, now driven by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbf{we argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression}. We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference. Through comprehensive analysis, we first examine recent developments in long-context AI across various domains and establish a unified mathematical framework for existing model efficiency strategies, demonstrating why token compression represents a crucial paradigm shift in addressing long-context overhead. Subsequently, we systematically review the research landscape of token compression, analyzing its fundamental benefits and identifying its compelling advantages across diverse scenarios. Furthermore, we provide an in-depth analysis of current challenges in token compression research and outline promising future directions. Ultimately, our work aims to offer a fresh perspective on AI efficiency, synthesize existing research, and catalyze innovative developments to address the challenges that increasing context lengths pose to the AI community's advancement.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 124 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang</p>

            <p><strong>Title:</strong><br>
            Shifting AI Efficiency From Model-Centric to Data-Centric Compression</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19147v1">http://arxiv.org/abs/2505.19147v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gains. However, as we approach hardware limits on model size, the dominant computational bottleneck has fundamentally shifted to the quadratic cost of self-attention over long token sequences, now driven by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbf{we argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression}. We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference. Through comprehensive analysis, we first examine recent developments in long-context AI across various domains and establish a unified mathematical framework for existing model efficiency strategies, demonstrating why token compression represents a crucial paradigm shift in addressing long-context overhead. Subsequently, we systematically review the research landscape of token compression, analyzing its fundamental benefits and identifying its compelling advantages across diverse scenarios. Furthermore, we provide an in-depth analysis of current challenges in token compression research and outline promising future directions. Ultimately, our work aims to offer a fresh perspective on AI efficiency, synthesize existing research, and catalyze innovative developments to address the challenges that increasing context lengths pose to the AI community's advancement.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 27 May 2025 21:29:42 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/db6d722a/818da79f.mp3" length="21409212" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1334</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 124 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang</p>

            <p><strong>Title:</strong><br>
            Shifting AI Efficiency From Model-Centric to Data-Centric Compression</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19147v1">http://arxiv.org/abs/2505.19147v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gains. However, as we approach hardware limits on model size, the dominant computational bottleneck has fundamentally shifted to the quadratic cost of self-attention over long token sequences, now driven by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbf{we argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression}. We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference. Through comprehensive analysis, we first examine recent developments in long-context AI across various domains and establish a unified mathematical framework for existing model efficiency strategies, demonstrating why token compression represents a crucial paradigm shift in addressing long-context overhead. Subsequently, we systematically review the research landscape of token compression, analyzing its fundamental benefits and identifying its compelling advantages across diverse scenarios. Furthermore, we provide an in-depth analysis of current challenges in token compression research and outline promising future directions. Ultimately, our work aims to offer a fresh perspective on AI efficiency, synthesize existing research, and catalyze innovative developments to address the challenges that increasing context lengths pose to the AI community's advancement.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Alchemist: Turning Public Text-to-Image Data into Generative Gold</title>
      <itunes:episode>814</itunes:episode>
      <podcast:episode>814</podcast:episode>
      <itunes:title>Alchemist: Turning Public Text-to-Image Data into Generative Gold</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c28135da-80ad-4dfe-99cb-f8fedebe1a4a</guid>
      <link>https://share.transistor.fm/s/47b2d514</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Valerii Startsev, Alexander Ustyuzhanin, Alexey Kirillov, Dmitry Baranchuk, Sergey Kastryulin</p>

            <p><strong>Title:</strong><br>
            Alchemist: Turning Public Text-to-Image Data into Generative Gold</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19297v1">http://arxiv.org/abs/2505.19297v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pre-training equips text-to-image (T2I) models with broad world knowledge, but this alone is often insufficient to achieve high aesthetic quality and alignment. Consequently, supervised fine-tuning (SFT) is crucial for further refinement. However, its effectiveness highly depends on the quality of the fine-tuning dataset. Existing public SFT datasets frequently target narrow domains (e.g., anime or specific art styles), and the creation of high-quality, general-purpose SFT datasets remains a significant challenge. Current curation methods are often costly and struggle to identify truly impactful samples. This challenge is further complicated by the scarcity of public general-purpose datasets, as leading models often rely on large, proprietary, and poorly documented internal data, hindering broader research progress. This paper introduces a novel methodology for creating general-purpose SFT datasets by leveraging a pre-trained generative model as an estimator of high-impact training samples. We apply this methodology to construct and release Alchemist, a compact (3,350 samples) yet highly effective SFT dataset. Experiments demonstrate that Alchemist substantially improves the generative quality of five public T2I models while preserving diversity and style. Additionally, we release the fine-tuned models' weights to the public.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Valerii Startsev, Alexander Ustyuzhanin, Alexey Kirillov, Dmitry Baranchuk, Sergey Kastryulin</p>

            <p><strong>Title:</strong><br>
            Alchemist: Turning Public Text-to-Image Data into Generative Gold</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19297v1">http://arxiv.org/abs/2505.19297v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pre-training equips text-to-image (T2I) models with broad world knowledge, but this alone is often insufficient to achieve high aesthetic quality and alignment. Consequently, supervised fine-tuning (SFT) is crucial for further refinement. However, its effectiveness highly depends on the quality of the fine-tuning dataset. Existing public SFT datasets frequently target narrow domains (e.g., anime or specific art styles), and the creation of high-quality, general-purpose SFT datasets remains a significant challenge. Current curation methods are often costly and struggle to identify truly impactful samples. This challenge is further complicated by the scarcity of public general-purpose datasets, as leading models often rely on large, proprietary, and poorly documented internal data, hindering broader research progress. This paper introduces a novel methodology for creating general-purpose SFT datasets by leveraging a pre-trained generative model as an estimator of high-impact training samples. We apply this methodology to construct and release Alchemist, a compact (3,350 samples) yet highly effective SFT dataset. Experiments demonstrate that Alchemist substantially improves the generative quality of five public T2I models while preserving diversity and style. Additionally, we release the fine-tuned models' weights to the public.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 27 May 2025 21:29:21 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/47b2d514/708a35ae.mp3" length="18597179" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1159</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Valerii Startsev, Alexander Ustyuzhanin, Alexey Kirillov, Dmitry Baranchuk, Sergey Kastryulin</p>

            <p><strong>Title:</strong><br>
            Alchemist: Turning Public Text-to-Image Data into Generative Gold</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19297v1">http://arxiv.org/abs/2505.19297v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pre-training equips text-to-image (T2I) models with broad world knowledge, but this alone is often insufficient to achieve high aesthetic quality and alignment. Consequently, supervised fine-tuning (SFT) is crucial for further refinement. However, its effectiveness highly depends on the quality of the fine-tuning dataset. Existing public SFT datasets frequently target narrow domains (e.g., anime or specific art styles), and the creation of high-quality, general-purpose SFT datasets remains a significant challenge. Current curation methods are often costly and struggle to identify truly impactful samples. This challenge is further complicated by the scarcity of public general-purpose datasets, as leading models often rely on large, proprietary, and poorly documented internal data, hindering broader research progress. This paper introduces a novel methodology for creating general-purpose SFT datasets by leveraging a pre-trained generative model as an estimator of high-impact training samples. We apply this methodology to construct and release Alchemist, a compact (3,350 samples) yet highly effective SFT dataset. Experiments demonstrate that Alchemist substantially improves the generative quality of five public T2I models while preserving diversity and style. Additionally, we release the fine-tuned models' weights to the public.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs</title>
      <itunes:episode>813</itunes:episode>
      <podcast:episode>813</podcast:episode>
      <itunes:title>BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3075e388-3170-41b0-bac9-2a7c4288d2a6</guid>
      <link>https://share.transistor.fm/s/bbe620a7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.AI, cs.CE, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, Ji Liu</p>

            <p><strong>Title:</strong><br>
            BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19457v1">http://arxiv.org/abs/2505.19457v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.AI, cs.CE, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, Ji Liu</p>

            <p><strong>Title:</strong><br>
            BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19457v1">http://arxiv.org/abs/2505.19457v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 27 May 2025 21:29:00 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bbe620a7/9e280f9c.mp3" length="22646801" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1412</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.AI, cs.CE, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, Ji Liu</p>

            <p><strong>Title:</strong><br>
            BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19457v1">http://arxiv.org/abs/2505.19457v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PATS: Process-Level Adaptive Thinking Mode Switching</title>
      <itunes:episode>812</itunes:episode>
      <podcast:episode>812</podcast:episode>
      <itunes:title>PATS: Process-Level Adaptive Thinking Mode Switching</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3c132a6b-a0d8-4708-8172-ea839da9fb4a</guid>
      <link>https://share.transistor.fm/s/7fe5d10e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yi Wang, Junxiao Liu, Shimao Zhang, Jiajun Chen, Shujian Huang</p>

            <p><strong>Title:</strong><br>
            PATS: Process-Level Adaptive Thinking Mode Switching</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19250v1">http://arxiv.org/abs/2505.19250v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current large-language models (LLMs) typically adopt a fixed reasoning strategy, either simple or complex, for all questions, regardless of their difficulty. This neglect of variation in task and reasoning process complexity leads to an imbalance between performance and efficiency. Existing methods attempt to implement training-free fast-slow thinking system switching to handle problems of varying difficulty, but are limited by coarse-grained solution-level strategy adjustments. To address this issue, we propose a novel reasoning paradigm: Process-Level Adaptive Thinking Mode Switching (PATS), which enables LLMs to dynamically adjust their reasoning strategy based on the difficulty of each step, optimizing the balance between accuracy and computational efficiency. Our approach integrates Process Reward Models (PRMs) with Beam Search, incorporating progressive mode switching and bad-step penalty mechanisms. Experiments on diverse mathematical benchmarks demonstrate that our methodology achieves high accuracy while maintaining moderate token usage. This study emphasizes the significance of process-level, difficulty-aware reasoning strategy adaptation, offering valuable insights into efficient inference for LLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yi Wang, Junxiao Liu, Shimao Zhang, Jiajun Chen, Shujian Huang</p>

            <p><strong>Title:</strong><br>
            PATS: Process-Level Adaptive Thinking Mode Switching</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19250v1">http://arxiv.org/abs/2505.19250v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current large-language models (LLMs) typically adopt a fixed reasoning strategy, either simple or complex, for all questions, regardless of their difficulty. This neglect of variation in task and reasoning process complexity leads to an imbalance between performance and efficiency. Existing methods attempt to implement training-free fast-slow thinking system switching to handle problems of varying difficulty, but are limited by coarse-grained solution-level strategy adjustments. To address this issue, we propose a novel reasoning paradigm: Process-Level Adaptive Thinking Mode Switching (PATS), which enables LLMs to dynamically adjust their reasoning strategy based on the difficulty of each step, optimizing the balance between accuracy and computational efficiency. Our approach integrates Process Reward Models (PRMs) with Beam Search, incorporating progressive mode switching and bad-step penalty mechanisms. Experiments on diverse mathematical benchmarks demonstrate that our methodology achieves high accuracy while maintaining moderate token usage. This study emphasizes the significance of process-level, difficulty-aware reasoning strategy adaptation, offering valuable insights into efficient inference for LLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 27 May 2025 21:28:39 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7fe5d10e/ede93b3f.mp3" length="20406093" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1272</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yi Wang, Junxiao Liu, Shimao Zhang, Jiajun Chen, Shujian Huang</p>

            <p><strong>Title:</strong><br>
            PATS: Process-Level Adaptive Thinking Mode Switching</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19250v1">http://arxiv.org/abs/2505.19250v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current large-language models (LLMs) typically adopt a fixed reasoning strategy, either simple or complex, for all questions, regardless of their difficulty. This neglect of variation in task and reasoning process complexity leads to an imbalance between performance and efficiency. Existing methods attempt to implement training-free fast-slow thinking system switching to handle problems of varying difficulty, but are limited by coarse-grained solution-level strategy adjustments. To address this issue, we propose a novel reasoning paradigm: Process-Level Adaptive Thinking Mode Switching (PATS), which enables LLMs to dynamically adjust their reasoning strategy based on the difficulty of each step, optimizing the balance between accuracy and computational efficiency. Our approach integrates Process Reward Models (PRMs) with Beam Search, incorporating progressive mode switching and bad-step penalty mechanisms. Experiments on diverse mathematical benchmarks demonstrate that our methodology achieves high accuracy while maintaining moderate token usage. This study emphasizes the significance of process-level, difficulty-aware reasoning strategy adaptation, offering valuable insights into efficient inference for LLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance</title>
      <itunes:episode>811</itunes:episode>
      <podcast:episode>811</podcast:episode>
      <itunes:title>Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cf6480e8-165d-4a0e-8cfd-75d503a2c66f</guid>
      <link>https://share.transistor.fm/s/7e315e1d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Taeyoon Kwon, Dongwook Choi, Sunghwan Kim, Hyojun Kim, Seungjun Moon, Beong-woo Kwak, Kuan-Hao Huang, Jinyoung Yeo</p>

            <p><strong>Title:</strong><br>
            Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16348v1">http://arxiv.org/abs/2505.16348v1</a></p>

            <p><strong>Abstract:</strong><br>
            Embodied agents empowered by large language models (LLMs) have shown strong performance in household object rearrangement tasks. However, these tasks primarily focus on single-turn interactions with simplified instructions, which do not truly reflect the challenges of providing meaningful assistance to users. To provide personalized assistance, embodied agents must understand the unique semantics that users assign to the physical world (e.g., favorite cup, breakfast routine) by leveraging prior interaction history to interpret dynamic, real-world instructions. Yet, the effectiveness of embodied agents in utilizing memory for personalized assistance remains largely underexplored. To address this gap, we present MEMENTO, a personalized embodied agent evaluation framework designed to comprehensively assess memory utilization capabilities to provide personalized assistance. Our framework consists of a two-stage memory evaluation process design that enables quantifying the impact of memory utilization on task performance. This process enables the evaluation of agents' understanding of personalized knowledge in object rearrangement tasks by focusing on its role in goal interpretation: (1) the ability to identify target objects based on personal meaning (object semantics), and (2) the ability to infer object-location configurations from consistent user patterns, such as routines (user patterns). Our experiments across various LLMs reveal significant limitations in memory utilization, with even frontier models like GPT-4o experiencing a 30.5% performance drop when required to reference multiple memories, particularly in tasks involving user patterns. These findings, along with our detailed analyses and case studies, provide valuable insights for future research in developing more effective personalized embodied agents. Project website: https://connoriginal.github.io/MEMENTO</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Taeyoon Kwon, Dongwook Choi, Sunghwan Kim, Hyojun Kim, Seungjun Moon, Beong-woo Kwak, Kuan-Hao Huang, Jinyoung Yeo</p>

            <p><strong>Title:</strong><br>
            Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16348v1">http://arxiv.org/abs/2505.16348v1</a></p>

            <p><strong>Abstract:</strong><br>
            Embodied agents empowered by large language models (LLMs) have shown strong performance in household object rearrangement tasks. However, these tasks primarily focus on single-turn interactions with simplified instructions, which do not truly reflect the challenges of providing meaningful assistance to users. To provide personalized assistance, embodied agents must understand the unique semantics that users assign to the physical world (e.g., favorite cup, breakfast routine) by leveraging prior interaction history to interpret dynamic, real-world instructions. Yet, the effectiveness of embodied agents in utilizing memory for personalized assistance remains largely underexplored. To address this gap, we present MEMENTO, a personalized embodied agent evaluation framework designed to comprehensively assess memory utilization capabilities to provide personalized assistance. Our framework consists of a two-stage memory evaluation process design that enables quantifying the impact of memory utilization on task performance. This process enables the evaluation of agents' understanding of personalized knowledge in object rearrangement tasks by focusing on its role in goal interpretation: (1) the ability to identify target objects based on personal meaning (object semantics), and (2) the ability to infer object-location configurations from consistent user patterns, such as routines (user patterns). Our experiments across various LLMs reveal significant limitations in memory utilization, with even frontier models like GPT-4o experiencing a 30.5% performance drop when required to reference multiple memories, particularly in tasks involving user patterns. These findings, along with our detailed analyses and case studies, provide valuable insights for future research in developing more effective personalized embodied agents. Project website: https://connoriginal.github.io/MEMENTO</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 27 May 2025 21:28:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7e315e1d/2ccf94f2.mp3" length="20170406" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1257</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Taeyoon Kwon, Dongwook Choi, Sunghwan Kim, Hyojun Kim, Seungjun Moon, Beong-woo Kwak, Kuan-Hao Huang, Jinyoung Yeo</p>

            <p><strong>Title:</strong><br>
            Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16348v1">http://arxiv.org/abs/2505.16348v1</a></p>

            <p><strong>Abstract:</strong><br>
            Embodied agents empowered by large language models (LLMs) have shown strong performance in household object rearrangement tasks. However, these tasks primarily focus on single-turn interactions with simplified instructions, which do not truly reflect the challenges of providing meaningful assistance to users. To provide personalized assistance, embodied agents must understand the unique semantics that users assign to the physical world (e.g., favorite cup, breakfast routine) by leveraging prior interaction history to interpret dynamic, real-world instructions. Yet, the effectiveness of embodied agents in utilizing memory for personalized assistance remains largely underexplored. To address this gap, we present MEMENTO, a personalized embodied agent evaluation framework designed to comprehensively assess memory utilization capabilities to provide personalized assistance. Our framework consists of a two-stage memory evaluation process design that enables quantifying the impact of memory utilization on task performance. This process enables the evaluation of agents' understanding of personalized knowledge in object rearrangement tasks by focusing on its role in goal interpretation: (1) the ability to identify target objects based on personal meaning (object semantics), and (2) the ability to infer object-location configurations from consistent user patterns, such as routines (user patterns). Our experiments across various LLMs reveal significant limitations in memory utilization, with even frontier models like GPT-4o experiencing a 30.5% performance drop when required to reference multiple memories, particularly in tasks involving user patterns. These findings, along with our detailed analyses and case studies, provide valuable insights for future research in developing more effective personalized embodied agents. Project website: https://connoriginal.github.io/MEMENTO</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ARM: Adaptive Reasoning Model</title>
      <itunes:episode>810</itunes:episode>
      <podcast:episode>810</podcast:episode>
      <itunes:title>ARM: Adaptive Reasoning Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f73768fa-5d0b-429b-ad10-e579cc849924</guid>
      <link>https://share.transistor.fm/s/83e96485</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Siye Wu, Jian Xie, Yikai Zhang, Aili Chen, Kai Zhang, Yu Su, Yanghua Xiao</p>

            <p><strong>Title:</strong><br>
            ARM: Adaptive Reasoning Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.20258v1">http://arxiv.org/abs/2505.20258v1</a></p>

            <p><strong>Abstract:</strong><br>
            While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the "overthinking" problem -- excessive and unnecessary reasoning -- which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones -- Direct Answer, Short CoT, and Code -- as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of 30%, and up to 70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2x speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens -- ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Siye Wu, Jian Xie, Yikai Zhang, Aili Chen, Kai Zhang, Yu Su, Yanghua Xiao</p>

            <p><strong>Title:</strong><br>
            ARM: Adaptive Reasoning Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.20258v1">http://arxiv.org/abs/2505.20258v1</a></p>

            <p><strong>Abstract:</strong><br>
            While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the "overthinking" problem -- excessive and unnecessary reasoning -- which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones -- Direct Answer, Short CoT, and Code -- as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of 30%, and up to 70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2x speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens -- ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 27 May 2025 21:27:57 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/83e96485/94a8584d.mp3" length="21878123" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1364</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Siye Wu, Jian Xie, Yikai Zhang, Aili Chen, Kai Zhang, Yu Su, Yanghua Xiao</p>

            <p><strong>Title:</strong><br>
            ARM: Adaptive Reasoning Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.20258v1">http://arxiv.org/abs/2505.20258v1</a></p>

            <p><strong>Abstract:</strong><br>
            While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the "overthinking" problem -- excessive and unnecessary reasoning -- which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones -- Direct Answer, Short CoT, and Code -- as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of 30%, and up to 70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2x speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens -- ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles</title>
      <itunes:episode>809</itunes:episode>
      <podcast:episode>809</podcast:episode>
      <itunes:title>Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">536e8b90-ed37-4b3d-adac-8d44b4808d06</guid>
      <link>https://share.transistor.fm/s/455d1540</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, Mingxuan Wang</p>

            <p><strong>Title:</strong><br>
            Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19914v1">http://arxiv.org/abs/2505.19914v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs), such as OpenAI's o1 and DeepSeek's R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans without domain knowledge. We introduce Enigmata, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across seven categories, each with 1) a generator that produces unlimited examples with controllable difficulty and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose Enigmata-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2.5-32B-Enigmata, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like Enigmata-Eval, ARC-AGI (32.8%), and ARC-AGI 2 (0.6%). It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multi-tasking trade-off. When trained on larger models like Seed1.5-Thinking (20B activated parameters and 200B total parameters), puzzle data from Enigmata further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of Enigmata. This work offers a unified, controllable framework for advancing logical reasoning in LLMs. Resources of this work can be found at https://seed-enigmata.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, Mingxuan Wang</p>

            <p><strong>Title:</strong><br>
            Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19914v1">http://arxiv.org/abs/2505.19914v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs), such as OpenAI's o1 and DeepSeek's R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans without domain knowledge. We introduce Enigmata, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across seven categories, each with 1) a generator that produces unlimited examples with controllable difficulty and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose Enigmata-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2.5-32B-Enigmata, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like Enigmata-Eval, ARC-AGI (32.8%), and ARC-AGI 2 (0.6%). It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multi-tasking trade-off. When trained on larger models like Seed1.5-Thinking (20B activated parameters and 200B total parameters), puzzle data from Enigmata further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of Enigmata. This work offers a unified, controllable framework for advancing logical reasoning in LLMs. Resources of this work can be found at https://seed-enigmata.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 27 May 2025 21:27:36 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/455d1540/bc9e2991.mp3" length="20573737" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1282</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, Mingxuan Wang</p>

            <p><strong>Title:</strong><br>
            Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19914v1">http://arxiv.org/abs/2505.19914v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs), such as OpenAI's o1 and DeepSeek's R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans without domain knowledge. We introduce Enigmata, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across seven categories, each with 1) a generator that produces unlimited examples with controllable difficulty and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose Enigmata-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2.5-32B-Enigmata, consistently surpasses o3-mini-high and o1 on the puzzle reasoning benchmarks like Enigmata-Eval, ARC-AGI (32.8%), and ARC-AGI 2 (0.6%). It also generalizes well to out-of-domain puzzle benchmarks and mathematical reasoning, with little multi-tasking trade-off. When trained on larger models like Seed1.5-Thinking (20B activated parameters and 200B total parameters), puzzle data from Enigmata further boosts SoTA performance on advanced math and STEM reasoning tasks such as AIME (2024-2025), BeyondAIME and GPQA (Diamond), showing nice generalization benefits of Enigmata. This work offers a unified, controllable framework for advancing logical reasoning in LLMs. Resources of this work can be found at https://seed-enigmata.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective</title>
      <itunes:episode>808</itunes:episode>
      <podcast:episode>808</podcast:episode>
      <itunes:title>Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ead929cb-a1d0-44ae-ad34-fd496cc05536</guid>
      <link>https://share.transistor.fm/s/54daeb78</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junnan Liu, Hongwei Liu, Linchen Xiao, Shudong Liu, Taolin Zhang, Zihan Ma, Songyang Zhang, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19815v1">http://arxiv.org/abs/2505.19815v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose a novel framework for comprehending the reasoning capabilities of large language models (LLMs) through the perspective of meta-learning. By conceptualizing reasoning trajectories as pseudo-gradient descent updates to the LLM's parameters, we identify parallels between LLM reasoning and various meta-learning paradigms. We formalize the training process for reasoning tasks as a meta-learning setup, with each question treated as an individual task, and reasoning trajectories serving as the inner loop optimization for adapting model parameters. Once trained on a diverse set of questions, the LLM develops fundamental reasoning capabilities that can generalize to previously unseen questions. Extensive empirical evaluations substantiate the strong connection between LLM reasoning and meta-learning, exploring several issues of significant interest from a meta-learning standpoint. Our work not only enhances the understanding of LLM reasoning but also provides practical insights for improving these models through established meta-learning techniques.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junnan Liu, Hongwei Liu, Linchen Xiao, Shudong Liu, Taolin Zhang, Zihan Ma, Songyang Zhang, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19815v1">http://arxiv.org/abs/2505.19815v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose a novel framework for comprehending the reasoning capabilities of large language models (LLMs) through the perspective of meta-learning. By conceptualizing reasoning trajectories as pseudo-gradient descent updates to the LLM's parameters, we identify parallels between LLM reasoning and various meta-learning paradigms. We formalize the training process for reasoning tasks as a meta-learning setup, with each question treated as an individual task, and reasoning trajectories serving as the inner loop optimization for adapting model parameters. Once trained on a diverse set of questions, the LLM develops fundamental reasoning capabilities that can generalize to previously unseen questions. Extensive empirical evaluations substantiate the strong connection between LLM reasoning and meta-learning, exploring several issues of significant interest from a meta-learning standpoint. Our work not only enhances the understanding of LLM reasoning but also provides practical insights for improving these models through established meta-learning techniques.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 27 May 2025 21:27:15 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/54daeb78/a8efbd1f.mp3" length="20569116" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1282</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junnan Liu, Hongwei Liu, Linchen Xiao, Shudong Liu, Taolin Zhang, Zihan Ma, Songyang Zhang, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.19815v1">http://arxiv.org/abs/2505.19815v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose a novel framework for comprehending the reasoning capabilities of large language models (LLMs) through the perspective of meta-learning. By conceptualizing reasoning trajectories as pseudo-gradient descent updates to the LLM's parameters, we identify parallels between LLM reasoning and various meta-learning paradigms. We formalize the training process for reasoning tasks as a meta-learning setup, with each question treated as an individual task, and reasoning trajectories serving as the inner loop optimization for adapting model parameters. Once trained on a diverse set of questions, the LLM develops fundamental reasoning capabilities that can generalize to previously unseen questions. Extensive empirical evaluations substantiate the strong connection between LLM reasoning and meta-learning, exploring several issues of significant interest from a meta-learning standpoint. Our work not only enhances the understanding of LLM reasoning but also provides practical insights for improving these models through established meta-learning techniques.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>B-score: Detecting biases in large language models using response history</title>
      <itunes:episode>807</itunes:episode>
      <podcast:episode>807</podcast:episode>
      <itunes:title>B-score: Detecting biases in large language models using response history</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0841149e-13c3-45a8-8cc3-2aa3737167f4</guid>
      <link>https://share.transistor.fm/s/110c2270</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            An Vo, Mohammad Reza Taesiri, Daeyoung Kim, Anh Totti Nguyen</p>

            <p><strong>Title:</strong><br>
            B-score: Detecting biases in large language models using response history</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.18545v1">http://arxiv.org/abs/2505.18545v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) often exhibit strong biases, e.g, against women or in favor of the number 7. We investigate whether LLMs would be able to output less biased answers when allowed to observe their prior answers to the same question in a multi-turn conversation. To understand which types of questions invite more biased answers, we test LLMs on our proposed set of questions that span 9 topics and belong to three types: (1) Subjective; (2) Random; and (3) Objective. Interestingly, LLMs are able to "de-bias" themselves in a multi-turn conversation in response to questions that seek an Random, unbiased answer. Furthermore, we propose B-score, a novel metric that is effective in detecting biases to Subjective, Random, Easy, and Hard questions. On MMLU, HLE, and CSQA, leveraging B-score substantially improves the verification accuracy of LLM answers (i.e, accepting LLM correct answers and rejecting incorrect ones) compared to using verbalized confidence scores or the frequency of single-turn answers alone. Code and data are available at: https://b-score.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            An Vo, Mohammad Reza Taesiri, Daeyoung Kim, Anh Totti Nguyen</p>

            <p><strong>Title:</strong><br>
            B-score: Detecting biases in large language models using response history</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.18545v1">http://arxiv.org/abs/2505.18545v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) often exhibit strong biases, e.g, against women or in favor of the number 7. We investigate whether LLMs would be able to output less biased answers when allowed to observe their prior answers to the same question in a multi-turn conversation. To understand which types of questions invite more biased answers, we test LLMs on our proposed set of questions that span 9 topics and belong to three types: (1) Subjective; (2) Random; and (3) Objective. Interestingly, LLMs are able to "de-bias" themselves in a multi-turn conversation in response to questions that seek an Random, unbiased answer. Furthermore, we propose B-score, a novel metric that is effective in detecting biases to Subjective, Random, Easy, and Hard questions. On MMLU, HLE, and CSQA, leveraging B-score substantially improves the verification accuracy of LLM answers (i.e, accepting LLM correct answers and rejecting incorrect ones) compared to using verbalized confidence scores or the frequency of single-turn answers alone. Code and data are available at: https://b-score.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 27 May 2025 21:26:54 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/110c2270/652195ee.mp3" length="22397272" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1396</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            An Vo, Mohammad Reza Taesiri, Daeyoung Kim, Anh Totti Nguyen</p>

            <p><strong>Title:</strong><br>
            B-score: Detecting biases in large language models using response history</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.18545v1">http://arxiv.org/abs/2505.18545v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) often exhibit strong biases, e.g, against women or in favor of the number 7. We investigate whether LLMs would be able to output less biased answers when allowed to observe their prior answers to the same question in a multi-turn conversation. To understand which types of questions invite more biased answers, we test LLMs on our proposed set of questions that span 9 topics and belong to three types: (1) Subjective; (2) Random; and (3) Objective. Interestingly, LLMs are able to "de-bias" themselves in a multi-turn conversation in response to questions that seek an Random, unbiased answer. Furthermore, we propose B-score, a novel metric that is effective in detecting biases to Subjective, Random, Easy, and Hard questions. On MMLU, HLE, and CSQA, leveraging B-score substantially improves the verification accuracy of LLM answers (i.e, accepting LLM correct answers and rejecting incorrect ones) compared to using verbalized confidence scores or the frequency of single-turn answers alone. Code and data are available at: https://b-score.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations</title>
      <itunes:episode>806</itunes:episode>
      <podcast:episode>806</podcast:episode>
      <itunes:title>TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6f59ef7d-e5d1-40c9-8697-1ccb03012b39</guid>
      <link>https://share.transistor.fm/s/99c2378f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 95 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Alan Arazi, Eilam Shapira, Roi Reichart</p>

            <p><strong>Title:</strong><br>
            TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.18125v1">http://arxiv.org/abs/2505.18125v1</a></p>

            <p><strong>Abstract:</strong><br>
            While deep learning has achieved remarkable success across many domains, it has historically underperformed on tabular learning tasks, which remain dominated by gradient boosting decision trees (GBDTs). However, recent advancements are paving the way for Tabular Foundation Models, which can leverage real-world knowledge and generalize across diverse datasets, particularly when the data contains free-text. Although incorporating language model capabilities into tabular tasks has been explored, most existing methods utilize static, target-agnostic textual representations, limiting their effectiveness. We introduce TabSTAR: a Foundation Tabular Model with Semantically Target-Aware Representations. TabSTAR is designed to enable transfer learning on tabular data with textual features, with an architecture free of dataset-specific parameters. It unfreezes a pretrained text encoder and takes as input target tokens, which provide the model with the context needed to learn task-specific embeddings. TabSTAR achieves state-of-the-art performance for both medium- and large-sized datasets across known benchmarks of classification tasks with text features, and its pretraining phase exhibits scaling laws in the number of datasets, offering a pathway for further performance improvements.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 95 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Alan Arazi, Eilam Shapira, Roi Reichart</p>

            <p><strong>Title:</strong><br>
            TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.18125v1">http://arxiv.org/abs/2505.18125v1</a></p>

            <p><strong>Abstract:</strong><br>
            While deep learning has achieved remarkable success across many domains, it has historically underperformed on tabular learning tasks, which remain dominated by gradient boosting decision trees (GBDTs). However, recent advancements are paving the way for Tabular Foundation Models, which can leverage real-world knowledge and generalize across diverse datasets, particularly when the data contains free-text. Although incorporating language model capabilities into tabular tasks has been explored, most existing methods utilize static, target-agnostic textual representations, limiting their effectiveness. We introduce TabSTAR: a Foundation Tabular Model with Semantically Target-Aware Representations. TabSTAR is designed to enable transfer learning on tabular data with textual features, with an architecture free of dataset-specific parameters. It unfreezes a pretrained text encoder and takes as input target tokens, which provide the model with the context needed to learn task-specific embeddings. TabSTAR achieves state-of-the-art performance for both medium- and large-sized datasets across known benchmarks of classification tasks with text features, and its pretraining phase exhibits scaling laws in the number of datasets, offering a pathway for further performance improvements.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 26 May 2025 21:24:52 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/99c2378f/169d30ab.mp3" length="19892869" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1240</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 95 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Alan Arazi, Eilam Shapira, Roi Reichart</p>

            <p><strong>Title:</strong><br>
            TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.18125v1">http://arxiv.org/abs/2505.18125v1</a></p>

            <p><strong>Abstract:</strong><br>
            While deep learning has achieved remarkable success across many domains, it has historically underperformed on tabular learning tasks, which remain dominated by gradient boosting decision trees (GBDTs). However, recent advancements are paving the way for Tabular Foundation Models, which can leverage real-world knowledge and generalize across diverse datasets, particularly when the data contains free-text. Although incorporating language model capabilities into tabular tasks has been explored, most existing methods utilize static, target-agnostic textual representations, limiting their effectiveness. We introduce TabSTAR: a Foundation Tabular Model with Semantically Target-Aware Representations. TabSTAR is designed to enable transfer learning on tabular data with textual features, with an architecture free of dataset-specific parameters. It unfreezes a pretrained text encoder and takes as input target tokens, which provide the model with the context needed to learn task-specific embeddings. TabSTAR achieves state-of-the-art performance for both medium- and large-sized datasets across known benchmarks of classification tasks with text features, and its pretraining phase exhibits scaling laws in the number of datasets, offering a pathway for further performance improvements.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning</title>
      <itunes:episode>805</itunes:episode>
      <podcast:episode>805</podcast:episode>
      <itunes:title>QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ef71e737-ed66-498c-9642-7676a173e539</guid>
      <link>https://share.transistor.fm/s/0629f4c8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan</p>

            <p><strong>Title:</strong><br>
            QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17667v1">http://arxiv.org/abs/2505.17667v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent large reasoning models (LRMs) have demonstrated strong reasoning capabilities through reinforcement learning (RL). These improvements have primarily been observed within the short-context reasoning tasks. In contrast, extending LRMs to effectively process and reason on long-context inputs via RL remains a critical unsolved challenge. To bridge this gap, we first formalize the paradigm of long-context reasoning RL, and identify key challenges in suboptimal training efficiency and unstable optimization process. To address these issues, we propose QwenLong-L1, a framework that adapts short-context LRMs to long-context scenarios via progressive context scaling. Specifically, we utilize a warm-up supervised fine-tuning (SFT) stage to establish a robust initial policy, followed by a curriculum-guided phased RL technique to stabilize the policy evolution, and enhanced with a difficulty-aware retrospective sampling strategy to incentivize the policy exploration. Experiments on seven long-context document question-answering benchmarks demonstrate that QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B, achieving performance on par with Claude-3.7-Sonnet-Thinking, demonstrating leading performance among state-of-the-art LRMs. This work advances the development of practical long-context LRMs capable of robust reasoning across information-intensive environments.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan</p>

            <p><strong>Title:</strong><br>
            QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17667v1">http://arxiv.org/abs/2505.17667v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent large reasoning models (LRMs) have demonstrated strong reasoning capabilities through reinforcement learning (RL). These improvements have primarily been observed within the short-context reasoning tasks. In contrast, extending LRMs to effectively process and reason on long-context inputs via RL remains a critical unsolved challenge. To bridge this gap, we first formalize the paradigm of long-context reasoning RL, and identify key challenges in suboptimal training efficiency and unstable optimization process. To address these issues, we propose QwenLong-L1, a framework that adapts short-context LRMs to long-context scenarios via progressive context scaling. Specifically, we utilize a warm-up supervised fine-tuning (SFT) stage to establish a robust initial policy, followed by a curriculum-guided phased RL technique to stabilize the policy evolution, and enhanced with a difficulty-aware retrospective sampling strategy to incentivize the policy exploration. Experiments on seven long-context document question-answering benchmarks demonstrate that QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B, achieving performance on par with Claude-3.7-Sonnet-Thinking, demonstrating leading performance among state-of-the-art LRMs. This work advances the development of practical long-context LRMs capable of robust reasoning across information-intensive environments.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 26 May 2025 21:24:30 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0629f4c8/7b7c1a65.mp3" length="23214811" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1447</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan</p>

            <p><strong>Title:</strong><br>
            QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17667v1">http://arxiv.org/abs/2505.17667v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent large reasoning models (LRMs) have demonstrated strong reasoning capabilities through reinforcement learning (RL). These improvements have primarily been observed within the short-context reasoning tasks. In contrast, extending LRMs to effectively process and reason on long-context inputs via RL remains a critical unsolved challenge. To bridge this gap, we first formalize the paradigm of long-context reasoning RL, and identify key challenges in suboptimal training efficiency and unstable optimization process. To address these issues, we propose QwenLong-L1, a framework that adapts short-context LRMs to long-context scenarios via progressive context scaling. Specifically, we utilize a warm-up supervised fine-tuning (SFT) stage to establish a robust initial policy, followed by a curriculum-guided phased RL technique to stabilize the policy evolution, and enhanced with a difficulty-aware retrospective sampling strategy to incentivize the policy exploration. Experiments on seven long-context document question-answering benchmarks demonstrate that QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B, achieving performance on par with Claude-3.7-Sonnet-Thinking, demonstrating leading performance among state-of-the-art LRMs. This work advances the development of practical long-context LRMs capable of robust reasoning across information-intensive environments.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Quartet: Native FP4 Training Can Be Optimal for Large Language Models</title>
      <itunes:episode>804</itunes:episode>
      <podcast:episode>804</podcast:episode>
      <itunes:title>Quartet: Native FP4 Training Can Be Optimal for Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4576c114-4183-4560-8523-ab64bdf6dfa0</guid>
      <link>https://share.transistor.fm/s/8f1050d8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh</p>

            <p><strong>Title:</strong><br>
            Quartet: Native FP4 Training Can Be Optimal for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14669v1">http://arxiv.org/abs/2505.14669v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancement of large language models (LLMs) has been paralleled by unprecedented increases in computational demands, with training costs for state-of-the-art models doubling every few months. Training models directly in low-precision arithmetic offers a solution, by improving both computational throughput and energy efficiency. Specifically, NVIDIA's recent Blackwell architecture facilitates extremely low-precision operations, specifically FP4 variants, promising substantial efficiency gains. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we systematically investigate hardware-supported FP4 training and introduce Quartet, a new approach enabling accurate, end-to-end FP4 training with all the major computations (in e.g. linear layers) being performed in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across varying bit-widths and allows us to identify a "near-optimal" low-precision training technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for NVIDIA Blackwell GPUs, and show that it can achieve state-of-the-art accuracy for FP4 precision, successfully training billion-scale models. Our method demonstrates that fully FP4-based training is a competitive alternative to standard-precision and FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh</p>

            <p><strong>Title:</strong><br>
            Quartet: Native FP4 Training Can Be Optimal for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14669v1">http://arxiv.org/abs/2505.14669v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancement of large language models (LLMs) has been paralleled by unprecedented increases in computational demands, with training costs for state-of-the-art models doubling every few months. Training models directly in low-precision arithmetic offers a solution, by improving both computational throughput and energy efficiency. Specifically, NVIDIA's recent Blackwell architecture facilitates extremely low-precision operations, specifically FP4 variants, promising substantial efficiency gains. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we systematically investigate hardware-supported FP4 training and introduce Quartet, a new approach enabling accurate, end-to-end FP4 training with all the major computations (in e.g. linear layers) being performed in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across varying bit-widths and allows us to identify a "near-optimal" low-precision training technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for NVIDIA Blackwell GPUs, and show that it can achieve state-of-the-art accuracy for FP4 precision, successfully training billion-scale models. Our method demonstrates that fully FP4-based training is a competitive alternative to standard-precision and FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 26 May 2025 21:24:08 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8f1050d8/78405fef.mp3" length="21911599" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1366</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh</p>

            <p><strong>Title:</strong><br>
            Quartet: Native FP4 Training Can Be Optimal for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14669v1">http://arxiv.org/abs/2505.14669v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancement of large language models (LLMs) has been paralleled by unprecedented increases in computational demands, with training costs for state-of-the-art models doubling every few months. Training models directly in low-precision arithmetic offers a solution, by improving both computational throughput and energy efficiency. Specifically, NVIDIA's recent Blackwell architecture facilitates extremely low-precision operations, specifically FP4 variants, promising substantial efficiency gains. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we systematically investigate hardware-supported FP4 training and introduce Quartet, a new approach enabling accurate, end-to-end FP4 training with all the major computations (in e.g. linear layers) being performed in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across varying bit-widths and allows us to identify a "near-optimal" low-precision training technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for NVIDIA Blackwell GPUs, and show that it can achieve state-of-the-art accuracy for FP4 precision, successfully training billion-scale models. Our method demonstrates that fully FP4-based training is a competitive alternative to standard-precision and FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models</title>
      <itunes:episode>803</itunes:episode>
      <podcast:episode>803</podcast:episode>
      <itunes:title>Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5909d54a-0c25-494f-bfce-6819a4bf0c23</guid>
      <link>https://share.transistor.fm/s/5346d96c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Doohyuk Jang, Yoonjeon Kim, Chanjae Park, Hyun Ryu, Eunho Yang</p>

            <p><strong>Title:</strong><br>
            Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17225v1">http://arxiv.org/abs/2505.17225v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models have demonstrated remarkable proficiency in long and complex reasoning tasks. However, they frequently exhibit a problematic reliance on familiar reasoning patterns, a phenomenon we term \textit{reasoning rigidity}. Despite explicit instructions from users, these models often override clearly stated conditions and default to habitual reasoning trajectories, leading to incorrect conclusions. This behavior presents significant challenges, particularly in domains such as mathematics and logic puzzle, where precise adherence to specified constraints is critical. To systematically investigate reasoning rigidity, a behavior largely unexplored in prior work, we introduce a expert-curated diagnostic set, \dataset{}. Our dataset includes specially modified variants of existing mathematical benchmarks, namely AIME and MATH500, as well as well-known puzzles deliberately redesigned to require deviation from familiar reasoning strategies. Using this dataset, we identify recurring contamination patterns that occur when models default to ingrained reasoning. Specifically, we categorize this contamination into three distinctive modes: (i) Interpretation Overload, (ii) Input Distrust, and (iii) Partial Instruction Attention, each causing models to ignore or distort provided instructions. We publicly release our diagnostic set to facilitate future research on mitigating reasoning rigidity in language models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Doohyuk Jang, Yoonjeon Kim, Chanjae Park, Hyun Ryu, Eunho Yang</p>

            <p><strong>Title:</strong><br>
            Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17225v1">http://arxiv.org/abs/2505.17225v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models have demonstrated remarkable proficiency in long and complex reasoning tasks. However, they frequently exhibit a problematic reliance on familiar reasoning patterns, a phenomenon we term \textit{reasoning rigidity}. Despite explicit instructions from users, these models often override clearly stated conditions and default to habitual reasoning trajectories, leading to incorrect conclusions. This behavior presents significant challenges, particularly in domains such as mathematics and logic puzzle, where precise adherence to specified constraints is critical. To systematically investigate reasoning rigidity, a behavior largely unexplored in prior work, we introduce a expert-curated diagnostic set, \dataset{}. Our dataset includes specially modified variants of existing mathematical benchmarks, namely AIME and MATH500, as well as well-known puzzles deliberately redesigned to require deviation from familiar reasoning strategies. Using this dataset, we identify recurring contamination patterns that occur when models default to ingrained reasoning. Specifically, we categorize this contamination into three distinctive modes: (i) Interpretation Overload, (ii) Input Distrust, and (iii) Partial Instruction Attention, each causing models to ignore or distort provided instructions. We publicly release our diagnostic set to facilitate future research on mitigating reasoning rigidity in language models.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 26 May 2025 21:23:47 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5346d96c/b905d1ad.mp3" length="19841042" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1236</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Doohyuk Jang, Yoonjeon Kim, Chanjae Park, Hyun Ryu, Eunho Yang</p>

            <p><strong>Title:</strong><br>
            Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17225v1">http://arxiv.org/abs/2505.17225v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models have demonstrated remarkable proficiency in long and complex reasoning tasks. However, they frequently exhibit a problematic reliance on familiar reasoning patterns, a phenomenon we term \textit{reasoning rigidity}. Despite explicit instructions from users, these models often override clearly stated conditions and default to habitual reasoning trajectories, leading to incorrect conclusions. This behavior presents significant challenges, particularly in domains such as mathematics and logic puzzle, where precise adherence to specified constraints is critical. To systematically investigate reasoning rigidity, a behavior largely unexplored in prior work, we introduce a expert-curated diagnostic set, \dataset{}. Our dataset includes specially modified variants of existing mathematical benchmarks, namely AIME and MATH500, as well as well-known puzzles deliberately redesigned to require deviation from familiar reasoning strategies. Using this dataset, we identify recurring contamination patterns that occur when models default to ingrained reasoning. Specifically, we categorize this contamination into three distinctive modes: (i) Interpretation Overload, (ii) Input Distrust, and (iii) Partial Instruction Attention, each causing models to ignore or distort provided instructions. We publicly release our diagnostic set to facilitate future research on mitigating reasoning rigidity in language models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>One RL to See Them All: Visual Triple Unified Reinforcement Learning</title>
      <itunes:episode>802</itunes:episode>
      <podcast:episode>802</podcast:episode>
      <itunes:title>One RL to See Them All: Visual Triple Unified Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">180224bc-2467-4d3c-a8f7-3ef81cca75eb</guid>
      <link>https://share.transistor.fm/s/4dfc4714</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, Junjie Yan</p>

            <p><strong>Title:</strong><br>
            One RL to See Them All: Visual Triple Unified Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.18129v1">http://arxiv.org/abs/2505.18129v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, Junjie Yan</p>

            <p><strong>Title:</strong><br>
            One RL to See Them All: Visual Triple Unified Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.18129v1">http://arxiv.org/abs/2505.18129v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 26 May 2025 21:23:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4dfc4714/1be464c8.mp3" length="19325267" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1204</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, Junjie Yan</p>

            <p><strong>Title:</strong><br>
            One RL to See Them All: Visual Triple Unified Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.18129v1">http://arxiv.org/abs/2505.18129v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Distilling LLM Agent into Small Models with Retrieval and Code Tools</title>
      <itunes:episode>801</itunes:episode>
      <podcast:episode>801</podcast:episode>
      <itunes:title>Distilling LLM Agent into Small Models with Retrieval and Code Tools</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d12f4722-b4db-42e0-905f-b3d9a8bd681b</guid>
      <link>https://share.transistor.fm/s/64c39c19</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Minki Kang, Jongwon Jeong, Seanie Lee, Jaewoong Cho, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Distilling LLM Agent into Small Models with Retrieval and Code Tools</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17612v1">http://arxiv.org/abs/2505.17612v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents. Our code is available at https://github.com/Nardien/agent-distillation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Minki Kang, Jongwon Jeong, Seanie Lee, Jaewoong Cho, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Distilling LLM Agent into Small Models with Retrieval and Code Tools</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17612v1">http://arxiv.org/abs/2505.17612v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents. Our code is available at https://github.com/Nardien/agent-distillation.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 26 May 2025 21:23:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/64c39c19/3c3f012d.mp3" length="20695337" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1290</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Minki Kang, Jongwon Jeong, Seanie Lee, Jaewoong Cho, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Distilling LLM Agent into Small Models with Retrieval and Code Tools</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17612v1">http://arxiv.org/abs/2505.17612v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents. Our code is available at https://github.com/Nardien/agent-distillation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization</title>
      <itunes:episode>800</itunes:episode>
      <podcast:episode>800</podcast:episode>
      <itunes:title>QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">075f3309-2f94-4d7b-95cd-0a0d00093a9f</guid>
      <link>https://share.transistor.fm/s/d0e4c23b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weizhou Shen, Chenliang Li, Fanqi Wan, Shengyi Liao, Shaopeng Lai, Bo Zhang, Yingcheng Shi, Yuning Wu, Gang Fu, Zhansheng Li, Bin Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan</p>

            <p><strong>Title:</strong><br>
            QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.18092v1">http://arxiv.org/abs/2505.18092v1</a></p>

            <p><strong>Abstract:</strong><br>
            This technical report presents QwenLong-CPRS, a context compression framework designed for explicit long-context optimization, addressing prohibitive computation overhead during the prefill stage and the "lost in the middle" performance degradation of large language models (LLMs) during long sequence processing. Implemented through a novel dynamic context optimization mechanism, QwenLong-CPRS enables multi-granularity context compression guided by natural language instructions, achieving both efficiency gains and improved performance.   Evolved from the Qwen architecture series, QwenLong-CPRS introduces four key innovations: (1) Natural language-guided dynamic optimization, (2) Bidirectional reasoning layers for enhanced boundary awareness, (3) Token critic mechanisms with language modeling heads, and (4) Window-parallel inference.   Comprehensive evaluations across five benchmarks (4K-2M word contexts) demonstrate QwenLong-CPRS's threefold effectiveness: (1) Consistent superiority over other context management methods like RAG and sparse attention in both accuracy and efficiency. (2) Architecture-agnostic integration with all flagship LLMs, including GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, and Qwen2.5-max, achieves 21.59$\times$ context compression alongside 19.15-point average performance gains; (3) Deployed with Qwen2.5-32B-Instruct, QwenLong-CPRS surpasses leading proprietary LLMs by 4.85 and 10.88 points on Ruler-128K and InfiniteBench, establishing new SOTA performance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weizhou Shen, Chenliang Li, Fanqi Wan, Shengyi Liao, Shaopeng Lai, Bo Zhang, Yingcheng Shi, Yuning Wu, Gang Fu, Zhansheng Li, Bin Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan</p>

            <p><strong>Title:</strong><br>
            QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.18092v1">http://arxiv.org/abs/2505.18092v1</a></p>

            <p><strong>Abstract:</strong><br>
            This technical report presents QwenLong-CPRS, a context compression framework designed for explicit long-context optimization, addressing prohibitive computation overhead during the prefill stage and the "lost in the middle" performance degradation of large language models (LLMs) during long sequence processing. Implemented through a novel dynamic context optimization mechanism, QwenLong-CPRS enables multi-granularity context compression guided by natural language instructions, achieving both efficiency gains and improved performance.   Evolved from the Qwen architecture series, QwenLong-CPRS introduces four key innovations: (1) Natural language-guided dynamic optimization, (2) Bidirectional reasoning layers for enhanced boundary awareness, (3) Token critic mechanisms with language modeling heads, and (4) Window-parallel inference.   Comprehensive evaluations across five benchmarks (4K-2M word contexts) demonstrate QwenLong-CPRS's threefold effectiveness: (1) Consistent superiority over other context management methods like RAG and sparse attention in both accuracy and efficiency. (2) Architecture-agnostic integration with all flagship LLMs, including GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, and Qwen2.5-max, achieves 21.59$\times$ context compression alongside 19.15-point average performance gains; (3) Deployed with Qwen2.5-32B-Instruct, QwenLong-CPRS surpasses leading proprietary LLMs by 4.85 and 10.88 points on Ruler-128K and InfiniteBench, establishing new SOTA performance.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 26 May 2025 21:22:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d0e4c23b/7cffbb88.mp3" length="22072933" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1376</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weizhou Shen, Chenliang Li, Fanqi Wan, Shengyi Liao, Shaopeng Lai, Bo Zhang, Yingcheng Shi, Yuning Wu, Gang Fu, Zhansheng Li, Bin Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan</p>

            <p><strong>Title:</strong><br>
            QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.18092v1">http://arxiv.org/abs/2505.18092v1</a></p>

            <p><strong>Abstract:</strong><br>
            This technical report presents QwenLong-CPRS, a context compression framework designed for explicit long-context optimization, addressing prohibitive computation overhead during the prefill stage and the "lost in the middle" performance degradation of large language models (LLMs) during long sequence processing. Implemented through a novel dynamic context optimization mechanism, QwenLong-CPRS enables multi-granularity context compression guided by natural language instructions, achieving both efficiency gains and improved performance.   Evolved from the Qwen architecture series, QwenLong-CPRS introduces four key innovations: (1) Natural language-guided dynamic optimization, (2) Bidirectional reasoning layers for enhanced boundary awareness, (3) Token critic mechanisms with language modeling heads, and (4) Window-parallel inference.   Comprehensive evaluations across five benchmarks (4K-2M word contexts) demonstrate QwenLong-CPRS's threefold effectiveness: (1) Consistent superiority over other context management methods like RAG and sparse attention in both accuracy and efficiency. (2) Architecture-agnostic integration with all flagship LLMs, including GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, and Qwen2.5-max, achieves 21.59$\times$ context compression alongside 19.15-point average performance gains; (3) Deployed with Qwen2.5-32B-Instruct, QwenLong-CPRS surpasses leading proprietary LLMs by 4.85 and 10.88 points on Ruler-128K and InfiniteBench, establishing new SOTA performance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PhyX: Does Your Model Have the "Wits" for Physical Reasoning?</title>
      <itunes:episode>799</itunes:episode>
      <podcast:episode>799</podcast:episode>
      <itunes:title>PhyX: Does Your Model Have the "Wits" for Physical Reasoning?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">64eddb16-667f-4afa-80f1-e022f611b504</guid>
      <link>https://share.transistor.fm/s/fe2503f1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hui Shen, Taiqiang Wu, Qi Han, Yunta Hsieh, Jizhou Wang, Yuyue Zhang, Yuxin Cheng, Zijian Hao, Yuansheng Ni, Xin Wang, Zhongwei Wan, Kai Zhang, Wendong Xu, Jing Xiong, Ping Luo, Wenhu Chen, Chaofan Tao, Zhuoqing Mao, Ngai Wong</p>

            <p><strong>Title:</strong><br>
            PhyX: Does Your Model Have the "Wits" for Physical Reasoning?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15929v1">http://arxiv.org/abs/2505.15929v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&amp;acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5\%, 42.2\%, and 45.8\% accuracy respectively-performance gaps exceeding 29\% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit, enabling one-click evaluation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hui Shen, Taiqiang Wu, Qi Han, Yunta Hsieh, Jizhou Wang, Yuyue Zhang, Yuxin Cheng, Zijian Hao, Yuansheng Ni, Xin Wang, Zhongwei Wan, Kai Zhang, Wendong Xu, Jing Xiong, Ping Luo, Wenhu Chen, Chaofan Tao, Zhuoqing Mao, Ngai Wong</p>

            <p><strong>Title:</strong><br>
            PhyX: Does Your Model Have the "Wits" for Physical Reasoning?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15929v1">http://arxiv.org/abs/2505.15929v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&amp;acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5\%, 42.2\%, and 45.8\% accuracy respectively-performance gaps exceeding 29\% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit, enabling one-click evaluation.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 26 May 2025 21:22:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fe2503f1/47d90fe4.mp3" length="21230736" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1323</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hui Shen, Taiqiang Wu, Qi Han, Yunta Hsieh, Jizhou Wang, Yuyue Zhang, Yuxin Cheng, Zijian Hao, Yuansheng Ni, Xin Wang, Zhongwei Wan, Kai Zhang, Wendong Xu, Jing Xiong, Ping Luo, Wenhu Chen, Chaofan Tao, Zhuoqing Mao, Ngai Wong</p>

            <p><strong>Title:</strong><br>
            PhyX: Does Your Model Have the "Wits" for Physical Reasoning?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15929v1">http://arxiv.org/abs/2505.15929v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&amp;acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5\%, 42.2\%, and 45.8\% accuracy respectively-performance gaps exceeding 29\% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit, enabling one-click evaluation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Scaling Image and Video Generation via Test-Time Evolutionary Search</title>
      <itunes:episode>798</itunes:episode>
      <podcast:episode>798</podcast:episode>
      <itunes:title>Scaling Image and Video Generation via Test-Time Evolutionary Search</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6f0462fb-f753-4ac9-87ef-0660f60f73d0</guid>
      <link>https://share.transistor.fm/s/7e4bdc42</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Haoran He, Jiajun Liang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Ling Pan</p>

            <p><strong>Title:</strong><br>
            Scaling Image and Video Generation via Test-Time Evolutionary Search</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17618v1">http://arxiv.org/abs/2505.17618v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the marginal cost of scaling computation (data and parameters) during model pre-training continues to increase substantially, test-time scaling (TTS) has emerged as a promising direction for improving generative model performance by allocating additional computation at inference time. While TTS has demonstrated significant success across multiple language tasks, there remains a notable gap in understanding the test-time scaling behaviors of image and video generative models (diffusion-based or flow-based models). Although recent works have initiated exploration into inference-time strategies for vision tasks, these approaches face critical limitations: being constrained to task-specific domains, exhibiting poor scalability, or falling into reward over-optimization that sacrifices sample diversity. In this paper, we propose \textbf{Evo}lutionary \textbf{Search} (EvoSearch), a novel, generalist, and efficient TTS method that effectively enhances the scalability of both image and video generation across diffusion and flow models, without requiring additional training or model expansion. EvoSearch reformulates test-time scaling for diffusion and flow models as an evolutionary search problem, leveraging principles from biological evolution to efficiently explore and refine the denoising trajectory. By incorporating carefully designed selection and mutation mechanisms tailored to the stochastic differential equation denoising process, EvoSearch iteratively generates higher-quality offspring while preserving population diversity. Through extensive evaluation across both diffusion and flow architectures for image and video generation tasks, we demonstrate that our method consistently outperforms existing approaches, achieves higher diversity, and shows strong generalizability to unseen evaluation metrics. Our project is available at the website https://tinnerhrhe.github.io/evosearch.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Haoran He, Jiajun Liang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Ling Pan</p>

            <p><strong>Title:</strong><br>
            Scaling Image and Video Generation via Test-Time Evolutionary Search</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17618v1">http://arxiv.org/abs/2505.17618v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the marginal cost of scaling computation (data and parameters) during model pre-training continues to increase substantially, test-time scaling (TTS) has emerged as a promising direction for improving generative model performance by allocating additional computation at inference time. While TTS has demonstrated significant success across multiple language tasks, there remains a notable gap in understanding the test-time scaling behaviors of image and video generative models (diffusion-based or flow-based models). Although recent works have initiated exploration into inference-time strategies for vision tasks, these approaches face critical limitations: being constrained to task-specific domains, exhibiting poor scalability, or falling into reward over-optimization that sacrifices sample diversity. In this paper, we propose \textbf{Evo}lutionary \textbf{Search} (EvoSearch), a novel, generalist, and efficient TTS method that effectively enhances the scalability of both image and video generation across diffusion and flow models, without requiring additional training or model expansion. EvoSearch reformulates test-time scaling for diffusion and flow models as an evolutionary search problem, leveraging principles from biological evolution to efficiently explore and refine the denoising trajectory. By incorporating carefully designed selection and mutation mechanisms tailored to the stochastic differential equation denoising process, EvoSearch iteratively generates higher-quality offspring while preserving population diversity. Through extensive evaluation across both diffusion and flow architectures for image and video generation tasks, we demonstrate that our method consistently outperforms existing approaches, achieves higher diversity, and shows strong generalizability to unseen evaluation metrics. Our project is available at the website https://tinnerhrhe.github.io/evosearch.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 26 May 2025 21:22:00 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7e4bdc42/0b102a8e.mp3" length="23687089" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1477</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Haoran He, Jiajun Liang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Ling Pan</p>

            <p><strong>Title:</strong><br>
            Scaling Image and Video Generation via Test-Time Evolutionary Search</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17618v1">http://arxiv.org/abs/2505.17618v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the marginal cost of scaling computation (data and parameters) during model pre-training continues to increase substantially, test-time scaling (TTS) has emerged as a promising direction for improving generative model performance by allocating additional computation at inference time. While TTS has demonstrated significant success across multiple language tasks, there remains a notable gap in understanding the test-time scaling behaviors of image and video generative models (diffusion-based or flow-based models). Although recent works have initiated exploration into inference-time strategies for vision tasks, these approaches face critical limitations: being constrained to task-specific domains, exhibiting poor scalability, or falling into reward over-optimization that sacrifices sample diversity. In this paper, we propose \textbf{Evo}lutionary \textbf{Search} (EvoSearch), a novel, generalist, and efficient TTS method that effectively enhances the scalability of both image and video generation across diffusion and flow models, without requiring additional training or model expansion. EvoSearch reformulates test-time scaling for diffusion and flow models as an evolutionary search problem, leveraging principles from biological evolution to efficiently explore and refine the denoising trajectory. By incorporating carefully designed selection and mutation mechanisms tailored to the stochastic differential equation denoising process, EvoSearch iteratively generates higher-quality offspring while preserving population diversity. Through extensive evaluation across both diffusion and flow architectures for image and video generation tasks, we demonstrate that our method consistently outperforms existing approaches, achieves higher diversity, and shows strong generalizability to unseen evaluation metrics. Our project is available at the website https://tinnerhrhe.github.io/evosearch.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback</title>
      <itunes:episode>797</itunes:episode>
      <podcast:episode>797</podcast:episode>
      <itunes:title>MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fc001fd8-81ec-466d-9c55-f9489ef3a27e</guid>
      <link>https://share.transistor.fm/s/41740a03</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.AI, cs.CE</p>

            <p><strong>Authors:</strong><br>
            Wanhao Liu, Zonglin Yang, Jue Wang, Lidong Bing, Di Zhang, Dongzhan Zhou, Yuqiang Li, Houqiang Li, Erik Cambria, Wanli Ouyang</p>

            <p><strong>Title:</strong><br>
            MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17873v1">http://arxiv.org/abs/2505.17873v1</a></p>

            <p><strong>Abstract:</strong><br>
            Hypothesis ranking is a crucial component of automated scientific discovery, particularly in natural sciences where wet-lab experiments are costly and throughput-limited. Existing approaches focus on pre-experiment ranking, relying solely on large language model's internal reasoning without incorporating empirical outcomes from experiments. We introduce the task of experiment-guided ranking, which aims to prioritize candidate hypotheses based on the results of previously tested ones. However, developing such strategies is challenging due to the impracticality of repeatedly conducting real experiments in natural science domains. To address this, we propose a simulator grounded in three domain-informed assumptions, modeling hypothesis performance as a function of similarity to a known ground truth hypothesis, perturbed by noise. We curate a dataset of 124 chemistry hypotheses with experimentally reported outcomes to validate the simulator. Building on this simulator, we develop a pseudo experiment-guided ranking method that clusters hypotheses by shared functional characteristics and prioritizes candidates based on insights derived from simulated experimental feedback. Experiments show that our method outperforms pre-experiment baselines and strong ablations.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.AI, cs.CE</p>

            <p><strong>Authors:</strong><br>
            Wanhao Liu, Zonglin Yang, Jue Wang, Lidong Bing, Di Zhang, Dongzhan Zhou, Yuqiang Li, Houqiang Li, Erik Cambria, Wanli Ouyang</p>

            <p><strong>Title:</strong><br>
            MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17873v1">http://arxiv.org/abs/2505.17873v1</a></p>

            <p><strong>Abstract:</strong><br>
            Hypothesis ranking is a crucial component of automated scientific discovery, particularly in natural sciences where wet-lab experiments are costly and throughput-limited. Existing approaches focus on pre-experiment ranking, relying solely on large language model's internal reasoning without incorporating empirical outcomes from experiments. We introduce the task of experiment-guided ranking, which aims to prioritize candidate hypotheses based on the results of previously tested ones. However, developing such strategies is challenging due to the impracticality of repeatedly conducting real experiments in natural science domains. To address this, we propose a simulator grounded in three domain-informed assumptions, modeling hypothesis performance as a function of similarity to a known ground truth hypothesis, perturbed by noise. We curate a dataset of 124 chemistry hypotheses with experimentally reported outcomes to validate the simulator. Building on this simulator, we develop a pseudo experiment-guided ranking method that clusters hypotheses by shared functional characteristics and prioritizes candidates based on insights derived from simulated experimental feedback. Experiments show that our method outperforms pre-experiment baselines and strong ablations.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 26 May 2025 21:21:38 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/41740a03/6046fd23.mp3" length="18867208" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1176</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.AI, cs.CE</p>

            <p><strong>Authors:</strong><br>
            Wanhao Liu, Zonglin Yang, Jue Wang, Lidong Bing, Di Zhang, Dongzhan Zhou, Yuqiang Li, Houqiang Li, Erik Cambria, Wanli Ouyang</p>

            <p><strong>Title:</strong><br>
            MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17873v1">http://arxiv.org/abs/2505.17873v1</a></p>

            <p><strong>Abstract:</strong><br>
            Hypothesis ranking is a crucial component of automated scientific discovery, particularly in natural sciences where wet-lab experiments are costly and throughput-limited. Existing approaches focus on pre-experiment ranking, relying solely on large language model's internal reasoning without incorporating empirical outcomes from experiments. We introduce the task of experiment-guided ranking, which aims to prioritize candidate hypotheses based on the results of previously tested ones. However, developing such strategies is challenging due to the impracticality of repeatedly conducting real experiments in natural science domains. To address this, we propose a simulator grounded in three domain-informed assumptions, modeling hypothesis performance as a function of similarity to a known ground truth hypothesis, perturbed by noise. We curate a dataset of 124 chemistry hypotheses with experimentally reported outcomes to validate the simulator. Building on this simulator, we develop a pseudo experiment-guided ranking method that clusters hypotheses by shared functional characteristics and prioritizes candidates based on insights derived from simulated experimental feedback. Experiments show that our method outperforms pre-experiment baselines and strong ablations.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification</title>
      <itunes:episode>796</itunes:episode>
      <podcast:episode>796</podcast:episode>
      <itunes:title>NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">440b23bc-c222-407e-87bb-dd188ff5f16e</guid>
      <link>https://share.transistor.fm/s/27cc839a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 86 | cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            NovelSeek Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Runmin Ma, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, Zhongying Tu, Xiangyu Yue, Wangli Ouyang, Bowen Zhou, Lei Bai</p>

            <p><strong>Title:</strong><br>
            NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16938v1">http://arxiv.org/abs/2505.16938v1</a></p>

            <p><strong>Abstract:</strong><br>
            Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce NovelSeek, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. NovelSeek highlights three key advantages: 1) Scalability: NovelSeek has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: NovelSeek provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: NovelSeek has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.52 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 86 | cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            NovelSeek Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Runmin Ma, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, Zhongying Tu, Xiangyu Yue, Wangli Ouyang, Bowen Zhou, Lei Bai</p>

            <p><strong>Title:</strong><br>
            NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16938v1">http://arxiv.org/abs/2505.16938v1</a></p>

            <p><strong>Abstract:</strong><br>
            Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce NovelSeek, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. NovelSeek highlights three key advantages: 1) Scalability: NovelSeek has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: NovelSeek provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: NovelSeek has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.52 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 23 May 2025 20:54:59 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/27cc839a/9ec0d11e.mp3" length="20039597" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1249</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 86 | cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            NovelSeek Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Runmin Ma, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, Zhongying Tu, Xiangyu Yue, Wangli Ouyang, Bowen Zhou, Lei Bai</p>

            <p><strong>Title:</strong><br>
            NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16938v1">http://arxiv.org/abs/2505.16938v1</a></p>

            <p><strong>Abstract:</strong><br>
            Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce NovelSeek, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. NovelSeek highlights three key advantages: 1) Scalability: NovelSeek has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: NovelSeek provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: NovelSeek has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.52 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models</title>
      <itunes:episode>795</itunes:episode>
      <podcast:episode>795</podcast:episode>
      <itunes:title>Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e86456a8-0e39-4b65-9b6c-c31d8b892c01</guid>
      <link>https://share.transistor.fm/s/765f3de5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tingchen Fu, Jiawei Gu, Yafu Li, Xiaoye Qu, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14810v1">http://arxiv.org/abs/2505.14810v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-following is essential for aligning large language models (LLMs) with user intent. While recent reasoning-oriented models exhibit impressive performance on complex mathematical problems, their ability to adhere to natural language instructions remains underexplored. In this work, we introduce MathIF, a dedicated benchmark for evaluating instruction-following in mathematical reasoning tasks. Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability, as models that reason more effectively often struggle to comply with user directives. We find that models tuned on distilled long chains-of-thought or trained with reasoning-oriented reinforcement learning often degrade in instruction adherence, especially when generation length increases. Furthermore, we show that even simple interventions can partially recover obedience, though at the cost of reasoning performance. These findings highlight a fundamental tension in current LLM training paradigms and motivate the need for more instruction-aware reasoning models. We release the code and data at https://github.com/TingchenFu/MathIF.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tingchen Fu, Jiawei Gu, Yafu Li, Xiaoye Qu, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14810v1">http://arxiv.org/abs/2505.14810v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-following is essential for aligning large language models (LLMs) with user intent. While recent reasoning-oriented models exhibit impressive performance on complex mathematical problems, their ability to adhere to natural language instructions remains underexplored. In this work, we introduce MathIF, a dedicated benchmark for evaluating instruction-following in mathematical reasoning tasks. Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability, as models that reason more effectively often struggle to comply with user directives. We find that models tuned on distilled long chains-of-thought or trained with reasoning-oriented reinforcement learning often degrade in instruction adherence, especially when generation length increases. Furthermore, we show that even simple interventions can partially recover obedience, though at the cost of reasoning performance. These findings highlight a fundamental tension in current LLM training paradigms and motivate the need for more instruction-aware reasoning models. We release the code and data at https://github.com/TingchenFu/MathIF.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 23 May 2025 20:54:36 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/765f3de5/ba7fb58b.mp3" length="22842418" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1424</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tingchen Fu, Jiawei Gu, Yafu Li, Xiaoye Qu, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14810v1">http://arxiv.org/abs/2505.14810v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-following is essential for aligning large language models (LLMs) with user intent. While recent reasoning-oriented models exhibit impressive performance on complex mathematical problems, their ability to adhere to natural language instructions remains underexplored. In this work, we introduce MathIF, a dedicated benchmark for evaluating instruction-following in mathematical reasoning tasks. Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability, as models that reason more effectively often struggle to comply with user directives. We find that models tuned on distilled long chains-of-thought or trained with reasoning-oriented reinforcement learning often degrade in instruction adherence, especially when generation length increases. Furthermore, we show that even simple interventions can partially recover obedience, though at the cost of reasoning performance. These findings highlight a fundamental tension in current LLM training paradigms and motivate the need for more instruction-aware reasoning models. We release the code and data at https://github.com/TingchenFu/MathIF.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning</title>
      <itunes:episode>794</itunes:episode>
      <podcast:episode>794</podcast:episode>
      <itunes:title>Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1572aebf-f78c-41e2-8ccd-065382cc8fe6</guid>
      <link>https://share.transistor.fm/s/cef7c2e2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16410v1">http://arxiv.org/abs/2505.16410v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, large language models (LLMs) have shown remarkable reasoning capabilities via large-scale reinforcement learning (RL). However, leveraging the RL algorithm to empower effective multi-tool collaborative reasoning in LLMs remains an open challenge. In this paper, we introduce Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning. Tool-Star integrates six types of tools and incorporates systematic designs in both data synthesis and training. To address the scarcity of tool-use data, we propose a general tool-integrated reasoning data synthesis pipeline, which combines tool-integrated prompting with hint-based sampling to automatically and scalably generate tool-use trajectories. A subsequent quality normalization and difficulty-aware classification process filters out low-quality samples and organizes the dataset from easy to hard. Furthermore, we propose a two-stage training framework to enhance multi-tool collaborative reasoning by: (1) cold-start fine-tuning, which guides LLMs to explore reasoning patterns via tool-invocation feedback; and (2) a multi-tool self-critic RL algorithm with hierarchical reward design, which reinforces reward understanding and promotes effective tool collaboration. Experimental analyses on over 10 challenging reasoning benchmarks highlight the effectiveness and efficiency of Tool-Star. The code is available at https://github.com/dongguanting/Tool-Star.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16410v1">http://arxiv.org/abs/2505.16410v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, large language models (LLMs) have shown remarkable reasoning capabilities via large-scale reinforcement learning (RL). However, leveraging the RL algorithm to empower effective multi-tool collaborative reasoning in LLMs remains an open challenge. In this paper, we introduce Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning. Tool-Star integrates six types of tools and incorporates systematic designs in both data synthesis and training. To address the scarcity of tool-use data, we propose a general tool-integrated reasoning data synthesis pipeline, which combines tool-integrated prompting with hint-based sampling to automatically and scalably generate tool-use trajectories. A subsequent quality normalization and difficulty-aware classification process filters out low-quality samples and organizes the dataset from easy to hard. Furthermore, we propose a two-stage training framework to enhance multi-tool collaborative reasoning by: (1) cold-start fine-tuning, which guides LLMs to explore reasoning patterns via tool-invocation feedback; and (2) a multi-tool self-critic RL algorithm with hierarchical reward design, which reinforces reward understanding and promotes effective tool collaboration. Experimental analyses on over 10 challenging reasoning benchmarks highlight the effectiveness and efficiency of Tool-Star. The code is available at https://github.com/dongguanting/Tool-Star.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 23 May 2025 20:54:13 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cef7c2e2/2f5013ef.mp3" length="21064407" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1313</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16410v1">http://arxiv.org/abs/2505.16410v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, large language models (LLMs) have shown remarkable reasoning capabilities via large-scale reinforcement learning (RL). However, leveraging the RL algorithm to empower effective multi-tool collaborative reasoning in LLMs remains an open challenge. In this paper, we introduce Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning. Tool-Star integrates six types of tools and incorporates systematic designs in both data synthesis and training. To address the scarcity of tool-use data, we propose a general tool-integrated reasoning data synthesis pipeline, which combines tool-integrated prompting with hint-based sampling to automatically and scalably generate tool-use trajectories. A subsequent quality normalization and difficulty-aware classification process filters out low-quality samples and organizes the dataset from easy to hard. Furthermore, we propose a two-stage training framework to enhance multi-tool collaborative reasoning by: (1) cold-start fine-tuning, which guides LLMs to explore reasoning patterns via tool-invocation feedback; and (2) a multi-tool self-critic RL algorithm with hierarchical reward design, which reinforces reward understanding and promotes effective tool collaboration. Experimental analyses on over 10 challenging reasoning benchmarks highlight the effectiveness and efficiency of Tool-Star. The code is available at https://github.com/dongguanting/Tool-Star.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning</title>
      <itunes:episode>793</itunes:episode>
      <podcast:episode>793</podcast:episode>
      <itunes:title>Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9d400373-96e0-45c8-a259-e50b9b773726</guid>
      <link>https://share.transistor.fm/s/bb1e2487</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Alex Su, Haozhe Wang, Weimin Ren, Fangzhen Lin, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15966v1">http://arxiv.org/abs/2505.15966v1</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model's initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Alex Su, Haozhe Wang, Weimin Ren, Fangzhen Lin, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15966v1">http://arxiv.org/abs/2505.15966v1</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model's initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 23 May 2025 20:53:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bb1e2487/ecad88c4.mp3" length="17199972" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1071</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Alex Su, Haozhe Wang, Weimin Ren, Fangzhen Lin, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15966v1">http://arxiv.org/abs/2505.15966v1</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model's initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models</title>
      <itunes:episode>792</itunes:episode>
      <podcast:episode>792</podcast:episode>
      <itunes:title>KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">28f78abc-73f8-47bf-a9a4-632b69c92b0a</guid>
      <link>https://share.transistor.fm/s/4dc20494</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, Xu Yang</p>

            <p><strong>Title:</strong><br>
            KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16707v1">http://arxiv.org/abs/2505.16707v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, we introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on 10 state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, Xu Yang</p>

            <p><strong>Title:</strong><br>
            KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16707v1">http://arxiv.org/abs/2505.16707v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, we introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on 10 state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 23 May 2025 20:53:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4dc20494/7a06678d.mp3" length="20126913" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1254</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, Xu Yang</p>

            <p><strong>Title:</strong><br>
            KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16707v1">http://arxiv.org/abs/2505.16707v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, we introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on 10 state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design</title>
      <itunes:episode>791</itunes:episode>
      <podcast:episode>791</podcast:episode>
      <itunes:title>QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">226a2b42-4301-481d-965e-1de45a64dc15</guid>
      <link>https://share.transistor.fm/s/59a233d7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Benjamin Schneider, Dongfu Jiang, Chao Du, Tianyu Pang, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16175v1">http://arxiv.org/abs/2505.16175v1</a></p>

            <p><strong>Abstract:</strong><br>
            Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting. However, it remains computationally prohibitive for VideoLLMs, primarily due to two bottlenecks: 1) sequential video decoding, the process of converting the raw bit stream to RGB frames can take up to a minute for hour-long video inputs, and 2) costly prefilling of up to several million tokens for LLM inference, resulting in high latency and memory use. To address these challenges, we propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications. It comprises three key innovations: QuickDecoder, a parallelized CPU-based video decoder that achieves 2-3 times speedup by splitting videos into keyframe-aligned intervals processed concurrently; QuickPrefill, a memory-efficient prefilling method using KV-cache pruning to support more frames with less GPU memory; and an overlapping scheme that overlaps CPU video decoding with GPU inference. Together, these components infernece time reduce by a minute on long video inputs, enabling scalable, high-quality video understanding even on limited hardware. Experiments show that QuickVideo generalizes across durations and sampling rates, making long video processing feasible in practice.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Benjamin Schneider, Dongfu Jiang, Chao Du, Tianyu Pang, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16175v1">http://arxiv.org/abs/2505.16175v1</a></p>

            <p><strong>Abstract:</strong><br>
            Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting. However, it remains computationally prohibitive for VideoLLMs, primarily due to two bottlenecks: 1) sequential video decoding, the process of converting the raw bit stream to RGB frames can take up to a minute for hour-long video inputs, and 2) costly prefilling of up to several million tokens for LLM inference, resulting in high latency and memory use. To address these challenges, we propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications. It comprises three key innovations: QuickDecoder, a parallelized CPU-based video decoder that achieves 2-3 times speedup by splitting videos into keyframe-aligned intervals processed concurrently; QuickPrefill, a memory-efficient prefilling method using KV-cache pruning to support more frames with less GPU memory; and an overlapping scheme that overlaps CPU video decoding with GPU inference. Together, these components infernece time reduce by a minute on long video inputs, enabling scalable, high-quality video understanding even on limited hardware. Experiments show that QuickVideo generalizes across durations and sampling rates, making long video processing feasible in practice.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 23 May 2025 20:53:03 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/59a233d7/29bf7ecd.mp3" length="18939083" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1180</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Benjamin Schneider, Dongfu Jiang, Chao Du, Tianyu Pang, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16175v1">http://arxiv.org/abs/2505.16175v1</a></p>

            <p><strong>Abstract:</strong><br>
            Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting. However, it remains computationally prohibitive for VideoLLMs, primarily due to two bottlenecks: 1) sequential video decoding, the process of converting the raw bit stream to RGB frames can take up to a minute for hour-long video inputs, and 2) costly prefilling of up to several million tokens for LLM inference, resulting in high latency and memory use. To address these challenges, we propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications. It comprises three key innovations: QuickDecoder, a parallelized CPU-based video decoder that achieves 2-3 times speedup by splitting videos into keyframe-aligned intervals processed concurrently; QuickPrefill, a memory-efficient prefilling method using KV-cache pruning to support more frames with less GPU memory; and an overlapping scheme that overlaps CPU video decoding with GPU inference. Together, these components infernece time reduce by a minute on long video inputs, enabling scalable, high-quality video understanding even on limited hardware. Experiments show that QuickVideo generalizes across durations and sampling rates, making long video processing feasible in practice.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning</title>
      <itunes:episode>790</itunes:episode>
      <podcast:episode>790</podcast:episode>
      <itunes:title>GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ecf07076-bdd9-45f1-bedd-d951837e3a15</guid>
      <link>https://share.transistor.fm/s/da2f42e6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV, cs.AI, cs.CL, cs.LG, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17022v1">http://arxiv.org/abs/2505.17022v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at https://github.com/gogoduan/GoT-R1.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV, cs.AI, cs.CL, cs.LG, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17022v1">http://arxiv.org/abs/2505.17022v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at https://github.com/gogoduan/GoT-R1.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 23 May 2025 20:52:40 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/da2f42e6/ab89d410.mp3" length="24423562" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1523</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV, cs.AI, cs.CL, cs.LG, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.17022v1">http://arxiv.org/abs/2505.17022v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at https://github.com/gogoduan/GoT-R1.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning</title>
      <itunes:episode>789</itunes:episode>
      <podcast:episode>789</podcast:episode>
      <itunes:title>LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a0c3f2ab-2c05-46da-a33b-35c7c8b97364</guid>
      <link>https://share.transistor.fm/s/bfe94904</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, Chongxuan Li</p>

            <p><strong>Title:</strong><br>
            LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16933v1">http://arxiv.org/abs/2505.16933v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, Chongxuan Li</p>

            <p><strong>Title:</strong><br>
            LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16933v1">http://arxiv.org/abs/2505.16933v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 23 May 2025 20:52:17 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bfe94904/7a40c81b.mp3" length="19505410" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1215</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, Chongxuan Li</p>

            <p><strong>Title:</strong><br>
            LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.16933v1">http://arxiv.org/abs/2505.16933v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Scaling Diffusion Transformers Efficiently via $μ$P</title>
      <itunes:episode>788</itunes:episode>
      <podcast:episode>788</podcast:episode>
      <itunes:title>Scaling Diffusion Transformers Efficiently via $μ$P</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">76754516-9af5-49cc-8f07-da24399076b4</guid>
      <link>https://share.transistor.fm/s/ae95a7b6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, Chongxuan Li</p>

            <p><strong>Title:</strong><br>
            Scaling Diffusion Transformers Efficiently via $μ$P</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15270v1">http://arxiv.org/abs/2505.15270v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($\mu$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $\mu$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $\mu$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $\mu$P of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-$\alpha$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $\mu$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$\mu$P enjoys robust HP transferability. Notably, DiT-XL-2-$\mu$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $\mu$P on text-to-image generation by scaling PixArt-$\alpha$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $\mu$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$\alpha$ and 3% of consumption by human experts for MMDiT-18B. These results establish $\mu$P as a principled and efficient framework for scaling diffusion Transformers.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, Chongxuan Li</p>

            <p><strong>Title:</strong><br>
            Scaling Diffusion Transformers Efficiently via $μ$P</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15270v1">http://arxiv.org/abs/2505.15270v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($\mu$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $\mu$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $\mu$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $\mu$P of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-$\alpha$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $\mu$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$\mu$P enjoys robust HP transferability. Notably, DiT-XL-2-$\mu$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $\mu$P on text-to-image generation by scaling PixArt-$\alpha$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $\mu$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$\alpha$ and 3% of consumption by human experts for MMDiT-18B. These results establish $\mu$P as a principled and efficient framework for scaling diffusion Transformers.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 23 May 2025 20:51:54 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ae95a7b6/299972f6.mp3" length="22032424" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1373</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, Chongxuan Li</p>

            <p><strong>Title:</strong><br>
            Scaling Diffusion Transformers Efficiently via $μ$P</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15270v1">http://arxiv.org/abs/2505.15270v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($\mu$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $\mu$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $\mu$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $\mu$P of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-$\alpha$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $\mu$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$\mu$P enjoys robust HP transferability. Notably, DiT-XL-2-$\mu$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $\mu$P on text-to-image generation by scaling PixArt-$\alpha$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $\mu$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$\alpha$ and 3% of consumption by human experts for MMDiT-18B. These results establish $\mu$P as a principled and efficient framework for scaling diffusion Transformers.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Web-Shepherd: Advancing PRMs for Reinforcing Web Agents</title>
      <itunes:episode>787</itunes:episode>
      <podcast:episode>787</podcast:episode>
      <itunes:title>Web-Shepherd: Advancing PRMs for Reinforcing Web Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">304d98fa-0184-412a-a31a-b3aa0ca8b560</guid>
      <link>https://share.transistor.fm/s/a423bb3f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 80 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hyungjoo Chae, Sunghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, Dongwook Choi, Minseok Kang, Gwanhoon Im, ByeongUng Cho, Hyojun Kim, Jun Hee Han, Taeyoon Kwon, Minju Kim, Beong-woo Kwak, Dongjin Kang, Jinyoung Yeo</p>

            <p><strong>Title:</strong><br>
            Web-Shepherd: Advancing PRMs for Reinforcing Web Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15277v1">http://arxiv.org/abs/2505.15277v1</a></p>

            <p><strong>Abstract:</strong><br>
            Web navigation is a unique domain that can automate many repetitive real-life tasks and is challenging as it requires long-horizon sequential decision making beyond typical multimodal large language model (MLLM) tasks. Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment. To address this, in this work, we propose the first process reward model (PRM) called Web-Shepherd which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WebPRM Collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WebRewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WebRewardBench. Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve 10.9 points better performance, in 10 less cost compared to using GPT-4o-mini as the verifier. Our model, dataset, and code are publicly available at LINK.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 80 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hyungjoo Chae, Sunghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, Dongwook Choi, Minseok Kang, Gwanhoon Im, ByeongUng Cho, Hyojun Kim, Jun Hee Han, Taeyoon Kwon, Minju Kim, Beong-woo Kwak, Dongjin Kang, Jinyoung Yeo</p>

            <p><strong>Title:</strong><br>
            Web-Shepherd: Advancing PRMs for Reinforcing Web Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15277v1">http://arxiv.org/abs/2505.15277v1</a></p>

            <p><strong>Abstract:</strong><br>
            Web navigation is a unique domain that can automate many repetitive real-life tasks and is challenging as it requires long-horizon sequential decision making beyond typical multimodal large language model (MLLM) tasks. Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment. To address this, in this work, we propose the first process reward model (PRM) called Web-Shepherd which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WebPRM Collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WebRewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WebRewardBench. Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve 10.9 points better performance, in 10 less cost compared to using GPT-4o-mini as the verifier. Our model, dataset, and code are publicly available at LINK.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 22 May 2025 20:58:39 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a423bb3f/3c0b7354.mp3" length="21965920" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1369</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 80 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hyungjoo Chae, Sunghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, Dongwook Choi, Minseok Kang, Gwanhoon Im, ByeongUng Cho, Hyojun Kim, Jun Hee Han, Taeyoon Kwon, Minju Kim, Beong-woo Kwak, Dongjin Kang, Jinyoung Yeo</p>

            <p><strong>Title:</strong><br>
            Web-Shepherd: Advancing PRMs for Reinforcing Web Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15277v1">http://arxiv.org/abs/2505.15277v1</a></p>

            <p><strong>Abstract:</strong><br>
            Web navigation is a unique domain that can automate many repetitive real-life tasks and is challenging as it requires long-horizon sequential decision making beyond typical multimodal large language model (MLLM) tasks. Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment. To address this, in this work, we propose the first process reward model (PRM) called Web-Shepherd which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WebPRM Collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WebRewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WebRewardBench. Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve 10.9 points better performance, in 10 less cost compared to using GPT-4o-mini as the verifier. Our model, dataset, and code are publicly available at LINK.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MMaDA: Multimodal Large Diffusion Language Models</title>
      <itunes:episode>786</itunes:episode>
      <podcast:episode>786</podcast:episode>
      <itunes:title>MMaDA: Multimodal Large Diffusion Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2d588ccc-a677-4877-bf55-52e882520c96</guid>
      <link>https://share.transistor.fm/s/97b7d889</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang</p>

            <p><strong>Title:</strong><br>
            MMaDA: Multimodal Large Diffusion Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15809v1">http://arxiv.org/abs/2505.15809v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang</p>

            <p><strong>Title:</strong><br>
            MMaDA: Multimodal Large Diffusion Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15809v1">http://arxiv.org/abs/2505.15809v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 22 May 2025 20:58:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/97b7d889/51ebda51.mp3" length="20214247" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1260</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang</p>

            <p><strong>Title:</strong><br>
            MMaDA: Multimodal Large Diffusion Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15809v1">http://arxiv.org/abs/2505.15809v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Scaling Law for Quantization-Aware Training</title>
      <itunes:episode>785</itunes:episode>
      <podcast:episode>785</podcast:episode>
      <itunes:title>Scaling Law for Quantization-Aware Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fb0a0cca-1386-4806-9f0c-10a32419bb76</guid>
      <link>https://share.transistor.fm/s/11ff5d83</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mengzhao Chen, Chaoyi Zhang, Jing Liu, Yutao Zeng, Zeyue Xue, Zhiheng Liu, Yunshui Li, Jin Ma, Jie Huang, Xun Zhou, Ping Luo</p>

            <p><strong>Title:</strong><br>
            Scaling Law for Quantization-Aware Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14302v1">http://arxiv.org/abs/2505.14302v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) demand substantial computational and memory resources, creating deployment challenges. Quantization-aware training (QAT) addresses these challenges by reducing model precision while maintaining performance. However, the scaling behavior of QAT, especially at 4-bit precision (W4A4), is not well understood. Existing QAT scaling laws often ignore key factors such as the number of training tokens and quantization granularity, which limits their applicability. This paper proposes a unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size. Through 268 QAT experiments, we show that quantization error decreases as model size increases, but rises with more training tokens and coarser quantization granularity. To identify the sources of W4A4 quantization error, we decompose it into weight and activation components. Both components follow the overall trend of W4A4 quantization error, but with different sensitivities. Specifically, weight quantization error increases more rapidly with more training tokens. Further analysis shows that the activation quantization error in the FC2 layer, caused by outliers, is the primary bottleneck of W4A4 QAT quantization error. By applying mixed-precision quantization to address this bottleneck, we demonstrate that weight and activation quantization errors can converge to similar levels. Additionally, with more training data, weight quantization error eventually exceeds activation quantization error, suggesting that reducing weight quantization error is also important in such scenarios. These findings offer key insights for improving QAT research and development.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mengzhao Chen, Chaoyi Zhang, Jing Liu, Yutao Zeng, Zeyue Xue, Zhiheng Liu, Yunshui Li, Jin Ma, Jie Huang, Xun Zhou, Ping Luo</p>

            <p><strong>Title:</strong><br>
            Scaling Law for Quantization-Aware Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14302v1">http://arxiv.org/abs/2505.14302v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) demand substantial computational and memory resources, creating deployment challenges. Quantization-aware training (QAT) addresses these challenges by reducing model precision while maintaining performance. However, the scaling behavior of QAT, especially at 4-bit precision (W4A4), is not well understood. Existing QAT scaling laws often ignore key factors such as the number of training tokens and quantization granularity, which limits their applicability. This paper proposes a unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size. Through 268 QAT experiments, we show that quantization error decreases as model size increases, but rises with more training tokens and coarser quantization granularity. To identify the sources of W4A4 quantization error, we decompose it into weight and activation components. Both components follow the overall trend of W4A4 quantization error, but with different sensitivities. Specifically, weight quantization error increases more rapidly with more training tokens. Further analysis shows that the activation quantization error in the FC2 layer, caused by outliers, is the primary bottleneck of W4A4 QAT quantization error. By applying mixed-precision quantization to address this bottleneck, we demonstrate that weight and activation quantization errors can converge to similar levels. Additionally, with more training data, weight quantization error eventually exceeds activation quantization error, suggesting that reducing weight quantization error is also important in such scenarios. These findings offer key insights for improving QAT research and development.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 22 May 2025 20:57:57 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/11ff5d83/87703ecc.mp3" length="19008011" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1184</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mengzhao Chen, Chaoyi Zhang, Jing Liu, Yutao Zeng, Zeyue Xue, Zhiheng Liu, Yunshui Li, Jin Ma, Jie Huang, Xun Zhou, Ping Luo</p>

            <p><strong>Title:</strong><br>
            Scaling Law for Quantization-Aware Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14302v1">http://arxiv.org/abs/2505.14302v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) demand substantial computational and memory resources, creating deployment challenges. Quantization-aware training (QAT) addresses these challenges by reducing model precision while maintaining performance. However, the scaling behavior of QAT, especially at 4-bit precision (W4A4), is not well understood. Existing QAT scaling laws often ignore key factors such as the number of training tokens and quantization granularity, which limits their applicability. This paper proposes a unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size. Through 268 QAT experiments, we show that quantization error decreases as model size increases, but rises with more training tokens and coarser quantization granularity. To identify the sources of W4A4 quantization error, we decompose it into weight and activation components. Both components follow the overall trend of W4A4 quantization error, but with different sensitivities. Specifically, weight quantization error increases more rapidly with more training tokens. Further analysis shows that the activation quantization error in the FC2 layer, caused by outliers, is the primary bottleneck of W4A4 QAT quantization error. By applying mixed-precision quantization to address this bottleneck, we demonstrate that weight and activation quantization errors can converge to similar levels. Additionally, with more training data, weight quantization error eventually exceeds activation quantization error, suggesting that reducing weight quantization error is also important in such scenarios. These findings offer key insights for improving QAT research and development.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning</title>
      <itunes:episode>784</itunes:episode>
      <podcast:episode>784</podcast:episode>
      <itunes:title>UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9dfff46e-1a86-45fa-a0cc-719dca8de24a</guid>
      <link>https://share.transistor.fm/s/d0bc790e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, Yansong Tang</p>

            <p><strong>Title:</strong><br>
            UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14231v1">http://arxiv.org/abs/2505.14231v1</a></p>

            <p><strong>Abstract:</strong><br>
            Traditional visual grounding methods primarily focus on single-image scenarios with simple textual references. However, extending these methods to real-world scenarios that involve implicit and complex instructions, particularly in conjunction with multiple images, poses significant challenges, which is mainly due to the lack of advanced reasoning ability across diverse multi-modal contexts. In this work, we aim to address the more practical universal grounding task, and propose UniVG-R1, a reasoning guided multimodal large language model (MLLM) for universal visual grounding, which enhances reasoning capabilities through reinforcement learning (RL) combined with cold-start data. Specifically, we first construct a high-quality Chain-of-Thought (CoT) grounding dataset, annotated with detailed reasoning chains, to guide the model towards correct reasoning paths via supervised fine-tuning. Subsequently, we perform rule-based reinforcement learning to encourage the model to identify correct reasoning chains, thereby incentivizing its reasoning capabilities. In addition, we identify a difficulty bias arising from the prevalence of easy samples as RL training progresses, and we propose a difficulty-aware weight adjustment strategy to further strengthen the performance. Experimental results demonstrate the effectiveness of UniVG-R1, which achieves state-of-the-art performance on MIG-Bench with a 9.1% improvement over the previous method. Furthermore, our model exhibits strong generalizability, achieving an average improvement of 23.4% in zero-shot performance across four image and video reasoning grounding benchmarks. The project page can be accessed at https://amap-ml.github.io/UniVG-R1-page/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, Yansong Tang</p>

            <p><strong>Title:</strong><br>
            UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14231v1">http://arxiv.org/abs/2505.14231v1</a></p>

            <p><strong>Abstract:</strong><br>
            Traditional visual grounding methods primarily focus on single-image scenarios with simple textual references. However, extending these methods to real-world scenarios that involve implicit and complex instructions, particularly in conjunction with multiple images, poses significant challenges, which is mainly due to the lack of advanced reasoning ability across diverse multi-modal contexts. In this work, we aim to address the more practical universal grounding task, and propose UniVG-R1, a reasoning guided multimodal large language model (MLLM) for universal visual grounding, which enhances reasoning capabilities through reinforcement learning (RL) combined with cold-start data. Specifically, we first construct a high-quality Chain-of-Thought (CoT) grounding dataset, annotated with detailed reasoning chains, to guide the model towards correct reasoning paths via supervised fine-tuning. Subsequently, we perform rule-based reinforcement learning to encourage the model to identify correct reasoning chains, thereby incentivizing its reasoning capabilities. In addition, we identify a difficulty bias arising from the prevalence of easy samples as RL training progresses, and we propose a difficulty-aware weight adjustment strategy to further strengthen the performance. Experimental results demonstrate the effectiveness of UniVG-R1, which achieves state-of-the-art performance on MIG-Bench with a 9.1% improvement over the previous method. Furthermore, our model exhibits strong generalizability, achieving an average improvement of 23.4% in zero-shot performance across four image and video reasoning grounding benchmarks. The project page can be accessed at https://amap-ml.github.io/UniVG-R1-page/.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 22 May 2025 20:57:36 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d0bc790e/77040827.mp3" length="17582808" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1095</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, Yansong Tang</p>

            <p><strong>Title:</strong><br>
            UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14231v1">http://arxiv.org/abs/2505.14231v1</a></p>

            <p><strong>Abstract:</strong><br>
            Traditional visual grounding methods primarily focus on single-image scenarios with simple textual references. However, extending these methods to real-world scenarios that involve implicit and complex instructions, particularly in conjunction with multiple images, poses significant challenges, which is mainly due to the lack of advanced reasoning ability across diverse multi-modal contexts. In this work, we aim to address the more practical universal grounding task, and propose UniVG-R1, a reasoning guided multimodal large language model (MLLM) for universal visual grounding, which enhances reasoning capabilities through reinforcement learning (RL) combined with cold-start data. Specifically, we first construct a high-quality Chain-of-Thought (CoT) grounding dataset, annotated with detailed reasoning chains, to guide the model towards correct reasoning paths via supervised fine-tuning. Subsequently, we perform rule-based reinforcement learning to encourage the model to identify correct reasoning chains, thereby incentivizing its reasoning capabilities. In addition, we identify a difficulty bias arising from the prevalence of easy samples as RL training progresses, and we propose a difficulty-aware weight adjustment strategy to further strengthen the performance. Experimental results demonstrate the effectiveness of UniVG-R1, which achieves state-of-the-art performance on MIG-Bench with a 9.1% improvement over the previous method. Furthermore, our model exhibits strong generalizability, achieving an average improvement of 23.4% in zero-shot performance across four image and video reasoning grounding benchmarks. The project page can be accessed at https://amap-ml.github.io/UniVG-R1-page/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective</title>
      <itunes:episode>783</itunes:episode>
      <podcast:episode>783</podcast:episode>
      <itunes:title>Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">590c145e-e28e-418a-b40b-b961334646b2</guid>
      <link>https://share.transistor.fm/s/c7646980</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Siyue Zhang, Yilun Zhao, Liyuan Geng, Arman Cohan, Anh Tuan Luu, Chen Zhao</p>

            <p><strong>Title:</strong><br>
            Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15045v1">http://arxiv.org/abs/2505.15045v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve competitive performance on traditional text embedding benchmarks. Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Siyue Zhang, Yilun Zhao, Liyuan Geng, Arman Cohan, Anh Tuan Luu, Chen Zhao</p>

            <p><strong>Title:</strong><br>
            Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15045v1">http://arxiv.org/abs/2505.15045v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve competitive performance on traditional text embedding benchmarks. Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 22 May 2025 20:57:15 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c7646980/9623eb5a.mp3" length="20010308" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1247</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Siyue Zhang, Yilun Zhao, Liyuan Geng, Arman Cohan, Anh Tuan Luu, Chen Zhao</p>

            <p><strong>Title:</strong><br>
            Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15045v1">http://arxiv.org/abs/2505.15045v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve competitive performance on traditional text embedding benchmarks. Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Efficient Agent Training for Computer Use</title>
      <itunes:episode>782</itunes:episode>
      <podcast:episode>782</podcast:episode>
      <itunes:title>Efficient Agent Training for Computer Use</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b779db94-f832-426a-8951-62e66159b1ed</guid>
      <link>https://share.transistor.fm/s/5304ec57</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yanheng He, Jiahe Jin, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            Efficient Agent Training for Computer Use</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13909v1">http://arxiv.org/abs/2505.13909v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents. We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations. Starting with just 312 human-annotated computer use trajectories, we further improved data quality by synthesizing diverse action decisions with Claude 3.7 Sonnet. Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141% relative improvement, surpassing the strong Claude 3.7 Sonnet with extended thinking on WindowsAgentArena-V2, an improved benchmark we also released. Furthermore, PC Agent-E demonstrates strong generalizability to different operating systems on OSWorld. Our findings suggest that strong computer use capabilities can be stimulated from a small amount of high-quality trajectory data.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yanheng He, Jiahe Jin, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            Efficient Agent Training for Computer Use</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13909v1">http://arxiv.org/abs/2505.13909v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents. We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations. Starting with just 312 human-annotated computer use trajectories, we further improved data quality by synthesizing diverse action decisions with Claude 3.7 Sonnet. Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141% relative improvement, surpassing the strong Claude 3.7 Sonnet with extended thinking on WindowsAgentArena-V2, an improved benchmark we also released. Furthermore, PC Agent-E demonstrates strong generalizability to different operating systems on OSWorld. Our findings suggest that strong computer use capabilities can be stimulated from a small amount of high-quality trajectory data.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 22 May 2025 20:56:54 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5304ec57/c5f19a40.mp3" length="22255134" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1387</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yanheng He, Jiahe Jin, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            Efficient Agent Training for Computer Use</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13909v1">http://arxiv.org/abs/2505.13909v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents. We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations. Starting with just 312 human-annotated computer use trajectories, we further improved data quality by synthesizing diverse action decisions with Claude 3.7 Sonnet. Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141% relative improvement, surpassing the strong Claude 3.7 Sonnet with extended thinking on WindowsAgentArena-V2, an improved benchmark we also released. Furthermore, PC Agent-E demonstrates strong generalizability to different operating systems on OSWorld. Our findings suggest that strong computer use capabilities can be stimulated from a small amount of high-quality trajectory data.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>This Time is Different: An Observability Perspective on Time Series Foundation Models</title>
      <itunes:episode>781</itunes:episode>
      <podcast:episode>781</podcast:episode>
      <itunes:title>This Time is Different: An Observability Perspective on Time Series Foundation Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">26017101-b216-480f-8e0b-2dbe581c1a23</guid>
      <link>https://share.transistor.fm/s/cdc07998</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, Jean Ogier du Terrail, Anna-Monica Toon, Kan Wang, Stephan Xie, David Asker, Ameet Talwalkar, Othmane Abou-Amal</p>

            <p><strong>Title:</strong><br>
            This Time is Different: An Observability Perspective on Time Series Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14766v1">http://arxiv.org/abs/2505.14766v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Toto, a time series forecasting foundation model with 151 million parameters. Toto uses a modern decoder-only architecture coupled with architectural innovations designed to account for specific challenges found in multivariate observability time series data. Toto's pre-training corpus is a mixture of observability data, open datasets, and synthetic data, and is 4-10$\times$ larger than those of leading time series foundation models. Additionally, we introduce BOOM, a large-scale benchmark consisting of 350 million observations across 2,807 real-world time series. For both Toto and BOOM, we source observability data exclusively from Datadog's own telemetry and internal observability metrics. Extensive evaluations demonstrate that Toto achieves state-of-the-art performance on both BOOM and on established general purpose time series forecasting benchmarks. Toto's model weights, inference code, and evaluation scripts, as well as BOOM's data and evaluation code, are all available as open source under the Apache 2.0 License available at https://huggingface.co/Datadog/Toto-Open-Base-1.0 and https://github.com/DataDog/toto.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, Jean Ogier du Terrail, Anna-Monica Toon, Kan Wang, Stephan Xie, David Asker, Ameet Talwalkar, Othmane Abou-Amal</p>

            <p><strong>Title:</strong><br>
            This Time is Different: An Observability Perspective on Time Series Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14766v1">http://arxiv.org/abs/2505.14766v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Toto, a time series forecasting foundation model with 151 million parameters. Toto uses a modern decoder-only architecture coupled with architectural innovations designed to account for specific challenges found in multivariate observability time series data. Toto's pre-training corpus is a mixture of observability data, open datasets, and synthetic data, and is 4-10$\times$ larger than those of leading time series foundation models. Additionally, we introduce BOOM, a large-scale benchmark consisting of 350 million observations across 2,807 real-world time series. For both Toto and BOOM, we source observability data exclusively from Datadog's own telemetry and internal observability metrics. Extensive evaluations demonstrate that Toto achieves state-of-the-art performance on both BOOM and on established general purpose time series forecasting benchmarks. Toto's model weights, inference code, and evaluation scripts, as well as BOOM's data and evaluation code, are all available as open source under the Apache 2.0 License available at https://huggingface.co/Datadog/Toto-Open-Base-1.0 and https://github.com/DataDog/toto.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 22 May 2025 20:56:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cdc07998/738c280d.mp3" length="21265868" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1325</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, Jean Ogier du Terrail, Anna-Monica Toon, Kan Wang, Stephan Xie, David Asker, Ameet Talwalkar, Othmane Abou-Amal</p>

            <p><strong>Title:</strong><br>
            This Time is Different: An Observability Perspective on Time Series Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14766v1">http://arxiv.org/abs/2505.14766v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Toto, a time series forecasting foundation model with 151 million parameters. Toto uses a modern decoder-only architecture coupled with architectural innovations designed to account for specific challenges found in multivariate observability time series data. Toto's pre-training corpus is a mixture of observability data, open datasets, and synthetic data, and is 4-10$\times$ larger than those of leading time series foundation models. Additionally, we introduce BOOM, a large-scale benchmark consisting of 350 million observations across 2,807 real-world time series. For both Toto and BOOM, we source observability data exclusively from Datadog's own telemetry and internal observability metrics. Extensive evaluations demonstrate that Toto achieves state-of-the-art performance on both BOOM and on established general purpose time series forecasting benchmarks. Toto's model weights, inference code, and evaluation scripts, as well as BOOM's data and evaluation code, are all available as open source under the Apache 2.0 License available at https://huggingface.co/Datadog/Toto-Open-Base-1.0 and https://github.com/DataDog/toto.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Learn to Reason Efficiently with Adaptive Length-based Reward Shaping</title>
      <itunes:episode>780</itunes:episode>
      <podcast:episode>780</podcast:episode>
      <itunes:title>Learn to Reason Efficiently with Adaptive Length-based Reward Shaping</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ca0bf223-6168-4eac-98d6-3f2b77be0e86</guid>
      <link>https://share.transistor.fm/s/21dd8afe</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, Junxian He</p>

            <p><strong>Title:</strong><br>
            Learn to Reason Efficiently with Adaptive Length-based Reward Shaping</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15612v1">http://arxiv.org/abs/2505.15612v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel Length-bAsed StEp Reward shaping method (LASER), which employs a step function as the reward, controlled by a target length. LASER surpasses previous methods, achieving a superior Pareto-optimal balance between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of the model evolves during training, necessitating reward specifications that are also adaptive and dynamic; (2) Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. This approach is expected to facilitate a combination of fast and slow thinking, leading to a better overall tradeoff. The resulting method is termed LASER-D (Dynamic and Difficulty-aware). Experiments on DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, and DeepSeek-R1-Distill-Qwen-32B show that our approach significantly enhances both reasoning performance and response length efficiency. For instance, LASER-D and its variant achieve a +6.1 improvement on AIME2024 while reducing token usage by 63%. Further analysis reveals our RL-based compression produces more concise reasoning patterns with less redundant "self-reflections". Resources are at https://github.com/hkust-nlp/Laser.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, Junxian He</p>

            <p><strong>Title:</strong><br>
            Learn to Reason Efficiently with Adaptive Length-based Reward Shaping</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15612v1">http://arxiv.org/abs/2505.15612v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel Length-bAsed StEp Reward shaping method (LASER), which employs a step function as the reward, controlled by a target length. LASER surpasses previous methods, achieving a superior Pareto-optimal balance between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of the model evolves during training, necessitating reward specifications that are also adaptive and dynamic; (2) Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. This approach is expected to facilitate a combination of fast and slow thinking, leading to a better overall tradeoff. The resulting method is termed LASER-D (Dynamic and Difficulty-aware). Experiments on DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, and DeepSeek-R1-Distill-Qwen-32B show that our approach significantly enhances both reasoning performance and response length efficiency. For instance, LASER-D and its variant achieve a +6.1 improvement on AIME2024 while reducing token usage by 63%. Further analysis reveals our RL-based compression produces more concise reasoning patterns with less redundant "self-reflections". Resources are at https://github.com/hkust-nlp/Laser.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 22 May 2025 20:56:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/21dd8afe/1ed8e3f2.mp3" length="17109666" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1066</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, Junxian He</p>

            <p><strong>Title:</strong><br>
            Learn to Reason Efficiently with Adaptive Length-based Reward Shaping</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.15612v1">http://arxiv.org/abs/2505.15612v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel Length-bAsed StEp Reward shaping method (LASER), which employs a step function as the reward, controlled by a target length. LASER surpasses previous methods, achieving a superior Pareto-optimal balance between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of the model evolves during training, necessitating reward specifications that are also adaptive and dynamic; (2) Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. This approach is expected to facilitate a combination of fast and slow thinking, leading to a better overall tradeoff. The resulting method is termed LASER-D (Dynamic and Difficulty-aware). Experiments on DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, and DeepSeek-R1-Distill-Qwen-32B show that our approach significantly enhances both reasoning performance and response length efficiency. For instance, LASER-D and its variant achieve a +6.1 improvement on AIME2024 while reducing token usage by 63%. Further analysis reveals our RL-based compression produces more concise reasoning patterns with less redundant "self-reflections". Resources are at https://github.com/hkust-nlp/Laser.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Emerging Properties in Unified Multimodal Pretraining</title>
      <itunes:episode>779</itunes:episode>
      <podcast:episode>779</podcast:episode>
      <itunes:title>Emerging Properties in Unified Multimodal Pretraining</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4a904b4f-131c-4088-81a0-422093ca11b0</guid>
      <link>https://share.transistor.fm/s/2d122b3b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 87 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan</p>

            <p><strong>Title:</strong><br>
            Emerging Properties in Unified Multimodal Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14683v1">http://arxiv.org/abs/2505.14683v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 87 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan</p>

            <p><strong>Title:</strong><br>
            Emerging Properties in Unified Multimodal Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14683v1">http://arxiv.org/abs/2505.14683v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 21 May 2025 20:36:48 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2d122b3b/648c08c4.mp3" length="21913255" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1366</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 87 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan</p>

            <p><strong>Title:</strong><br>
            Emerging Properties in Unified Multimodal Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14683v1">http://arxiv.org/abs/2505.14683v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training</title>
      <itunes:episode>778</itunes:episode>
      <podcast:episode>778</podcast:episode>
      <itunes:title>SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">16714037-c63c-4043-bf42-42f81e88138f</guid>
      <link>https://share.transistor.fm/s/6e53f6d7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.LG, cs.AI, cs.AR, cs.CV, cs.PF</p>

            <p><strong>Authors:</strong><br>
            Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen</p>

            <p><strong>Title:</strong><br>
            SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.11594v1">http://arxiv.org/abs/2505.11594v1</a></p>

            <p><strong>Abstract:</strong><br>
            The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code will be available at https://github.com/thu-ml/SageAttention.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.LG, cs.AI, cs.AR, cs.CV, cs.PF</p>

            <p><strong>Authors:</strong><br>
            Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen</p>

            <p><strong>Title:</strong><br>
            SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.11594v1">http://arxiv.org/abs/2505.11594v1</a></p>

            <p><strong>Abstract:</strong><br>
            The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code will be available at https://github.com/thu-ml/SageAttention.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 21 May 2025 20:36:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6e53f6d7/a4af5019.mp3" length="20392760" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1271</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.LG, cs.AI, cs.AR, cs.CV, cs.PF</p>

            <p><strong>Authors:</strong><br>
            Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen</p>

            <p><strong>Title:</strong><br>
            SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.11594v1">http://arxiv.org/abs/2505.11594v1</a></p>

            <p><strong>Abstract:</strong><br>
            The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code will be available at https://github.com/thu-ml/SageAttention.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Optimizing Anytime Reasoning via Budget Relative Policy Optimization</title>
      <itunes:episode>777</itunes:episode>
      <podcast:episode>777</podcast:episode>
      <itunes:title>Optimizing Anytime Reasoning via Budget Relative Policy Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4b62d59d-d38e-4f6c-ac9a-8fddef981b0c</guid>
      <link>https://share.transistor.fm/s/e06a2b37</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Penghui Qi, Zichen Liu, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin</p>

            <p><strong>Title:</strong><br>
            Optimizing Anytime Reasoning via Budget Relative Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13438v1">http://arxiv.org/abs/2505.13438v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling test-time compute is crucial for enhancing the reasoning capabilities of large language models (LLMs). Existing approaches typically employ reinforcement learning (RL) to maximize a verifiable reward obtained at the end of reasoning traces. However, such methods optimize only the final performance under a large and fixed token budget, which hinders efficiency in both training and deployment. In this work, we present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance, which aims to improve token efficiency and the flexibility of reasoning under varying token budget constraints. To achieve this, we truncate the complete thinking process to fit within sampled token budgets from a prior distribution, compelling the model to summarize the optimal answer for each truncated thinking for verification. This introduces verifiable dense rewards into the reasoning process, facilitating more effective credit assignment in RL optimization. We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward. Additionally, we introduce a novel variance reduction technique, Budget Relative Policy Optimization (BRPO), to enhance the robustness and efficiency of the learning process when reinforcing the thinking policy. Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Penghui Qi, Zichen Liu, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin</p>

            <p><strong>Title:</strong><br>
            Optimizing Anytime Reasoning via Budget Relative Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13438v1">http://arxiv.org/abs/2505.13438v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling test-time compute is crucial for enhancing the reasoning capabilities of large language models (LLMs). Existing approaches typically employ reinforcement learning (RL) to maximize a verifiable reward obtained at the end of reasoning traces. However, such methods optimize only the final performance under a large and fixed token budget, which hinders efficiency in both training and deployment. In this work, we present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance, which aims to improve token efficiency and the flexibility of reasoning under varying token budget constraints. To achieve this, we truncate the complete thinking process to fit within sampled token budgets from a prior distribution, compelling the model to summarize the optimal answer for each truncated thinking for verification. This introduces verifiable dense rewards into the reasoning process, facilitating more effective credit assignment in RL optimization. We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward. Additionally, we introduce a novel variance reduction technique, Budget Relative Policy Optimization (BRPO), to enhance the robustness and efficiency of the learning process when reinforcing the thinking policy. Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 21 May 2025 20:36:05 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e06a2b37/c6041030.mp3" length="21157600" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1319</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Penghui Qi, Zichen Liu, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin</p>

            <p><strong>Title:</strong><br>
            Optimizing Anytime Reasoning via Budget Relative Policy Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13438v1">http://arxiv.org/abs/2505.13438v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling test-time compute is crucial for enhancing the reasoning capabilities of large language models (LLMs). Existing approaches typically employ reinforcement learning (RL) to maximize a verifiable reward obtained at the end of reasoning traces. However, such methods optimize only the final performance under a large and fixed token budget, which hinders efficiency in both training and deployment. In this work, we present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance, which aims to improve token efficiency and the flexibility of reasoning under varying token budget constraints. To achieve this, we truncate the complete thinking process to fit within sampled token budgets from a prior distribution, compelling the model to summarize the optimal answer for each truncated thinking for verification. This introduces verifiable dense rewards into the reasoning process, facilitating more effective credit assignment in RL optimization. We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward. Additionally, we introduce a novel variance reduction technique, Budget Relative Policy Optimization (BRPO), to enhance the robustness and efficiency of the learning process when reinforcing the thinking policy. Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank</title>
      <itunes:episode>776</itunes:episode>
      <podcast:episode>776</podcast:episode>
      <itunes:title>VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b2d8cd27-1b86-42f0-a2f2-74f4f4b20438</guid>
      <link>https://share.transistor.fm/s/a817e50d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, Kede Ma</p>

            <p><strong>Title:</strong><br>
            VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14460v1">http://arxiv.org/abs/2505.14460v1</a></p>

            <p><strong>Abstract:</strong><br>
            DeepSeek-R1 has demonstrated remarkable effectiveness in incentivizing reasoning and generalization capabilities of large language models (LLMs) through reinforcement learning. Nevertheless, the potential of reasoning-induced computational modeling has not been thoroughly explored in the context of image quality assessment (IQA), a task critically dependent on visual reasoning. In this paper, we introduce VisualQuality-R1, a reasoning-induced no-reference IQA (NR-IQA) model, and we train it with reinforcement learning to rank, a learning algorithm tailored to the intrinsically relative nature of visual quality. Specifically, for a pair of images, we employ group relative policy optimization to generate multiple quality scores for each image. These estimates are then used to compute comparative probabilities of one image having higher quality than the other under the Thurstone model. Rewards for each quality estimate are defined using continuous fidelity measures rather than discretized binary labels. Extensive experiments show that the proposed VisualQuality-R1 consistently outperforms discriminative deep learning-based NR-IQA models as well as a recent reasoning-induced quality regression method. Moreover, VisualQuality-R1 is capable of generating contextually rich, human-aligned quality descriptions, and supports multi-dataset training without requiring perceptual scale realignment. These features make VisualQuality-R1 especially well-suited for reliably measuring progress in a wide range of image processing tasks like super-resolution and image generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, Kede Ma</p>

            <p><strong>Title:</strong><br>
            VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14460v1">http://arxiv.org/abs/2505.14460v1</a></p>

            <p><strong>Abstract:</strong><br>
            DeepSeek-R1 has demonstrated remarkable effectiveness in incentivizing reasoning and generalization capabilities of large language models (LLMs) through reinforcement learning. Nevertheless, the potential of reasoning-induced computational modeling has not been thoroughly explored in the context of image quality assessment (IQA), a task critically dependent on visual reasoning. In this paper, we introduce VisualQuality-R1, a reasoning-induced no-reference IQA (NR-IQA) model, and we train it with reinforcement learning to rank, a learning algorithm tailored to the intrinsically relative nature of visual quality. Specifically, for a pair of images, we employ group relative policy optimization to generate multiple quality scores for each image. These estimates are then used to compute comparative probabilities of one image having higher quality than the other under the Thurstone model. Rewards for each quality estimate are defined using continuous fidelity measures rather than discretized binary labels. Extensive experiments show that the proposed VisualQuality-R1 consistently outperforms discriminative deep learning-based NR-IQA models as well as a recent reasoning-induced quality regression method. Moreover, VisualQuality-R1 is capable of generating contextually rich, human-aligned quality descriptions, and supports multi-dataset training without requiring perceptual scale realignment. These features make VisualQuality-R1 especially well-suited for reliably measuring progress in a wide range of image processing tasks like super-resolution and image generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 21 May 2025 20:35:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a817e50d/90eaf6a8.mp3" length="19878254" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1239</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, Kede Ma</p>

            <p><strong>Title:</strong><br>
            VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14460v1">http://arxiv.org/abs/2505.14460v1</a></p>

            <p><strong>Abstract:</strong><br>
            DeepSeek-R1 has demonstrated remarkable effectiveness in incentivizing reasoning and generalization capabilities of large language models (LLMs) through reinforcement learning. Nevertheless, the potential of reasoning-induced computational modeling has not been thoroughly explored in the context of image quality assessment (IQA), a task critically dependent on visual reasoning. In this paper, we introduce VisualQuality-R1, a reasoning-induced no-reference IQA (NR-IQA) model, and we train it with reinforcement learning to rank, a learning algorithm tailored to the intrinsically relative nature of visual quality. Specifically, for a pair of images, we employ group relative policy optimization to generate multiple quality scores for each image. These estimates are then used to compute comparative probabilities of one image having higher quality than the other under the Thurstone model. Rewards for each quality estimate are defined using continuous fidelity measures rather than discretized binary labels. Extensive experiments show that the proposed VisualQuality-R1 consistently outperforms discriminative deep learning-based NR-IQA models as well as a recent reasoning-induced quality regression method. Moreover, VisualQuality-R1 is capable of generating contextually rich, human-aligned quality descriptions, and supports multi-dataset training without requiring perceptual scale realignment. These features make VisualQuality-R1 especially well-suited for reliably measuring progress in a wide range of image processing tasks like super-resolution and image generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Visual Agentic Reinforcement Fine-Tuning</title>
      <itunes:episode>775</itunes:episode>
      <podcast:episode>775</podcast:episode>
      <itunes:title>Visual Agentic Reinforcement Fine-Tuning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">181a2298-62ca-476b-ab97-22a4595c6b24</guid>
      <link>https://share.transistor.fm/s/5b2b4481</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            Visual Agentic Reinforcement Fine-Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14246v1">http://arxiv.org/abs/2505.14246v1</a></p>

            <p><strong>Abstract:</strong><br>
            A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native agentic ability to use external tools such as web browsers for searching and writing/executing code for image manipulation to think with images. In the open-source research community, while significant progress has been made in language-only agentic abilities such as function calling and tool integration, the development of multi-modal agentic capabilities that involve truly thinking with images, and their corresponding benchmarks, are still less explored. This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques. We also present a Multi-modal Agentic Tool Bench (MAT) with two settings (MAT-Search and MAT-Coding) designed to evaluate LVLMs' agentic search and coding abilities. Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search, ultimately surpassing GPT-4o. Visual-ARFT also achieves +29.3 F1% / +25.9% EM gains on existing multi-hop QA benchmarks such as 2Wiki and HotpotQA, demonstrating strong generalization capabilities. Our findings suggest that Visual-ARFT offers a promising path toward building robust and generalizable multimodal agents.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            Visual Agentic Reinforcement Fine-Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14246v1">http://arxiv.org/abs/2505.14246v1</a></p>

            <p><strong>Abstract:</strong><br>
            A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native agentic ability to use external tools such as web browsers for searching and writing/executing code for image manipulation to think with images. In the open-source research community, while significant progress has been made in language-only agentic abilities such as function calling and tool integration, the development of multi-modal agentic capabilities that involve truly thinking with images, and their corresponding benchmarks, are still less explored. This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques. We also present a Multi-modal Agentic Tool Bench (MAT) with two settings (MAT-Search and MAT-Coding) designed to evaluate LVLMs' agentic search and coding abilities. Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search, ultimately surpassing GPT-4o. Visual-ARFT also achieves +29.3 F1% / +25.9% EM gains on existing multi-hop QA benchmarks such as 2Wiki and HotpotQA, demonstrating strong generalization capabilities. Our findings suggest that Visual-ARFT offers a promising path toward building robust and generalizable multimodal agents.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 21 May 2025 20:35:22 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5b2b4481/933395a9.mp3" length="22630878" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1411</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            Visual Agentic Reinforcement Fine-Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.14246v1">http://arxiv.org/abs/2505.14246v1</a></p>

            <p><strong>Abstract:</strong><br>
            A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native agentic ability to use external tools such as web browsers for searching and writing/executing code for image manipulation to think with images. In the open-source research community, while significant progress has been made in language-only agentic abilities such as function calling and tool integration, the development of multi-modal agentic capabilities that involve truly thinking with images, and their corresponding benchmarks, are still less explored. This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques. We also present a Multi-modal Agentic Tool Bench (MAT) with two settings (MAT-Search and MAT-Coding) designed to evaluate LVLMs' agentic search and coding abilities. Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search, ultimately surpassing GPT-4o. Visual-ARFT also achieves +29.3 F1% / +25.9% EM gains on existing multi-hop QA benchmarks such as 2Wiki and HotpotQA, demonstrating strong generalization capabilities. Our findings suggest that Visual-ARFT offers a promising path toward building robust and generalizable multimodal agents.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Neurosymbolic Diffusion Models</title>
      <itunes:episode>774</itunes:episode>
      <podcast:episode>774</podcast:episode>
      <itunes:title>Neurosymbolic Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">15fb1fd0-d801-41cf-91ae-8c39656cbb8a</guid>
      <link>https://share.transistor.fm/s/e9c5e206</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Emile van Krieken, Pasquale Minervini, Edoardo Ponti, Antonio Vergari</p>

            <p><strong>Title:</strong><br>
            Neurosymbolic Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13138v1">http://arxiv.org/abs/2505.13138v1</a></p>

            <p><strong>Abstract:</strong><br>
            Neurosymbolic (NeSy) predictors combine neural perception with symbolic reasoning to solve tasks like visual reasoning. However, standard NeSy predictors assume conditional independence between the symbols they extract, thus limiting their ability to model interactions and uncertainty - often leading to overconfident predictions and poor out-of-distribution generalisation. To overcome the limitations of the independence assumption, we introduce neurosymbolic diffusion models (NeSyDMs), a new class of NeSy predictors that use discrete diffusion to model dependencies between symbols. Our approach reuses the independence assumption from NeSy predictors at each step of the diffusion process, enabling scalable learning while capturing symbol dependencies and uncertainty quantification. Across both synthetic and real-world benchmarks - including high-dimensional visual path planning and rule-based autonomous driving - NeSyDMs achieve state-of-the-art accuracy among NeSy predictors and demonstrate strong calibration.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Emile van Krieken, Pasquale Minervini, Edoardo Ponti, Antonio Vergari</p>

            <p><strong>Title:</strong><br>
            Neurosymbolic Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13138v1">http://arxiv.org/abs/2505.13138v1</a></p>

            <p><strong>Abstract:</strong><br>
            Neurosymbolic (NeSy) predictors combine neural perception with symbolic reasoning to solve tasks like visual reasoning. However, standard NeSy predictors assume conditional independence between the symbols they extract, thus limiting their ability to model interactions and uncertainty - often leading to overconfident predictions and poor out-of-distribution generalisation. To overcome the limitations of the independence assumption, we introduce neurosymbolic diffusion models (NeSyDMs), a new class of NeSy predictors that use discrete diffusion to model dependencies between symbols. Our approach reuses the independence assumption from NeSy predictors at each step of the diffusion process, enabling scalable learning while capturing symbol dependencies and uncertainty quantification. Across both synthetic and real-world benchmarks - including high-dimensional visual path planning and rule-based autonomous driving - NeSyDMs achieve state-of-the-art accuracy among NeSy predictors and demonstrate strong calibration.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 21 May 2025 20:35:00 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e9c5e206/f8c7c5bc.mp3" length="22844027" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1424</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Emile van Krieken, Pasquale Minervini, Edoardo Ponti, Antonio Vergari</p>

            <p><strong>Title:</strong><br>
            Neurosymbolic Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13138v1">http://arxiv.org/abs/2505.13138v1</a></p>

            <p><strong>Abstract:</strong><br>
            Neurosymbolic (NeSy) predictors combine neural perception with symbolic reasoning to solve tasks like visual reasoning. However, standard NeSy predictors assume conditional independence between the symbols they extract, thus limiting their ability to model interactions and uncertainty - often leading to overconfident predictions and poor out-of-distribution generalisation. To overcome the limitations of the independence assumption, we introduce neurosymbolic diffusion models (NeSyDMs), a new class of NeSy predictors that use discrete diffusion to model dependencies between symbols. Our approach reuses the independence assumption from NeSy predictors at each step of the diffusion process, enabling scalable learning while capturing symbol dependencies and uncertainty quantification. Across both synthetic and real-world benchmarks - including high-dimensional visual path planning and rule-based autonomous driving - NeSyDMs achieve state-of-the-art accuracy among NeSy predictors and demonstrate strong calibration.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Chain-of-Model Learning for Language Model</title>
      <itunes:episode>773</itunes:episode>
      <podcast:episode>773</podcast:episode>
      <itunes:title>Chain-of-Model Learning for Language Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">71ace55c-6dbe-4d37-831d-98b3ecc9cf4e</guid>
      <link>https://share.transistor.fm/s/e292a0de</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kaitao Song, Xiaohua Wang, Xu Tan, Huiqiang Jiang, Chengruidong Zhang, Yongliang Shen, Cen LU, Zihao Li, Zifan Song, Caihua Shan, Yansen Wang, Kan Ren, Xiaoqing Zheng, Tao Qin, Yuqing Yang, Dongsheng Li, Lili Qiu</p>

            <p><strong>Title:</strong><br>
            Chain-of-Model Learning for Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.11820v1">http://arxiv.org/abs/2505.11820v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains) at the hidden dimension level. In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i.e., chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this principle, we devise Chain-of-Language-Model (CoLM), which incorporates the idea of CoM into each layer of Transformer architecture. Based on CoLM, we further introduce CoLM-Air by introducing a KV sharing mechanism, that computes all keys and values within the first chain and then shares across all chains. This design demonstrates additional extensibility, such as enabling seamless LM switching, prefilling acceleration and so on. Experimental results demonstrate our CoLM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models. Our code will be released in the future at: https://github.com/microsoft/CoLM.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kaitao Song, Xiaohua Wang, Xu Tan, Huiqiang Jiang, Chengruidong Zhang, Yongliang Shen, Cen LU, Zihao Li, Zifan Song, Caihua Shan, Yansen Wang, Kan Ren, Xiaoqing Zheng, Tao Qin, Yuqing Yang, Dongsheng Li, Lili Qiu</p>

            <p><strong>Title:</strong><br>
            Chain-of-Model Learning for Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.11820v1">http://arxiv.org/abs/2505.11820v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains) at the hidden dimension level. In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i.e., chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this principle, we devise Chain-of-Language-Model (CoLM), which incorporates the idea of CoM into each layer of Transformer architecture. Based on CoLM, we further introduce CoLM-Air by introducing a KV sharing mechanism, that computes all keys and values within the first chain and then shares across all chains. This design demonstrates additional extensibility, such as enabling seamless LM switching, prefilling acceleration and so on. Experimental results demonstrate our CoLM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models. Our code will be released in the future at: https://github.com/microsoft/CoLM.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 20 May 2025 21:18:54 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e292a0de/698b6153.mp3" length="22736624" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1417</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 70 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kaitao Song, Xiaohua Wang, Xu Tan, Huiqiang Jiang, Chengruidong Zhang, Yongliang Shen, Cen LU, Zihao Li, Zifan Song, Caihua Shan, Yansen Wang, Kan Ren, Xiaoqing Zheng, Tao Qin, Yuqing Yang, Dongsheng Li, Lili Qiu</p>

            <p><strong>Title:</strong><br>
            Chain-of-Model Learning for Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.11820v1">http://arxiv.org/abs/2505.11820v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains) at the hidden dimension level. In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i.e., chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this principle, we devise Chain-of-Language-Model (CoLM), which incorporates the idea of CoM into each layer of Transformer architecture. Based on CoLM, we further introduce CoLM-Air by introducing a KV sharing mechanism, that computes all keys and values within the first chain and then shares across all chains. This design demonstrates additional extensibility, such as enabling seamless LM switching, prefilling acceleration and so on. Experimental results demonstrate our CoLM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models. Our code will be released in the future at: https://github.com/microsoft/CoLM.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AdaptThink: Reasoning Models Can Learn When to Think</title>
      <itunes:episode>772</itunes:episode>
      <podcast:episode>772</podcast:episode>
      <itunes:title>AdaptThink: Reasoning Models Can Learn When to Think</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">504382e7-18f5-4f0d-93ca-a7f947144c11</guid>
      <link>https://share.transistor.fm/s/d234cf36</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            AdaptThink: Reasoning Models Can Learn When to Think</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13417v1">http://arxiv.org/abs/2505.13417v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, large reasoning models have achieved impressive performance on various tasks by employing human-like deep thinking. However, the lengthy thinking process substantially increases inference overhead, making efficiency a critical bottleneck. In this work, we first demonstrate that NoThinking, which prompts the reasoning model to skip thinking and directly generate the final solution, is a better choice for relatively simple tasks in terms of both performance and efficiency. Motivated by this, we propose AdaptThink, a novel RL algorithm to teach reasoning models to choose the optimal thinking mode adaptively based on problem difficulty. Specifically, AdaptThink features two core components: (1) a constrained optimization objective that encourages the model to choose NoThinking while maintaining the overall performance; (2) an importance sampling strategy that balances Thinking and NoThinking samples during on-policy training, thereby enabling cold start and allowing the model to explore and exploit both thinking modes throughout the training process. Our experiments indicate that AdaptThink significantly reduces the inference costs while further enhancing performance. Notably, on three math datasets, AdaptThink reduces the average response length of DeepSeek-R1-Distill-Qwen-1.5B by 53% and improves its accuracy by 2.4%, highlighting the promise of adaptive thinking-mode selection for optimizing the balance between reasoning quality and efficiency. Our codes and models are available at https://github.com/THU-KEG/AdaptThink.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            AdaptThink: Reasoning Models Can Learn When to Think</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13417v1">http://arxiv.org/abs/2505.13417v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, large reasoning models have achieved impressive performance on various tasks by employing human-like deep thinking. However, the lengthy thinking process substantially increases inference overhead, making efficiency a critical bottleneck. In this work, we first demonstrate that NoThinking, which prompts the reasoning model to skip thinking and directly generate the final solution, is a better choice for relatively simple tasks in terms of both performance and efficiency. Motivated by this, we propose AdaptThink, a novel RL algorithm to teach reasoning models to choose the optimal thinking mode adaptively based on problem difficulty. Specifically, AdaptThink features two core components: (1) a constrained optimization objective that encourages the model to choose NoThinking while maintaining the overall performance; (2) an importance sampling strategy that balances Thinking and NoThinking samples during on-policy training, thereby enabling cold start and allowing the model to explore and exploit both thinking modes throughout the training process. Our experiments indicate that AdaptThink significantly reduces the inference costs while further enhancing performance. Notably, on three math datasets, AdaptThink reduces the average response length of DeepSeek-R1-Distill-Qwen-1.5B by 53% and improves its accuracy by 2.4%, highlighting the promise of adaptive thinking-mode selection for optimizing the balance between reasoning quality and efficiency. Our codes and models are available at https://github.com/THU-KEG/AdaptThink.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 20 May 2025 21:18:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d234cf36/9e78c36a.mp3" length="19751987" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1231</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 58 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            AdaptThink: Reasoning Models Can Learn When to Think</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13417v1">http://arxiv.org/abs/2505.13417v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, large reasoning models have achieved impressive performance on various tasks by employing human-like deep thinking. However, the lengthy thinking process substantially increases inference overhead, making efficiency a critical bottleneck. In this work, we first demonstrate that NoThinking, which prompts the reasoning model to skip thinking and directly generate the final solution, is a better choice for relatively simple tasks in terms of both performance and efficiency. Motivated by this, we propose AdaptThink, a novel RL algorithm to teach reasoning models to choose the optimal thinking mode adaptively based on problem difficulty. Specifically, AdaptThink features two core components: (1) a constrained optimization objective that encourages the model to choose NoThinking while maintaining the overall performance; (2) an importance sampling strategy that balances Thinking and NoThinking samples during on-policy training, thereby enabling cold start and allowing the model to explore and exploit both thinking modes throughout the training process. Our experiments indicate that AdaptThink significantly reduces the inference costs while further enhancing performance. Notably, on three math datasets, AdaptThink reduces the average response length of DeepSeek-R1-Distill-Qwen-1.5B by 53% and improves its accuracy by 2.4%, highlighting the promise of adaptive thinking-mode selection for optimizing the balance between reasoning quality and efficiency. Our codes and models are available at https://github.com/THU-KEG/AdaptThink.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning</title>
      <itunes:episode>771</itunes:episode>
      <podcast:episode>771</podcast:episode>
      <itunes:title>AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c67fd17e-fc23-455f-b7c7-8be67afbaab7</guid>
      <link>https://share.transistor.fm/s/a5a15d6d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chenwei Lou, Zewei Sun, Xinnian Liang, Meng Qu, Wei Shen, Wenqi Wang, Yuntao Li, Qingping Yang, Shuangzhi Wu</p>

            <p><strong>Title:</strong><br>
            AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.11896v1">http://arxiv.org/abs/2505.11896v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated remarkable capabilities but often face challenges with tasks requiring sophisticated reasoning. While Chain-of-Thought (CoT) prompting significantly enhances reasoning, it indiscriminately generates lengthy reasoning steps for all queries, leading to substantial computational costs and inefficiency, especially for simpler inputs. To address this critical issue, we introduce AdaCoT (Adaptive Chain-of-Thought), a novel framework enabling LLMs to adaptively decide when to invoke CoT. AdaCoT framed adaptive reasoning as a Pareto optimization problem that seeks to balance model performance with the costs associated with CoT invocation (both frequency and computational overhead). We propose a reinforcement learning (RL) based method, specifically utilizing Proximal Policy Optimization (PPO), to dynamically control the CoT triggering decision boundary by adjusting penalty coefficients, thereby allowing the model to determine CoT necessity based on implicit query complexity. A key technical contribution is Selective Loss Masking (SLM), designed to counteract decision boundary collapse during multi-stage RL training, ensuring robust and stable adaptive triggering. Experimental results demonstrate that AdaCoT successfully navigates the Pareto frontier, achieving substantial reductions in CoT usage for queries not requiring elaborate reasoning. For instance, on our production traffic testset, AdaCoT reduced CoT triggering rates to as low as 3.18\% and decreased average response tokens by 69.06%, while maintaining high performance on complex tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chenwei Lou, Zewei Sun, Xinnian Liang, Meng Qu, Wei Shen, Wenqi Wang, Yuntao Li, Qingping Yang, Shuangzhi Wu</p>

            <p><strong>Title:</strong><br>
            AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.11896v1">http://arxiv.org/abs/2505.11896v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated remarkable capabilities but often face challenges with tasks requiring sophisticated reasoning. While Chain-of-Thought (CoT) prompting significantly enhances reasoning, it indiscriminately generates lengthy reasoning steps for all queries, leading to substantial computational costs and inefficiency, especially for simpler inputs. To address this critical issue, we introduce AdaCoT (Adaptive Chain-of-Thought), a novel framework enabling LLMs to adaptively decide when to invoke CoT. AdaCoT framed adaptive reasoning as a Pareto optimization problem that seeks to balance model performance with the costs associated with CoT invocation (both frequency and computational overhead). We propose a reinforcement learning (RL) based method, specifically utilizing Proximal Policy Optimization (PPO), to dynamically control the CoT triggering decision boundary by adjusting penalty coefficients, thereby allowing the model to determine CoT necessity based on implicit query complexity. A key technical contribution is Selective Loss Masking (SLM), designed to counteract decision boundary collapse during multi-stage RL training, ensuring robust and stable adaptive triggering. Experimental results demonstrate that AdaCoT successfully navigates the Pareto frontier, achieving substantial reductions in CoT usage for queries not requiring elaborate reasoning. For instance, on our production traffic testset, AdaCoT reduced CoT triggering rates to as low as 3.18\% and decreased average response tokens by 69.06%, while maintaining high performance on complex tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 20 May 2025 21:18:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a5a15d6d/19960c46.mp3" length="20168726" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1257</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chenwei Lou, Zewei Sun, Xinnian Liang, Meng Qu, Wei Shen, Wenqi Wang, Yuntao Li, Qingping Yang, Shuangzhi Wu</p>

            <p><strong>Title:</strong><br>
            AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.11896v1">http://arxiv.org/abs/2505.11896v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated remarkable capabilities but often face challenges with tasks requiring sophisticated reasoning. While Chain-of-Thought (CoT) prompting significantly enhances reasoning, it indiscriminately generates lengthy reasoning steps for all queries, leading to substantial computational costs and inefficiency, especially for simpler inputs. To address this critical issue, we introduce AdaCoT (Adaptive Chain-of-Thought), a novel framework enabling LLMs to adaptively decide when to invoke CoT. AdaCoT framed adaptive reasoning as a Pareto optimization problem that seeks to balance model performance with the costs associated with CoT invocation (both frequency and computational overhead). We propose a reinforcement learning (RL) based method, specifically utilizing Proximal Policy Optimization (PPO), to dynamically control the CoT triggering decision boundary by adjusting penalty coefficients, thereby allowing the model to determine CoT necessity based on implicit query complexity. A key technical contribution is Selective Loss Masking (SLM), designed to counteract decision boundary collapse during multi-stage RL training, ensuring robust and stable adaptive triggering. Experimental results demonstrate that AdaCoT successfully navigates the Pareto frontier, achieving substantial reductions in CoT usage for queries not requiring elaborate reasoning. For instance, on our production traffic testset, AdaCoT reduced CoT triggering rates to as low as 3.18\% and decreased average response tokens by 69.06%, while maintaining high performance on complex tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction</title>
      <itunes:episode>770</itunes:episode>
      <podcast:episode>770</podcast:episode>
      <itunes:title>Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6fd07f07-5569-4f05-9775-38dd34af09d4</guid>
      <link>https://share.transistor.fm/s/d391d0b0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jeffrey Willette, Heejun Lee, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.11254v1">http://arxiv.org/abs/2505.11254v1</a></p>

            <p><strong>Abstract:</strong><br>
            The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance degradation. We discover that one reason for this degradation is that the sparse calculation induces a distributional shift in the attention outputs. The distributional shift causes decoding-time queries to fail to align well with the appropriate keys from the prefill stage, leading to a drop in performance. We propose a simple, novel, and effective procedure for correcting this distributional shift, bringing the distribution of sparse attention outputs closer to that of quadratic attention. Our method can be applied on top of any sparse attention method, and results in an average 36%pt performance increase, recovering 88% of quadratic attention accuracy on the 131K RULER benchmark when applied on top of sliding window attention with sink tokens while only adding a small overhead. Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jeffrey Willette, Heejun Lee, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.11254v1">http://arxiv.org/abs/2505.11254v1</a></p>

            <p><strong>Abstract:</strong><br>
            The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance degradation. We discover that one reason for this degradation is that the sparse calculation induces a distributional shift in the attention outputs. The distributional shift causes decoding-time queries to fail to align well with the appropriate keys from the prefill stage, leading to a drop in performance. We propose a simple, novel, and effective procedure for correcting this distributional shift, bringing the distribution of sparse attention outputs closer to that of quadratic attention. Our method can be applied on top of any sparse attention method, and results in an average 36%pt performance increase, recovering 88% of quadratic attention accuracy on the 131K RULER benchmark when applied on top of sliding window attention with sink tokens while only adding a small overhead. Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 20 May 2025 21:17:51 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d391d0b0/7f366c83.mp3" length="19742821" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1230</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jeffrey Willette, Heejun Lee, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.11254v1">http://arxiv.org/abs/2505.11254v1</a></p>

            <p><strong>Abstract:</strong><br>
            The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance degradation. We discover that one reason for this degradation is that the sparse calculation induces a distributional shift in the attention outputs. The distributional shift causes decoding-time queries to fail to align well with the appropriate keys from the prefill stage, leading to a drop in performance. We propose a simple, novel, and effective procedure for correcting this distributional shift, bringing the distribution of sparse attention outputs closer to that of quadratic attention. Our method can be applied on top of any sparse attention method, and results in an average 36%pt performance increase, recovering 88% of quadratic attention accuracy on the 131K RULER benchmark when applied on top of sliding window attention with sink tokens while only adding a small overhead. Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis</title>
      <itunes:episode>769</itunes:episode>
      <podcast:episode>769</podcast:episode>
      <itunes:title>Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">948ac11c-add8-4587-9518-bec89dc5b1d7</guid>
      <link>https://share.transistor.fm/s/6b5d3e78</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, Caiming Xiong</p>

            <p><strong>Title:</strong><br>
            Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13227v1">http://arxiv.org/abs/2505.13227v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, Caiming Xiong</p>

            <p><strong>Title:</strong><br>
            Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13227v1">http://arxiv.org/abs/2505.13227v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 20 May 2025 21:17:30 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6b5d3e78/c4ca9c2a.mp3" length="21259173" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1325</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, Caiming Xiong</p>

            <p><strong>Title:</strong><br>
            Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13227v1">http://arxiv.org/abs/2505.13227v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Faster Video Diffusion with Trainable Sparse Attention</title>
      <itunes:episode>768</itunes:episode>
      <podcast:episode>768</podcast:episode>
      <itunes:title>Faster Video Diffusion with Trainable Sparse Attention</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8b8b4265-db27-469d-8f66-30eb828f2882</guid>
      <link>https://share.transistor.fm/s/c058739f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Peiyuan Zhang, Haofeng Huang, Yongqi Chen, Will Lin, Zhengzhong Liu, Ion Stoica, Eric P. Xing, Hao Zhang</p>

            <p><strong>Title:</strong><br>
            Faster Video Diffusion with Trainable Sparse Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13389v1">http://arxiv.org/abs/2505.13389v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85\% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Peiyuan Zhang, Haofeng Huang, Yongqi Chen, Will Lin, Zhengzhong Liu, Ion Stoica, Eric P. Xing, Hao Zhang</p>

            <p><strong>Title:</strong><br>
            Faster Video Diffusion with Trainable Sparse Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13389v1">http://arxiv.org/abs/2505.13389v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85\% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 20 May 2025 21:17:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c058739f/01dbd101.mp3" length="24699372" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1540</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Peiyuan Zhang, Haofeng Huang, Yongqi Chen, Will Lin, Zhengzhong Liu, Ion Stoica, Eric P. Xing, Hao Zhang</p>

            <p><strong>Title:</strong><br>
            Faster Video Diffusion with Trainable Sparse Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13389v1">http://arxiv.org/abs/2505.13389v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85\% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Thinkless: LLM Learns When to Think</title>
      <itunes:episode>767</itunes:episode>
      <podcast:episode>767</podcast:episode>
      <itunes:title>Thinkless: LLM Learns When to Think</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4b08ecf1-8416-4ee6-8e0b-2a8db222e9c4</guid>
      <link>https://share.transistor.fm/s/5a99e1ca</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Gongfan Fang, Xinyin Ma, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            Thinkless: LLM Learns When to Think</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13379v1">http://arxiv.org/abs/2505.13379v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model's ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens,  for concise responses and  for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%, significantly improving the efficiency of Reasoning Language Models. The code is available at https://github.com/VainF/Thinkless</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Gongfan Fang, Xinyin Ma, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            Thinkless: LLM Learns When to Think</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13379v1">http://arxiv.org/abs/2505.13379v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model's ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens,  for concise responses and  for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%, significantly improving the efficiency of Reasoning Language Models. The code is available at https://github.com/VainF/Thinkless</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 20 May 2025 21:16:47 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5a99e1ca/fe3ad99a.mp3" length="17326971" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1079</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Gongfan Fang, Xinyin Ma, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            Thinkless: LLM Learns When to Think</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13379v1">http://arxiv.org/abs/2505.13379v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model's ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens,  for concise responses and  for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%, significantly improving the efficiency of Reasoning Language Models. The code is available at https://github.com/VainF/Thinkless</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Model Merging in Pre-training of Large Language Models</title>
      <itunes:episode>766</itunes:episode>
      <podcast:episode>766</podcast:episode>
      <itunes:title>Model Merging in Pre-training of Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">12d07fb8-226f-433c-9e18-317b8f386ad8</guid>
      <link>https://share.transistor.fm/s/f548fade</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Zhou Xun, Siyuan Qiao, Liang Xiang, Yonghui Wu</p>

            <p><strong>Title:</strong><br>
            Model Merging in Pre-training of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.12082v2">http://arxiv.org/abs/2505.12082v2</a></p>

            <p><strong>Abstract:</strong><br>
            Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Zhou Xun, Siyuan Qiao, Liang Xiang, Yonghui Wu</p>

            <p><strong>Title:</strong><br>
            Model Merging in Pre-training of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.12082v2">http://arxiv.org/abs/2505.12082v2</a></p>

            <p><strong>Abstract:</strong><br>
            Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 20 May 2025 21:16:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f548fade/95b1f898.mp3" length="22166539" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1382</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Zhou Xun, Siyuan Qiao, Liang Xiang, Yonghui Wu</p>

            <p><strong>Title:</strong><br>
            Model Merging in Pre-training of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.12082v2">http://arxiv.org/abs/2505.12082v2</a></p>

            <p><strong>Abstract:</strong><br>
            Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space</title>
      <itunes:episode>765</itunes:episode>
      <podcast:episode>765</podcast:episode>
      <itunes:title>Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">714fb37c-53ed-49fb-99ee-233b53f506d8</guid>
      <link>https://share.transistor.fm/s/00b4b516</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu, Zilong Zheng</p>

            <p><strong>Title:</strong><br>
            Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13308v1">http://arxiv.org/abs/2505.13308v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu, Zilong Zheng</p>

            <p><strong>Title:</strong><br>
            Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13308v1">http://arxiv.org/abs/2505.13308v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 20 May 2025 21:16:05 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/00b4b516/9dd240c0.mp3" length="23932869" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1492</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu, Zilong Zheng</p>

            <p><strong>Title:</strong><br>
            Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.13308v1">http://arxiv.org/abs/2505.13308v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Qwen3 Technical Report</title>
      <itunes:episode>764</itunes:episode>
      <podcast:episode>764</podcast:episode>
      <itunes:title>Qwen3 Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c3816988-5bc3-4e74-b668-85e11acf6e5c</guid>
      <link>https://share.transistor.fm/s/afe47940</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 117 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, Zihan Qiu</p>

            <p><strong>Title:</strong><br>
            Qwen3 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.09388v1">http://arxiv.org/abs/2505.09388v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models--such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)--and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 117 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, Zihan Qiu</p>

            <p><strong>Title:</strong><br>
            Qwen3 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.09388v1">http://arxiv.org/abs/2505.09388v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models--such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)--and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 19 May 2025 20:18:37 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/afe47940/7d6fb735.mp3" length="20714935" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1291</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 117 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, Zihan Qiu</p>

            <p><strong>Title:</strong><br>
            Qwen3 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.09388v1">http://arxiv.org/abs/2505.09388v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models--such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)--and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning</title>
      <itunes:episode>763</itunes:episode>
      <podcast:episode>763</podcast:episode>
      <itunes:title>GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9e76e4c1-4bc3-4ef0-b01c-7e850a6fc2a9</guid>
      <link>https://share.transistor.fm/s/c0b68605</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.AI, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Yue Liu, Shengfang Zhai, Mingzhe Du, Yulin Chen, Tri Cao, Hongcheng Gao, Cheng Wang, Xinfeng Li, Kun Wang, Junfeng Fang, Jiaheng Zhang, Bryan Hooi</p>

            <p><strong>Title:</strong><br>
            GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.11049v1">http://arxiv.org/abs/2505.11049v1</a></p>

            <p><strong>Abstract:</strong><br>
            To enhance the safety of VLMs, this paper introduces a novel reasoning-based VLM guard model dubbed GuardReasoner-VL. The core idea is to incentivize the guard model to deliberatively reason before making moderation decisions via online RL. First, we construct GuardReasoner-VLTrain, a reasoning corpus with 123K samples and 631K reasoning steps, spanning text, image, and text-image inputs. Then, based on it, we cold-start our model's reasoning ability via SFT. In addition, we further enhance reasoning regarding moderation through online RL. Concretely, to enhance diversity and difficulty of samples, we conduct rejection sampling followed by data augmentation via the proposed safety-aware data concatenation. Besides, we use a dynamic clipping parameter to encourage exploration in early stages and exploitation in later stages. To balance performance and token efficiency, we design a length-aware safety reward that integrates accuracy, format, and token cost. Extensive experiments demonstrate the superiority of our model. Remarkably, it surpasses the runner-up by 19.27% F1 score on average. We release data, code, and models (3B/7B) of GuardReasoner-VL at https://github.com/yueliu1999/GuardReasoner-VL/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.AI, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Yue Liu, Shengfang Zhai, Mingzhe Du, Yulin Chen, Tri Cao, Hongcheng Gao, Cheng Wang, Xinfeng Li, Kun Wang, Junfeng Fang, Jiaheng Zhang, Bryan Hooi</p>

            <p><strong>Title:</strong><br>
            GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.11049v1">http://arxiv.org/abs/2505.11049v1</a></p>

            <p><strong>Abstract:</strong><br>
            To enhance the safety of VLMs, this paper introduces a novel reasoning-based VLM guard model dubbed GuardReasoner-VL. The core idea is to incentivize the guard model to deliberatively reason before making moderation decisions via online RL. First, we construct GuardReasoner-VLTrain, a reasoning corpus with 123K samples and 631K reasoning steps, spanning text, image, and text-image inputs. Then, based on it, we cold-start our model's reasoning ability via SFT. In addition, we further enhance reasoning regarding moderation through online RL. Concretely, to enhance diversity and difficulty of samples, we conduct rejection sampling followed by data augmentation via the proposed safety-aware data concatenation. Besides, we use a dynamic clipping parameter to encourage exploration in early stages and exploitation in later stages. To balance performance and token efficiency, we design a length-aware safety reward that integrates accuracy, format, and token cost. Extensive experiments demonstrate the superiority of our model. Remarkably, it surpasses the runner-up by 19.27% F1 score on average. We release data, code, and models (3B/7B) of GuardReasoner-VL at https://github.com/yueliu1999/GuardReasoner-VL/</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 19 May 2025 20:18:15 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c0b68605/a7022fc8.mp3" length="22942278" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1430</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.AI, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Yue Liu, Shengfang Zhai, Mingzhe Du, Yulin Chen, Tri Cao, Hongcheng Gao, Cheng Wang, Xinfeng Li, Kun Wang, Junfeng Fang, Jiaheng Zhang, Bryan Hooi</p>

            <p><strong>Title:</strong><br>
            GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.11049v1">http://arxiv.org/abs/2505.11049v1</a></p>

            <p><strong>Abstract:</strong><br>
            To enhance the safety of VLMs, this paper introduces a novel reasoning-based VLM guard model dubbed GuardReasoner-VL. The core idea is to incentivize the guard model to deliberatively reason before making moderation decisions via online RL. First, we construct GuardReasoner-VLTrain, a reasoning corpus with 123K samples and 631K reasoning steps, spanning text, image, and text-image inputs. Then, based on it, we cold-start our model's reasoning ability via SFT. In addition, we further enhance reasoning regarding moderation through online RL. Concretely, to enhance diversity and difficulty of samples, we conduct rejection sampling followed by data augmentation via the proposed safety-aware data concatenation. Besides, we use a dynamic clipping parameter to encourage exploration in early stages and exploitation in later stages. To balance performance and token efficiency, we design a length-aware safety reward that integrates accuracy, format, and token cost. Extensive experiments demonstrate the superiority of our model. Remarkably, it surpasses the runner-up by 19.27% F1 score on average. We release data, code, and models (3B/7B) of GuardReasoner-VL at https://github.com/yueliu1999/GuardReasoner-VL/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly</title>
      <itunes:episode>762</itunes:episode>
      <podcast:episode>762</podcast:episode>
      <itunes:title>MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">eb45afce-f5cd-413e-a3a8-7d28f491bc81</guid>
      <link>https://share.transistor.fm/s/7dd5dd58</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, Mark Steedman</p>

            <p><strong>Title:</strong><br>
            MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.10610v1">http://arxiv.org/abs/2505.10610v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models' vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, Mark Steedman</p>

            <p><strong>Title:</strong><br>
            MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.10610v1">http://arxiv.org/abs/2505.10610v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models' vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 19 May 2025 20:17:54 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7dd5dd58/177e1409.mp3" length="18713813" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1166</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, Mark Steedman</p>

            <p><strong>Title:</strong><br>
            MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.10610v1">http://arxiv.org/abs/2505.10610v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models' vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Visual Planning: Let's Think Only with Images</title>
      <itunes:episode>761</itunes:episode>
      <podcast:episode>761</podcast:episode>
      <itunes:title>Visual Planning: Let's Think Only with Images</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">37fc93cb-dd6f-42d2-99be-c622b564a7b1</guid>
      <link>https://share.transistor.fm/s/7d07cff8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG, cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vulić</p>

            <p><strong>Title:</strong><br>
            Visual Planning: Let's Think Only with Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.11409v1">http://arxiv.org/abs/2505.11409v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG, cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vulić</p>

            <p><strong>Title:</strong><br>
            Visual Planning: Let's Think Only with Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.11409v1">http://arxiv.org/abs/2505.11409v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 19 May 2025 20:17:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7d07cff8/c7bb6f4c.mp3" length="21115363" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1316</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG, cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vulić</p>

            <p><strong>Title:</strong><br>
            Visual Planning: Let's Think Only with Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.11409v1">http://arxiv.org/abs/2505.11409v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models</title>
      <itunes:episode>760</itunes:episode>
      <podcast:episode>760</podcast:episode>
      <itunes:title>Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a7b1747b-3a43-43ab-947d-c3f02e7623db</guid>
      <link>https://share.transistor.fm/s/e1377302</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 76 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiyuan Hu, Yibo Wang, Hanze Dong, Yuhui Xu, Amrita Saha, Caiming Xiong, Bryan Hooi, Junnan Li</p>

            <p><strong>Title:</strong><br>
            Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.10554v1">http://arxiv.org/abs/2505.10554v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning. Prior work has shown that outcome-based reinforcement learning (RL) can incidentally elicit advanced reasoning behaviors such as self-correction, backtracking, and verification phenomena often referred to as the model's "aha moment". However, the timing and consistency of these emergent behaviors remain unpredictable and uncontrollable, limiting the scalability and reliability of LRMs' reasoning capabilities. To address these limitations, we move beyond reliance on prompts and coincidental "aha moments". Instead, we explicitly align models with three meta-abilities: deduction, induction, and abduction, using automatically generated, self-verifiable tasks. Our three stage-pipeline individual alignment, parameter-space merging, and domain-specific reinforcement learning, boosting performance by over 10\% relative to instruction-tuned baselines. Furthermore, domain-specific RL from the aligned checkpoint yields an additional 2\% average gain in the performance ceiling across math, coding, and science benchmarks, demonstrating that explicit meta-ability alignment offers a scalable and dependable foundation for reasoning. Code is available at: https://github.com/zhiyuanhubj/Meta-Ability-Alignment</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 76 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiyuan Hu, Yibo Wang, Hanze Dong, Yuhui Xu, Amrita Saha, Caiming Xiong, Bryan Hooi, Junnan Li</p>

            <p><strong>Title:</strong><br>
            Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.10554v1">http://arxiv.org/abs/2505.10554v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning. Prior work has shown that outcome-based reinforcement learning (RL) can incidentally elicit advanced reasoning behaviors such as self-correction, backtracking, and verification phenomena often referred to as the model's "aha moment". However, the timing and consistency of these emergent behaviors remain unpredictable and uncontrollable, limiting the scalability and reliability of LRMs' reasoning capabilities. To address these limitations, we move beyond reliance on prompts and coincidental "aha moments". Instead, we explicitly align models with three meta-abilities: deduction, induction, and abduction, using automatically generated, self-verifiable tasks. Our three stage-pipeline individual alignment, parameter-space merging, and domain-specific reinforcement learning, boosting performance by over 10\% relative to instruction-tuned baselines. Furthermore, domain-specific RL from the aligned checkpoint yields an additional 2\% average gain in the performance ceiling across math, coding, and science benchmarks, demonstrating that explicit meta-ability alignment offers a scalable and dependable foundation for reasoning. Code is available at: https://github.com/zhiyuanhubj/Meta-Ability-Alignment</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 16 May 2025 19:56:44 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e1377302/11cb933e.mp3" length="21054797" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1312</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 76 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiyuan Hu, Yibo Wang, Hanze Dong, Yuhui Xu, Amrita Saha, Caiming Xiong, Bryan Hooi, Junnan Li</p>

            <p><strong>Title:</strong><br>
            Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.10554v1">http://arxiv.org/abs/2505.10554v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning. Prior work has shown that outcome-based reinforcement learning (RL) can incidentally elicit advanced reasoning behaviors such as self-correction, backtracking, and verification phenomena often referred to as the model's "aha moment". However, the timing and consistency of these emergent behaviors remain unpredictable and uncontrollable, limiting the scalability and reliability of LRMs' reasoning capabilities. To address these limitations, we move beyond reliance on prompts and coincidental "aha moments". Instead, we explicitly align models with three meta-abilities: deduction, induction, and abduction, using automatically generated, self-verifiable tasks. Our three stage-pipeline individual alignment, parameter-space merging, and domain-specific reinforcement learning, boosting performance by over 10\% relative to instruction-tuned baselines. Furthermore, domain-specific RL from the aligned checkpoint yields an additional 2\% average gain in the performance ceiling across math, coding, and science benchmarks, demonstrating that explicit meta-ability alignment offers a scalable and dependable foundation for reasoning. Code is available at: https://github.com/zhiyuanhubj/Meta-Ability-Alignment</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>System Prompt Optimization with Meta-Learning</title>
      <itunes:episode>759</itunes:episode>
      <podcast:episode>759</podcast:episode>
      <itunes:title>System Prompt Optimization with Meta-Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">68c85f16-3eae-49fa-92c2-c4401870b947</guid>
      <link>https://share.transistor.fm/s/5bdd11ef</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yumin Choi, Jinheon Baek, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            System Prompt Optimization with Meta-Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.09666v1">http://arxiv.org/abs/2505.09666v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the system prompt that is, once optimized, applicable across different tasks and domains. Motivated by this, we introduce the novel problem of bilevel system prompt optimization, whose objective is to design system prompts that are robust to diverse user prompts and transferable to unseen tasks. To tackle this problem, we then propose a meta-learning framework, which meta-learns the system prompt by optimizing it over various user prompts across multiple datasets, while simultaneously updating the user prompts in an iterative manner to ensure synergy between them. We conduct experiments on 14 unseen datasets spanning 5 different domains, on which we show that our approach produces system prompts that generalize effectively to diverse user prompts. Also, our findings reveal that the optimized system prompt enables rapid adaptation even to unseen tasks, requiring fewer optimization steps for test-time user prompts while achieving improved performance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yumin Choi, Jinheon Baek, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            System Prompt Optimization with Meta-Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.09666v1">http://arxiv.org/abs/2505.09666v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the system prompt that is, once optimized, applicable across different tasks and domains. Motivated by this, we introduce the novel problem of bilevel system prompt optimization, whose objective is to design system prompts that are robust to diverse user prompts and transferable to unseen tasks. To tackle this problem, we then propose a meta-learning framework, which meta-learns the system prompt by optimizing it over various user prompts across multiple datasets, while simultaneously updating the user prompts in an iterative manner to ensure synergy between them. We conduct experiments on 14 unseen datasets spanning 5 different domains, on which we show that our approach produces system prompts that generalize effectively to diverse user prompts. Also, our findings reveal that the optimized system prompt enables rapid adaptation even to unseen tasks, requiring fewer optimization steps for test-time user prompts while achieving improved performance.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 16 May 2025 19:56:23 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5bdd11ef/c2f59150.mp3" length="20865005" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1300</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yumin Choi, Jinheon Baek, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            System Prompt Optimization with Meta-Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.09666v1">http://arxiv.org/abs/2505.09666v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the system prompt that is, once optimized, applicable across different tasks and domains. Motivated by this, we introduce the novel problem of bilevel system prompt optimization, whose objective is to design system prompts that are robust to diverse user prompts and transferable to unseen tasks. To tackle this problem, we then propose a meta-learning framework, which meta-learns the system prompt by optimizing it over various user prompts across multiple datasets, while simultaneously updating the user prompts in an iterative manner to ensure synergy between them. We conduct experiments on 14 unseen datasets spanning 5 different domains, on which we show that our approach produces system prompts that generalize effectively to diverse user prompts. Also, our findings reveal that the optimized system prompt enables rapid adaptation even to unseen tasks, requiring fewer optimization steps for test-time user prompts while achieving improved performance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset</title>
      <itunes:episode>758</itunes:episode>
      <podcast:episode>758</podcast:episode>
      <itunes:title>BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d686c782-3671-41a3-9123-3b7fb4eda24f</guid>
      <link>https://share.transistor.fm/s/bb73cacb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu</p>

            <p><strong>Title:</strong><br>
            BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.09568v1">http://arxiv.org/abs/2505.09568v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu</p>

            <p><strong>Title:</strong><br>
            BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.09568v1">http://arxiv.org/abs/2505.09568v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 15 May 2025 20:06:41 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bb73cacb/fcede928.mp3" length="18730535" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1167</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, Ran Xu</p>

            <p><strong>Title:</strong><br>
            BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.09568v1">http://arxiv.org/abs/2505.09568v1</a></p>

            <p><strong>Abstract:</strong><br>
            Unifying image understanding and generation has gained growing attention in recent research on multimodal models. Although design choices for image understanding have been extensively studied, the optimal model architecture and training recipe for a unified framework with image generation remain underexplored. Motivated by the strong potential of autoregressive and diffusion models for high-quality generation and scalability, we conduct a comprehensive study of their use in unified multimodal settings, with emphasis on image representations, modeling objectives, and training strategies. Grounded in these investigations, we introduce a novel approach that employs a diffusion transformer to generate semantically rich CLIP image features, in contrast to conventional VAE-based representations. This design yields both higher training efficiency and improved generative quality. Furthermore, we demonstrate that a sequential pretraining strategy for unified models-first training on image understanding and subsequently on image generation-offers practical advantages by preserving image understanding capability while developing strong image generation ability. Finally, we carefully curate a high-quality instruction-tuning dataset BLIP3o-60k for image generation by prompting GPT-4o with a diverse set of captions covering various scenes, objects, human gestures, and more. Building on our innovative model design, training recipe, and datasets, we develop BLIP3-o, a suite of state-of-the-art unified multimodal models. BLIP3-o achieves superior performance across most of the popular benchmarks spanning both image understanding and generation tasks. To facilitate future research, we fully open-source our models, including code, model weights, training scripts, and pretraining and instruction tuning datasets.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception</title>
      <itunes:episode>757</itunes:episode>
      <podcast:episode>757</podcast:episode>
      <itunes:title>DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">055442b2-f3a3-4b69-be88-23020ba6115c</guid>
      <link>https://share.transistor.fm/s/0f7f5fcb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, Zhuotao Tian</p>

            <p><strong>Title:</strong><br>
            DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.04410v1">http://arxiv.org/abs/2505.04410v1</a></p>

            <p><strong>Abstract:</strong><br>
            Dense visual prediction tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense prediction often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain ``content'' and ``context'' features respectively. The ``content'' features are aligned with image crop representations to improve local discriminability, while ``context'' features learn to retain the spatial correlations under the guidance of vision foundation models, such as DINO. Extensive experiments demonstrate that DeCLIP significantly outperforms existing methods across multiple open-vocabulary dense prediction tasks, including object detection and semantic segmentation. Code is available at \textcolor{magenta}{https://github.com/xiaomoguhz/DeCLIP}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, Zhuotao Tian</p>

            <p><strong>Title:</strong><br>
            DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.04410v1">http://arxiv.org/abs/2505.04410v1</a></p>

            <p><strong>Abstract:</strong><br>
            Dense visual prediction tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense prediction often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain ``content'' and ``context'' features respectively. The ``content'' features are aligned with image crop representations to improve local discriminability, while ``context'' features learn to retain the spatial correlations under the guidance of vision foundation models, such as DINO. Extensive experiments demonstrate that DeCLIP significantly outperforms existing methods across multiple open-vocabulary dense prediction tasks, including object detection and semantic segmentation. Code is available at \textcolor{magenta}{https://github.com/xiaomoguhz/DeCLIP}.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 15 May 2025 20:06:20 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0f7f5fcb/496bce99.mp3" length="18312129" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1141</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junjie Wang, Bin Chen, Yulin Li, Bin Kang, Yichi Chen, Zhuotao Tian</p>

            <p><strong>Title:</strong><br>
            DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.04410v1">http://arxiv.org/abs/2505.04410v1</a></p>

            <p><strong>Abstract:</strong><br>
            Dense visual prediction tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense prediction often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain ``content'' and ``context'' features respectively. The ``content'' features are aligned with image crop representations to improve local discriminability, while ``context'' features learn to retain the spatial correlations under the guidance of vision foundation models, such as DINO. Extensive experiments demonstrate that DeCLIP significantly outperforms existing methods across multiple open-vocabulary dense prediction tasks, including object detection and semantic segmentation. Code is available at \textcolor{magenta}{https://github.com/xiaomoguhz/DeCLIP}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures</title>
      <itunes:episode>756</itunes:episode>
      <podcast:episode>756</podcast:episode>
      <itunes:title>Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9331c67b-c098-4460-b253-c330df86b3dc</guid>
      <link>https://share.transistor.fm/s/1b849e39</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.DC, cs.AI, cs.AR</p>

            <p><strong>Authors:</strong><br>
            Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y. X. Wei</p>

            <p><strong>Title:</strong><br>
            Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.09343v1">http://arxiv.org/abs/2505.09343v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inference at scale. This paper presents an in-depth analysis of the DeepSeek-V3/R1 model architecture and its AI infrastructure, highlighting key innovations such as Multi-head Latent Attention (MLA) for enhanced memory efficiency, Mixture of Experts (MoE) architectures for optimized computation-communication trade-offs, FP8 mixed-precision training to unlock the full potential of hardware capabilities, and a Multi-Plane Network Topology to minimize cluster-level network overhead. Building on the hardware bottlenecks encountered during DeepSeek-V3's development, we engage in a broader discussion with academic and industry peers on potential future hardware directions, including precise low-precision computation units, scale-up and scale-out convergence, and innovations in low-latency communication fabrics. These insights underscore the critical role of hardware and model co-design in meeting the escalating demands of AI workloads, offering a practical blueprint for innovation in next-generation AI systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.DC, cs.AI, cs.AR</p>

            <p><strong>Authors:</strong><br>
            Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y. X. Wei</p>

            <p><strong>Title:</strong><br>
            Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.09343v1">http://arxiv.org/abs/2505.09343v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inference at scale. This paper presents an in-depth analysis of the DeepSeek-V3/R1 model architecture and its AI infrastructure, highlighting key innovations such as Multi-head Latent Attention (MLA) for enhanced memory efficiency, Mixture of Experts (MoE) architectures for optimized computation-communication trade-offs, FP8 mixed-precision training to unlock the full potential of hardware capabilities, and a Multi-Plane Network Topology to minimize cluster-level network overhead. Building on the hardware bottlenecks encountered during DeepSeek-V3's development, we engage in a broader discussion with academic and industry peers on potential future hardware directions, including precise low-precision computation units, scale-up and scale-out convergence, and innovations in low-latency communication fabrics. These insights underscore the critical role of hardware and model co-design in meeting the escalating demands of AI workloads, offering a practical blueprint for innovation in next-generation AI systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 15 May 2025 20:05:58 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1b849e39/e0470650.mp3" length="20987934" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1308</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.DC, cs.AI, cs.AR</p>

            <p><strong>Authors:</strong><br>
            Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, Wenfeng Liang, Ying He, Yuqing Wang, Yuxuan Liu, Y. X. Wei</p>

            <p><strong>Title:</strong><br>
            Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.09343v1">http://arxiv.org/abs/2505.09343v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inference at scale. This paper presents an in-depth analysis of the DeepSeek-V3/R1 model architecture and its AI infrastructure, highlighting key innovations such as Multi-head Latent Attention (MLA) for enhanced memory efficiency, Mixture of Experts (MoE) architectures for optimized computation-communication trade-offs, FP8 mixed-precision training to unlock the full potential of hardware capabilities, and a Multi-Plane Network Topology to minimize cluster-level network overhead. Building on the hardware bottlenecks encountered during DeepSeek-V3's development, we engage in a broader discussion with academic and industry peers on potential future hardware directions, including precise low-precision computation units, scale-up and scale-out convergence, and innovations in low-latency communication fabrics. These insights underscore the critical role of hardware and model co-design in meeting the escalating demands of AI workloads, offering a practical blueprint for innovation in next-generation AI systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder</title>
      <itunes:episode>755</itunes:episode>
      <podcast:episode>755</podcast:episode>
      <itunes:title>MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">73c1fb88-a6ea-4957-a872-24dc51645947</guid>
      <link>https://share.transistor.fm/s/a7a19501</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 83 | eess.AS, cs.SD</p>

            <p><strong>Authors:</strong><br>
            Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, Yucen He</p>

            <p><strong>Title:</strong><br>
            MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.07916v1">http://arxiv.org/abs/2505.07916v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high similarity to the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning metrics (Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. Another key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, is its extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion control via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional voice cloning (PVC) by fine-tuning timbre features with additional data. We encourage readers to visit https://minimax-ai.github.io/tts_tech_report for more examples.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 83 | eess.AS, cs.SD</p>

            <p><strong>Authors:</strong><br>
            Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, Yucen He</p>

            <p><strong>Title:</strong><br>
            MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.07916v1">http://arxiv.org/abs/2505.07916v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high similarity to the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning metrics (Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. Another key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, is its extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion control via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional voice cloning (PVC) by fine-tuning timbre features with additional data. We encourage readers to visit https://minimax-ai.github.io/tts_tech_report for more examples.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 14 May 2025 19:46:47 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a7a19501/7d275f77.mp3" length="21129194" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1317</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 83 | eess.AS, cs.SD</p>

            <p><strong>Authors:</strong><br>
            Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, Yucen He</p>

            <p><strong>Title:</strong><br>
            MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.07916v1">http://arxiv.org/abs/2505.07916v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high similarity to the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning metrics (Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. Another key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, is its extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion control via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional voice cloning (PVC) by fine-tuning timbre features with additional data. We encourage readers to visit https://minimax-ai.github.io/tts_tech_report for more examples.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Seed1.5-VL Technical Report</title>
      <itunes:episode>754</itunes:episode>
      <podcast:episode>754</podcast:episode>
      <itunes:title>Seed1.5-VL Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">648bc6c8-ddee-4669-a2f7-e197ac0ff980</guid>
      <link>https://share.transistor.fm/s/9dbb7b15</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 86 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng, Weiwei Liu, Wenqian Wang, Xianhan Zeng, Xiao Liu, Xiaobo Qin, Xiaohan Ding, Xiaojun Xiao, Xiaoying Zhang, Xuanwei Zhang, Xuehan Xiong, Yanghua Peng, Yangrui Chen, Yanwei Li, Yanxu Hu, Yi Lin, Yiyuan Hu, Yiyuan Zhang, Youbin Wu, Yu Li, Yudong Liu, Yue Ling, Yujia Qin, Zanbo Wang, Zhiwu He, Aoxue Zhang, Bairen Yi, Bencheng Liao, Can Huang, Can Zhang, Chaorui Deng, Chaoyi Deng, Cheng Lin, Cheng Yuan, Chenggang Li, Chenhui Gou, Chenwei Lou, Chengzhi Wei, Chundian Liu, Chunyuan Li, Deyao Zhu, Donghong Zhong, Feng Li, Feng Zhang, Gang Wu, Guodong Li, Guohong Xiao, Haibin Lin, Haihua Yang, Haoming Wang, Heng Ji, Hongxiang Hao, Hui Shen, Huixia Li, Jiahao Li, Jialong Wu, Jianhua Zhu, Jianpeng Jiao, Jiashi Feng, Jiaze Chen, Jianhui Duan, Jihao Liu, Jin Zeng, Jingqun Tang, Jingyu Sun, Joya Chen, Jun Long, Junda Feng, Junfeng Zhan, Junjie Fang, Junting Lu, Kai Hua, Kai Liu, Kai Shen, Kaiyuan Zhang, Ke Shen, Ke Wang, Keyu Pan, Kun Zhang, Kunchang Li, Lanxin Li, Lei Li, Lei Shi, Li Han, Liang Xiang, Liangqiang Chen, Lin Chen, Lin Li, Lin Yan, Liying Chi, Longxiang Liu, Mengfei Du, Mingxuan Wang, Ningxin Pan, Peibin Chen, Pengfei Chen, Pengfei Wu, Qingqing Yuan, Qingyao Shuai, Qiuyan Tao, Renjie Zheng, Renrui Zhang, Ru Zhang, Rui Wang, Rui Yang, Rui Zhao, Shaoqiang Xu, Shihao Liang, Shipeng Yan, Shu Zhong, Shuaishuai Cao, Shuangzhi Wu, Shufan Liu, Shuhan Chang, Songhua Cai, Tenglong Ao, Tianhao Yang, Tingting Zhang, Wanjun Zhong, Wei Jia, Wei Weng, Weihao Yu, Wenhao Huang, Wenjia Zhu, Wenli Yang, Wenzhi Wang, Xiang Long, XiangRui Yin, Xiao Li, Xiaolei Zhu, Xiaoying Jia, Xijin Zhang, Xin Liu, Xinchen Zhang, Xinyu Yang, Xiongcai Luo, Xiuli Chen, Xuantong Zhong, Xuefeng Xiao, Xujing Li, Yan Wu, Yawei Wen, Yifan Du, Yihao Zhang, Yining Ye, Yonghui Wu, Yu Liu, Yu Yue, Yufeng Zhou, Yufeng Yuan, Yuhang Xu, Yuhong Yang, Yun Zhang, Yunhao Fang, Yuntao Li, Yurui Ren, Yuwen Xiong, Zehua Hong, Zehua Wang, Zewei Sun, Zeyu Wang, Zhao Cai, Zhaoyue Zha, Zhecheng An, Zhehui Zhao, Zhengzhuo Xu, Zhipeng Chen, Zhiyong Wu, Zhuofan Zheng, Zihao Wang, Zilong Huang, Ziyu Zhu, Zuquan Song</p>

            <p><strong>Title:</strong><br>
            Seed1.5-VL Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.07062v1">http://arxiv.org/abs/2505.07062v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 86 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng, Weiwei Liu, Wenqian Wang, Xianhan Zeng, Xiao Liu, Xiaobo Qin, Xiaohan Ding, Xiaojun Xiao, Xiaoying Zhang, Xuanwei Zhang, Xuehan Xiong, Yanghua Peng, Yangrui Chen, Yanwei Li, Yanxu Hu, Yi Lin, Yiyuan Hu, Yiyuan Zhang, Youbin Wu, Yu Li, Yudong Liu, Yue Ling, Yujia Qin, Zanbo Wang, Zhiwu He, Aoxue Zhang, Bairen Yi, Bencheng Liao, Can Huang, Can Zhang, Chaorui Deng, Chaoyi Deng, Cheng Lin, Cheng Yuan, Chenggang Li, Chenhui Gou, Chenwei Lou, Chengzhi Wei, Chundian Liu, Chunyuan Li, Deyao Zhu, Donghong Zhong, Feng Li, Feng Zhang, Gang Wu, Guodong Li, Guohong Xiao, Haibin Lin, Haihua Yang, Haoming Wang, Heng Ji, Hongxiang Hao, Hui Shen, Huixia Li, Jiahao Li, Jialong Wu, Jianhua Zhu, Jianpeng Jiao, Jiashi Feng, Jiaze Chen, Jianhui Duan, Jihao Liu, Jin Zeng, Jingqun Tang, Jingyu Sun, Joya Chen, Jun Long, Junda Feng, Junfeng Zhan, Junjie Fang, Junting Lu, Kai Hua, Kai Liu, Kai Shen, Kaiyuan Zhang, Ke Shen, Ke Wang, Keyu Pan, Kun Zhang, Kunchang Li, Lanxin Li, Lei Li, Lei Shi, Li Han, Liang Xiang, Liangqiang Chen, Lin Chen, Lin Li, Lin Yan, Liying Chi, Longxiang Liu, Mengfei Du, Mingxuan Wang, Ningxin Pan, Peibin Chen, Pengfei Chen, Pengfei Wu, Qingqing Yuan, Qingyao Shuai, Qiuyan Tao, Renjie Zheng, Renrui Zhang, Ru Zhang, Rui Wang, Rui Yang, Rui Zhao, Shaoqiang Xu, Shihao Liang, Shipeng Yan, Shu Zhong, Shuaishuai Cao, Shuangzhi Wu, Shufan Liu, Shuhan Chang, Songhua Cai, Tenglong Ao, Tianhao Yang, Tingting Zhang, Wanjun Zhong, Wei Jia, Wei Weng, Weihao Yu, Wenhao Huang, Wenjia Zhu, Wenli Yang, Wenzhi Wang, Xiang Long, XiangRui Yin, Xiao Li, Xiaolei Zhu, Xiaoying Jia, Xijin Zhang, Xin Liu, Xinchen Zhang, Xinyu Yang, Xiongcai Luo, Xiuli Chen, Xuantong Zhong, Xuefeng Xiao, Xujing Li, Yan Wu, Yawei Wen, Yifan Du, Yihao Zhang, Yining Ye, Yonghui Wu, Yu Liu, Yu Yue, Yufeng Zhou, Yufeng Yuan, Yuhang Xu, Yuhong Yang, Yun Zhang, Yunhao Fang, Yuntao Li, Yurui Ren, Yuwen Xiong, Zehua Hong, Zehua Wang, Zewei Sun, Zeyu Wang, Zhao Cai, Zhaoyue Zha, Zhecheng An, Zhehui Zhao, Zhengzhuo Xu, Zhipeng Chen, Zhiyong Wu, Zhuofan Zheng, Zihao Wang, Zilong Huang, Ziyu Zhu, Zuquan Song</p>

            <p><strong>Title:</strong><br>
            Seed1.5-VL Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.07062v1">http://arxiv.org/abs/2505.07062v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 13 May 2025 20:34:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9dbb7b15/668841cd.mp3" length="20016948" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1247</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 86 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng, Weiwei Liu, Wenqian Wang, Xianhan Zeng, Xiao Liu, Xiaobo Qin, Xiaohan Ding, Xiaojun Xiao, Xiaoying Zhang, Xuanwei Zhang, Xuehan Xiong, Yanghua Peng, Yangrui Chen, Yanwei Li, Yanxu Hu, Yi Lin, Yiyuan Hu, Yiyuan Zhang, Youbin Wu, Yu Li, Yudong Liu, Yue Ling, Yujia Qin, Zanbo Wang, Zhiwu He, Aoxue Zhang, Bairen Yi, Bencheng Liao, Can Huang, Can Zhang, Chaorui Deng, Chaoyi Deng, Cheng Lin, Cheng Yuan, Chenggang Li, Chenhui Gou, Chenwei Lou, Chengzhi Wei, Chundian Liu, Chunyuan Li, Deyao Zhu, Donghong Zhong, Feng Li, Feng Zhang, Gang Wu, Guodong Li, Guohong Xiao, Haibin Lin, Haihua Yang, Haoming Wang, Heng Ji, Hongxiang Hao, Hui Shen, Huixia Li, Jiahao Li, Jialong Wu, Jianhua Zhu, Jianpeng Jiao, Jiashi Feng, Jiaze Chen, Jianhui Duan, Jihao Liu, Jin Zeng, Jingqun Tang, Jingyu Sun, Joya Chen, Jun Long, Junda Feng, Junfeng Zhan, Junjie Fang, Junting Lu, Kai Hua, Kai Liu, Kai Shen, Kaiyuan Zhang, Ke Shen, Ke Wang, Keyu Pan, Kun Zhang, Kunchang Li, Lanxin Li, Lei Li, Lei Shi, Li Han, Liang Xiang, Liangqiang Chen, Lin Chen, Lin Li, Lin Yan, Liying Chi, Longxiang Liu, Mengfei Du, Mingxuan Wang, Ningxin Pan, Peibin Chen, Pengfei Chen, Pengfei Wu, Qingqing Yuan, Qingyao Shuai, Qiuyan Tao, Renjie Zheng, Renrui Zhang, Ru Zhang, Rui Wang, Rui Yang, Rui Zhao, Shaoqiang Xu, Shihao Liang, Shipeng Yan, Shu Zhong, Shuaishuai Cao, Shuangzhi Wu, Shufan Liu, Shuhan Chang, Songhua Cai, Tenglong Ao, Tianhao Yang, Tingting Zhang, Wanjun Zhong, Wei Jia, Wei Weng, Weihao Yu, Wenhao Huang, Wenjia Zhu, Wenli Yang, Wenzhi Wang, Xiang Long, XiangRui Yin, Xiao Li, Xiaolei Zhu, Xiaoying Jia, Xijin Zhang, Xin Liu, Xinchen Zhang, Xinyu Yang, Xiongcai Luo, Xiuli Chen, Xuantong Zhong, Xuefeng Xiao, Xujing Li, Yan Wu, Yawei Wen, Yifan Du, Yihao Zhang, Yining Ye, Yonghui Wu, Yu Liu, Yu Yue, Yufeng Zhou, Yufeng Yuan, Yuhang Xu, Yuhong Yang, Yun Zhang, Yunhao Fang, Yuntao Li, Yurui Ren, Yuwen Xiong, Zehua Hong, Zehua Wang, Zewei Sun, Zeyu Wang, Zhao Cai, Zhaoyue Zha, Zhecheng An, Zhehui Zhao, Zhengzhuo Xu, Zhipeng Chen, Zhiyong Wu, Zhuofan Zheng, Zihao Wang, Zilong Huang, Ziyu Zhu, Zuquan Song</p>

            <p><strong>Title:</strong><br>
            Seed1.5-VL Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.07062v1">http://arxiv.org/abs/2505.07062v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining</title>
      <itunes:episode>753</itunes:episode>
      <podcast:episode>753</podcast:episode>
      <itunes:title>MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ee1ef2cc-2b66-4bed-9280-25170ff33892</guid>
      <link>https://share.transistor.fm/s/dae0e210</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiaomi LLM-Core Team, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen Xu, Jun Shi, Kainan Bao, QingKai Fang, Kang Zhou, Kangyang Zhou, Lei Li, Menghang Zhu, Nuo Chen, Qiantong Wang, Shaohui Liu, Shicheng Li, Shuhao Gu, Shuhuai Ren, Shuo Liu, Sirui Deng, Weiji Zhuang, Weiwei Lv, Wenyu Yang, Xin Zhang, Xing Yong, Xing Zhang, Xingchen Song, Xinzhe Xu, Xu Wang, Yihan Yan, Yu Tu, Yuanyuan Tian, Yudong Wang, Yue Yu, Zhenru Lin, Zhichao Song, Zihao Yue</p>

            <p><strong>Title:</strong><br>
            MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.07608v1">http://arxiv.org/abs/2505.07608v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiaomi LLM-Core Team, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen Xu, Jun Shi, Kainan Bao, QingKai Fang, Kang Zhou, Kangyang Zhou, Lei Li, Menghang Zhu, Nuo Chen, Qiantong Wang, Shaohui Liu, Shicheng Li, Shuhao Gu, Shuhuai Ren, Shuo Liu, Sirui Deng, Weiji Zhuang, Weiwei Lv, Wenyu Yang, Xin Zhang, Xing Yong, Xing Zhang, Xingchen Song, Xinzhe Xu, Xu Wang, Yihan Yan, Yu Tu, Yuanyuan Tian, Yudong Wang, Yue Yu, Zhenru Lin, Zhichao Song, Zihao Yue</p>

            <p><strong>Title:</strong><br>
            MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.07608v1">http://arxiv.org/abs/2505.07608v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 13 May 2025 20:34:28 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/dae0e210/7ee4c441.mp3" length="21036835" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1311</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiaomi LLM-Core Team, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen Xu, Jun Shi, Kainan Bao, QingKai Fang, Kang Zhou, Kangyang Zhou, Lei Li, Menghang Zhu, Nuo Chen, Qiantong Wang, Shaohui Liu, Shicheng Li, Shuhao Gu, Shuhuai Ren, Shuo Liu, Sirui Deng, Weiji Zhuang, Weiwei Lv, Wenyu Yang, Xin Zhang, Xing Yong, Xing Zhang, Xingchen Song, Xinzhe Xu, Xu Wang, Yihan Yan, Yu Tu, Yuanyuan Tian, Yudong Wang, Yue Yu, Zhenru Lin, Zhichao Song, Zihao Yue</p>

            <p><strong>Title:</strong><br>
            MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.07608v1">http://arxiv.org/abs/2505.07608v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets</title>
      <itunes:episode>752</itunes:episode>
      <podcast:episode>752</podcast:episode>
      <itunes:title>Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f7cb63e8-0afb-4406-ad24-2fc07d9ea5be</guid>
      <link>https://share.transistor.fm/s/0c9f9e0d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, Xiao Chen, Feipeng Tian, Jianxiong Pan, Zeming Li, Gang Yu, Xiangyu Zhang, Daxin Jiang, Ping Tan</p>

            <p><strong>Title:</strong><br>
            Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.07747v1">http://arxiv.org/abs/2505.07747v1</a></p>

            <p><strong>Abstract:</strong><br>
            While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing &gt;5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an diffusion-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The diffusion-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notably, the framework uniquely bridges the 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, Xiao Chen, Feipeng Tian, Jianxiong Pan, Zeming Li, Gang Yu, Xiangyu Zhang, Daxin Jiang, Ping Tan</p>

            <p><strong>Title:</strong><br>
            Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.07747v1">http://arxiv.org/abs/2505.07747v1</a></p>

            <p><strong>Abstract:</strong><br>
            While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing &gt;5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an diffusion-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The diffusion-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notably, the framework uniquely bridges the 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 13 May 2025 20:34:07 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0c9f9e0d/67c40094.mp3" length="21309333" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1328</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, Xiao Chen, Feipeng Tian, Jianxiong Pan, Zeming Li, Gang Yu, Xiangyu Zhang, Daxin Jiang, Ping Tan</p>

            <p><strong>Title:</strong><br>
            Step1X-3D: Towards High-Fidelity and Controllable Generation of Textured 3D Assets</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.07747v1">http://arxiv.org/abs/2505.07747v1</a></p>

            <p><strong>Abstract:</strong><br>
            While generative artificial intelligence has advanced significantly across text, image, audio, and video domains, 3D generation remains comparatively underdeveloped due to fundamental challenges such as data scarcity, algorithmic limitations, and ecosystem fragmentation. To this end, we present Step1X-3D, an open framework addressing these challenges through: (1) a rigorous data curation pipeline processing &gt;5M assets to create a 2M high-quality dataset with standardized geometric and textural properties; (2) a two-stage 3D-native architecture combining a hybrid VAE-DiT geometry generator with an diffusion-based texture synthesis module; and (3) the full open-source release of models, training code, and adaptation modules. For geometry generation, the hybrid VAE-DiT component produces TSDF representations by employing perceiver-based latent encoding with sharp edge sampling for detail preservation. The diffusion-based texture synthesis module then ensures cross-view consistency through geometric conditioning and latent-space synchronization. Benchmark results demonstrate state-of-the-art performance that exceeds existing open-source methods, while also achieving competitive quality with proprietary solutions. Notably, the framework uniquely bridges the 2D and 3D generation paradigms by supporting direct transfer of 2D control techniques~(e.g., LoRA) to 3D synthesis. By simultaneously advancing data quality, algorithmic fidelity, and reproducibility, Step1X-3D aims to establish new standards for open research in controllable 3D asset generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Learning from Peers in Reasoning Models</title>
      <itunes:episode>751</itunes:episode>
      <podcast:episode>751</podcast:episode>
      <itunes:title>Learning from Peers in Reasoning Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">84f23e9a-5151-4a5f-b141-61a7787eb0d0</guid>
      <link>https://share.transistor.fm/s/7e8c05a2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tongxu Luo, Wenyu Du, Jiaxi Bi, Stephen Chung, Zhengyang Tang, Hao Yang, Min Zhang, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            Learning from Peers in Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.07787v1">http://arxiv.org/abs/2505.07787v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the "Prefix Dominance Trap". Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose **Learning from Peers** (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them into our **LeaP-T** model series. Experiments on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond show that LeaP provides substantial improvements. For instance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than the baseline on average, and surpasses DeepSeek-R1-671B on three math benchmarks with an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals LeaP's robust error correction by timely peer insights, showing strong error tolerance and handling varied task difficulty. LeaP marks a milestone by enabling LRMs to collaborate during reasoning. Our code, datasets, and models are available at https://learning-from-peers.github.io/ .</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tongxu Luo, Wenyu Du, Jiaxi Bi, Stephen Chung, Zhengyang Tang, Hao Yang, Min Zhang, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            Learning from Peers in Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.07787v1">http://arxiv.org/abs/2505.07787v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the "Prefix Dominance Trap". Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose **Learning from Peers** (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them into our **LeaP-T** model series. Experiments on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond show that LeaP provides substantial improvements. For instance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than the baseline on average, and surpasses DeepSeek-R1-671B on three math benchmarks with an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals LeaP's robust error correction by timely peer insights, showing strong error tolerance and handling varied task difficulty. LeaP marks a milestone by enabling LRMs to collaborate during reasoning. Our code, datasets, and models are available at https://learning-from-peers.github.io/ .</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 13 May 2025 20:33:46 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7e8c05a2/db29a39e.mp3" length="20392706" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1271</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tongxu Luo, Wenyu Du, Jiaxi Bi, Stephen Chung, Zhengyang Tang, Hao Yang, Min Zhang, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            Learning from Peers in Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.07787v1">http://arxiv.org/abs/2505.07787v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the "Prefix Dominance Trap". Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose **Learning from Peers** (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them into our **LeaP-T** model series. Experiments on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond show that LeaP provides substantial improvements. For instance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than the baseline on average, and surpasses DeepSeek-R1-671B on three math benchmarks with an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals LeaP's robust error correction by timely peer insights, showing strong error tolerance and handling varied task difficulty. LeaP marks a milestone by enabling LRMs to collaborate during reasoning. Our code, datasets, and models are available at https://learning-from-peers.github.io/ .</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Unified Continuous Generative Models</title>
      <itunes:episode>750</itunes:episode>
      <podcast:episode>750</podcast:episode>
      <itunes:title>Unified Continuous Generative Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2064b6fb-63bd-4920-90d9-671d77f61df0</guid>
      <link>https://share.transistor.fm/s/36bac515</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Peng Sun, Yi Jiang, Tao Lin</p>

            <p><strong>Title:</strong><br>
            Unified Continuous Generative Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.07447v1">http://arxiv.org/abs/2505.07447v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in continuous generative models, including multi-step approaches like diffusion and flow-matching (typically requiring 8-1000 sampling steps) and few-step methods such as consistency models (typically 1-8 steps), have demonstrated impressive generative performance. However, existing work often treats these approaches as distinct paradigms, resulting in separate training and sampling methodologies. We introduce a unified framework for training, sampling, and analyzing these models. Our implementation, the Unified Continuous Generative Models Trainer and Sampler (UCGM-{T,S}), achieves state-of-the-art (SOTA) performance. For example, on ImageNet 256x256 using a 675M diffusion transformer, UCGM-T trains a multi-step model achieving 1.30 FID in 20 steps and a few-step model reaching 1.42 FID in just 2 steps. Additionally, applying UCGM-S to a pre-trained model (previously 1.26 FID at 250 steps) improves performance to 1.06 FID in only 40 steps. Code is available at: https://github.com/LINs-lab/UCGM.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Peng Sun, Yi Jiang, Tao Lin</p>

            <p><strong>Title:</strong><br>
            Unified Continuous Generative Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.07447v1">http://arxiv.org/abs/2505.07447v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in continuous generative models, including multi-step approaches like diffusion and flow-matching (typically requiring 8-1000 sampling steps) and few-step methods such as consistency models (typically 1-8 steps), have demonstrated impressive generative performance. However, existing work often treats these approaches as distinct paradigms, resulting in separate training and sampling methodologies. We introduce a unified framework for training, sampling, and analyzing these models. Our implementation, the Unified Continuous Generative Models Trainer and Sampler (UCGM-{T,S}), achieves state-of-the-art (SOTA) performance. For example, on ImageNet 256x256 using a 675M diffusion transformer, UCGM-T trains a multi-step model achieving 1.30 FID in 20 steps and a few-step model reaching 1.42 FID in just 2 steps. Additionally, applying UCGM-S to a pre-trained model (previously 1.26 FID at 250 steps) improves performance to 1.06 FID in only 40 steps. Code is available at: https://github.com/LINs-lab/UCGM.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 13 May 2025 20:33:24 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/36bac515/291e3652.mp3" length="17493320" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1090</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Peng Sun, Yi Jiang, Tao Lin</p>

            <p><strong>Title:</strong><br>
            Unified Continuous Generative Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.07447v1">http://arxiv.org/abs/2505.07447v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in continuous generative models, including multi-step approaches like diffusion and flow-matching (typically requiring 8-1000 sampling steps) and few-step methods such as consistency models (typically 1-8 steps), have demonstrated impressive generative performance. However, existing work often treats these approaches as distinct paradigms, resulting in separate training and sampling methodologies. We introduce a unified framework for training, sampling, and analyzing these models. Our implementation, the Unified Continuous Generative Models Trainer and Sampler (UCGM-{T,S}), achieves state-of-the-art (SOTA) performance. For example, on ImageNet 256x256 using a 675M diffusion transformer, UCGM-T trains a multi-step model achieving 1.30 FID in 20 steps and a few-step model reaching 1.42 FID in just 2 steps. Additionally, applying UCGM-S to a pre-trained model (previously 1.26 FID at 250 steps) improves performance to 1.06 FID in only 40 steps. Code is available at: https://github.com/LINs-lab/UCGM.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback</title>
      <itunes:episode>749</itunes:episode>
      <podcast:episode>749</podcast:episode>
      <itunes:title>REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d96635d0-01d8-4581-82a9-8a14749ddb3d</guid>
      <link>https://share.transistor.fm/s/bbe5041a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Aniruddha Roy, Pretam Ray, Abhilash Nandy, Somak Aditya, Pawan Goyal</p>

            <p><strong>Title:</strong><br>
            REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.06548v1">http://arxiv.org/abs/2505.06548v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-based Large Language Models (LLMs) have proven effective in numerous few-shot or zero-shot Natural Language Processing (NLP) tasks. However, creating human-annotated instruction data is time-consuming, expensive, and often limited in quantity and task diversity. Previous research endeavors have attempted to address this challenge by proposing frameworks capable of generating instructions in a semi-automated and task-agnostic manner directly from the model itself. Many of these efforts have relied on large API-only parameter-based models such as GPT-3.5 (175B), which are expensive, and subject to limits on a number of queries. This paper explores the performance of three open-source small LLMs such as LLaMA 2-7B, LLama 2-13B, and Mistral 7B, using a semi-automated framework, thereby reducing human intervention, effort, and cost required to generate an instruction dataset for fine-tuning LLMs. Furthermore, we demonstrate that incorporating a Reinforcement Learning (RL) based training algorithm into this LLMs-based framework leads to further enhancements. Our evaluation of the dataset reveals that these RL-based frameworks achieve a substantial improvements in 63-66% of the tasks compared to previous approaches.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Aniruddha Roy, Pretam Ray, Abhilash Nandy, Somak Aditya, Pawan Goyal</p>

            <p><strong>Title:</strong><br>
            REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.06548v1">http://arxiv.org/abs/2505.06548v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-based Large Language Models (LLMs) have proven effective in numerous few-shot or zero-shot Natural Language Processing (NLP) tasks. However, creating human-annotated instruction data is time-consuming, expensive, and often limited in quantity and task diversity. Previous research endeavors have attempted to address this challenge by proposing frameworks capable of generating instructions in a semi-automated and task-agnostic manner directly from the model itself. Many of these efforts have relied on large API-only parameter-based models such as GPT-3.5 (175B), which are expensive, and subject to limits on a number of queries. This paper explores the performance of three open-source small LLMs such as LLaMA 2-7B, LLama 2-13B, and Mistral 7B, using a semi-automated framework, thereby reducing human intervention, effort, and cost required to generate an instruction dataset for fine-tuning LLMs. Furthermore, we demonstrate that incorporating a Reinforcement Learning (RL) based training algorithm into this LLMs-based framework leads to further enhancements. Our evaluation of the dataset reveals that these RL-based frameworks achieve a substantial improvements in 63-66% of the tasks compared to previous approaches.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 13 May 2025 20:33:03 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bbe5041a/4a5be95c.mp3" length="19944342" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1243</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Aniruddha Roy, Pretam Ray, Abhilash Nandy, Somak Aditya, Pawan Goyal</p>

            <p><strong>Title:</strong><br>
            REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.06548v1">http://arxiv.org/abs/2505.06548v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-based Large Language Models (LLMs) have proven effective in numerous few-shot or zero-shot Natural Language Processing (NLP) tasks. However, creating human-annotated instruction data is time-consuming, expensive, and often limited in quantity and task diversity. Previous research endeavors have attempted to address this challenge by proposing frameworks capable of generating instructions in a semi-automated and task-agnostic manner directly from the model itself. Many of these efforts have relied on large API-only parameter-based models such as GPT-3.5 (175B), which are expensive, and subject to limits on a number of queries. This paper explores the performance of three open-source small LLMs such as LLaMA 2-7B, LLama 2-13B, and Mistral 7B, using a semi-automated framework, thereby reducing human intervention, effort, and cost required to generate an instruction dataset for fine-tuning LLMs. Furthermore, we demonstrate that incorporating a Reinforcement Learning (RL) based training algorithm into this LLMs-based framework leads to further enhancements. Our evaluation of the dataset reveals that these RL-based frameworks achieve a substantial improvements in 63-66% of the tasks compared to previous approaches.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Bielik v3 Small: Technical Report</title>
      <itunes:episode>748</itunes:episode>
      <podcast:episode>748</podcast:episode>
      <itunes:title>Bielik v3 Small: Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">014bdce0-fdeb-4746-9d32-85bc4ad83046</guid>
      <link>https://share.transistor.fm/s/71b5030b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.LG, cs.AI, cs.CL, 68T50, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Krzysztof Ociepa, Łukasz Flis, Remigiusz Kinas, Krzysztof Wróbel, Adrian Gwoździej</p>

            <p><strong>Title:</strong><br>
            Bielik v3 Small: Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.02550v2">http://arxiv.org/abs/2505.02550v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Bielik v3, a series of parameter-efficient generative text models (1.5B and 4.5B) optimized for Polish language processing. These models demonstrate that smaller, well-optimized architectures can achieve performance comparable to much larger counterparts while requiring substantially fewer computational resources. Our approach incorporates several key innovations: a custom Polish tokenizer (APT4) that significantly improves token efficiency, Weighted Instruction Cross-Entropy Loss to balance learning across instruction types, and Adaptive Learning Rate that dynamically adjusts based on training progress. Trained on a meticulously curated corpus of 292 billion tokens spanning 303 million documents, these models excel across multiple benchmarks, including the Open PL LLM Leaderboard, Complex Polish Text Understanding Benchmark, Polish EQ-Bench, and Polish Medical Leaderboard. The 4.5B parameter model achieves results competitive with models 2-3 times its size, while the 1.5B model delivers strong performance despite its extremely compact profile. These advances establish new benchmarks for parameter-efficient language modeling in less-represented languages, making high-quality Polish language AI more accessible for resource-constrained applications.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.LG, cs.AI, cs.CL, 68T50, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Krzysztof Ociepa, Łukasz Flis, Remigiusz Kinas, Krzysztof Wróbel, Adrian Gwoździej</p>

            <p><strong>Title:</strong><br>
            Bielik v3 Small: Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.02550v2">http://arxiv.org/abs/2505.02550v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Bielik v3, a series of parameter-efficient generative text models (1.5B and 4.5B) optimized for Polish language processing. These models demonstrate that smaller, well-optimized architectures can achieve performance comparable to much larger counterparts while requiring substantially fewer computational resources. Our approach incorporates several key innovations: a custom Polish tokenizer (APT4) that significantly improves token efficiency, Weighted Instruction Cross-Entropy Loss to balance learning across instruction types, and Adaptive Learning Rate that dynamically adjusts based on training progress. Trained on a meticulously curated corpus of 292 billion tokens spanning 303 million documents, these models excel across multiple benchmarks, including the Open PL LLM Leaderboard, Complex Polish Text Understanding Benchmark, Polish EQ-Bench, and Polish Medical Leaderboard. The 4.5B parameter model achieves results competitive with models 2-3 times its size, while the 1.5B model delivers strong performance despite its extremely compact profile. These advances establish new benchmarks for parameter-efficient language modeling in less-represented languages, making high-quality Polish language AI more accessible for resource-constrained applications.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 12 May 2025 20:07:48 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/71b5030b/8753ba63.mp3" length="23777333" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1482</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.LG, cs.AI, cs.CL, 68T50, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Krzysztof Ociepa, Łukasz Flis, Remigiusz Kinas, Krzysztof Wróbel, Adrian Gwoździej</p>

            <p><strong>Title:</strong><br>
            Bielik v3 Small: Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.02550v2">http://arxiv.org/abs/2505.02550v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Bielik v3, a series of parameter-efficient generative text models (1.5B and 4.5B) optimized for Polish language processing. These models demonstrate that smaller, well-optimized architectures can achieve performance comparable to much larger counterparts while requiring substantially fewer computational resources. Our approach incorporates several key innovations: a custom Polish tokenizer (APT4) that significantly improves token efficiency, Weighted Instruction Cross-Entropy Loss to balance learning across instruction types, and Adaptive Learning Rate that dynamically adjusts based on training progress. Trained on a meticulously curated corpus of 292 billion tokens spanning 303 million documents, these models excel across multiple benchmarks, including the Open PL LLM Leaderboard, Complex Polish Text Understanding Benchmark, Polish EQ-Bench, and Polish Medical Leaderboard. The 4.5B parameter model achieves results competitive with models 2-3 times its size, while the 1.5B model delivers strong performance despite its extremely compact profile. These advances establish new benchmarks for parameter-efficient language modeling in less-represented languages, making high-quality Polish language AI more accessible for resource-constrained applications.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Bielik 11B v2 Technical Report</title>
      <itunes:episode>747</itunes:episode>
      <podcast:episode>747</podcast:episode>
      <itunes:title>Bielik 11B v2 Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1f51a486-49e8-41e0-a21d-201f446e94e3</guid>
      <link>https://share.transistor.fm/s/3f9bba56</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL, cs.AI, 68T50, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Krzysztof Ociepa, Łukasz Flis, Krzysztof Wróbel, Adrian Gwoździej, Remigiusz Kinas</p>

            <p><strong>Title:</strong><br>
            Bielik 11B v2 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.02410v2">http://arxiv.org/abs/2505.02410v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present Bielik 11B v2, a state-of-the-art language model optimized for Polish text processing. Built on the Mistral 7B v0.2 architecture and scaled to 11B parameters using depth up-scaling, this model demonstrates exceptional performance across Polish language benchmarks while maintaining strong cross-lingual capabilities. We introduce two key technical innovations: Weighted Instruction Cross-Entropy Loss, which optimizes learning across diverse instruction types by assigning quality-based weights to training examples, and Adaptive Learning Rate, which dynamically adjusts based on context length. Comprehensive evaluation across multiple benchmarks demonstrates that Bielik 11B v2 outperforms many larger models, including those with 2-6 times more parameters, and significantly surpasses other specialized Polish language models on tasks ranging from linguistic understanding to complex reasoning. The model's parameter efficiency and extensive quantization options enable deployment across various hardware configurations, advancing Polish language AI capabilities and establishing new benchmarks for resource-efficient language modeling in less-represented languages.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL, cs.AI, 68T50, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Krzysztof Ociepa, Łukasz Flis, Krzysztof Wróbel, Adrian Gwoździej, Remigiusz Kinas</p>

            <p><strong>Title:</strong><br>
            Bielik 11B v2 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.02410v2">http://arxiv.org/abs/2505.02410v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present Bielik 11B v2, a state-of-the-art language model optimized for Polish text processing. Built on the Mistral 7B v0.2 architecture and scaled to 11B parameters using depth up-scaling, this model demonstrates exceptional performance across Polish language benchmarks while maintaining strong cross-lingual capabilities. We introduce two key technical innovations: Weighted Instruction Cross-Entropy Loss, which optimizes learning across diverse instruction types by assigning quality-based weights to training examples, and Adaptive Learning Rate, which dynamically adjusts based on context length. Comprehensive evaluation across multiple benchmarks demonstrates that Bielik 11B v2 outperforms many larger models, including those with 2-6 times more parameters, and significantly surpasses other specialized Polish language models on tasks ranging from linguistic understanding to complex reasoning. The model's parameter efficiency and extensive quantization options enable deployment across various hardware configurations, advancing Polish language AI capabilities and establishing new benchmarks for resource-efficient language modeling in less-represented languages.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 12 May 2025 20:07:25 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3f9bba56/e5b25032.mp3" length="22652184" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1412</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL, cs.AI, 68T50, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Krzysztof Ociepa, Łukasz Flis, Krzysztof Wróbel, Adrian Gwoździej, Remigiusz Kinas</p>

            <p><strong>Title:</strong><br>
            Bielik 11B v2 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.02410v2">http://arxiv.org/abs/2505.02410v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present Bielik 11B v2, a state-of-the-art language model optimized for Polish text processing. Built on the Mistral 7B v0.2 architecture and scaled to 11B parameters using depth up-scaling, this model demonstrates exceptional performance across Polish language benchmarks while maintaining strong cross-lingual capabilities. We introduce two key technical innovations: Weighted Instruction Cross-Entropy Loss, which optimizes learning across diverse instruction types by assigning quality-based weights to training examples, and Adaptive Learning Rate, which dynamically adjusts based on context length. Comprehensive evaluation across multiple benchmarks demonstrates that Bielik 11B v2 outperforms many larger models, including those with 2-6 times more parameters, and significantly surpasses other specialized Polish language models on tasks ranging from linguistic understanding to complex reasoning. The model's parameter efficiency and extensive quantization options enable deployment across various hardware configurations, advancing Polish language AI capabilities and establishing new benchmarks for resource-efficient language modeling in less-represented languages.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models</title>
      <itunes:episode>746</itunes:episode>
      <podcast:episode>746</podcast:episode>
      <itunes:title>Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5d5137c3-581b-4d2d-b180-281333e2dc94</guid>
      <link>https://share.transistor.fm/s/e12dd3be</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 79 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.04921v1">http://arxiv.org/abs/2505.04921v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning. As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior. To address these issues, we present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field's shifting design philosophies and emerging capabilities. First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion. Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains. Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 79 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.04921v1">http://arxiv.org/abs/2505.04921v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning. As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior. To address these issues, we present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field's shifting design philosophies and emerging capabilities. First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion. Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains. Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 09 May 2025 20:02:45 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e12dd3be/7f9621f1.mp3" length="22457467" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1400</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 79 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.04921v1">http://arxiv.org/abs/2505.04921v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning. As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior. To address these issues, we present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field's shifting design philosophies and emerging capabilities. First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion. Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains. Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>On Path to Multimodal Generalist: General-Level and General-Bench</title>
      <itunes:episode>745</itunes:episode>
      <podcast:episode>745</podcast:episode>
      <itunes:title>On Path to Multimodal Generalist: General-Level and General-Bench</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">72566839-cf91-4f01-bea1-aefa8679a1f4</guid>
      <link>https://share.transistor.fm/s/e992f0e4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, Qingyu Shi, Zhiyuan Zhou, Liangtao Shi, Minghe Gao, Daoan Zhang, Zhiqi Ge, Weiming Wu, Siliang Tang, Kaihang Pan, Yaobo Ye, Haobo Yuan, Tao Zhang, Tianjie Ju, Zixiang Meng, Shilin Xu, Liyu Jia, Wentao Hu, Meng Luo, Jiebo Luo, Tat-Seng Chua, Shuicheng Yan, Hanwang Zhang</p>

            <p><strong>Title:</strong><br>
            On Path to Multimodal Generalist: General-Level and General-Bench</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.04620v1">http://arxiv.org/abs/2505.04620v1</a></p>

            <p><strong>Abstract:</strong><br>
            The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of LLMs. Unlike earlier specialists, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting limited modalities to arbitrary ones. While many benchmarks exist to assess MLLMs, a critical question arises: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI? We argue that the answer is not as straightforward as it seems. This project introduces General-Level, an evaluation framework that defines 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI. At the core of the framework is the concept of Synergy, which measures whether models maintain consistent capabilities across comprehension and generation, and across multiple modalities. To support this evaluation, we present General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project page: https://generalist.top/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, Qingyu Shi, Zhiyuan Zhou, Liangtao Shi, Minghe Gao, Daoan Zhang, Zhiqi Ge, Weiming Wu, Siliang Tang, Kaihang Pan, Yaobo Ye, Haobo Yuan, Tao Zhang, Tianjie Ju, Zixiang Meng, Shilin Xu, Liyu Jia, Wentao Hu, Meng Luo, Jiebo Luo, Tat-Seng Chua, Shuicheng Yan, Hanwang Zhang</p>

            <p><strong>Title:</strong><br>
            On Path to Multimodal Generalist: General-Level and General-Bench</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.04620v1">http://arxiv.org/abs/2505.04620v1</a></p>

            <p><strong>Abstract:</strong><br>
            The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of LLMs. Unlike earlier specialists, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting limited modalities to arbitrary ones. While many benchmarks exist to assess MLLMs, a critical question arises: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI? We argue that the answer is not as straightforward as it seems. This project introduces General-Level, an evaluation framework that defines 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI. At the core of the framework is the concept of Synergy, which measures whether models maintain consistent capabilities across comprehension and generation, and across multiple modalities. To support this evaluation, we present General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project page: https://generalist.top/</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 09 May 2025 20:02:23 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e992f0e4/2a516121.mp3" length="20022002" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1248</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, Qingyu Shi, Zhiyuan Zhou, Liangtao Shi, Minghe Gao, Daoan Zhang, Zhiqi Ge, Weiming Wu, Siliang Tang, Kaihang Pan, Yaobo Ye, Haobo Yuan, Tao Zhang, Tianjie Ju, Zixiang Meng, Shilin Xu, Liyu Jia, Wentao Hu, Meng Luo, Jiebo Luo, Tat-Seng Chua, Shuicheng Yan, Hanwang Zhang</p>

            <p><strong>Title:</strong><br>
            On Path to Multimodal Generalist: General-Level and General-Bench</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.04620v1">http://arxiv.org/abs/2505.04620v1</a></p>

            <p><strong>Abstract:</strong><br>
            The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of LLMs. Unlike earlier specialists, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting limited modalities to arbitrary ones. While many benchmarks exist to assess MLLMs, a critical question arises: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI? We argue that the answer is not as straightforward as it seems. This project introduces General-Level, an evaluation framework that defines 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI. At the core of the framework is the concept of Synergy, which measures whether models maintain consistent capabilities across comprehension and generation, and across multiple modalities. To support this evaluation, we present General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project page: https://generalist.top/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Flow-GRPO: Training Flow Matching Models via Online RL</title>
      <itunes:episode>744</itunes:episode>
      <podcast:episode>744</podcast:episode>
      <itunes:title>Flow-GRPO: Training Flow Matching Models via Online RL</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1b034ec9-3d21-4a8f-932a-8baf3b0a0afa</guid>
      <link>https://share.transistor.fm/s/6205b0dc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, Wanli Ouyang</p>

            <p><strong>Title:</strong><br>
            Flow-GRPO: Training Flow Matching Models via Online RL</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.05470v1">http://arxiv.org/abs/2505.05470v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number, significantly improving sampling efficiency without performance degradation. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly perfect object counts, spatial relations, and fine-grained attributes, boosting GenEval accuracy from $63\%$ to $95\%$. In visual text rendering, its accuracy improves from $59\%$ to $92\%$, significantly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, little to no reward hacking occurred, meaning rewards did not increase at the cost of image quality or diversity, and both remained stable in our experiments.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, Wanli Ouyang</p>

            <p><strong>Title:</strong><br>
            Flow-GRPO: Training Flow Matching Models via Online RL</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.05470v1">http://arxiv.org/abs/2505.05470v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number, significantly improving sampling efficiency without performance degradation. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly perfect object counts, spatial relations, and fine-grained attributes, boosting GenEval accuracy from $63\%$ to $95\%$. In visual text rendering, its accuracy improves from $59\%$ to $92\%$, significantly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, little to no reward hacking occurred, meaning rewards did not increase at the cost of image quality or diversity, and both remained stable in our experiments.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 09 May 2025 20:02:02 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6205b0dc/8aa04824.mp3" length="22942690" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1430</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, Wanli Ouyang</p>

            <p><strong>Title:</strong><br>
            Flow-GRPO: Training Flow Matching Models via Online RL</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.05470v1">http://arxiv.org/abs/2505.05470v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number, significantly improving sampling efficiency without performance degradation. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly perfect object counts, spatial relations, and fine-grained attributes, boosting GenEval accuracy from $63\%$ to $95\%$. In visual text rendering, its accuracy improves from $59\%$ to $92\%$, significantly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, little to no reward hacking occurred, meaning rewards did not increase at the cost of image quality or diversity, and both remained stable in our experiments.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities</title>
      <itunes:episode>743</itunes:episode>
      <podcast:episode>743</podcast:episode>
      <itunes:title>Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">56686396-d1ab-442d-ac7d-dacc239d8c36</guid>
      <link>https://share.transistor.fm/s/6867f40b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang</p>

            <p><strong>Title:</strong><br>
            Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.02567v2">http://arxiv.org/abs/2505.02567v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent years have seen remarkable progress in both multimodal understanding models and image generation models. Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation. Recently, there has been growing interest in developing unified frameworks that integrate these tasks. The emergence of GPT-4o's new capabilities exemplifies this trend, highlighting the potential for unification. However, the architectural differences between the two domains pose significant challenges. To provide a clear overview of current efforts toward unification, we present a comprehensive survey aimed at guiding future research. First, we introduce the foundational concepts and recent advancements in multimodal understanding and text-to-image generation models. Next, we review existing unified models, categorizing them into three main architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches that fuse autoregressive and diffusion mechanisms. For each category, we analyze the structural designs and innovations introduced by related works. Additionally, we compile datasets and benchmarks tailored for unified models, offering resources for future exploration. Finally, we discuss the key challenges facing this nascent field, including tokenization strategy, cross-modal attention, and data. As this area is still in its early stages, we anticipate rapid advancements and will regularly update this survey. Our goal is to inspire further research and provide a valuable reference for the community. The references associated with this survey are available on GitHub (https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models).</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang</p>

            <p><strong>Title:</strong><br>
            Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.02567v2">http://arxiv.org/abs/2505.02567v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent years have seen remarkable progress in both multimodal understanding models and image generation models. Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation. Recently, there has been growing interest in developing unified frameworks that integrate these tasks. The emergence of GPT-4o's new capabilities exemplifies this trend, highlighting the potential for unification. However, the architectural differences between the two domains pose significant challenges. To provide a clear overview of current efforts toward unification, we present a comprehensive survey aimed at guiding future research. First, we introduce the foundational concepts and recent advancements in multimodal understanding and text-to-image generation models. Next, we review existing unified models, categorizing them into three main architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches that fuse autoregressive and diffusion mechanisms. For each category, we analyze the structural designs and innovations introduced by related works. Additionally, we compile datasets and benchmarks tailored for unified models, offering resources for future exploration. Finally, we discuss the key challenges facing this nascent field, including tokenization strategy, cross-modal attention, and data. As this area is still in its early stages, we anticipate rapid advancements and will regularly update this survey. Our goal is to inspire further research and provide a valuable reference for the community. The references associated with this survey are available on GitHub (https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models).</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 08 May 2025 19:57:49 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6867f40b/dd96e739.mp3" length="17655965" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1100</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xinjie Zhang, Jintao Guo, Shanshan Zhao, Minghao Fu, Lunhao Duan, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang</p>

            <p><strong>Title:</strong><br>
            Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.02567v2">http://arxiv.org/abs/2505.02567v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent years have seen remarkable progress in both multimodal understanding models and image generation models. Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation. Recently, there has been growing interest in developing unified frameworks that integrate these tasks. The emergence of GPT-4o's new capabilities exemplifies this trend, highlighting the potential for unification. However, the architectural differences between the two domains pose significant challenges. To provide a clear overview of current efforts toward unification, we present a comprehensive survey aimed at guiding future research. First, we introduce the foundational concepts and recent advancements in multimodal understanding and text-to-image generation models. Next, we review existing unified models, categorizing them into three main architectural paradigms: diffusion-based, autoregressive-based, and hybrid approaches that fuse autoregressive and diffusion mechanisms. For each category, we analyze the structural designs and innovations introduced by related works. Additionally, we compile datasets and benchmarks tailored for unified models, offering resources for future exploration. Finally, we discuss the key challenges facing this nascent field, including tokenization strategy, cross-modal attention, and data. As this area is still in its early stages, we anticipate rapid advancements and will regularly update this survey. Our goal is to inspire further research and provide a valuable reference for the community. The references associated with this survey are available on GitHub (https://github.com/AIDC-AI/Awesome-Unified-Multimodal-Models).</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ZeroSearch: Incentivize the Search Capability of LLMs without Searching</title>
      <itunes:episode>742</itunes:episode>
      <podcast:episode>742</podcast:episode>
      <itunes:title>ZeroSearch: Incentivize the Search Capability of LLMs without Searching</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e32aa36c-0aaa-4e6e-a39c-f50297a5307d</guid>
      <link>https://share.transistor.fm/s/3fb73d43</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Fei Huang, Yan Zhang</p>

            <p><strong>Title:</strong><br>
            ZeroSearch: Incentivize the Search Capability of LLMs without Searching</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.04588v1">http://arxiv.org/abs/2505.04588v1</a></p>

            <p><strong>Abstract:</strong><br>
            Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a reinforcement learning framework that incentivizes the search capabilities of LLMs without interacting with real search engines. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both relevant and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Fei Huang, Yan Zhang</p>

            <p><strong>Title:</strong><br>
            ZeroSearch: Incentivize the Search Capability of LLMs without Searching</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.04588v1">http://arxiv.org/abs/2505.04588v1</a></p>

            <p><strong>Abstract:</strong><br>
            Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a reinforcement learning framework that incentivizes the search capabilities of LLMs without interacting with real search engines. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both relevant and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 08 May 2025 19:57:28 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3fb73d43/16063577.mp3" length="21298455" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1327</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Fei Huang, Yan Zhang</p>

            <p><strong>Title:</strong><br>
            ZeroSearch: Incentivize the Search Capability of LLMs without Searching</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.04588v1">http://arxiv.org/abs/2505.04588v1</a></p>

            <p><strong>Abstract:</strong><br>
            Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a reinforcement learning framework that incentivizes the search capabilities of LLMs without interacting with real search engines. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both relevant and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning</title>
      <itunes:episode>741</itunes:episode>
      <podcast:episode>741</podcast:episode>
      <itunes:title>Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e82ea10d-7008-48e4-8f3f-a1bd6d03054b</guid>
      <link>https://share.transistor.fm/s/c5d064b4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.03318v1">http://arxiv.org/abs/2505.03318v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model's latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model's cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.03318v1">http://arxiv.org/abs/2505.03318v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model's latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model's cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 07 May 2025 20:22:48 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c5d064b4/d7f61b3b.mp3" length="21328559" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1329</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 67 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.03318v1">http://arxiv.org/abs/2505.03318v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model's latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model's cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Absolute Zero: Reinforced Self-play Reasoning with Zero Data</title>
      <itunes:episode>740</itunes:episode>
      <podcast:episode>740</podcast:episode>
      <itunes:title>Absolute Zero: Reinforced Self-play Reasoning with Zero Data</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b31e6a5b-23d8-464e-a161-8d62ea46e5bc</guid>
      <link>https://share.transistor.fm/s/b6baadcf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang</p>

            <p><strong>Title:</strong><br>
            Absolute Zero: Reinforced Self-play Reasoning with Zero Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.03335v2">http://arxiv.org/abs/2505.03335v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang</p>

            <p><strong>Title:</strong><br>
            Absolute Zero: Reinforced Self-play Reasoning with Zero Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.03335v2">http://arxiv.org/abs/2505.03335v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 07 May 2025 20:22:27 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b6baadcf/dec370ad.mp3" length="23713830" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1478</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang</p>

            <p><strong>Title:</strong><br>
            Absolute Zero: Reinforced Self-play Reasoning with Zero Data</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.03335v2">http://arxiv.org/abs/2505.03335v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale</title>
      <itunes:episode>739</itunes:episode>
      <podcast:episode>739</podcast:episode>
      <itunes:title>RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9f189b41-e7db-4d49-9ce7-543de1fd242c</guid>
      <link>https://share.transistor.fm/s/0b73dd79</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL, cs.AI, cs.LG, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah</p>

            <p><strong>Title:</strong><br>
            RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.03005v1">http://arxiv.org/abs/2505.03005v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than \$2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement.   Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL, cs.AI, cs.LG, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah</p>

            <p><strong>Title:</strong><br>
            RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.03005v1">http://arxiv.org/abs/2505.03005v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than \$2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement.   Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 07 May 2025 20:22:06 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0b73dd79/590c5d7d.mp3" length="20142802" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1255</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL, cs.AI, cs.LG, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah</p>

            <p><strong>Title:</strong><br>
            RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.03005v1">http://arxiv.org/abs/2505.03005v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than \$2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement.   Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios</title>
      <itunes:episode>738</itunes:episode>
      <podcast:episode>738</podcast:episode>
      <itunes:title>FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f19b608a-a888-4750-8bbf-b0a7fee654a1</guid>
      <link>https://share.transistor.fm/s/4cd05a2c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV, cs.AI, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, Yansong Tang</p>

            <p><strong>Title:</strong><br>
            FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.03730v1">http://arxiv.org/abs/2505.03730v1</a></p>

            <p><strong>Abstract:</strong><br>
            Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure, such as layout, skeleton, and viewpoint consistency, reducing adaptability across diverse subjects and scenarios. To overcome these limitations, we propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process. Experiments demonstrate that our method effectively transfers actions to subjects with diverse layouts, skeletons, and viewpoints. We release our code and model weights to support further research at https://shiyi-zh0408.github.io/projectpages/FlexiAct/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV, cs.AI, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, Yansong Tang</p>

            <p><strong>Title:</strong><br>
            FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.03730v1">http://arxiv.org/abs/2505.03730v1</a></p>

            <p><strong>Abstract:</strong><br>
            Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure, such as layout, skeleton, and viewpoint consistency, reducing adaptability across diverse subjects and scenarios. To overcome these limitations, we propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process. Experiments demonstrate that our method effectively transfers actions to subjects with diverse layouts, skeletons, and viewpoints. We release our code and model weights to support further research at https://shiyi-zh0408.github.io/projectpages/FlexiAct/</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 07 May 2025 20:21:44 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4cd05a2c/6c2c1e1b.mp3" length="19240003" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1199</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV, cs.AI, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Shiyi Zhang, Junhao Zhuang, Zhaoyang Zhang, Ying Shan, Yansong Tang</p>

            <p><strong>Title:</strong><br>
            FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.03730v1">http://arxiv.org/abs/2505.03730v1</a></p>

            <p><strong>Abstract:</strong><br>
            Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure, such as layout, skeleton, and viewpoint consistency, reducing adaptability across diverse subjects and scenarios. To overcome these limitations, we propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process. Experiments demonstrate that our method effectively transfers actions to subjects with diverse layouts, skeletons, and viewpoints. We release our code and model weights to support further research at https://shiyi-zh0408.github.io/projectpages/FlexiAct/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play</title>
      <itunes:episode>737</itunes:episode>
      <podcast:episode>737</podcast:episode>
      <itunes:title>Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">670a944d-ce49-436e-9ce8-20243dcf6591</guid>
      <link>https://share.transistor.fm/s/dcebc27c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.AI, cs.CL, cs.SD</p>

            <p><strong>Authors:</strong><br>
            Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu</p>

            <p><strong>Title:</strong><br>
            Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.02707v1">http://arxiv.org/abs/2505.02707v1</a></p>

            <p><strong>Abstract:</strong><br>
            A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation -- where users can simply write text instructions to define the speaker's identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.AI, cs.CL, cs.SD</p>

            <p><strong>Authors:</strong><br>
            Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu</p>

            <p><strong>Title:</strong><br>
            Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.02707v1">http://arxiv.org/abs/2505.02707v1</a></p>

            <p><strong>Abstract:</strong><br>
            A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation -- where users can simply write text instructions to define the speaker's identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 06 May 2025 20:36:34 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/dcebc27c/becceb24.mp3" length="22466676" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1400</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 56 | cs.AI, cs.CL, cs.SD</p>

            <p><strong>Authors:</strong><br>
            Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu</p>

            <p><strong>Title:</strong><br>
            Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.02707v1">http://arxiv.org/abs/2505.02707v1</a></p>

            <p><strong>Abstract:</strong><br>
            A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation -- where users can simply write text instructions to define the speaker's identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RM-R1: Reward Modeling as Reasoning</title>
      <itunes:episode>736</itunes:episode>
      <podcast:episode>736</podcast:episode>
      <itunes:title>RM-R1: Reward Modeling as Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b4adc128-2353-4250-94a1-89bd17016272</guid>
      <link>https://share.transistor.fm/s/500020a5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji</p>

            <p><strong>Title:</strong><br>
            RM-R1: Reward Modeling as Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.02387v1">http://arxiv.org/abs/2505.02387v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reward modeling is essential for aligning large language models (LLMs) with human preferences, especially through reinforcement learning from human feedback (RLHF). To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. However, existing RMs either produce opaque scalar scores or directly generate the prediction of a preferred answer, making them struggle to integrate natural language critiques, thus lacking interpretability. Inspired by recent advances of long chain-of-thought (CoT) on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RM's interpretability and performance. In this work, we introduce a new class of generative reward models -- Reasoning Reward Models (ReasRMs) -- which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. The training consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. RM-R1 improves LLM rollouts by self-generating reasoning traces or chat-specific rubrics and evaluating candidate responses against them. Empirically, our models achieve state-of-the-art or near state-of-the-art performance of generative RMs across multiple comprehensive reward model benchmarks, outperforming much larger open-weight models (e.g., Llama3.1-405B) and proprietary ones (e.g., GPT-4o) by up to 13.8%. Beyond final performance, we perform thorough empirical analysis to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six ReasRM models along with code and data at https://github.com/RM-R1-UIUC/RM-R1.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji</p>

            <p><strong>Title:</strong><br>
            RM-R1: Reward Modeling as Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.02387v1">http://arxiv.org/abs/2505.02387v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reward modeling is essential for aligning large language models (LLMs) with human preferences, especially through reinforcement learning from human feedback (RLHF). To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. However, existing RMs either produce opaque scalar scores or directly generate the prediction of a preferred answer, making them struggle to integrate natural language critiques, thus lacking interpretability. Inspired by recent advances of long chain-of-thought (CoT) on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RM's interpretability and performance. In this work, we introduce a new class of generative reward models -- Reasoning Reward Models (ReasRMs) -- which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. The training consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. RM-R1 improves LLM rollouts by self-generating reasoning traces or chat-specific rubrics and evaluating candidate responses against them. Empirically, our models achieve state-of-the-art or near state-of-the-art performance of generative RMs across multiple comprehensive reward model benchmarks, outperforming much larger open-weight models (e.g., Llama3.1-405B) and proprietary ones (e.g., GPT-4o) by up to 13.8%. Beyond final performance, we perform thorough empirical analysis to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six ReasRM models along with code and data at https://github.com/RM-R1-UIUC/RM-R1.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 06 May 2025 20:36:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/500020a5/a1178155.mp3" length="22327853" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1392</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji</p>

            <p><strong>Title:</strong><br>
            RM-R1: Reward Modeling as Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.02387v1">http://arxiv.org/abs/2505.02387v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reward modeling is essential for aligning large language models (LLMs) with human preferences, especially through reinforcement learning from human feedback (RLHF). To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. However, existing RMs either produce opaque scalar scores or directly generate the prediction of a preferred answer, making them struggle to integrate natural language critiques, thus lacking interpretability. Inspired by recent advances of long chain-of-thought (CoT) on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RM's interpretability and performance. In this work, we introduce a new class of generative reward models -- Reasoning Reward Models (ReasRMs) -- which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. The training consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. RM-R1 improves LLM rollouts by self-generating reasoning traces or chat-specific rubrics and evaluating candidate responses against them. Empirically, our models achieve state-of-the-art or near state-of-the-art performance of generative RMs across multiple comprehensive reward model benchmarks, outperforming much larger open-weight models (e.g., Llama3.1-405B) and proprietary ones (e.g., GPT-4o) by up to 13.8%. Beyond final performance, we perform thorough empirical analysis to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six ReasRM models along with code and data at https://github.com/RM-R1-UIUC/RM-R1.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers</title>
      <itunes:episode>735</itunes:episode>
      <podcast:episode>735</podcast:episode>
      <itunes:title>Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d55b2e92-122a-4393-b103-4958970f4747</guid>
      <link>https://share.transistor.fm/s/58e00d4e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL, cs.AI, cs.LG, I.2.7; I.2.6; I.2.3; I.7</p>

            <p><strong>Authors:</strong><br>
            Roman Abramov, Felix Steinbauer, Gjergji Kasneci</p>

            <p><strong>Title:</strong><br>
            Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20752v1">http://arxiv.org/abs/2504.20752v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transformers have achieved great success in numerous NLP tasks but continue to exhibit notable gaps in multi-step factual reasoning, especially when real-world knowledge is sparse. Recent advances in grokking have demonstrated that neural networks can transition from memorizing to perfectly generalizing once they detect underlying logical patterns - yet these studies have primarily used small, synthetic tasks. In this paper, for the first time, we extend grokking to real-world factual data and address the challenge of dataset sparsity by augmenting existing knowledge graphs with carefully designed synthetic data to raise the ratio $\phi_r$ of inferred facts to atomic facts above the threshold required for grokking. Surprisingly, we find that even factually incorrect synthetic data can strengthen emergent reasoning circuits rather than degrade accuracy, as it forces the model to rely on relational structure rather than memorization. When evaluated on multi-hop reasoning benchmarks, our approach achieves up to 95-100% accuracy on 2WikiMultiHopQA - substantially improving over strong baselines and matching or exceeding current state-of-the-art results. We further provide an in-depth analysis of how increasing $\phi_r$ drives the formation of generalizing circuits inside Transformers. Our findings suggest that grokking-based data augmentation can unlock implicit multi-hop reasoning capabilities, opening the door to more robust and interpretable factual reasoning in large-scale language models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL, cs.AI, cs.LG, I.2.7; I.2.6; I.2.3; I.7</p>

            <p><strong>Authors:</strong><br>
            Roman Abramov, Felix Steinbauer, Gjergji Kasneci</p>

            <p><strong>Title:</strong><br>
            Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20752v1">http://arxiv.org/abs/2504.20752v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transformers have achieved great success in numerous NLP tasks but continue to exhibit notable gaps in multi-step factual reasoning, especially when real-world knowledge is sparse. Recent advances in grokking have demonstrated that neural networks can transition from memorizing to perfectly generalizing once they detect underlying logical patterns - yet these studies have primarily used small, synthetic tasks. In this paper, for the first time, we extend grokking to real-world factual data and address the challenge of dataset sparsity by augmenting existing knowledge graphs with carefully designed synthetic data to raise the ratio $\phi_r$ of inferred facts to atomic facts above the threshold required for grokking. Surprisingly, we find that even factually incorrect synthetic data can strengthen emergent reasoning circuits rather than degrade accuracy, as it forces the model to rely on relational structure rather than memorization. When evaluated on multi-hop reasoning benchmarks, our approach achieves up to 95-100% accuracy on 2WikiMultiHopQA - substantially improving over strong baselines and matching or exceeding current state-of-the-art results. We further provide an in-depth analysis of how increasing $\phi_r$ drives the formation of generalizing circuits inside Transformers. Our findings suggest that grokking-based data augmentation can unlock implicit multi-hop reasoning capabilities, opening the door to more robust and interpretable factual reasoning in large-scale language models.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 06 May 2025 20:35:50 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/58e00d4e/5239e3d2.mp3" length="20561614" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1281</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL, cs.AI, cs.LG, I.2.7; I.2.6; I.2.3; I.7</p>

            <p><strong>Authors:</strong><br>
            Roman Abramov, Felix Steinbauer, Gjergji Kasneci</p>

            <p><strong>Title:</strong><br>
            Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20752v1">http://arxiv.org/abs/2504.20752v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transformers have achieved great success in numerous NLP tasks but continue to exhibit notable gaps in multi-step factual reasoning, especially when real-world knowledge is sparse. Recent advances in grokking have demonstrated that neural networks can transition from memorizing to perfectly generalizing once they detect underlying logical patterns - yet these studies have primarily used small, synthetic tasks. In this paper, for the first time, we extend grokking to real-world factual data and address the challenge of dataset sparsity by augmenting existing knowledge graphs with carefully designed synthetic data to raise the ratio $\phi_r$ of inferred facts to atomic facts above the threshold required for grokking. Surprisingly, we find that even factually incorrect synthetic data can strengthen emergent reasoning circuits rather than degrade accuracy, as it forces the model to rely on relational structure rather than memorization. When evaluated on multi-hop reasoning benchmarks, our approach achieves up to 95-100% accuracy on 2WikiMultiHopQA - substantially improving over strong baselines and matching or exceeding current state-of-the-art results. We further provide an in-depth analysis of how increasing $\phi_r$ drives the formation of generalizing circuits inside Transformers. Our findings suggest that grokking-based data augmentation can unlock implicit multi-hop reasoning capabilities, opening the door to more robust and interpretable factual reasoning in large-scale language models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Practical Efficiency of Muon for Pretraining</title>
      <itunes:episode>734</itunes:episode>
      <podcast:episode>734</podcast:episode>
      <itunes:title>Practical Efficiency of Muon for Pretraining</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0ffd48a3-8c1d-4633-af7a-66b6d4beb730</guid>
      <link>https://share.transistor.fm/s/876d9842</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Essential AI, :, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, Ashish Vaswani</p>

            <p><strong>Title:</strong><br>
            Practical Efficiency of Muon for Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.02222v1">http://arxiv.org/abs/2505.02222v1</a></p>

            <p><strong>Abstract:</strong><br>
            We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Essential AI, :, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, Ashish Vaswani</p>

            <p><strong>Title:</strong><br>
            Practical Efficiency of Muon for Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.02222v1">http://arxiv.org/abs/2505.02222v1</a></p>

            <p><strong>Abstract:</strong><br>
            We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 06 May 2025 20:35:28 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/876d9842/9beabddf.mp3" length="22510092" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1403</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Essential AI, :, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, Ashish Vaswani</p>

            <p><strong>Title:</strong><br>
            Practical Efficiency of Muon for Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.02222v1">http://arxiv.org/abs/2505.02222v1</a></p>

            <p><strong>Abstract:</strong><br>
            We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PixelHacker: Image Inpainting with Structural and Semantic Consistency</title>
      <itunes:episode>733</itunes:episode>
      <podcast:episode>733</podcast:episode>
      <itunes:title>PixelHacker: Image Inpainting with Structural and Semantic Consistency</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9ff7be8c-a73f-4bfc-a856-0df8992ad711</guid>
      <link>https://share.transistor.fm/s/8b2d3c84</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziyang Xu, Kangsheng Duan, Xiaolei Shen, Zhifeng Ding, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            PixelHacker: Image Inpainting with Structural and Semantic Consistency</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20438v2">http://arxiv.org/abs/2504.20438v2</a></p>

            <p><strong>Abstract:</strong><br>
            Image inpainting is a fundamental research area between image editing and image generation. Recent state-of-the-art (SOTA) methods have explored novel attention mechanisms, lightweight architectures, and context-aware modeling, demonstrating impressive performance. However, they often struggle with complex structure (e.g., texture, shape, spatial relations) and semantics (e.g., color consistency, object restoration, and logical correctness), leading to artifacts and inappropriate generation. To address this challenge, we design a simple yet effective inpainting paradigm called latent categories guidance, and further propose a diffusion-based model named PixelHacker. Specifically, we first construct a large dataset containing 14 million image-mask pairs by annotating foreground and background (potential 116 and 21 categories, respectively). Then, we encode potential foreground and background representations separately through two fixed-size embeddings, and intermittently inject these features into the denoising process via linear attention. Finally, by pre-training on our dataset and fine-tuning on open-source benchmarks, we obtain PixelHacker. Extensive experiments show that PixelHacker comprehensively outperforms the SOTA on a wide range of datasets (Places2, CelebA-HQ, and FFHQ) and exhibits remarkable consistency in both structure and semantics. Project page at https://hustvl.github.io/PixelHacker.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziyang Xu, Kangsheng Duan, Xiaolei Shen, Zhifeng Ding, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            PixelHacker: Image Inpainting with Structural and Semantic Consistency</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20438v2">http://arxiv.org/abs/2504.20438v2</a></p>

            <p><strong>Abstract:</strong><br>
            Image inpainting is a fundamental research area between image editing and image generation. Recent state-of-the-art (SOTA) methods have explored novel attention mechanisms, lightweight architectures, and context-aware modeling, demonstrating impressive performance. However, they often struggle with complex structure (e.g., texture, shape, spatial relations) and semantics (e.g., color consistency, object restoration, and logical correctness), leading to artifacts and inappropriate generation. To address this challenge, we design a simple yet effective inpainting paradigm called latent categories guidance, and further propose a diffusion-based model named PixelHacker. Specifically, we first construct a large dataset containing 14 million image-mask pairs by annotating foreground and background (potential 116 and 21 categories, respectively). Then, we encode potential foreground and background representations separately through two fixed-size embeddings, and intermittently inject these features into the denoising process via linear attention. Finally, by pre-training on our dataset and fine-tuning on open-source benchmarks, we obtain PixelHacker. Extensive experiments show that PixelHacker comprehensively outperforms the SOTA on a wide range of datasets (Places2, CelebA-HQ, and FFHQ) and exhibits remarkable consistency in both structure and semantics. Project page at https://hustvl.github.io/PixelHacker.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 05 May 2025 19:47:24 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8b2d3c84/ca40d493.mp3" length="17691466" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1102</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziyang Xu, Kangsheng Duan, Xiaolei Shen, Zhifeng Ding, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            PixelHacker: Image Inpainting with Structural and Semantic Consistency</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20438v2">http://arxiv.org/abs/2504.20438v2</a></p>

            <p><strong>Abstract:</strong><br>
            Image inpainting is a fundamental research area between image editing and image generation. Recent state-of-the-art (SOTA) methods have explored novel attention mechanisms, lightweight architectures, and context-aware modeling, demonstrating impressive performance. However, they often struggle with complex structure (e.g., texture, shape, spatial relations) and semantics (e.g., color consistency, object restoration, and logical correctness), leading to artifacts and inappropriate generation. To address this challenge, we design a simple yet effective inpainting paradigm called latent categories guidance, and further propose a diffusion-based model named PixelHacker. Specifically, we first construct a large dataset containing 14 million image-mask pairs by annotating foreground and background (potential 116 and 21 categories, respectively). Then, we encode potential foreground and background representations separately through two fixed-size embeddings, and intermittently inject these features into the denoising process via linear attention. Finally, by pre-training on our dataset and fine-tuning on open-source benchmarks, we obtain PixelHacker. Extensive experiments show that PixelHacker comprehensively outperforms the SOTA on a wide range of datasets (Places2, CelebA-HQ, and FFHQ) and exhibits remarkable consistency in both structure and semantics. Project page at https://hustvl.github.io/PixelHacker.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Survey of Interactive Generative Video</title>
      <itunes:episode>732</itunes:episode>
      <podcast:episode>732</podcast:episode>
      <itunes:title>A Survey of Interactive Generative Video</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">56caf7fd-ddb0-4fe0-a8e4-068e8dc8e3d9</guid>
      <link>https://share.transistor.fm/s/d791dfa4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiwen Yu, Yiran Qin, Haoxuan Che, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Hao Chen, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            A Survey of Interactive Generative Video</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.21853v1">http://arxiv.org/abs/2504.21853v1</a></p>

            <p><strong>Abstract:</strong><br>
            Interactive Generative Video (IGV) has emerged as a crucial technology in response to the growing demand for high-quality, interactive video content across various domains. In this paper, we define IGV as a technology that combines generative capabilities to produce diverse high-quality video content with interactive features that enable user engagement through control signals and responsive feedback. We survey the current landscape of IGV applications, focusing on three major domains: 1) gaming, where IGV enables infinite exploration in virtual worlds; 2) embodied AI, where IGV serves as a physics-aware environment synthesizer for training agents in multimodal interaction with dynamically evolving scenes; and 3) autonomous driving, where IGV provides closed-loop simulation capabilities for safety-critical testing and validation. To guide future development, we propose a comprehensive framework that decomposes an ideal IGV system into five essential modules: Generation, Control, Memory, Dynamics, and Intelligence. Furthermore, we systematically analyze the technical challenges and future directions in realizing each component for an ideal IGV system, such as achieving real-time generation, enabling open-domain control, maintaining long-term coherence, simulating accurate physics, and integrating causal reasoning. We believe that this systematic analysis will facilitate future research and development in the field of IGV, ultimately advancing the technology toward more sophisticated and practical applications.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiwen Yu, Yiran Qin, Haoxuan Che, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Hao Chen, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            A Survey of Interactive Generative Video</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.21853v1">http://arxiv.org/abs/2504.21853v1</a></p>

            <p><strong>Abstract:</strong><br>
            Interactive Generative Video (IGV) has emerged as a crucial technology in response to the growing demand for high-quality, interactive video content across various domains. In this paper, we define IGV as a technology that combines generative capabilities to produce diverse high-quality video content with interactive features that enable user engagement through control signals and responsive feedback. We survey the current landscape of IGV applications, focusing on three major domains: 1) gaming, where IGV enables infinite exploration in virtual worlds; 2) embodied AI, where IGV serves as a physics-aware environment synthesizer for training agents in multimodal interaction with dynamically evolving scenes; and 3) autonomous driving, where IGV provides closed-loop simulation capabilities for safety-critical testing and validation. To guide future development, we propose a comprehensive framework that decomposes an ideal IGV system into five essential modules: Generation, Control, Memory, Dynamics, and Intelligence. Furthermore, we systematically analyze the technical challenges and future directions in realizing each component for an ideal IGV system, such as achieving real-time generation, enabling open-domain control, maintaining long-term coherence, simulating accurate physics, and integrating causal reasoning. We believe that this systematic analysis will facilitate future research and development in the field of IGV, ultimately advancing the technology toward more sophisticated and practical applications.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 02 May 2025 20:00:06 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d791dfa4/d81ce799.mp3" length="21947933" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1368</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiwen Yu, Yiran Qin, Haoxuan Che, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Hao Chen, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            A Survey of Interactive Generative Video</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.21853v1">http://arxiv.org/abs/2504.21853v1</a></p>

            <p><strong>Abstract:</strong><br>
            Interactive Generative Video (IGV) has emerged as a crucial technology in response to the growing demand for high-quality, interactive video content across various domains. In this paper, we define IGV as a technology that combines generative capabilities to produce diverse high-quality video content with interactive features that enable user engagement through control signals and responsive feedback. We survey the current landscape of IGV applications, focusing on three major domains: 1) gaming, where IGV enables infinite exploration in virtual worlds; 2) embodied AI, where IGV serves as a physics-aware environment synthesizer for training agents in multimodal interaction with dynamically evolving scenes; and 3) autonomous driving, where IGV provides closed-loop simulation capabilities for safety-critical testing and validation. To guide future development, we propose a comprehensive framework that decomposes an ideal IGV system into five essential modules: Generation, Control, Memory, Dynamics, and Intelligence. Furthermore, we systematically analyze the technical challenges and future directions in realizing each component for an ideal IGV system, such as achieving real-time generation, enabling open-domain control, maintaining long-term coherence, simulating accurate physics, and integrating causal reasoning. We believe that this systematic analysis will facilitate future research and development in the field of IGV, ultimately advancing the technology toward more sophisticated and practical applications.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DeepCritic: Deliberate Critique with Large Language Models</title>
      <itunes:episode>731</itunes:episode>
      <podcast:episode>731</podcast:episode>
      <itunes:title>DeepCritic: Deliberate Critique with Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">02d162e9-4936-4ae8-98b9-e8ba122d9f2d</guid>
      <link>https://share.transistor.fm/s/9ab1f456</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wenkai Yang, Jingwen Chen, Yankai Lin, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            DeepCritic: Deliberate Critique with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.00662v1">http://arxiv.org/abs/2505.00662v1</a></p>

            <p><strong>Abstract:</strong><br>
            As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on studying and enhancing the math critique ability of LLMs. Current LLM critics provide critiques that are too shallow and superficial on each step, leading to low judgment accuracy and struggling to offer sufficient feedback for the LLM generator to correct mistakes. To tackle this issue, we propose a novel and effective two-stage framework to develop LLM critics that are capable of deliberately critiquing on each reasoning step of math solutions. In the first stage, we utilize Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for supervised fine-tuning. Each seed critique consists of deliberate step-wise critiques that includes multi-perspective verifications as well as in-depth critiques of initial critiques for each reasoning step. Then, we perform reinforcement learning on the fine-tuned model with either existing human-labeled data from PRM800K or our automatically annotated data obtained via Monte Carlo sampling-based correctness estimation, to further incentivize its critique ability. Our developed critique model built on Qwen2.5-7B-Instruct not only significantly outperforms existing LLM critics (including the same-sized DeepSeek-R1-distill models and GPT-4o) on various error identification benchmarks, but also more effectively helps the LLM generator refine erroneous steps through more detailed feedback.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wenkai Yang, Jingwen Chen, Yankai Lin, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            DeepCritic: Deliberate Critique with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.00662v1">http://arxiv.org/abs/2505.00662v1</a></p>

            <p><strong>Abstract:</strong><br>
            As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on studying and enhancing the math critique ability of LLMs. Current LLM critics provide critiques that are too shallow and superficial on each step, leading to low judgment accuracy and struggling to offer sufficient feedback for the LLM generator to correct mistakes. To tackle this issue, we propose a novel and effective two-stage framework to develop LLM critics that are capable of deliberately critiquing on each reasoning step of math solutions. In the first stage, we utilize Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for supervised fine-tuning. Each seed critique consists of deliberate step-wise critiques that includes multi-perspective verifications as well as in-depth critiques of initial critiques for each reasoning step. Then, we perform reinforcement learning on the fine-tuned model with either existing human-labeled data from PRM800K or our automatically annotated data obtained via Monte Carlo sampling-based correctness estimation, to further incentivize its critique ability. Our developed critique model built on Qwen2.5-7B-Instruct not only significantly outperforms existing LLM critics (including the same-sized DeepSeek-R1-distill models and GPT-4o) on various error identification benchmarks, but also more effectively helps the LLM generator refine erroneous steps through more detailed feedback.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 02 May 2025 19:59:44 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9ab1f456/56b67d37.mp3" length="21230733" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1323</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wenkai Yang, Jingwen Chen, Yankai Lin, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            DeepCritic: Deliberate Critique with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2505.00662v1">http://arxiv.org/abs/2505.00662v1</a></p>

            <p><strong>Abstract:</strong><br>
            As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on studying and enhancing the math critique ability of LLMs. Current LLM critics provide critiques that are too shallow and superficial on each step, leading to low judgment accuracy and struggling to offer sufficient feedback for the LLM generator to correct mistakes. To tackle this issue, we propose a novel and effective two-stage framework to develop LLM critics that are capable of deliberately critiquing on each reasoning step of math solutions. In the first stage, we utilize Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for supervised fine-tuning. Each seed critique consists of deliberate step-wise critiques that includes multi-perspective verifications as well as in-depth critiques of initial critiques for each reasoning step. Then, we perform reinforcement learning on the fine-tuned model with either existing human-labeled data from PRM800K or our automatically annotated data obtained via Monte Carlo sampling-based correctness estimation, to further incentivize its critique ability. Our developed critique model built on Qwen2.5-7B-Instruct not only significantly outperforms existing LLM critics (including the same-sized DeepSeek-R1-distill models and GPT-4o) on various error identification benchmarks, but also more effectively helps the LLM generator refine erroneous steps through more detailed feedback.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Sadeed: Advancing Arabic Diacritization Through Small Language Model</title>
      <itunes:episode>730</itunes:episode>
      <podcast:episode>730</podcast:episode>
      <itunes:title>Sadeed: Advancing Arabic Diacritization Through Small Language Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">36062bd4-1ccf-42ae-a533-d82a81e2cef2</guid>
      <link>https://share.transistor.fm/s/4ec67056</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zeina Aldallal, Sara Chrouf, Khalil Hennara, Mohamed Motaism Hamed, Muhammad Hreden, Safwan AlModhayan</p>

            <p><strong>Title:</strong><br>
            Sadeed: Advancing Arabic Diacritization Through Small Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.21635v1">http://arxiv.org/abs/2504.21635v1</a></p>

            <p><strong>Abstract:</strong><br>
            Arabic text diacritization remains a persistent challenge in natural language processing due to the language's morphological richness. In this paper, we introduce Sadeed, a novel approach based on a fine-tuned decoder-only language model adapted from Kuwain 1.5B Hennara et al. [2025], a compact model originally trained on diverse Arabic corpora. Sadeed is fine-tuned on carefully curated, high-quality diacritized datasets, constructed through a rigorous data-cleaning and normalization pipeline. Despite utilizing modest computational resources, Sadeed achieves competitive results compared to proprietary large language models and outperforms traditional models trained on similar domains. Additionally, we highlight key limitations in current benchmarking practices for Arabic diacritization. To address these issues, we introduce SadeedDiac-25, a new benchmark designed to enable fairer and more comprehensive evaluation across diverse text genres and complexity levels. Together, Sadeed and SadeedDiac-25 provide a robust foundation for advancing Arabic NLP applications, including machine translation, text-to-speech, and language learning tools.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zeina Aldallal, Sara Chrouf, Khalil Hennara, Mohamed Motaism Hamed, Muhammad Hreden, Safwan AlModhayan</p>

            <p><strong>Title:</strong><br>
            Sadeed: Advancing Arabic Diacritization Through Small Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.21635v1">http://arxiv.org/abs/2504.21635v1</a></p>

            <p><strong>Abstract:</strong><br>
            Arabic text diacritization remains a persistent challenge in natural language processing due to the language's morphological richness. In this paper, we introduce Sadeed, a novel approach based on a fine-tuned decoder-only language model adapted from Kuwain 1.5B Hennara et al. [2025], a compact model originally trained on diverse Arabic corpora. Sadeed is fine-tuned on carefully curated, high-quality diacritized datasets, constructed through a rigorous data-cleaning and normalization pipeline. Despite utilizing modest computational resources, Sadeed achieves competitive results compared to proprietary large language models and outperforms traditional models trained on similar domains. Additionally, we highlight key limitations in current benchmarking practices for Arabic diacritization. To address these issues, we introduce SadeedDiac-25, a new benchmark designed to enable fairer and more comprehensive evaluation across diverse text genres and complexity levels. Together, Sadeed and SadeedDiac-25 provide a robust foundation for advancing Arabic NLP applications, including machine translation, text-to-speech, and language learning tools.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 01 May 2025 20:26:52 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4ec67056/1528f705.mp3" length="21321858" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1329</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zeina Aldallal, Sara Chrouf, Khalil Hennara, Mohamed Motaism Hamed, Muhammad Hreden, Safwan AlModhayan</p>

            <p><strong>Title:</strong><br>
            Sadeed: Advancing Arabic Diacritization Through Small Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.21635v1">http://arxiv.org/abs/2504.21635v1</a></p>

            <p><strong>Abstract:</strong><br>
            Arabic text diacritization remains a persistent challenge in natural language processing due to the language's morphological richness. In this paper, we introduce Sadeed, a novel approach based on a fine-tuned decoder-only language model adapted from Kuwain 1.5B Hennara et al. [2025], a compact model originally trained on diverse Arabic corpora. Sadeed is fine-tuned on carefully curated, high-quality diacritized datasets, constructed through a rigorous data-cleaning and normalization pipeline. Despite utilizing modest computational resources, Sadeed achieves competitive results compared to proprietary large language models and outperforms traditional models trained on similar domains. Additionally, we highlight key limitations in current benchmarking practices for Arabic diacritization. To address these issues, we introduce SadeedDiac-25, a new benchmark designed to enable fairer and more comprehensive evaluation across diverse text genres and complexity levels. Together, Sadeed and SadeedDiac-25 provide a robust foundation for advancing Arabic NLP applications, including machine translation, text-to-speech, and language learning tools.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WebThinker: Empowering Large Reasoning Models with Deep Research Capability</title>
      <itunes:episode>729</itunes:episode>
      <podcast:episode>729</podcast:episode>
      <itunes:title>WebThinker: Empowering Large Reasoning Models with Deep Research Capability</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6e447919-9e67-42f6-a09d-bcc5484d4e5c</guid>
      <link>https://share.transistor.fm/s/09def1ae</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            WebThinker: Empowering Large Reasoning Models with Deep Research Capability</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.21776v1">http://arxiv.org/abs/2504.21776v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, knowledge-intensive tasks and hinders their ability to produce comprehensive research reports requiring synthesis of diverse web information. To address this, we propose \textbf{WebThinker}, a deep research agent that empowers LRMs to autonomously search the web, navigate web pages, and draft research reports during the reasoning process. WebThinker integrates a \textbf{Deep Web Explorer} module, enabling LRMs to dynamically search, navigate, and extract information from the web when encountering knowledge gaps. It also employs an \textbf{Autonomous Think-Search-and-Draft strategy}, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. To further enhance research tool utilization, we introduce an \textbf{RL-based training strategy} via iterative online Direct Preference Optimization (DPO). Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) demonstrate that WebThinker significantly outperforms existing methods and strong proprietary systems. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems. The code is available at https://github.com/RUC-NLPIR/WebThinker.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            WebThinker: Empowering Large Reasoning Models with Deep Research Capability</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.21776v1">http://arxiv.org/abs/2504.21776v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, knowledge-intensive tasks and hinders their ability to produce comprehensive research reports requiring synthesis of diverse web information. To address this, we propose \textbf{WebThinker}, a deep research agent that empowers LRMs to autonomously search the web, navigate web pages, and draft research reports during the reasoning process. WebThinker integrates a \textbf{Deep Web Explorer} module, enabling LRMs to dynamically search, navigate, and extract information from the web when encountering knowledge gaps. It also employs an \textbf{Autonomous Think-Search-and-Draft strategy}, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. To further enhance research tool utilization, we introduce an \textbf{RL-based training strategy} via iterative online Direct Preference Optimization (DPO). Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) demonstrate that WebThinker significantly outperforms existing methods and strong proprietary systems. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems. The code is available at https://github.com/RUC-NLPIR/WebThinker.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 01 May 2025 20:26:30 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/09def1ae/10901630.mp3" length="20406952" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1272</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            WebThinker: Empowering Large Reasoning Models with Deep Research Capability</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.21776v1">http://arxiv.org/abs/2504.21776v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, knowledge-intensive tasks and hinders their ability to produce comprehensive research reports requiring synthesis of diverse web information. To address this, we propose \textbf{WebThinker}, a deep research agent that empowers LRMs to autonomously search the web, navigate web pages, and draft research reports during the reasoning process. WebThinker integrates a \textbf{Deep Web Explorer} module, enabling LRMs to dynamically search, navigate, and extract information from the web when encountering knowledge gaps. It also employs an \textbf{Autonomous Think-Search-and-Draft strategy}, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. To further enhance research tool utilization, we introduce an \textbf{RL-based training strategy} via iterative online Direct Preference Optimization (DPO). Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) demonstrate that WebThinker significantly outperforms existing methods and strong proprietary systems. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems. The code is available at https://github.com/RUC-NLPIR/WebThinker.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math</title>
      <itunes:episode>728</itunes:episode>
      <podcast:episode>728</podcast:episode>
      <itunes:title>Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">04a04a1d-1344-4cbf-862d-0ccffaabf370</guid>
      <link>https://share.transistor.fm/s/7d7812c0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haoran Xu, Baolin Peng, Hany Awadalla, Dongdong Chen, Yen-Chun Chen, Mei Gao, Young Jin Kim, Yunsheng Li, Liliang Ren, Yelong Shen, Shuohang Wang, Weijian Xu, Jianfeng Gao, Weizhu Chen</p>

            <p><strong>Title:</strong><br>
            Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.21233v1">http://arxiv.org/abs/2504.21233v1</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities in Large Language Models (LLMs) by training them to explicitly generate intermediate reasoning steps. While LLMs readily benefit from such techniques, improving reasoning in Small Language Models (SLMs) remains challenging due to their limited model capacity. Recent work by Deepseek-R1 demonstrates that distillation from LLM-generated synthetic data can substantially improve the reasoning ability of SLM. However, the detailed modeling recipe is not disclosed. In this work, we present a systematic training recipe for SLMs that consists of four steps: (1) large-scale mid-training on diverse distilled long-CoT data, (2) supervised fine-tuning on high-quality long-CoT data, (3) Rollout DPO leveraging a carefully curated preference dataset, and (4) Reinforcement Learning (RL) with Verifiable Reward. We apply our method on Phi-4-Mini, a compact 3.8B-parameter model. The resulting Phi-4-Mini-Reasoning model exceeds, on math reasoning tasks, much larger reasoning models, e.g., outperforming DeepSeek-R1-Distill-Qwen-7B by 3.2 points and DeepSeek-R1-Distill-Llama-8B by 7.7 points on Math-500. Our results validate that a carefully designed training recipe, with large-scale high-quality CoT data, is effective to unlock strong reasoning capabilities even in resource-constrained small models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haoran Xu, Baolin Peng, Hany Awadalla, Dongdong Chen, Yen-Chun Chen, Mei Gao, Young Jin Kim, Yunsheng Li, Liliang Ren, Yelong Shen, Shuohang Wang, Weijian Xu, Jianfeng Gao, Weizhu Chen</p>

            <p><strong>Title:</strong><br>
            Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.21233v1">http://arxiv.org/abs/2504.21233v1</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities in Large Language Models (LLMs) by training them to explicitly generate intermediate reasoning steps. While LLMs readily benefit from such techniques, improving reasoning in Small Language Models (SLMs) remains challenging due to their limited model capacity. Recent work by Deepseek-R1 demonstrates that distillation from LLM-generated synthetic data can substantially improve the reasoning ability of SLM. However, the detailed modeling recipe is not disclosed. In this work, we present a systematic training recipe for SLMs that consists of four steps: (1) large-scale mid-training on diverse distilled long-CoT data, (2) supervised fine-tuning on high-quality long-CoT data, (3) Rollout DPO leveraging a carefully curated preference dataset, and (4) Reinforcement Learning (RL) with Verifiable Reward. We apply our method on Phi-4-Mini, a compact 3.8B-parameter model. The resulting Phi-4-Mini-Reasoning model exceeds, on math reasoning tasks, much larger reasoning models, e.g., outperforming DeepSeek-R1-Distill-Qwen-7B by 3.2 points and DeepSeek-R1-Distill-Llama-8B by 7.7 points on Math-500. Our results validate that a carefully designed training recipe, with large-scale high-quality CoT data, is effective to unlock strong reasoning capabilities even in resource-constrained small models.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 01 May 2025 20:26:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7d7812c0/491eedf2.mp3" length="19253395" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1200</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haoran Xu, Baolin Peng, Hany Awadalla, Dongdong Chen, Yen-Chun Chen, Mei Gao, Young Jin Kim, Yunsheng Li, Liliang Ren, Yelong Shen, Shuohang Wang, Weijian Xu, Jianfeng Gao, Weizhu Chen</p>

            <p><strong>Title:</strong><br>
            Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.21233v1">http://arxiv.org/abs/2504.21233v1</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities in Large Language Models (LLMs) by training them to explicitly generate intermediate reasoning steps. While LLMs readily benefit from such techniques, improving reasoning in Small Language Models (SLMs) remains challenging due to their limited model capacity. Recent work by Deepseek-R1 demonstrates that distillation from LLM-generated synthetic data can substantially improve the reasoning ability of SLM. However, the detailed modeling recipe is not disclosed. In this work, we present a systematic training recipe for SLMs that consists of four steps: (1) large-scale mid-training on diverse distilled long-CoT data, (2) supervised fine-tuning on high-quality long-CoT data, (3) Rollout DPO leveraging a carefully curated preference dataset, and (4) Reinforcement Learning (RL) with Verifiable Reward. We apply our method on Phi-4-Mini, a compact 3.8B-parameter model. The resulting Phi-4-Mini-Reasoning model exceeds, on math reasoning tasks, much larger reasoning models, e.g., outperforming DeepSeek-R1-Distill-Qwen-7B by 3.2 points and DeepSeek-R1-Distill-Llama-8B by 7.7 points on Math-500. Our results validate that a carefully designed training recipe, with large-scale high-quality CoT data, is effective to unlock strong reasoning capabilities even in resource-constrained small models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning</title>
      <itunes:episode>727</itunes:episode>
      <podcast:episode>727</podcast:episode>
      <itunes:title>COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5bab98a1-0382-439c-aa91-30d99947cf22</guid>
      <link>https://share.transistor.fm/s/935a0d11</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xindi Wu, Hee Seung Hwang, Polina Kirichenko, Olga Russakovsky</p>

            <p><strong>Title:</strong><br>
            COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.21850v1">http://arxiv.org/abs/2504.21850v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditionally focused on scaling data volume, but not the compositional complexity of training examples. We propose COMPACT (COMPositional Atomic-to-complex visual Capability Tuning), which generates a training dataset explicitly controlling for the compositional complexity of the training examples. The data from COMPACT allows MLLMs to train on combinations of atomic capabilities to learn complex capabilities more efficiently. Across all benchmarks, COMPACT achieves comparable performance to the LLaVA-665k VIT while using less than 10% of its data budget, and even outperforms it on several, especially those involving complex multi-capability tasks. For example, COMPACT achieves substantial 83.3% improvement on MMStar and 94.0% improvement on MM-Vet compared to the full-scale VIT on particularly complex questions that require four or more atomic capabilities. COMPACT offers a scalable, data-efficient, visual compositional tuning recipe to improve on complex visual-language tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xindi Wu, Hee Seung Hwang, Polina Kirichenko, Olga Russakovsky</p>

            <p><strong>Title:</strong><br>
            COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.21850v1">http://arxiv.org/abs/2504.21850v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditionally focused on scaling data volume, but not the compositional complexity of training examples. We propose COMPACT (COMPositional Atomic-to-complex visual Capability Tuning), which generates a training dataset explicitly controlling for the compositional complexity of the training examples. The data from COMPACT allows MLLMs to train on combinations of atomic capabilities to learn complex capabilities more efficiently. Across all benchmarks, COMPACT achieves comparable performance to the LLaVA-665k VIT while using less than 10% of its data budget, and even outperforms it on several, especially those involving complex multi-capability tasks. For example, COMPACT achieves substantial 83.3% improvement on MMStar and 94.0% improvement on MM-Vet compared to the full-scale VIT on particularly complex questions that require four or more atomic capabilities. COMPACT offers a scalable, data-efficient, visual compositional tuning recipe to improve on complex visual-language tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 01 May 2025 20:25:46 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/935a0d11/4e818829.mp3" length="17929280" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1117</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xindi Wu, Hee Seung Hwang, Polina Kirichenko, Olga Russakovsky</p>

            <p><strong>Title:</strong><br>
            COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.21850v1">http://arxiv.org/abs/2504.21850v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditionally focused on scaling data volume, but not the compositional complexity of training examples. We propose COMPACT (COMPositional Atomic-to-complex visual Capability Tuning), which generates a training dataset explicitly controlling for the compositional complexity of the training examples. The data from COMPACT allows MLLMs to train on combinations of atomic capabilities to learn complex capabilities more efficiently. Across all benchmarks, COMPACT achieves comparable performance to the LLaVA-665k VIT while using less than 10% of its data budget, and even outperforms it on several, especially those involving complex multi-capability tasks. For example, COMPACT achieves substantial 83.3% improvement on MMStar and 94.0% improvement on MM-Vet compared to the full-scale VIT on particularly complex questions that require four or more atomic capabilities. COMPACT offers a scalable, data-efficient, visual compositional tuning recipe to improve on complex visual-language tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Reinforcement Learning for Reasoning in Large Language Models with One Training Example</title>
      <itunes:episode>726</itunes:episode>
      <podcast:episode>726</podcast:episode>
      <itunes:title>Reinforcement Learning for Reasoning in Large Language Models with One Training Example</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9692a6b2-14b1-4972-b066-66ad884f4959</guid>
      <link>https://share.transistor.fm/s/8989dd20</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen</p>

            <p><strong>Title:</strong><br>
            Reinforcement Learning for Reasoning in Large Language Models with One Training Example</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20571v1">http://arxiv.org/abs/2504.20571v1</a></p>

            <p><strong>Abstract:</strong><br>
            We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%, and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7%. This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which includes the aforementioned example. Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples (many of which yield approximately 30% or greater improvement on MATH500 when employed as a single training example). In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e.g., by adding entropy loss with an appropriate coefficient) in 1-shot RLVR training. As a bonus, we observe that applying entropy loss alone, without any outcome reward, significantly enhances Qwen2.5-Math-1.5B's performance on MATH500 by 27.4%. These findings can inspire future work on RLVR data efficiency and encourage a re-examination of both recent progress and the underlying mechanisms in RLVR. Our code, model, and data are open source at https://github.com/ypwang61/One-Shot-RLVR</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen</p>

            <p><strong>Title:</strong><br>
            Reinforcement Learning for Reasoning in Large Language Models with One Training Example</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20571v1">http://arxiv.org/abs/2504.20571v1</a></p>

            <p><strong>Abstract:</strong><br>
            We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%, and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7%. This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which includes the aforementioned example. Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples (many of which yield approximately 30% or greater improvement on MATH500 when employed as a single training example). In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e.g., by adding entropy loss with an appropriate coefficient) in 1-shot RLVR training. As a bonus, we observe that applying entropy loss alone, without any outcome reward, significantly enhances Qwen2.5-Math-1.5B's performance on MATH500 by 27.4%. These findings can inspire future work on RLVR data efficiency and encourage a re-examination of both recent progress and the underlying mechanisms in RLVR. Our code, model, and data are open source at https://github.com/ypwang61/One-Shot-RLVR</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 30 Apr 2025 20:28:37 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8989dd20/2dfc471e.mp3" length="21591878" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1346</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen</p>

            <p><strong>Title:</strong><br>
            Reinforcement Learning for Reasoning in Large Language Models with One Training Example</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20571v1">http://arxiv.org/abs/2504.20571v1</a></p>

            <p><strong>Abstract:</strong><br>
            We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%, and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7%. This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which includes the aforementioned example. Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples (many of which yield approximately 30% or greater improvement on MATH500 when employed as a single training example). In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e.g., by adding entropy loss with an appropriate coefficient) in 1-shot RLVR training. As a bonus, we observe that applying entropy loss alone, without any outcome reward, significantly enhances Qwen2.5-Math-1.5B's performance on MATH500 by 27.4%. These findings can inspire future work on RLVR data efficiency and encourage a re-examination of both recent progress and the underlying mechanisms in RLVR. Our code, model, and data are open source at https://github.com/ypwang61/One-Shot-RLVR</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities</title>
      <itunes:episode>725</itunes:episode>
      <podcast:episode>725</podcast:episode>
      <itunes:title>UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">80aea394-55ca-4d15-8888-9b6458659f12</guid>
      <link>https://share.transistor.fm/s/583b8a83</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL, cs.AI, cs.CV, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20734v1">http://arxiv.org/abs/2504.20734v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, a novel RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single combined corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose a modality-aware routing mechanism that dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it. Also, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 8 benchmarks spanning multiple modalities, showing its superiority over modality-specific and unified baselines.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL, cs.AI, cs.CV, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20734v1">http://arxiv.org/abs/2504.20734v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, a novel RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single combined corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose a modality-aware routing mechanism that dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it. Also, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 8 benchmarks spanning multiple modalities, showing its superiority over modality-specific and unified baselines.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 30 Apr 2025 20:28:16 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/583b8a83/99a4645b.mp3" length="21112918" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1316</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL, cs.AI, cs.CV, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20734v1">http://arxiv.org/abs/2504.20734v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, a novel RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single combined corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose a modality-aware routing mechanism that dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it. Also, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 8 benchmarks spanning multiple modalities, showing its superiority over modality-specific and unified baselines.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ReasonIR: Training Retrievers for Reasoning Tasks</title>
      <itunes:episode>724</itunes:episode>
      <podcast:episode>724</podcast:episode>
      <itunes:title>ReasonIR: Training Retrievers for Reasoning Tasks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f8fa9aaf-41f0-4fde-8764-b5d08b6585d9</guid>
      <link>https://share.transistor.fm/s/a73eae53</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.AI, cs.CL, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Wen-tau Yih, Pang Wei Koh, Luke Zettlemoyer</p>

            <p><strong>Title:</strong><br>
            ReasonIR: Training Retrievers for Reasoning Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20595v1">http://arxiv.org/abs/2504.20595v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present ReasonIR-8B, the first retriever specifically trained for general reasoning tasks. Existing retrievers have shown limited gains on reasoning tasks, in part because existing training datasets focus on short factual queries tied to documents that straightforwardly answer them. We develop a synthetic data generation pipeline that, for each document, our pipeline creates a challenging and relevant query, along with a plausibly related but ultimately unhelpful hard negative. By training on a mixture of our synthetic data and existing public data, ReasonIR-8B achieves a new state-of-the-art of 29.9 nDCG@10 without reranker and 36.9 nDCG@10 with reranker on BRIGHT, a widely-used reasoning-intensive information retrieval (IR) benchmark. When applied to RAG tasks, ReasonIR-8B improves MMLU and GPQA performance by 6.4% and 22.6% respectively, relative to the closed-book baseline, outperforming other retrievers and search engines. In addition, ReasonIR-8B uses test-time compute more effectively: on BRIGHT, its performance consistently increases with longer and more information-rich rewritten queries; it continues to outperform other retrievers when combined with an LLM reranker. Our training recipe is general and can be easily extended to future LLMs; to this end, we open-source our code, data, and model.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.AI, cs.CL, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Wen-tau Yih, Pang Wei Koh, Luke Zettlemoyer</p>

            <p><strong>Title:</strong><br>
            ReasonIR: Training Retrievers for Reasoning Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20595v1">http://arxiv.org/abs/2504.20595v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present ReasonIR-8B, the first retriever specifically trained for general reasoning tasks. Existing retrievers have shown limited gains on reasoning tasks, in part because existing training datasets focus on short factual queries tied to documents that straightforwardly answer them. We develop a synthetic data generation pipeline that, for each document, our pipeline creates a challenging and relevant query, along with a plausibly related but ultimately unhelpful hard negative. By training on a mixture of our synthetic data and existing public data, ReasonIR-8B achieves a new state-of-the-art of 29.9 nDCG@10 without reranker and 36.9 nDCG@10 with reranker on BRIGHT, a widely-used reasoning-intensive information retrieval (IR) benchmark. When applied to RAG tasks, ReasonIR-8B improves MMLU and GPQA performance by 6.4% and 22.6% respectively, relative to the closed-book baseline, outperforming other retrievers and search engines. In addition, ReasonIR-8B uses test-time compute more effectively: on BRIGHT, its performance consistently increases with longer and more information-rich rewritten queries; it continues to outperform other retrievers when combined with an LLM reranker. Our training recipe is general and can be easily extended to future LLMs; to this end, we open-source our code, data, and model.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 30 Apr 2025 20:27:55 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a73eae53/3093abbb.mp3" length="20577036" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1282</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.AI, cs.CL, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Wen-tau Yih, Pang Wei Koh, Luke Zettlemoyer</p>

            <p><strong>Title:</strong><br>
            ReasonIR: Training Retrievers for Reasoning Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20595v1">http://arxiv.org/abs/2504.20595v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present ReasonIR-8B, the first retriever specifically trained for general reasoning tasks. Existing retrievers have shown limited gains on reasoning tasks, in part because existing training datasets focus on short factual queries tied to documents that straightforwardly answer them. We develop a synthetic data generation pipeline that, for each document, our pipeline creates a challenging and relevant query, along with a plausibly related but ultimately unhelpful hard negative. By training on a mixture of our synthetic data and existing public data, ReasonIR-8B achieves a new state-of-the-art of 29.9 nDCG@10 without reranker and 36.9 nDCG@10 with reranker on BRIGHT, a widely-used reasoning-intensive information retrieval (IR) benchmark. When applied to RAG tasks, ReasonIR-8B improves MMLU and GPQA performance by 6.4% and 22.6% respectively, relative to the closed-book baseline, outperforming other retrievers and search engines. In addition, ReasonIR-8B uses test-time compute more effectively: on BRIGHT, its performance consistently increases with longer and more information-rich rewritten queries; it continues to outperform other retrievers when combined with an LLM reranker. Our training recipe is general and can be easily extended to future LLMs; to this end, we open-source our code, data, and model.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Leaderboard Illusion</title>
      <itunes:episode>723</itunes:episode>
      <podcast:episode>723</podcast:episode>
      <itunes:title>The Leaderboard Illusion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">aab280ab-2c36-4568-b2db-b34e5ed1e389</guid>
      <link>https://share.transistor.fm/s/307746d6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.AI, cs.CL, cs.LG, stat.ME</p>

            <p><strong>Authors:</strong><br>
            Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'Souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker</p>

            <p><strong>Title:</strong><br>
            The Leaderboard Illusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20879v1">http://arxiv.org/abs/2504.20879v1</a></p>

            <p><strong>Abstract:</strong><br>
            Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.AI, cs.CL, cs.LG, stat.ME</p>

            <p><strong>Authors:</strong><br>
            Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'Souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker</p>

            <p><strong>Title:</strong><br>
            The Leaderboard Illusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20879v1">http://arxiv.org/abs/2504.20879v1</a></p>

            <p><strong>Abstract:</strong><br>
            Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 30 Apr 2025 20:27:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/307746d6/be5bc6b9.mp3" length="20116838" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1254</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.AI, cs.CL, cs.LG, stat.ME</p>

            <p><strong>Authors:</strong><br>
            Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D'Souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker</p>

            <p><strong>Title:</strong><br>
            The Leaderboard Illusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20879v1">http://arxiv.org/abs/2504.20879v1</a></p>

            <p><strong>Abstract:</strong><br>
            Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models</title>
      <itunes:episode>722</itunes:episode>
      <podcast:episode>722</podcast:episode>
      <itunes:title>Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8cd8102f-73d6-40c3-acc6-91ba0514e311</guid>
      <link>https://share.transistor.fm/s/cf912429</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zae Myung Kim, Chanwoo Park, Vipul Raheja, Dongyeop Kang</p>

            <p><strong>Title:</strong><br>
            Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20157v1">http://arxiv.org/abs/2504.20157v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We introduce Meta Policy Optimization (MPO), a framework that addresses these challenges by integrating a meta-reward model that dynamically refines the reward model's prompt throughout training. In MPO, the meta-reward model monitors the evolving training context and continuously adjusts the reward model's prompt to maintain high alignment, providing an adaptive reward signal that resists exploitation by the policy. This meta-learning approach promotes a more stable policy optimization, and greatly reduces the need for manual reward prompt design. It yields performance on par with or better than models guided by extensively hand-crafted reward prompts. Furthermore, we show that MPO maintains its effectiveness across diverse tasks, such as question answering and mathematical reasoning, without requiring specialized reward designs. Beyond standard RLAIF, MPO's meta-learning formulation is readily extensible to higher-level alignment frameworks. Overall, this method addresses theoretical and practical challenges in reward-based RL alignment for LLMs, paving the way for more robust and adaptable alignment strategies. The code and models will be publicly shared.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zae Myung Kim, Chanwoo Park, Vipul Raheja, Dongyeop Kang</p>

            <p><strong>Title:</strong><br>
            Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20157v1">http://arxiv.org/abs/2504.20157v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We introduce Meta Policy Optimization (MPO), a framework that addresses these challenges by integrating a meta-reward model that dynamically refines the reward model's prompt throughout training. In MPO, the meta-reward model monitors the evolving training context and continuously adjusts the reward model's prompt to maintain high alignment, providing an adaptive reward signal that resists exploitation by the policy. This meta-learning approach promotes a more stable policy optimization, and greatly reduces the need for manual reward prompt design. It yields performance on par with or better than models guided by extensively hand-crafted reward prompts. Furthermore, we show that MPO maintains its effectiveness across diverse tasks, such as question answering and mathematical reasoning, without requiring specialized reward designs. Beyond standard RLAIF, MPO's meta-learning formulation is readily extensible to higher-level alignment frameworks. Overall, this method addresses theoretical and practical challenges in reward-based RL alignment for LLMs, paving the way for more robust and adaptable alignment strategies. The code and models will be publicly shared.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 30 Apr 2025 20:27:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cf912429/22d948ba.mp3" length="20324201" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1267</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zae Myung Kim, Chanwoo Park, Vipul Raheja, Dongyeop Kang</p>

            <p><strong>Title:</strong><br>
            Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.20157v1">http://arxiv.org/abs/2504.20157v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We introduce Meta Policy Optimization (MPO), a framework that addresses these challenges by integrating a meta-reward model that dynamically refines the reward model's prompt throughout training. In MPO, the meta-reward model monitors the evolving training context and continuously adjusts the reward model's prompt to maintain high alignment, providing an adaptive reward signal that resists exploitation by the policy. This meta-learning approach promotes a more stable policy optimization, and greatly reduces the need for manual reward prompt design. It yields performance on par with or better than models guided by extensively hand-crafted reward prompts. Furthermore, we show that MPO maintains its effectiveness across diverse tasks, such as question answering and mathematical reasoning, without requiring specialized reward designs. Beyond standard RLAIF, MPO's meta-learning formulation is readily extensible to higher-level alignment frameworks. Overall, this method addresses theoretical and practical challenges in reward-based RL alignment for LLMs, paving the way for more robust and adaptable alignment strategies. The code and models will be publicly shared.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RepText: Rendering Visual Text via Replicating</title>
      <itunes:episode>721</itunes:episode>
      <podcast:episode>721</podcast:episode>
      <itunes:title>RepText: Rendering Visual Text via Replicating</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">14833708-52a6-403d-9553-6c82bb40e3c2</guid>
      <link>https://share.transistor.fm/s/ca325b63</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haofan Wang, Yujia Xu, Yimeng Li, Junchen Li, Chaowei Zhang, Jing Wang, Kejia Yang, Zhibo Chen</p>

            <p><strong>Title:</strong><br>
            RepText: Rendering Visual Text via Replicating</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.19724v1">http://arxiv.org/abs/2504.19724v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and flexible typographic elements, especially non-Latin alphabets, remains constrained. To address these limitations, we start from an naive assumption that text understanding is only a sufficient condition for text rendering, but not a necessary condition. Based on this, we present RepText, which aims to empower pre-trained monolingual text-to-image generation models with the ability to accurately render, or more precisely, replicate, multilingual visual text in user-specified fonts, without the need to really understand them. Specifically, we adopt the setting from ControlNet and additionally integrate language agnostic glyph and position of rendered text to enable generating harmonized visual text, allowing users to customize text content, font and position on their needs. To improve accuracy, a text perceptual loss is employed along with the diffusion loss. Furthermore, to stabilize rendering process, at the inference phase, we directly initialize with noisy glyph latent instead of random initialization, and adopt region masks to restrict the feature injection to only the text region to avoid distortion of the background. We conducted extensive experiments to verify the effectiveness of our RepText relative to existing works, our approach outperforms existing open-source methods and achieves comparable results to native multi-language closed-source models. To be more fair, we also exhaustively discuss its limitations in the end.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haofan Wang, Yujia Xu, Yimeng Li, Junchen Li, Chaowei Zhang, Jing Wang, Kejia Yang, Zhibo Chen</p>

            <p><strong>Title:</strong><br>
            RepText: Rendering Visual Text via Replicating</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.19724v1">http://arxiv.org/abs/2504.19724v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and flexible typographic elements, especially non-Latin alphabets, remains constrained. To address these limitations, we start from an naive assumption that text understanding is only a sufficient condition for text rendering, but not a necessary condition. Based on this, we present RepText, which aims to empower pre-trained monolingual text-to-image generation models with the ability to accurately render, or more precisely, replicate, multilingual visual text in user-specified fonts, without the need to really understand them. Specifically, we adopt the setting from ControlNet and additionally integrate language agnostic glyph and position of rendered text to enable generating harmonized visual text, allowing users to customize text content, font and position on their needs. To improve accuracy, a text perceptual loss is employed along with the diffusion loss. Furthermore, to stabilize rendering process, at the inference phase, we directly initialize with noisy glyph latent instead of random initialization, and adopt region masks to restrict the feature injection to only the text region to avoid distortion of the background. We conducted extensive experiments to verify the effectiveness of our RepText relative to existing works, our approach outperforms existing open-source methods and achieves comparable results to native multi-language closed-source models. To be more fair, we also exhaustively discuss its limitations in the end.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 29 Apr 2025 19:50:29 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ca325b63/b1afa32d.mp3" length="21136262" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1317</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haofan Wang, Yujia Xu, Yimeng Li, Junchen Li, Chaowei Zhang, Jing Wang, Kejia Yang, Zhibo Chen</p>

            <p><strong>Title:</strong><br>
            RepText: Rendering Visual Text via Replicating</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.19724v1">http://arxiv.org/abs/2504.19724v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and flexible typographic elements, especially non-Latin alphabets, remains constrained. To address these limitations, we start from an naive assumption that text understanding is only a sufficient condition for text rendering, but not a necessary condition. Based on this, we present RepText, which aims to empower pre-trained monolingual text-to-image generation models with the ability to accurately render, or more precisely, replicate, multilingual visual text in user-specified fonts, without the need to really understand them. Specifically, we adopt the setting from ControlNet and additionally integrate language agnostic glyph and position of rendered text to enable generating harmonized visual text, allowing users to customize text content, font and position on their needs. To improve accuracy, a text perceptual loss is employed along with the diffusion loss. Furthermore, to stabilize rendering process, at the inference phase, we directly initialize with noisy glyph latent instead of random initialization, and adopt region masks to restrict the feature injection to only the text region to avoid distortion of the background. We conducted extensive experiments to verify the effectiveness of our RepText relative to existing works, our approach outperforms existing open-source methods and achieves comparable results to native multi-language closed-source models. To be more fair, we also exhaustively discuss its limitations in the end.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Towards Understanding Camera Motions in Any Video</title>
      <itunes:episode>720</itunes:episode>
      <podcast:episode>720</podcast:episode>
      <itunes:title>Towards Understanding Camera Motions in Any Video</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bc9d41db-7979-44d1-854f-3f119aff41cc</guid>
      <link>https://share.transistor.fm/s/78cd63fe</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 127 | cs.CV, cs.AI, cs.CL, cs.LG, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan</p>

            <p><strong>Title:</strong><br>
            Towards Understanding Camera Motions in Any Video</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15376v1">http://arxiv.org/abs/2504.15376v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 127 | cs.CV, cs.AI, cs.CL, cs.LG, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan</p>

            <p><strong>Title:</strong><br>
            Towards Understanding Camera Motions in Any Video</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15376v1">http://arxiv.org/abs/2504.15376v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 28 Apr 2025 20:14:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/78cd63fe/36bb37ae.mp3" length="20746727" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1293</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 127 | cs.CV, cs.AI, cs.CL, cs.LG, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan</p>

            <p><strong>Title:</strong><br>
            Towards Understanding Camera Motions in Any Video</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15376v1">http://arxiv.org/abs/2504.15376v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning</title>
      <itunes:episode>719</itunes:episode>
      <podcast:episode>719</podcast:episode>
      <itunes:title>Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7dcdfef1-0cc3-443c-bf03-74e808f3aadc</guid>
      <link>https://share.transistor.fm/s/915f0673</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chris, Yichen Wei, Yi Peng, Xiaokun Wang, Weijie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, Xuchen Song, Yang Liu, Yahui Zhou</p>

            <p><strong>Title:</strong><br>
            Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.16656v2">http://arxiv.org/abs/2504.16656v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present Skywork R1V2, a next-generation multimodal reasoning model and a major leap forward from its predecessor, Skywork R1V. At its core, R1V2 introduces a hybrid reinforcement learning paradigm that jointly leverages the Mixed Preference Optimization (MPO) and the Group Relative Policy Optimization (GRPO), which harmonizes reward-model guidance with rule-based strategies, thereby addressing the long-standing challenge of balancing sophisticated reasoning capabilities with broad generalization. To further enhance training efficiency, we introduce the Selective Sample Buffer (SSB) mechanism, which effectively counters the ``Vanishing Advantages'' dilemma inherent in GRPO by prioritizing high-value samples throughout the optimization process. Notably, we observe that excessive reinforcement signals can induce visual hallucinations--a phenomenon we systematically monitor and mitigate through calibrated reward thresholds throughout the training process. Empirical results affirm the exceptional capability of R1V2, with benchmark-leading performances such as 62.6 on OlympiadBench, 78.9 on AIME2024, 63.6 on LiveCodeBench, and 73.6 on MMMU. These results underscore R1V2's superiority over existing open-source models and demonstrate significant progress in closing the performance gap with premier proprietary systems, including Gemini 2.5 and OpenAI-o4-mini. The Skywork R1V2 model weights have been publicly released to promote openness and reproducibility https://huggingface.co/Skywork/Skywork-R1V2-38B.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chris, Yichen Wei, Yi Peng, Xiaokun Wang, Weijie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, Xuchen Song, Yang Liu, Yahui Zhou</p>

            <p><strong>Title:</strong><br>
            Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.16656v2">http://arxiv.org/abs/2504.16656v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present Skywork R1V2, a next-generation multimodal reasoning model and a major leap forward from its predecessor, Skywork R1V. At its core, R1V2 introduces a hybrid reinforcement learning paradigm that jointly leverages the Mixed Preference Optimization (MPO) and the Group Relative Policy Optimization (GRPO), which harmonizes reward-model guidance with rule-based strategies, thereby addressing the long-standing challenge of balancing sophisticated reasoning capabilities with broad generalization. To further enhance training efficiency, we introduce the Selective Sample Buffer (SSB) mechanism, which effectively counters the ``Vanishing Advantages'' dilemma inherent in GRPO by prioritizing high-value samples throughout the optimization process. Notably, we observe that excessive reinforcement signals can induce visual hallucinations--a phenomenon we systematically monitor and mitigate through calibrated reward thresholds throughout the training process. Empirical results affirm the exceptional capability of R1V2, with benchmark-leading performances such as 62.6 on OlympiadBench, 78.9 on AIME2024, 63.6 on LiveCodeBench, and 73.6 on MMMU. These results underscore R1V2's superiority over existing open-source models and demonstrate significant progress in closing the performance gap with premier proprietary systems, including Gemini 2.5 and OpenAI-o4-mini. The Skywork R1V2 model weights have been publicly released to promote openness and reproducibility https://huggingface.co/Skywork/Skywork-R1V2-38B.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 28 Apr 2025 20:13:55 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/915f0673/a4115bef.mp3" length="20684052" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1289</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chris, Yichen Wei, Yi Peng, Xiaokun Wang, Weijie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, Xuchen Song, Yang Liu, Yahui Zhou</p>

            <p><strong>Title:</strong><br>
            Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.16656v2">http://arxiv.org/abs/2504.16656v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present Skywork R1V2, a next-generation multimodal reasoning model and a major leap forward from its predecessor, Skywork R1V. At its core, R1V2 introduces a hybrid reinforcement learning paradigm that jointly leverages the Mixed Preference Optimization (MPO) and the Group Relative Policy Optimization (GRPO), which harmonizes reward-model guidance with rule-based strategies, thereby addressing the long-standing challenge of balancing sophisticated reasoning capabilities with broad generalization. To further enhance training efficiency, we introduce the Selective Sample Buffer (SSB) mechanism, which effectively counters the ``Vanishing Advantages'' dilemma inherent in GRPO by prioritizing high-value samples throughout the optimization process. Notably, we observe that excessive reinforcement signals can induce visual hallucinations--a phenomenon we systematically monitor and mitigate through calibrated reward thresholds throughout the training process. Empirical results affirm the exceptional capability of R1V2, with benchmark-leading performances such as 62.6 on OlympiadBench, 78.9 on AIME2024, 63.6 on LiveCodeBench, and 73.6 on MMMU. These results underscore R1V2's superiority over existing open-source models and demonstrate significant progress in closing the performance gap with premier proprietary systems, including Gemini 2.5 and OpenAI-o4-mini. The Skywork R1V2 model weights have been publicly released to promote openness and reproducibility https://huggingface.co/Skywork/Skywork-R1V2-38B.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs</title>
      <itunes:episode>718</itunes:episode>
      <podcast:episode>718</podcast:episode>
      <itunes:title>BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9c605d8a-3f1d-40cd-8604-a4f39fb9c929</guid>
      <link>https://share.transistor.fm/s/f9d51d19</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hongyu Wang, Shuming Ma, Furu Wei</p>

            <p><strong>Title:</strong><br>
            BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.18415v1">http://arxiv.org/abs/2504.18415v1</a></p>

            <p><strong>Abstract:</strong><br>
            Efficient deployment of 1-bit Large Language Models (LLMs) is hindered by activation outliers, which complicate quantization to low bit-widths. We introduce BitNet v2, a novel framework enabling native 4-bit activation quantization for 1-bit LLMs. To tackle outliers in attention and feed-forward network activations, we propose H-BitLinear, a module applying an online Hadamard transformation prior to activation quantization. This transformation smooths sharp activation distributions into more Gaussian-like forms, suitable for low-bit representation. Experiments show BitNet v2 trained from scratch with 8-bit activations matches BitNet b1.58 performance. Crucially, BitNet v2 achieves minimal performance degradation when trained with native 4-bit activations, significantly reducing memory footprint and computational cost for batched inference.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hongyu Wang, Shuming Ma, Furu Wei</p>

            <p><strong>Title:</strong><br>
            BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.18415v1">http://arxiv.org/abs/2504.18415v1</a></p>

            <p><strong>Abstract:</strong><br>
            Efficient deployment of 1-bit Large Language Models (LLMs) is hindered by activation outliers, which complicate quantization to low bit-widths. We introduce BitNet v2, a novel framework enabling native 4-bit activation quantization for 1-bit LLMs. To tackle outliers in attention and feed-forward network activations, we propose H-BitLinear, a module applying an online Hadamard transformation prior to activation quantization. This transformation smooths sharp activation distributions into more Gaussian-like forms, suitable for low-bit representation. Experiments show BitNet v2 trained from scratch with 8-bit activations matches BitNet b1.58 performance. Crucially, BitNet v2 achieves minimal performance degradation when trained with native 4-bit activations, significantly reducing memory footprint and computational cost for batched inference.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 28 Apr 2025 20:13:32 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f9d51d19/783d19d9.mp3" length="19765389" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1232</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hongyu Wang, Shuming Ma, Furu Wei</p>

            <p><strong>Title:</strong><br>
            BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.18415v1">http://arxiv.org/abs/2504.18415v1</a></p>

            <p><strong>Abstract:</strong><br>
            Efficient deployment of 1-bit Large Language Models (LLMs) is hindered by activation outliers, which complicate quantization to low bit-widths. We introduce BitNet v2, a novel framework enabling native 4-bit activation quantization for 1-bit LLMs. To tackle outliers in attention and feed-forward network activations, we propose H-BitLinear, a module applying an online Hadamard transformation prior to activation quantization. This transformation smooths sharp activation distributions into more Gaussian-like forms, suitable for low-bit representation. Experiments show BitNet v2 trained from scratch with 8-bit activations matches BitNet b1.58 performance. Crucially, BitNet v2 achieves minimal performance degradation when trained with native 4-bit activations, significantly reducing memory footprint and computational cost for batched inference.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Step1X-Edit: A Practical Framework for General Image Editing</title>
      <itunes:episode>717</itunes:episode>
      <podcast:episode>717</podcast:episode>
      <itunes:title>Step1X-Edit: A Practical Framework for General Image Editing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">94d1fc93-77e1-49fc-8bf1-22249768cfd0</guid>
      <link>https://share.transistor.fm/s/46640aa4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, Daxin Jiang</p>

            <p><strong>Title:</strong><br>
            Step1X-Edit: A Practical Framework for General Image Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.17761v1">http://arxiv.org/abs/2504.17761v1</a></p>

            <p><strong>Abstract:</strong><br>
            In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, Daxin Jiang</p>

            <p><strong>Title:</strong><br>
            Step1X-Edit: A Practical Framework for General Image Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.17761v1">http://arxiv.org/abs/2504.17761v1</a></p>

            <p><strong>Abstract:</strong><br>
            In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 25 Apr 2025 20:23:23 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/46640aa4/49f4fd20.mp3" length="19960557" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1244</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, Daxin Jiang</p>

            <p><strong>Title:</strong><br>
            Step1X-Edit: A Practical Framework for General Image Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.17761v1">http://arxiv.org/abs/2504.17761v1</a></p>

            <p><strong>Abstract:</strong><br>
            In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning</title>
      <itunes:episode>716</itunes:episode>
      <podcast:episode>716</podcast:episode>
      <itunes:title>Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">89620251-b296-43cf-a5a6-a6dfe044027a</guid>
      <link>https://share.transistor.fm/s/7a8bc680</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.17192v1">http://arxiv.org/abs/2504.17192v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, specifically from the original paper authors, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.17192v1">http://arxiv.org/abs/2504.17192v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, specifically from the original paper authors, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 25 Apr 2025 20:23:01 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7a8bc680/626ad586.mp3" length="21197737" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1321</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.17192v1">http://arxiv.org/abs/2504.17192v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, specifically from the original paper authors, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation</title>
      <itunes:episode>715</itunes:episode>
      <podcast:episode>715</podcast:episode>
      <itunes:title>RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">99abba45-fcb2-41dc-9139-25724f3cb12b</guid>
      <link>https://share.transistor.fm/s/9e1005ea</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Aviv Slobodkin, Hagai Taitelbaum, Yonatan Bitton, Brian Gordon, Michal Sokolik, Nitzan Bitton Guetta, Almog Gueta, Royi Rassin, Itay Laish, Dani Lischinski, Idan Szpektor</p>

            <p><strong>Title:</strong><br>
            RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.17502v1">http://arxiv.org/abs/2504.17502v1</a></p>

            <p><strong>Abstract:</strong><br>
            Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability -- ranging from enhanced personalization in image generation to consistent character representation in video rendering -- progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single prediction. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or matches existing baselines across multiple benchmarks and subject categories (e.g., \emph{Animal}, \emph{Object}), achieving up to 6.4-point gains in textual alignment and 8.5-point gains in subject consistency. It also excels with lesser-known concepts, aligning with human preferences at over 87\% accuracy.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Aviv Slobodkin, Hagai Taitelbaum, Yonatan Bitton, Brian Gordon, Michal Sokolik, Nitzan Bitton Guetta, Almog Gueta, Royi Rassin, Itay Laish, Dani Lischinski, Idan Szpektor</p>

            <p><strong>Title:</strong><br>
            RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.17502v1">http://arxiv.org/abs/2504.17502v1</a></p>

            <p><strong>Abstract:</strong><br>
            Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability -- ranging from enhanced personalization in image generation to consistent character representation in video rendering -- progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single prediction. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or matches existing baselines across multiple benchmarks and subject categories (e.g., \emph{Animal}, \emph{Object}), achieving up to 6.4-point gains in textual alignment and 8.5-point gains in subject consistency. It also excels with lesser-known concepts, aligning with human preferences at over 87\% accuracy.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 25 Apr 2025 20:22:39 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9e1005ea/a9b2048f.mp3" length="19512942" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1216</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Aviv Slobodkin, Hagai Taitelbaum, Yonatan Bitton, Brian Gordon, Michal Sokolik, Nitzan Bitton Guetta, Almog Gueta, Royi Rassin, Itay Laish, Dani Lischinski, Idan Szpektor</p>

            <p><strong>Title:</strong><br>
            RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.17502v1">http://arxiv.org/abs/2504.17502v1</a></p>

            <p><strong>Abstract:</strong><br>
            Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability -- ranging from enhanced personalization in image generation to consistent character representation in video rendering -- progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single prediction. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or matches existing baselines across multiple benchmarks and subject categories (e.g., \emph{Animal}, \emph{Object}), achieving up to 6.4-point gains in textual alignment and 8.5-point gains in subject consistency. It also excels with lesser-known concepts, aligning with human preferences at over 87\% accuracy.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs</title>
      <itunes:episode>714</itunes:episode>
      <podcast:episode>714</podcast:episode>
      <itunes:title>Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2db114cb-566d-4436-b20c-027ddef9befc</guid>
      <link>https://share.transistor.fm/s/c117ca61</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, Jiankang Deng</p>

            <p><strong>Title:</strong><br>
            Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.17432v1">http://arxiv.org/abs/2504.17432v1</a></p>

            <p><strong>Abstract:</strong><br>
            The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLM\'s language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, Jiankang Deng</p>

            <p><strong>Title:</strong><br>
            Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.17432v1">http://arxiv.org/abs/2504.17432v1</a></p>

            <p><strong>Abstract:</strong><br>
            The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLM\'s language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 25 Apr 2025 20:22:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c117ca61/d60ec71b.mp3" length="24181547" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1508</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, Jiankang Deng</p>

            <p><strong>Title:</strong><br>
            Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.17432v1">http://arxiv.org/abs/2504.17432v1</a></p>

            <p><strong>Abstract:</strong><br>
            The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains underexplored.In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLM\'s language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning</title>
      <itunes:episode>713</itunes:episode>
      <podcast:episode>713</podcast:episode>
      <itunes:title>DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d35fa4e5-2090-48ca-8234-cf1e02aba0f3</guid>
      <link>https://share.transistor.fm/s/8bfaeda1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Fulong Ye, Miao Hua, Pengze Zhang, Xinghui Li, Qichao Sun, Songtao Zhao, Qian He, Xinglong Wu</p>

            <p><strong>Title:</strong><br>
            DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.14509v2">http://arxiv.org/abs/2504.14509v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce DreamID, a diffusion-based face swapping model that achieves high levels of ID similarity, attribute preservation, image fidelity, and fast inference speed. Unlike the typical face swapping training process, which often relies on implicit supervision and struggles to achieve satisfactory results. DreamID establishes explicit supervision for face swapping by constructing Triplet ID Group data, significantly enhancing identity similarity and attribute preservation. The iterative nature of diffusion models poses challenges for utilizing efficient image-space loss functions, as performing time-consuming multi-step sampling to obtain the generated image during training is impractical. To address this issue, we leverage the accelerated diffusion model SD Turbo, reducing the inference steps to a single iteration, enabling efficient pixel-level end-to-end training with explicit Triplet ID Group supervision. Additionally, we propose an improved diffusion-based model architecture comprising SwapNet, FaceNet, and ID Adapter. This robust architecture fully unlocks the power of the Triplet ID Group explicit supervision. Finally, to further extend our method, we explicitly modify the Triplet ID Group data during training to fine-tune and preserve specific attributes, such as glasses and face shape. Extensive experiments demonstrate that DreamID outperforms state-of-the-art methods in terms of identity similarity, pose and expression preservation, and image fidelity. Overall, DreamID achieves high-quality face swapping results at 512*512 resolution in just 0.6 seconds and performs exceptionally well in challenging scenarios such as complex lighting, large angles, and occlusions.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Fulong Ye, Miao Hua, Pengze Zhang, Xinghui Li, Qichao Sun, Songtao Zhao, Qian He, Xinglong Wu</p>

            <p><strong>Title:</strong><br>
            DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.14509v2">http://arxiv.org/abs/2504.14509v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce DreamID, a diffusion-based face swapping model that achieves high levels of ID similarity, attribute preservation, image fidelity, and fast inference speed. Unlike the typical face swapping training process, which often relies on implicit supervision and struggles to achieve satisfactory results. DreamID establishes explicit supervision for face swapping by constructing Triplet ID Group data, significantly enhancing identity similarity and attribute preservation. The iterative nature of diffusion models poses challenges for utilizing efficient image-space loss functions, as performing time-consuming multi-step sampling to obtain the generated image during training is impractical. To address this issue, we leverage the accelerated diffusion model SD Turbo, reducing the inference steps to a single iteration, enabling efficient pixel-level end-to-end training with explicit Triplet ID Group supervision. Additionally, we propose an improved diffusion-based model architecture comprising SwapNet, FaceNet, and ID Adapter. This robust architecture fully unlocks the power of the Triplet ID Group explicit supervision. Finally, to further extend our method, we explicitly modify the Triplet ID Group data during training to fine-tune and preserve specific attributes, such as glasses and face shape. Extensive experiments demonstrate that DreamID outperforms state-of-the-art methods in terms of identity similarity, pose and expression preservation, and image fidelity. Overall, DreamID achieves high-quality face swapping results at 512*512 resolution in just 0.6 seconds and performs exceptionally well in challenging scenarios such as complex lighting, large angles, and occlusions.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 24 Apr 2025 20:45:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8bfaeda1/8ab812ce.mp3" length="19943869" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1243</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Fulong Ye, Miao Hua, Pengze Zhang, Xinghui Li, Qichao Sun, Songtao Zhao, Qian He, Xinglong Wu</p>

            <p><strong>Title:</strong><br>
            DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.14509v2">http://arxiv.org/abs/2504.14509v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce DreamID, a diffusion-based face swapping model that achieves high levels of ID similarity, attribute preservation, image fidelity, and fast inference speed. Unlike the typical face swapping training process, which often relies on implicit supervision and struggles to achieve satisfactory results. DreamID establishes explicit supervision for face swapping by constructing Triplet ID Group data, significantly enhancing identity similarity and attribute preservation. The iterative nature of diffusion models poses challenges for utilizing efficient image-space loss functions, as performing time-consuming multi-step sampling to obtain the generated image during training is impractical. To address this issue, we leverage the accelerated diffusion model SD Turbo, reducing the inference steps to a single iteration, enabling efficient pixel-level end-to-end training with explicit Triplet ID Group supervision. Additionally, we propose an improved diffusion-based model architecture comprising SwapNet, FaceNet, and ID Adapter. This robust architecture fully unlocks the power of the Triplet ID Group explicit supervision. Finally, to further extend our method, we explicitly modify the Triplet ID Group data during training to fine-tune and preserve specific attributes, such as glasses and face shape. Extensive experiments demonstrate that DreamID outperforms state-of-the-art methods in terms of identity similarity, pose and expression preservation, and image fidelity. Overall, DreamID achieves high-quality face swapping results at 512*512 resolution in just 0.6 seconds and performs exceptionally well in challenging scenarios such as complex lighting, large angles, and occlusions.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Trillion 7B Technical Report</title>
      <itunes:episode>712</itunes:episode>
      <podcast:episode>712</podcast:episode>
      <itunes:title>Trillion 7B Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c27bcd4c-4615-49a0-9617-b886c18f12cb</guid>
      <link>https://share.transistor.fm/s/d7805526</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Sungjun Han, Juyoung Suk, Suyeong An, Hyungguk Kim, Kyuseok Kim, Wonsuk Yang, Seungtaek Choi, Jamin Shin</p>

            <p><strong>Title:</strong><br>
            Trillion 7B Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15431v1">http://arxiv.org/abs/2504.15431v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Trillion-7B, the most token-efficient Korean-centric multilingual LLM available. Our novel Cross-lingual Document Attention (XLDA) mechanism enables highly efficient and effective knowledge transfer from English to target languages like Korean and Japanese. Combined with optimized data mixtures, language-specific filtering, and tailored tokenizer construction, Trillion-7B achieves competitive performance while dedicating only 10\% of its 2T training tokens to multilingual data and requiring just 59.4K H100 GPU hours (\$148K) for full training. Comprehensive evaluations across 27 benchmarks in four languages demonstrate Trillion-7B's robust multilingual performance and exceptional cross-lingual consistency.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Sungjun Han, Juyoung Suk, Suyeong An, Hyungguk Kim, Kyuseok Kim, Wonsuk Yang, Seungtaek Choi, Jamin Shin</p>

            <p><strong>Title:</strong><br>
            Trillion 7B Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15431v1">http://arxiv.org/abs/2504.15431v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Trillion-7B, the most token-efficient Korean-centric multilingual LLM available. Our novel Cross-lingual Document Attention (XLDA) mechanism enables highly efficient and effective knowledge transfer from English to target languages like Korean and Japanese. Combined with optimized data mixtures, language-specific filtering, and tailored tokenizer construction, Trillion-7B achieves competitive performance while dedicating only 10\% of its 2T training tokens to multilingual data and requiring just 59.4K H100 GPU hours (\$148K) for full training. Comprehensive evaluations across 27 benchmarks in four languages demonstrate Trillion-7B's robust multilingual performance and exceptional cross-lingual consistency.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 24 Apr 2025 20:44:56 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d7805526/e3cd199c.mp3" length="24822644" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1548</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Sungjun Han, Juyoung Suk, Suyeong An, Hyungguk Kim, Kyuseok Kim, Wonsuk Yang, Seungtaek Choi, Jamin Shin</p>

            <p><strong>Title:</strong><br>
            Trillion 7B Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15431v1">http://arxiv.org/abs/2504.15431v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Trillion-7B, the most token-efficient Korean-centric multilingual LLM available. Our novel Cross-lingual Document Attention (XLDA) mechanism enables highly efficient and effective knowledge transfer from English to target languages like Korean and Japanese. Combined with optimized data mixtures, language-specific filtering, and tailored tokenizer construction, Trillion-7B achieves competitive performance while dedicating only 10\% of its 2T training tokens to multilingual data and requiring just 59.4K H100 GPU hours (\$148K) for full training. Comprehensive evaluations across 27 benchmarks in four languages demonstrate Trillion-7B's robust multilingual performance and exceptional cross-lingual consistency.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Tina: Tiny Reasoning Models via LoRA</title>
      <itunes:episode>711</itunes:episode>
      <podcast:episode>711</podcast:episode>
      <itunes:title>Tina: Tiny Reasoning Models via LoRA</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">af5a74a8-5182-465f-9bf4-fd25e850307a</guid>
      <link>https://share.transistor.fm/s/5607ad18</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shangshang Wang, Julian Asilis, Ömer Faruk Akgül, Enes Burak Bilgin, Ollie Liu, Willie Neiswanger</p>

            <p><strong>Title:</strong><br>
            Tina: Tiny Reasoning Models via LoRA</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15777v1">http://arxiv.org/abs/2504.15777v1</a></p>

            <p><strong>Abstract:</strong><br>
            How cost-effectively can strong reasoning abilities be achieved in language models? Driven by this fundamental question, we present Tina, a family of tiny reasoning models achieved with high cost-efficiency. Notably, Tina demonstrates that substantial reasoning performance can be developed using only minimal resources, by applying parameter-efficient updates during reinforcement learning (RL), using low-rank adaptation (LoRA), to an already tiny 1.5B parameter base model. This minimalist approach produces models that achieve reasoning performance which is competitive with, and sometimes surpasses, SOTA RL reasoning models built upon the same base model. Crucially, this is achieved at a tiny fraction of the computational post-training cost employed by existing SOTA models. In fact, the best Tina model achieves a &gt;20\% reasoning performance increase and 43.33\% Pass@1 accuracy on AIME24, at only \$9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness of efficient RL reasoning via LoRA. We validate this across multiple open-source reasoning datasets and various ablation settings starting with a single, fixed set of hyperparameters. Furthermore, we hypothesize that this effectiveness and efficiency stem from LoRA rapidly adapting the model to the structural format of reasoning rewarded by RL, while largely preserving the base model's underlying knowledge. In service of accessibility and open research, we fully open-source all code, training logs, and model weights \&amp; checkpoints.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shangshang Wang, Julian Asilis, Ömer Faruk Akgül, Enes Burak Bilgin, Ollie Liu, Willie Neiswanger</p>

            <p><strong>Title:</strong><br>
            Tina: Tiny Reasoning Models via LoRA</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15777v1">http://arxiv.org/abs/2504.15777v1</a></p>

            <p><strong>Abstract:</strong><br>
            How cost-effectively can strong reasoning abilities be achieved in language models? Driven by this fundamental question, we present Tina, a family of tiny reasoning models achieved with high cost-efficiency. Notably, Tina demonstrates that substantial reasoning performance can be developed using only minimal resources, by applying parameter-efficient updates during reinforcement learning (RL), using low-rank adaptation (LoRA), to an already tiny 1.5B parameter base model. This minimalist approach produces models that achieve reasoning performance which is competitive with, and sometimes surpasses, SOTA RL reasoning models built upon the same base model. Crucially, this is achieved at a tiny fraction of the computational post-training cost employed by existing SOTA models. In fact, the best Tina model achieves a &gt;20\% reasoning performance increase and 43.33\% Pass@1 accuracy on AIME24, at only \$9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness of efficient RL reasoning via LoRA. We validate this across multiple open-source reasoning datasets and various ablation settings starting with a single, fixed set of hyperparameters. Furthermore, we hypothesize that this effectiveness and efficiency stem from LoRA rapidly adapting the model to the structural format of reasoning rewarded by RL, while largely preserving the base model's underlying knowledge. In service of accessibility and open research, we fully open-source all code, training logs, and model weights \&amp; checkpoints.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 24 Apr 2025 20:44:34 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5607ad18/20982a0e.mp3" length="22287730" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1389</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shangshang Wang, Julian Asilis, Ömer Faruk Akgül, Enes Burak Bilgin, Ollie Liu, Willie Neiswanger</p>

            <p><strong>Title:</strong><br>
            Tina: Tiny Reasoning Models via LoRA</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15777v1">http://arxiv.org/abs/2504.15777v1</a></p>

            <p><strong>Abstract:</strong><br>
            How cost-effectively can strong reasoning abilities be achieved in language models? Driven by this fundamental question, we present Tina, a family of tiny reasoning models achieved with high cost-efficiency. Notably, Tina demonstrates that substantial reasoning performance can be developed using only minimal resources, by applying parameter-efficient updates during reinforcement learning (RL), using low-rank adaptation (LoRA), to an already tiny 1.5B parameter base model. This minimalist approach produces models that achieve reasoning performance which is competitive with, and sometimes surpasses, SOTA RL reasoning models built upon the same base model. Crucially, this is achieved at a tiny fraction of the computational post-training cost employed by existing SOTA models. In fact, the best Tina model achieves a &gt;20\% reasoning performance increase and 43.33\% Pass@1 accuracy on AIME24, at only \$9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness of efficient RL reasoning via LoRA. We validate this across multiple open-source reasoning datasets and various ablation settings starting with a single, fixed set of hyperparameters. Furthermore, we hypothesize that this effectiveness and efficiency stem from LoRA rapidly adapting the model to the structural format of reasoning rewarded by RL, while largely preserving the base model's underlying knowledge. In service of accessibility and open research, we fully open-source all code, training logs, and model weights \&amp; checkpoints.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>I-Con: A Unifying Framework for Representation Learning</title>
      <itunes:episode>710</itunes:episode>
      <podcast:episode>710</podcast:episode>
      <itunes:title>I-Con: A Unifying Framework for Representation Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2b53473c-8830-43d0-a9ca-b65915434f5d</guid>
      <link>https://share.transistor.fm/s/79dc6b0b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.LG, cs.AI, cs.CV, cs.IT, math.IT</p>

            <p><strong>Authors:</strong><br>
            Shaden Alshammari, John Hershey, Axel Feldmann, William T. Freeman, Mark Hamilton</p>

            <p><strong>Title:</strong><br>
            I-Con: A Unifying Framework for Representation Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.16929v1">http://arxiv.org/abs/2504.16929v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the field of representation learning grows, there has been a proliferation of different loss functions to solve different classes of problems. We introduce a single information-theoretic equation that generalizes a large collection of modern loss functions in machine learning. In particular, we introduce a framework that shows that several broad classes of machine learning methods are precisely minimizing an integrated KL divergence between two conditional distributions: the supervisory and learned representations. This viewpoint exposes a hidden information geometry underlying clustering, spectral methods, dimensionality reduction, contrastive learning, and supervised learning. This framework enables the development of new loss functions by combining successful techniques from across the literature. We not only present a wide array of proofs, connecting over 23 different approaches, but we also leverage these theoretical results to create state-of-the-art unsupervised image classifiers that achieve a +8% improvement over the prior state-of-the-art on unsupervised classification on ImageNet-1K. We also demonstrate that I-Con can be used to derive principled debiasing methods which improve contrastive representation learners.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.LG, cs.AI, cs.CV, cs.IT, math.IT</p>

            <p><strong>Authors:</strong><br>
            Shaden Alshammari, John Hershey, Axel Feldmann, William T. Freeman, Mark Hamilton</p>

            <p><strong>Title:</strong><br>
            I-Con: A Unifying Framework for Representation Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.16929v1">http://arxiv.org/abs/2504.16929v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the field of representation learning grows, there has been a proliferation of different loss functions to solve different classes of problems. We introduce a single information-theoretic equation that generalizes a large collection of modern loss functions in machine learning. In particular, we introduce a framework that shows that several broad classes of machine learning methods are precisely minimizing an integrated KL divergence between two conditional distributions: the supervisory and learned representations. This viewpoint exposes a hidden information geometry underlying clustering, spectral methods, dimensionality reduction, contrastive learning, and supervised learning. This framework enables the development of new loss functions by combining successful techniques from across the literature. We not only present a wide array of proofs, connecting over 23 different approaches, but we also leverage these theoretical results to create state-of-the-art unsupervised image classifiers that achieve a +8% improvement over the prior state-of-the-art on unsupervised classification on ImageNet-1K. We also demonstrate that I-Con can be used to derive principled debiasing methods which improve contrastive representation learners.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 24 Apr 2025 20:44:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/79dc6b0b/529fa17e.mp3" length="20214671" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1260</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.LG, cs.AI, cs.CV, cs.IT, math.IT</p>

            <p><strong>Authors:</strong><br>
            Shaden Alshammari, John Hershey, Axel Feldmann, William T. Freeman, Mark Hamilton</p>

            <p><strong>Title:</strong><br>
            I-Con: A Unifying Framework for Representation Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.16929v1">http://arxiv.org/abs/2504.16929v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the field of representation learning grows, there has been a proliferation of different loss functions to solve different classes of problems. We introduce a single information-theoretic equation that generalizes a large collection of modern loss functions in machine learning. In particular, we introduce a framework that shows that several broad classes of machine learning methods are precisely minimizing an integrated KL divergence between two conditional distributions: the supervisory and learned representations. This viewpoint exposes a hidden information geometry underlying clustering, spectral methods, dimensionality reduction, contrastive learning, and supervised learning. This framework enables the development of new loss functions by combining successful techniques from across the literature. We not only present a wide array of proofs, connecting over 23 different approaches, but we also leverage these theoretical results to create state-of-the-art unsupervised image classifiers that achieve a +8% improvement over the prior state-of-the-art on unsupervised classification on ImageNet-1K. We also demonstrate that I-Con can be used to derive principled debiasing methods which improve contrastive representation learners.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Kuwain 1.5B: An Arabic SLM via Language Injection</title>
      <itunes:episode>709</itunes:episode>
      <podcast:episode>709</podcast:episode>
      <itunes:title>Kuwain 1.5B: An Arabic SLM via Language Injection</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">68da66b5-2e01-4d03-859f-5696175beed3</guid>
      <link>https://share.transistor.fm/s/07e3c7dc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 94 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Khalil Hennara, Sara Chrouf, Mohamed Motaism Hamed, Zeina Aldallal, Omar Hadid, Safwan AlModhayan</p>

            <p><strong>Title:</strong><br>
            Kuwain 1.5B: An Arabic SLM via Language Injection</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15120v1">http://arxiv.org/abs/2504.15120v1</a></p>

            <p><strong>Abstract:</strong><br>
            Enhancing existing models with new knowledge is a crucial aspect of AI development. This paper introduces a novel method for integrating a new language into a large language model (LLM). Our approach successfully incorporates a previously unseen target language into an existing LLM without compromising its prior knowledge. We trained a tiny model with 1.5 billion parameters named Kuwain by injecting the Arabic language into a small open-source model mainly trained in English. Our method demonstrates significant improvements in Arabic language performance, with an average 8% improvement across various benchmarks, while retaining the model's existing knowledge with a minimum amount of the original model's data. This offers a cost-effective alternative to training a comprehensive model in both English and Arabic. The results highlight the potential for efficient, targeted language model expansion without extensive retraining or resource-intensive processes.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 94 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Khalil Hennara, Sara Chrouf, Mohamed Motaism Hamed, Zeina Aldallal, Omar Hadid, Safwan AlModhayan</p>

            <p><strong>Title:</strong><br>
            Kuwain 1.5B: An Arabic SLM via Language Injection</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15120v1">http://arxiv.org/abs/2504.15120v1</a></p>

            <p><strong>Abstract:</strong><br>
            Enhancing existing models with new knowledge is a crucial aspect of AI development. This paper introduces a novel method for integrating a new language into a large language model (LLM). Our approach successfully incorporates a previously unseen target language into an existing LLM without compromising its prior knowledge. We trained a tiny model with 1.5 billion parameters named Kuwain by injecting the Arabic language into a small open-source model mainly trained in English. Our method demonstrates significant improvements in Arabic language performance, with an average 8% improvement across various benchmarks, while retaining the model's existing knowledge with a minimum amount of the original model's data. This offers a cost-effective alternative to training a comprehensive model in both English and Arabic. The results highlight the potential for efficient, targeted language model expansion without extensive retraining or resource-intensive processes.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 23 Apr 2025 20:41:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/07e3c7dc/65f4f761.mp3" length="19290975" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1202</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 94 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Khalil Hennara, Sara Chrouf, Mohamed Motaism Hamed, Zeina Aldallal, Omar Hadid, Safwan AlModhayan</p>

            <p><strong>Title:</strong><br>
            Kuwain 1.5B: An Arabic SLM via Language Injection</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15120v1">http://arxiv.org/abs/2504.15120v1</a></p>

            <p><strong>Abstract:</strong><br>
            Enhancing existing models with new knowledge is a crucial aspect of AI development. This paper introduces a novel method for integrating a new language into a large language model (LLM). Our approach successfully incorporates a previously unseen target language into an existing LLM without compromising its prior knowledge. We trained a tiny model with 1.5 billion parameters named Kuwain by injecting the Arabic language into a small open-source model mainly trained in English. Our method demonstrates significant improvements in Arabic language performance, with an average 8% improvement across various benchmarks, while retaining the model's existing knowledge with a minimum amount of the original model's data. This offers a cost-effective alternative to training a comprehensive model in both English and Arabic. The results highlight the potential for efficient, targeted language model expansion without extensive retraining or resource-intensive processes.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TTRL: Test-Time Reinforcement Learning</title>
      <itunes:episode>708</itunes:episode>
      <podcast:episode>708</podcast:episode>
      <itunes:title>TTRL: Test-Time Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">778bd2c0-db24-4ff1-8e1d-b1f95802b721</guid>
      <link>https://share.transistor.fm/s/444dbb6c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            TTRL: Test-Time Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.16084v1">http://arxiv.org/abs/2504.16084v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 159% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the Maj@N metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks, and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            TTRL: Test-Time Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.16084v1">http://arxiv.org/abs/2504.16084v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 159% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the Maj@N metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks, and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 23 Apr 2025 20:40:47 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/444dbb6c/ffa95d77.mp3" length="24169384" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1507</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 60 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            TTRL: Test-Time Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.16084v1">http://arxiv.org/abs/2504.16084v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 159% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the Maj@N metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks, and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks</title>
      <itunes:episode>707</itunes:episode>
      <podcast:episode>707</podcast:episode>
      <itunes:title>The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">61e9fa55-46a1-4220-b53b-2a082a8fdfb3</guid>
      <link>https://share.transistor.fm/s/f7d058c6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Minghao Wu, Weixuan Wang, Sinuo Liu, Huifeng Yin, Xintong Wang, Yu Zhao, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang</p>

            <p><strong>Title:</strong><br>
            The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15521v1">http://arxiv.org/abs/2504.15521v1</a></p>

            <p><strong>Abstract:</strong><br>
            As large language models (LLMs) continue to advance in linguistic capabilities, robust multilingual evaluation has become essential for promoting equitable technological progress. This position paper examines over 2,000 multilingual (non-English) benchmarks from 148 countries, published between 2021 and 2024, to evaluate past, present, and future practices in multilingual benchmarking. Our findings reveal that, despite significant investments amounting to tens of millions of dollars, English remains significantly overrepresented in these benchmarks. Additionally, most benchmarks rely on original language content rather than translations, with the majority sourced from high-resource countries such as China, India, Germany, the UK, and the USA. Furthermore, a comparison of benchmark performance with human judgments highlights notable disparities. STEM-related tasks exhibit strong correlations with human evaluations (0.70 to 0.85), while traditional NLP tasks like question answering (e.g., XQuAD) show much weaker correlations (0.11 to 0.30). Moreover, translating English benchmarks into other languages proves insufficient, as localized benchmarks demonstrate significantly higher alignment with local human judgments (0.68) than their translated counterparts (0.47). This underscores the importance of creating culturally and linguistically tailored benchmarks rather than relying solely on translations. Through this comprehensive analysis, we highlight six key limitations in current multilingual evaluation practices, propose the guiding principles accordingly for effective multilingual benchmarking, and outline five critical research directions to drive progress in the field. Finally, we call for a global collaborative effort to develop human-aligned benchmarks that prioritize real-world applications.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Minghao Wu, Weixuan Wang, Sinuo Liu, Huifeng Yin, Xintong Wang, Yu Zhao, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang</p>

            <p><strong>Title:</strong><br>
            The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15521v1">http://arxiv.org/abs/2504.15521v1</a></p>

            <p><strong>Abstract:</strong><br>
            As large language models (LLMs) continue to advance in linguistic capabilities, robust multilingual evaluation has become essential for promoting equitable technological progress. This position paper examines over 2,000 multilingual (non-English) benchmarks from 148 countries, published between 2021 and 2024, to evaluate past, present, and future practices in multilingual benchmarking. Our findings reveal that, despite significant investments amounting to tens of millions of dollars, English remains significantly overrepresented in these benchmarks. Additionally, most benchmarks rely on original language content rather than translations, with the majority sourced from high-resource countries such as China, India, Germany, the UK, and the USA. Furthermore, a comparison of benchmark performance with human judgments highlights notable disparities. STEM-related tasks exhibit strong correlations with human evaluations (0.70 to 0.85), while traditional NLP tasks like question answering (e.g., XQuAD) show much weaker correlations (0.11 to 0.30). Moreover, translating English benchmarks into other languages proves insufficient, as localized benchmarks demonstrate significantly higher alignment with local human judgments (0.68) than their translated counterparts (0.47). This underscores the importance of creating culturally and linguistically tailored benchmarks rather than relying solely on translations. Through this comprehensive analysis, we highlight six key limitations in current multilingual evaluation practices, propose the guiding principles accordingly for effective multilingual benchmarking, and outline five critical research directions to drive progress in the field. Finally, we call for a global collaborative effort to develop human-aligned benchmarks that prioritize real-world applications.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 23 Apr 2025 20:40:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f7d058c6/7ad871d7.mp3" length="21424251" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1335</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Minghao Wu, Weixuan Wang, Sinuo Liu, Huifeng Yin, Xintong Wang, Yu Zhao, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang</p>

            <p><strong>Title:</strong><br>
            The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15521v1">http://arxiv.org/abs/2504.15521v1</a></p>

            <p><strong>Abstract:</strong><br>
            As large language models (LLMs) continue to advance in linguistic capabilities, robust multilingual evaluation has become essential for promoting equitable technological progress. This position paper examines over 2,000 multilingual (non-English) benchmarks from 148 countries, published between 2021 and 2024, to evaluate past, present, and future practices in multilingual benchmarking. Our findings reveal that, despite significant investments amounting to tens of millions of dollars, English remains significantly overrepresented in these benchmarks. Additionally, most benchmarks rely on original language content rather than translations, with the majority sourced from high-resource countries such as China, India, Germany, the UK, and the USA. Furthermore, a comparison of benchmark performance with human judgments highlights notable disparities. STEM-related tasks exhibit strong correlations with human evaluations (0.70 to 0.85), while traditional NLP tasks like question answering (e.g., XQuAD) show much weaker correlations (0.11 to 0.30). Moreover, translating English benchmarks into other languages proves insufficient, as localized benchmarks demonstrate significantly higher alignment with local human judgments (0.68) than their translated counterparts (0.47). This underscores the importance of creating culturally and linguistically tailored benchmarks rather than relying solely on translations. Through this comprehensive analysis, we highlight six key limitations in current multilingual evaluation practices, propose the guiding principles accordingly for effective multilingual benchmarking, and outline five critical research directions to drive progress in the field. Finally, we call for a global collaborative effort to develop human-aligned benchmarks that prioritize real-world applications.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Describe Anything: Detailed Localized Image and Video Captioning</title>
      <itunes:episode>706</itunes:episode>
      <podcast:episode>706</podcast:episode>
      <itunes:title>Describe Anything: Detailed Localized Image and Video Captioning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">27b2d8ab-48e5-4a17-922c-8d4d4ce94182</guid>
      <link>https://share.transistor.fm/s/014284d1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, Yin Cui</p>

            <p><strong>Title:</strong><br>
            Describe Anything: Detailed Localized Image and Video Captioning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.16072v1">http://arxiv.org/abs/2504.16072v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, Yin Cui</p>

            <p><strong>Title:</strong><br>
            Describe Anything: Detailed Localized Image and Video Captioning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.16072v1">http://arxiv.org/abs/2504.16072v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 23 Apr 2025 20:40:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/014284d1/4ea6dde6.mp3" length="23288770" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1452</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, Yin Cui</p>

            <p><strong>Title:</strong><br>
            Describe Anything: Detailed Localized Image and Video Captioning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.16072v1">http://arxiv.org/abs/2504.16072v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Learning Adaptive Parallel Reasoning with Language Models</title>
      <itunes:episode>705</itunes:episode>
      <podcast:episode>705</podcast:episode>
      <itunes:title>Learning Adaptive Parallel Reasoning with Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0ec30a74-b414-4e10-b3a2-3e5dcd648afe</guid>
      <link>https://share.transistor.fm/s/da84e5cb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiayi Pan, Xiuyu Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, Alane Suhr</p>

            <p><strong>Title:</strong><br>
            Learning Adaptive Parallel Reasoning with Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15466v1">http://arxiv.org/abs/2504.15466v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling inference-time computation has substantially improved the reasoning capabilities of language models. However, existing methods have significant limitations: serialized chain-of-thought approaches generate overly long outputs, leading to increased latency and exhausted context windows, while parallel methods such as self-consistency suffer from insufficient coordination, resulting in redundant computations and limited performance gains. To address these shortcomings, we propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end. APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations. A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures. Experiments on the Countdown reasoning task demonstrate significant benefits of APR: (1) higher performance within the same context window (83.4% vs. 60.0% at 4k context); (2) superior scalability with increased computation (80.1% vs. 66.6% at 20k total tokens); (3) improved accuracy at equivalent latency (75.2% vs. 57.3% at approximately 5,000ms). APR represents a step towards enabling language models to autonomously optimize their reasoning processes through adaptive allocation of computation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiayi Pan, Xiuyu Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, Alane Suhr</p>

            <p><strong>Title:</strong><br>
            Learning Adaptive Parallel Reasoning with Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15466v1">http://arxiv.org/abs/2504.15466v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling inference-time computation has substantially improved the reasoning capabilities of language models. However, existing methods have significant limitations: serialized chain-of-thought approaches generate overly long outputs, leading to increased latency and exhausted context windows, while parallel methods such as self-consistency suffer from insufficient coordination, resulting in redundant computations and limited performance gains. To address these shortcomings, we propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end. APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations. A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures. Experiments on the Countdown reasoning task demonstrate significant benefits of APR: (1) higher performance within the same context window (83.4% vs. 60.0% at 4k context); (2) superior scalability with increased computation (80.1% vs. 66.6% at 20k total tokens); (3) improved accuracy at equivalent latency (75.2% vs. 57.3% at approximately 5,000ms). APR represents a step towards enabling language models to autonomously optimize their reasoning processes through adaptive allocation of computation.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 23 Apr 2025 20:39:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/da84e5cb/23dc0509.mp3" length="20244348" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1262</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiayi Pan, Xiuyu Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, Alane Suhr</p>

            <p><strong>Title:</strong><br>
            Learning Adaptive Parallel Reasoning with Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15466v1">http://arxiv.org/abs/2504.15466v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling inference-time computation has substantially improved the reasoning capabilities of language models. However, existing methods have significant limitations: serialized chain-of-thought approaches generate overly long outputs, leading to increased latency and exhausted context windows, while parallel methods such as self-consistency suffer from insufficient coordination, resulting in redundant computations and limited performance gains. To address these shortcomings, we propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end. APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations. A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures. Experiments on the Countdown reasoning task demonstrate significant benefits of APR: (1) higher performance within the same context window (83.4% vs. 60.0% at 4k context); (2) superior scalability with increased computation (80.1% vs. 66.6% at 20k total tokens); (3) improved accuracy at equivalent latency (75.2% vs. 57.3% at approximately 5,000ms). APR represents a step towards enabling language models to autonomously optimize their reasoning processes through adaptive allocation of computation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Learning to Reason under Off-Policy Guidance</title>
      <itunes:episode>704</itunes:episode>
      <podcast:episode>704</podcast:episode>
      <itunes:title>Learning to Reason under Off-Policy Guidance</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4d0fe91e-2a16-4b0f-8c5d-2543296cf39f</guid>
      <link>https://share.transistor.fm/s/7d1254f1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, Yue Zhang</p>

            <p><strong>Title:</strong><br>
            Learning to Reason under Off-Policy Guidance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.14945v2">http://arxiv.org/abs/2504.14945v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning (RL) with simple rule-based rewards. However, existing zero-RL approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. We introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Notably, we propose policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Remarkably, LUFFY achieves an over +7.0 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. It also substantially surpasses imitation-based supervised fine-tuning (SFT), particularly in generalization. Analysis shows LUFFY not only imitates effectively but also explores beyond demonstrations, offering a scalable path to train generalizable reasoning models with off-policy guidance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, Yue Zhang</p>

            <p><strong>Title:</strong><br>
            Learning to Reason under Off-Policy Guidance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.14945v2">http://arxiv.org/abs/2504.14945v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning (RL) with simple rule-based rewards. However, existing zero-RL approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. We introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Notably, we propose policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Remarkably, LUFFY achieves an over +7.0 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. It also substantially surpasses imitation-based supervised fine-tuning (SFT), particularly in generalization. Analysis shows LUFFY not only imitates effectively but also explores beyond demonstrations, offering a scalable path to train generalizable reasoning models with off-policy guidance.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 22 Apr 2025 20:56:00 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7d1254f1/39b25193.mp3" length="21263738" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1325</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, Yue Zhang</p>

            <p><strong>Title:</strong><br>
            Learning to Reason under Off-Policy Guidance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.14945v2">http://arxiv.org/abs/2504.14945v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning (RL) with simple rule-based rewards. However, existing zero-RL approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. We introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Notably, we propose policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Remarkably, LUFFY achieves an over +7.0 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. It also substantially surpasses imitation-based supervised fine-tuning (SFT), particularly in generalization. Analysis shows LUFFY not only imitates effectively but also explores beyond demonstrations, offering a scalable path to train generalizable reasoning models with off-policy guidance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models</title>
      <itunes:episode>703</itunes:episode>
      <podcast:episode>703</podcast:episode>
      <itunes:title>Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">47880802-fc3a-4d8a-add3-4c4262f35512</guid>
      <link>https://share.transistor.fm/s/29b1ecee</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, Tyler Poon, Max Ehrlich, Tuomas Rintamaki, Tyler Poon, Tong Lu, Limin Wang, Bryan Catanzaro, Jan Kautz, Andrew Tao, Zhiding Yu, Guilin Liu</p>

            <p><strong>Title:</strong><br>
            Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15271v1">http://arxiv.org/abs/2504.15271v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details. The framework also includes numerous efficiency optimizations in the pipeline for long-context data training. Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. Eagle 2.5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle 2.5-8B achieves 72.4% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and InternVL2.5-78B.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, Tyler Poon, Max Ehrlich, Tuomas Rintamaki, Tyler Poon, Tong Lu, Limin Wang, Bryan Catanzaro, Jan Kautz, Andrew Tao, Zhiding Yu, Guilin Liu</p>

            <p><strong>Title:</strong><br>
            Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15271v1">http://arxiv.org/abs/2504.15271v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details. The framework also includes numerous efficiency optimizations in the pipeline for long-context data training. Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. Eagle 2.5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle 2.5-8B achieves 72.4% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and InternVL2.5-78B.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 22 Apr 2025 20:55:37 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/29b1ecee/a44cf13a.mp3" length="19668007" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1226</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, Tyler Poon, Max Ehrlich, Tuomas Rintamaki, Tyler Poon, Tong Lu, Limin Wang, Bryan Catanzaro, Jan Kautz, Andrew Tao, Zhiding Yu, Guilin Liu</p>

            <p><strong>Title:</strong><br>
            Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15271v1">http://arxiv.org/abs/2504.15271v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution image understanding, introducing a generalist framework for both tasks. The proposed training framework incorporates Automatic Degrade Sampling and Image Area Preservation, two techniques that preserve contextual integrity and visual details. The framework also includes numerous efficiency optimizations in the pipeline for long-context data training. Finally, we propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations, facilitating long-video understanding. Eagle 2.5 demonstrates substantial improvements on long-context multimodal benchmarks, providing a robust solution to the limitations of existing VLMs. Notably, our best model Eagle 2.5-8B achieves 72.4% on Video-MME with 512 input frames, matching the results of top-tier commercial model such as GPT-4o and large-scale open-source models like Qwen2.5-VL-72B and InternVL2.5-78B.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FlowReasoner: Reinforcing Query-Level Meta-Agents</title>
      <itunes:episode>702</itunes:episode>
      <podcast:episode>702</podcast:episode>
      <itunes:title>FlowReasoner: Reinforcing Query-Level Meta-Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">62c1e96e-afc5-4162-ab06-83ed804cb5a0</guid>
      <link>https://share.transistor.fm/s/9ecce80d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, Tianyu Pang</p>

            <p><strong>Title:</strong><br>
            FlowReasoner: Reinforcing Query-Level Meta-Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15257v1">http://arxiv.org/abs/2504.15257v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper proposes a query-level meta-agent named FlowReasoner to automate the design of query-level multi-agent systems, i.e., one system per user query. Our core idea is to incentivize a reasoning-based meta-agent via external execution feedback. Concretely, by distilling DeepSeek R1, we first endow the basic reasoning ability regarding the generation of multi-agent systems to FlowReasoner. Then, we further enhance it via reinforcement learning (RL) with external execution feedback. A multi-purpose reward is designed to guide the RL training from aspects of performance, complexity, and efficiency. In this manner, FlowReasoner is enabled to generate a personalized multi-agent system for each user query via deliberative reasoning. Experiments on both engineering and competition code benchmarks demonstrate the superiority of FlowReasoner. Remarkably, it surpasses o1-mini by 10.52% accuracy across three benchmarks. The code is available at https://github.com/sail-sg/FlowReasoner.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, Tianyu Pang</p>

            <p><strong>Title:</strong><br>
            FlowReasoner: Reinforcing Query-Level Meta-Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15257v1">http://arxiv.org/abs/2504.15257v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper proposes a query-level meta-agent named FlowReasoner to automate the design of query-level multi-agent systems, i.e., one system per user query. Our core idea is to incentivize a reasoning-based meta-agent via external execution feedback. Concretely, by distilling DeepSeek R1, we first endow the basic reasoning ability regarding the generation of multi-agent systems to FlowReasoner. Then, we further enhance it via reinforcement learning (RL) with external execution feedback. A multi-purpose reward is designed to guide the RL training from aspects of performance, complexity, and efficiency. In this manner, FlowReasoner is enabled to generate a personalized multi-agent system for each user query via deliberative reasoning. Experiments on both engineering and competition code benchmarks demonstrate the superiority of FlowReasoner. Remarkably, it surpasses o1-mini by 10.52% accuracy across three benchmarks. The code is available at https://github.com/sail-sg/FlowReasoner.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 22 Apr 2025 20:55:14 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9ecce80d/16c2220a.mp3" length="17556863" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1094</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, Tianyu Pang</p>

            <p><strong>Title:</strong><br>
            FlowReasoner: Reinforcing Query-Level Meta-Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15257v1">http://arxiv.org/abs/2504.15257v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper proposes a query-level meta-agent named FlowReasoner to automate the design of query-level multi-agent systems, i.e., one system per user query. Our core idea is to incentivize a reasoning-based meta-agent via external execution feedback. Concretely, by distilling DeepSeek R1, we first endow the basic reasoning ability regarding the generation of multi-agent systems to FlowReasoner. Then, we further enhance it via reinforcement learning (RL) with external execution feedback. A multi-purpose reward is designed to guide the RL training from aspects of performance, complexity, and efficiency. In this manner, FlowReasoner is enabled to generate a personalized multi-agent system for each user query via deliberative reasoning. Experiments on both engineering and competition code benchmarks demonstrate the superiority of FlowReasoner. Remarkably, it surpasses o1-mini by 10.52% accuracy across three benchmarks. The code is available at https://github.com/sail-sg/FlowReasoner.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ToolRL: Reward is All Tool Learning Needs</title>
      <itunes:episode>701</itunes:episode>
      <podcast:episode>701</podcast:episode>
      <itunes:title>ToolRL: Reward is All Tool Learning Needs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8bbf0557-dd86-4501-b95e-89dcff9ee22f</guid>
      <link>https://share.transistor.fm/s/7d5384e2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, Heng Ji</p>

            <p><strong>Title:</strong><br>
            ToolRL: Reward is All Tool Learning Needs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13958v1">http://arxiv.org/abs/2504.13958v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. However, SFT struggles to generalize to unfamiliar or complex tool use scenarios. Recent advancements in reinforcement learning (RL), particularly with R1-like models, have demonstrated promising reasoning and generalization abilities. Yet, reward design for tool use presents unique challenges: multiple tools may be invoked with diverse parameters, and coarse-grained reward signals, such as answer matching, fail to offer the finegrained feedback required for effective learning. In this work, we present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm. We systematically explore a wide range of reward strategies, analyzing their types, scales, granularity, and temporal dynamics. Building on these insights, we propose a principled reward design tailored for tool use tasks and apply it to train LLMs using Group Relative Policy Optimization (GRPO). Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models. These results highlight the critical role of thoughtful reward design in enhancing the tool use capabilities and generalization performance of LLMs. All the codes are released to facilitate future research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, Heng Ji</p>

            <p><strong>Title:</strong><br>
            ToolRL: Reward is All Tool Learning Needs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13958v1">http://arxiv.org/abs/2504.13958v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. However, SFT struggles to generalize to unfamiliar or complex tool use scenarios. Recent advancements in reinforcement learning (RL), particularly with R1-like models, have demonstrated promising reasoning and generalization abilities. Yet, reward design for tool use presents unique challenges: multiple tools may be invoked with diverse parameters, and coarse-grained reward signals, such as answer matching, fail to offer the finegrained feedback required for effective learning. In this work, we present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm. We systematically explore a wide range of reward strategies, analyzing their types, scales, granularity, and temporal dynamics. Building on these insights, we propose a principled reward design tailored for tool use tasks and apply it to train LLMs using Group Relative Policy Optimization (GRPO). Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models. These results highlight the critical role of thoughtful reward design in enhancing the tool use capabilities and generalization performance of LLMs. All the codes are released to facilitate future research.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 22 Apr 2025 20:54:51 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7d5384e2/f764da58.mp3" length="23030030" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1436</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, Heng Ji</p>

            <p><strong>Title:</strong><br>
            ToolRL: Reward is All Tool Learning Needs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13958v1">http://arxiv.org/abs/2504.13958v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. However, SFT struggles to generalize to unfamiliar or complex tool use scenarios. Recent advancements in reinforcement learning (RL), particularly with R1-like models, have demonstrated promising reasoning and generalization abilities. Yet, reward design for tool use presents unique challenges: multiple tools may be invoked with diverse parameters, and coarse-grained reward signals, such as answer matching, fail to offer the finegrained feedback required for effective learning. In this work, we present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm. We systematically explore a wide range of reward strategies, analyzing their types, scales, granularity, and temporal dynamics. Building on these insights, we propose a principled reward design tailored for tool use tasks and apply it to train LLMs using Group Relative Policy Optimization (GRPO). Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models. These results highlight the critical role of thoughtful reward design in enhancing the tool use capabilities and generalization performance of LLMs. All the codes are released to facilitate future research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents</title>
      <itunes:episode>700</itunes:episode>
      <podcast:episode>700</podcast:episode>
      <itunes:title>X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7881ca92-f831-4b63-88eb-0f2c74fdd2d8</guid>
      <link>https://share.transistor.fm/s/600a9634</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CR, cs.AI, cs.CL, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel</p>

            <p><strong>Title:</strong><br>
            X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13203v1">http://arxiv.org/abs/2504.13203v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CR, cs.AI, cs.CL, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel</p>

            <p><strong>Title:</strong><br>
            X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13203v1">http://arxiv.org/abs/2504.13203v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 22 Apr 2025 20:54:28 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/600a9634/c624e358.mp3" length="20183341" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1258</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CR, cs.AI, cs.CL, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel</p>

            <p><strong>Title:</strong><br>
            X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13203v1">http://arxiv.org/abs/2504.13203v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians</title>
      <itunes:episode>699</itunes:episode>
      <podcast:episode>699</podcast:episode>
      <itunes:title>StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">553e7d79-4818-41d5-933c-5341594f6676</guid>
      <link>https://share.transistor.fm/s/faddea1e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Cailin Zhuang, Yaoqi Hu, Xuanyang Zhang, Wei Cheng, Jiacheng Bao, Shengqi Liu, Yiying Yang, Xianfang Zeng, Gang Yu, Ming Li</p>

            <p><strong>Title:</strong><br>
            StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15281v1">http://arxiv.org/abs/2504.15281v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D Gaussian Splatting (3DGS) excels in photorealistic scene reconstruction but struggles with stylized scenarios (e.g., cartoons, games) due to fragmented textures, semantic misalignment, and limited adaptability to abstract aesthetics. We propose StyleMe3D, a holistic framework for 3D GS style transfer that integrates multi-modal style conditioning, multi-level semantic alignment, and perceptual quality enhancement. Our key insights include: (1) optimizing only RGB attributes preserves geometric integrity during stylization; (2) disentangling low-, medium-, and high-level semantics is critical for coherent style transfer; (3) scalability across isolated objects and complex scenes is essential for practical deployment. StyleMe3D introduces four novel components: Dynamic Style Score Distillation (DSSD), leveraging Stable Diffusion's latent space for semantic alignment; Contrastive Style Descriptor (CSD) for localized, content-aware texture transfer; Simultaneously Optimized Scale (SOS) to decouple style details and structural coherence; and 3D Gaussian Quality Assessment (3DG-QA), a differentiable aesthetic prior trained on human-rated data to suppress artifacts and enhance visual harmony. Evaluated on NeRF synthetic dataset (objects) and tandt db (scenes) datasets, StyleMe3D outperforms state-of-the-art methods in preserving geometric details (e.g., carvings on sculptures) and ensuring stylistic consistency across scenes (e.g., coherent lighting in landscapes), while maintaining real-time rendering. This work bridges photorealistic 3D GS and artistic stylization, unlocking applications in gaming, virtual worlds, and digital art.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Cailin Zhuang, Yaoqi Hu, Xuanyang Zhang, Wei Cheng, Jiacheng Bao, Shengqi Liu, Yiying Yang, Xianfang Zeng, Gang Yu, Ming Li</p>

            <p><strong>Title:</strong><br>
            StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15281v1">http://arxiv.org/abs/2504.15281v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D Gaussian Splatting (3DGS) excels in photorealistic scene reconstruction but struggles with stylized scenarios (e.g., cartoons, games) due to fragmented textures, semantic misalignment, and limited adaptability to abstract aesthetics. We propose StyleMe3D, a holistic framework for 3D GS style transfer that integrates multi-modal style conditioning, multi-level semantic alignment, and perceptual quality enhancement. Our key insights include: (1) optimizing only RGB attributes preserves geometric integrity during stylization; (2) disentangling low-, medium-, and high-level semantics is critical for coherent style transfer; (3) scalability across isolated objects and complex scenes is essential for practical deployment. StyleMe3D introduces four novel components: Dynamic Style Score Distillation (DSSD), leveraging Stable Diffusion's latent space for semantic alignment; Contrastive Style Descriptor (CSD) for localized, content-aware texture transfer; Simultaneously Optimized Scale (SOS) to decouple style details and structural coherence; and 3D Gaussian Quality Assessment (3DG-QA), a differentiable aesthetic prior trained on human-rated data to suppress artifacts and enhance visual harmony. Evaluated on NeRF synthetic dataset (objects) and tandt db (scenes) datasets, StyleMe3D outperforms state-of-the-art methods in preserving geometric details (e.g., carvings on sculptures) and ensuring stylistic consistency across scenes (e.g., coherent lighting in landscapes), while maintaining real-time rendering. This work bridges photorealistic 3D GS and artistic stylization, unlocking applications in gaming, virtual worlds, and digital art.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 22 Apr 2025 20:54:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/faddea1e/9dfe744c.mp3" length="22411911" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1397</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Cailin Zhuang, Yaoqi Hu, Xuanyang Zhang, Wei Cheng, Jiacheng Bao, Shengqi Liu, Yiying Yang, Xianfang Zeng, Gang Yu, Ming Li</p>

            <p><strong>Title:</strong><br>
            StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.15281v1">http://arxiv.org/abs/2504.15281v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D Gaussian Splatting (3DGS) excels in photorealistic scene reconstruction but struggles with stylized scenarios (e.g., cartoons, games) due to fragmented textures, semantic misalignment, and limited adaptability to abstract aesthetics. We propose StyleMe3D, a holistic framework for 3D GS style transfer that integrates multi-modal style conditioning, multi-level semantic alignment, and perceptual quality enhancement. Our key insights include: (1) optimizing only RGB attributes preserves geometric integrity during stylization; (2) disentangling low-, medium-, and high-level semantics is critical for coherent style transfer; (3) scalability across isolated objects and complex scenes is essential for practical deployment. StyleMe3D introduces four novel components: Dynamic Style Score Distillation (DSSD), leveraging Stable Diffusion's latent space for semantic alignment; Contrastive Style Descriptor (CSD) for localized, content-aware texture transfer; Simultaneously Optimized Scale (SOS) to decouple style details and structural coherence; and 3D Gaussian Quality Assessment (3DG-QA), a differentiable aesthetic prior trained on human-rated data to suppress artifacts and enhance visual harmony. Evaluated on NeRF synthetic dataset (objects) and tandt db (scenes) datasets, StyleMe3D outperforms state-of-the-art methods in preserving geometric details (e.g., carvings on sculptures) and ensuring stylistic consistency across scenes (e.g., coherent lighting in landscapes), while maintaining real-time rendering. This work bridges photorealistic 3D GS and artistic stylization, unlocking applications in gaming, virtual worlds, and digital art.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?</title>
      <itunes:episode>698</itunes:episode>
      <podcast:episode>698</podcast:episode>
      <itunes:title>Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9c8d8a96-4526-46a6-9e16-5a970d4ea1d2</guid>
      <link>https://share.transistor.fm/s/af49fcaa</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, Gao Huang</p>

            <p><strong>Title:</strong><br>
            Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13837v1">http://arxiv.org/abs/2504.13837v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs, particularly in mathematics and programming tasks. It is widely believed that RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed corresponding base models' capacity. In this study, however, we critically re-examines this assumption by measuring the pass@\textit{k} metric with large values of \textit{k} to explore the reasoning capability boundary of the models across a wide range of model families and benchmarks. Surprisingly, the RL does \emph{not}, in fact, elicit fundamentally new reasoning patterns. While RL-trained models outperform their base models at smaller values of $k$ (\eg, $k$=1), base models can achieve a comparable or even higher pass@$k$ score compared to their RL counterparts at large $k$ values. The reasoning paths generated by RL-trained models are already included in the base models' sampling distribution, suggesting that most reasoning abilities manifested in RL-trained models are already obtained by base models. Further analysis shows that RL training boosts the performance by biasing the model's output distribution toward paths that are more likely to yield rewards, therefore sampling correct responses more efficiently. But this also results in a narrower reasoning capability boundary compared to base models. Similar results are observed in visual reasoning tasks trained with RLVR. Moreover, we find that distillation can genuinely introduce new knowledge into the model, different from RLVR. These findings underscore a critical limitation of RLVR in advancing LLM reasoning abilities which requires us to fundamentally rethink the impact of RL training in reasoning LLMs and the need of a better paradigm. Project Page: https://limit-of-RLVR.github.io</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, Gao Huang</p>

            <p><strong>Title:</strong><br>
            Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13837v1">http://arxiv.org/abs/2504.13837v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs, particularly in mathematics and programming tasks. It is widely believed that RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed corresponding base models' capacity. In this study, however, we critically re-examines this assumption by measuring the pass@\textit{k} metric with large values of \textit{k} to explore the reasoning capability boundary of the models across a wide range of model families and benchmarks. Surprisingly, the RL does \emph{not}, in fact, elicit fundamentally new reasoning patterns. While RL-trained models outperform their base models at smaller values of $k$ (\eg, $k$=1), base models can achieve a comparable or even higher pass@$k$ score compared to their RL counterparts at large $k$ values. The reasoning paths generated by RL-trained models are already included in the base models' sampling distribution, suggesting that most reasoning abilities manifested in RL-trained models are already obtained by base models. Further analysis shows that RL training boosts the performance by biasing the model's output distribution toward paths that are more likely to yield rewards, therefore sampling correct responses more efficiently. But this also results in a narrower reasoning capability boundary compared to base models. Similar results are observed in visual reasoning tasks trained with RLVR. Moreover, we find that distillation can genuinely introduce new knowledge into the model, different from RLVR. These findings underscore a critical limitation of RLVR in advancing LLM reasoning abilities which requires us to fundamentally rethink the impact of RL training in reasoning LLMs and the need of a better paradigm. Project Page: https://limit-of-RLVR.github.io</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 21 Apr 2025 20:11:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/af49fcaa/a9452609.mp3" length="20984593" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1308</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, Gao Huang</p>

            <p><strong>Title:</strong><br>
            Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13837v1">http://arxiv.org/abs/2504.13837v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs, particularly in mathematics and programming tasks. It is widely believed that RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed corresponding base models' capacity. In this study, however, we critically re-examines this assumption by measuring the pass@\textit{k} metric with large values of \textit{k} to explore the reasoning capability boundary of the models across a wide range of model families and benchmarks. Surprisingly, the RL does \emph{not}, in fact, elicit fundamentally new reasoning patterns. While RL-trained models outperform their base models at smaller values of $k$ (\eg, $k$=1), base models can achieve a comparable or even higher pass@$k$ score compared to their RL counterparts at large $k$ values. The reasoning paths generated by RL-trained models are already included in the base models' sampling distribution, suggesting that most reasoning abilities manifested in RL-trained models are already obtained by base models. Further analysis shows that RL training boosts the performance by biasing the model's output distribution toward paths that are more likely to yield rewards, therefore sampling correct responses more efficiently. But this also results in a narrower reasoning capability boundary compared to base models. Similar results are observed in visual reasoning tasks trained with RLVR. Moreover, we find that distillation can genuinely introduce new knowledge into the model, different from RLVR. These findings underscore a critical limitation of RLVR in advancing LLM reasoning abilities which requires us to fundamentally rethink the impact of RL training in reasoning LLMs and the need of a better paradigm. Project Page: https://limit-of-RLVR.github.io</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space</title>
      <itunes:episode>697</itunes:episode>
      <podcast:episode>697</podcast:episode>
      <itunes:title>MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">851fafb0-5939-4a59-a76f-e08206d4f08e</guid>
      <link>https://share.transistor.fm/s/9d38bde4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yicheng Chen, Yining Li, Kai Hu, Zerun Ma, Haochen Ye, Kai Chen</p>

            <p><strong>Title:</strong><br>
            MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13835v1">http://arxiv.org/abs/2504.13835v1</a></p>

            <p><strong>Abstract:</strong><br>
            Data quality and diversity are key to the construction of effective instruction-tuning datasets. % With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. % Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. % However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. % Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. % To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. % Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to \textbf{M}aximize the \textbf{I}nformation \textbf{G}ain (MIG) in semantic space. % Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. % Notably, the model fine-tuned with 5\% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73\% on AlpacaEval and +6.89\% on Wildbench.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yicheng Chen, Yining Li, Kai Hu, Zerun Ma, Haochen Ye, Kai Chen</p>

            <p><strong>Title:</strong><br>
            MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13835v1">http://arxiv.org/abs/2504.13835v1</a></p>

            <p><strong>Abstract:</strong><br>
            Data quality and diversity are key to the construction of effective instruction-tuning datasets. % With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. % Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. % However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. % Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. % To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. % Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to \textbf{M}aximize the \textbf{I}nformation \textbf{G}ain (MIG) in semantic space. % Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. % Notably, the model fine-tuned with 5\% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73\% on AlpacaEval and +6.89\% on Wildbench.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 21 Apr 2025 20:11:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9d38bde4/8a907d76.mp3" length="18657819" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1162</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yicheng Chen, Yining Li, Kai Hu, Zerun Ma, Haochen Ye, Kai Chen</p>

            <p><strong>Title:</strong><br>
            MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13835v1">http://arxiv.org/abs/2504.13835v1</a></p>

            <p><strong>Abstract:</strong><br>
            Data quality and diversity are key to the construction of effective instruction-tuning datasets. % With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. % Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. % However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. % Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. % To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. % Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to \textbf{M}aximize the \textbf{I}nformation \textbf{G}ain (MIG) in semantic space. % Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. % Notably, the model fine-tuned with 5\% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73\% on AlpacaEval and +6.89\% on Wildbench.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes</title>
      <itunes:episode>696</itunes:episode>
      <podcast:episode>696</podcast:episode>
      <itunes:title>NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d91c013b-e34a-4ca1-a942-cc5e604e6705</guid>
      <link>https://share.transistor.fm/s/92f46552</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tianyang Xu, Haojie Zheng, Chengze Li, Haoxiang Chen, Yixin Liu, Ruoxi Chen, Lichao Sun</p>

            <p><strong>Title:</strong><br>
            NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.11544v1">http://arxiv.org/abs/2504.11544v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-augmented generation (RAG) empowers large language models to access external and private corpus, enabling factually consistent responses in specific domains. By exploiting the inherent structure of the corpus, graph-based RAG methods further enrich this process by building a knowledge graph index and leveraging the structural nature of graphs. However, current graph-based RAG approaches seldom prioritize the design of graph structures. Inadequately designed graph not only impede the seamless integration of diverse graph algorithms but also result in workflow inconsistencies and degraded performance. To further unleash the potential of graph for RAG, we propose NodeRAG, a graph-centric framework introducing heterogeneous graph structures that enable the seamless and holistic integration of graph-based methodologies into the RAG workflow. By aligning closely with the capabilities of LLMs, this framework ensures a fully cohesive and efficient end-to-end process. Through extensive experiments, we demonstrate that NodeRAG exhibits performance advantages over previous methods, including GraphRAG and LightRAG, not only in indexing time, query time, and storage efficiency but also in delivering superior question-answering performance on multi-hop benchmarks and open-ended head-to-head evaluations with minimal retrieval tokens. Our GitHub repository could be seen at https://github.com/Terry-Xu-666/NodeRAG.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tianyang Xu, Haojie Zheng, Chengze Li, Haoxiang Chen, Yixin Liu, Ruoxi Chen, Lichao Sun</p>

            <p><strong>Title:</strong><br>
            NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.11544v1">http://arxiv.org/abs/2504.11544v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-augmented generation (RAG) empowers large language models to access external and private corpus, enabling factually consistent responses in specific domains. By exploiting the inherent structure of the corpus, graph-based RAG methods further enrich this process by building a knowledge graph index and leveraging the structural nature of graphs. However, current graph-based RAG approaches seldom prioritize the design of graph structures. Inadequately designed graph not only impede the seamless integration of diverse graph algorithms but also result in workflow inconsistencies and degraded performance. To further unleash the potential of graph for RAG, we propose NodeRAG, a graph-centric framework introducing heterogeneous graph structures that enable the seamless and holistic integration of graph-based methodologies into the RAG workflow. By aligning closely with the capabilities of LLMs, this framework ensures a fully cohesive and efficient end-to-end process. Through extensive experiments, we demonstrate that NodeRAG exhibits performance advantages over previous methods, including GraphRAG and LightRAG, not only in indexing time, query time, and storage efficiency but also in delivering superior question-answering performance on multi-hop benchmarks and open-ended head-to-head evaluations with minimal retrieval tokens. Our GitHub repository could be seen at https://github.com/Terry-Xu-666/NodeRAG.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 21 Apr 2025 20:10:46 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/92f46552/47c12b28.mp3" length="19924613" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1242</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tianyang Xu, Haojie Zheng, Chengze Li, Haoxiang Chen, Yixin Liu, Ruoxi Chen, Lichao Sun</p>

            <p><strong>Title:</strong><br>
            NodeRAG: Structuring Graph-based RAG with Heterogeneous Nodes</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.11544v1">http://arxiv.org/abs/2504.11544v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-augmented generation (RAG) empowers large language models to access external and private corpus, enabling factually consistent responses in specific domains. By exploiting the inherent structure of the corpus, graph-based RAG methods further enrich this process by building a knowledge graph index and leveraging the structural nature of graphs. However, current graph-based RAG approaches seldom prioritize the design of graph structures. Inadequately designed graph not only impede the seamless integration of diverse graph algorithms but also result in workflow inconsistencies and degraded performance. To further unleash the potential of graph for RAG, we propose NodeRAG, a graph-centric framework introducing heterogeneous graph structures that enable the seamless and holistic integration of graph-based methodologies into the RAG workflow. By aligning closely with the capabilities of LLMs, this framework ensures a fully cohesive and efficient end-to-end process. Through extensive experiments, we demonstrate that NodeRAG exhibits performance advantages over previous methods, including GraphRAG and LightRAG, not only in indexing time, query time, and storage efficiency but also in delivering superior question-answering performance on multi-hop benchmarks and open-ended head-to-head evaluations with minimal retrieval tokens. Our GitHub repository could be seen at https://github.com/Terry-Xu-666/NodeRAG.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training</title>
      <itunes:episode>695</itunes:episode>
      <podcast:episode>695</podcast:episode>
      <itunes:title>CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">91960019-f3d9-46e5-b8b1-e4a32d443a15</guid>
      <link>https://share.transistor.fm/s/d7a1d7ac</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 69 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan, Lin, Jan Kautz, Pavlo Molchanov</p>

            <p><strong>Title:</strong><br>
            CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13161v1">http://arxiv.org/abs/2504.13161v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture. Our data is available at: https://research.nvidia.com/labs/lpr/climb/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 69 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan, Lin, Jan Kautz, Pavlo Molchanov</p>

            <p><strong>Title:</strong><br>
            CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13161v1">http://arxiv.org/abs/2504.13161v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture. Our data is available at: https://research.nvidia.com/labs/lpr/climb/</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 18 Apr 2025 20:50:54 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d7a1d7ac/8ff4cf55.mp3" length="21835972" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1361</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 69 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan, Lin, Jan Kautz, Pavlo Molchanov</p>

            <p><strong>Title:</strong><br>
            CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13161v1">http://arxiv.org/abs/2504.13161v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture. Our data is available at: https://research.nvidia.com/labs/lpr/climb/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Antidistillation Sampling</title>
      <itunes:episode>694</itunes:episode>
      <podcast:episode>694</podcast:episode>
      <itunes:title>Antidistillation Sampling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1cd20425-2e25-4f3b-a733-d93eec735615</guid>
      <link>https://share.transistor.fm/s/52cfdef9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yash Savani, Asher Trockman, Zhili Feng, Avi Schwarzschild, Alexander Robey, Marc Finzi, J. Zico Kolter</p>

            <p><strong>Title:</strong><br>
            Antidistillation Sampling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13146v1">http://arxiv.org/abs/2504.13146v1</a></p>

            <p><strong>Abstract:</strong><br>
            Frontier models that generate extended reasoning traces inadvertently produce rich token sequences that can facilitate model distillation. Recognizing this vulnerability, model owners may seek sampling strategies that limit the effectiveness of distillation without compromising model performance. \emph{Antidistillation sampling} provides exactly this capability. By strategically modifying a model's next-token probability distribution, antidistillation sampling poisons reasoning traces, rendering them significantly less effective for distillation while preserving the model's practical utility. For further details, see https://antidistillation.com.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yash Savani, Asher Trockman, Zhili Feng, Avi Schwarzschild, Alexander Robey, Marc Finzi, J. Zico Kolter</p>

            <p><strong>Title:</strong><br>
            Antidistillation Sampling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13146v1">http://arxiv.org/abs/2504.13146v1</a></p>

            <p><strong>Abstract:</strong><br>
            Frontier models that generate extended reasoning traces inadvertently produce rich token sequences that can facilitate model distillation. Recognizing this vulnerability, model owners may seek sampling strategies that limit the effectiveness of distillation without compromising model performance. \emph{Antidistillation sampling} provides exactly this capability. By strategically modifying a model's next-token probability distribution, antidistillation sampling poisons reasoning traces, rendering them significantly less effective for distillation while preserving the model's practical utility. For further details, see https://antidistillation.com.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 18 Apr 2025 20:50:33 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/52cfdef9/88b299ea.mp3" length="17665090" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1100</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yash Savani, Asher Trockman, Zhili Feng, Avi Schwarzschild, Alexander Robey, Marc Finzi, J. Zico Kolter</p>

            <p><strong>Title:</strong><br>
            Antidistillation Sampling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13146v1">http://arxiv.org/abs/2504.13146v1</a></p>

            <p><strong>Abstract:</strong><br>
            Frontier models that generate extended reasoning traces inadvertently produce rich token sequences that can facilitate model distillation. Recognizing this vulnerability, model owners may seek sampling strategies that limit the effectiveness of distillation without compromising model performance. \emph{Antidistillation sampling} provides exactly this capability. By strategically modifying a model's next-token probability distribution, antidistillation sampling poisons reasoning traces, rendering them significantly less effective for distillation while preserving the model's practical utility. For further details, see https://antidistillation.com.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling</title>
      <itunes:episode>693</itunes:episode>
      <podcast:episode>693</podcast:episode>
      <itunes:title>Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">af5f5a6b-175a-44bd-988b-7774de5c37df</guid>
      <link>https://share.transistor.fm/s/b8a1e521</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tsung-Han Wu, Heekyung Lee, Jiaxin Ge, Joseph E. Gonzalez, Trevor Darrell, David M. Chan</p>

            <p><strong>Title:</strong><br>
            Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13169v1">http://arxiv.org/abs/2504.13169v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations, where they generate descriptions of nonexistent objects, actions, or concepts, posing significant risks in safety-critical applications. Existing hallucination mitigation methods typically follow one of two paradigms: generation adjustment, which modifies decoding behavior to align text with visual inputs, and post-hoc verification, where external models assess and correct outputs. While effective, generation adjustment methods often rely on heuristics and lack correction mechanisms, while post-hoc verification is complicated, typically requiring multiple models and tending to reject outputs rather than refine them. In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification. By leveraging a new hallucination-verification dataset containing over 1.3M semi-synthetic samples, along with a novel inference-time retrospective resampling technique, our approach enables VLMs to both detect hallucinations during generation and dynamically revise those hallucinations. Our evaluations show that REVERSE achieves state-of-the-art hallucination reduction, outperforming the best existing methods by up to 12% on CHAIR-MSCOCO and 28% on HaloQuest. Our dataset, model, and code are available at: https://reverse-vlm.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tsung-Han Wu, Heekyung Lee, Jiaxin Ge, Joseph E. Gonzalez, Trevor Darrell, David M. Chan</p>

            <p><strong>Title:</strong><br>
            Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13169v1">http://arxiv.org/abs/2504.13169v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations, where they generate descriptions of nonexistent objects, actions, or concepts, posing significant risks in safety-critical applications. Existing hallucination mitigation methods typically follow one of two paradigms: generation adjustment, which modifies decoding behavior to align text with visual inputs, and post-hoc verification, where external models assess and correct outputs. While effective, generation adjustment methods often rely on heuristics and lack correction mechanisms, while post-hoc verification is complicated, typically requiring multiple models and tending to reject outputs rather than refine them. In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification. By leveraging a new hallucination-verification dataset containing over 1.3M semi-synthetic samples, along with a novel inference-time retrospective resampling technique, our approach enables VLMs to both detect hallucinations during generation and dynamically revise those hallucinations. Our evaluations show that REVERSE achieves state-of-the-art hallucination reduction, outperforming the best existing methods by up to 12% on CHAIR-MSCOCO and 28% on HaloQuest. Our dataset, model, and code are available at: https://reverse-vlm.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 18 Apr 2025 20:50:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b8a1e521/2b2904e9.mp3" length="19551833" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1218</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tsung-Han Wu, Heekyung Lee, Jiaxin Ge, Joseph E. Gonzalez, Trevor Darrell, David M. Chan</p>

            <p><strong>Title:</strong><br>
            Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.13169v1">http://arxiv.org/abs/2504.13169v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations, where they generate descriptions of nonexistent objects, actions, or concepts, posing significant risks in safety-critical applications. Existing hallucination mitigation methods typically follow one of two paradigms: generation adjustment, which modifies decoding behavior to align text with visual inputs, and post-hoc verification, where external models assess and correct outputs. While effective, generation adjustment methods often rely on heuristics and lack correction mechanisms, while post-hoc verification is complicated, typically requiring multiple models and tending to reject outputs rather than refine them. In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification. By leveraging a new hallucination-verification dataset containing over 1.3M semi-synthetic samples, along with a novel inference-time retrospective resampling technique, our approach enables VLMs to both detect hallucinations during generation and dynamically revise those hallucinations. Our evaluations show that REVERSE achieves state-of-the-art hallucination reduction, outperforming the best existing methods by up to 12% on CHAIR-MSCOCO and 28% on HaloQuest. Our dataset, model, and code are available at: https://reverse-vlm.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Packing Input Frame Context in Next-Frame Prediction Models for Video Generation</title>
      <itunes:episode>692</itunes:episode>
      <podcast:episode>692</podcast:episode>
      <itunes:title>Packing Input Frame Context in Next-Frame Prediction Models for Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bd6a54fa-c379-4e8e-99ff-d282b4d19955</guid>
      <link>https://share.transistor.fm/s/83321624</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Lvmin Zhang, Maneesh Agrawala</p>

            <p><strong>Title:</strong><br>
            Packing Input Frame Context in Next-Frame Prediction Models for Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.12626v1">http://arxiv.org/abs/2504.12626v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Lvmin Zhang, Maneesh Agrawala</p>

            <p><strong>Title:</strong><br>
            Packing Input Frame Context in Next-Frame Prediction Models for Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.12626v1">http://arxiv.org/abs/2504.12626v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 18 Apr 2025 20:49:51 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/83321624/45cb0ed8.mp3" length="23290040" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1452</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Lvmin Zhang, Maneesh Agrawala</p>

            <p><strong>Title:</strong><br>
            Packing Input Frame Context in Next-Frame Prediction Models for Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.12626v1">http://arxiv.org/abs/2504.12626v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WORLDMEM: Long-term Consistent World Simulation with Memory</title>
      <itunes:episode>691</itunes:episode>
      <podcast:episode>691</podcast:episode>
      <itunes:title>WORLDMEM: Long-term Consistent World Simulation with Memory</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">800f22b7-8cbc-4d85-b593-490c28665240</guid>
      <link>https://share.transistor.fm/s/ed683733</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, Xingang Pan</p>

            <p><strong>Title:</strong><br>
            WORLDMEM: Long-term Consistent World Simulation with Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.12369v1">http://arxiv.org/abs/2504.12369v1</a></p>

            <p><strong>Abstract:</strong><br>
            World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions. However, the limited temporal context window often leads to failures in maintaining long-term consistency, particularly in preserving 3D spatial consistency. In this work, we present WorldMem, a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states (e.g., poses and timestamps). By employing a memory attention mechanism that effectively extracts relevant information from these memory frames based on their states, our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps. Furthermore, by incorporating timestamps into the states, our framework not only models a static world but also captures its dynamic evolution over time, enabling both perception and interaction within the simulated world. Extensive experiments in both virtual and real scenarios validate the effectiveness of our approach.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, Xingang Pan</p>

            <p><strong>Title:</strong><br>
            WORLDMEM: Long-term Consistent World Simulation with Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.12369v1">http://arxiv.org/abs/2504.12369v1</a></p>

            <p><strong>Abstract:</strong><br>
            World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions. However, the limited temporal context window often leads to failures in maintaining long-term consistency, particularly in preserving 3D spatial consistency. In this work, we present WorldMem, a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states (e.g., poses and timestamps). By employing a memory attention mechanism that effectively extracts relevant information from these memory frames based on their states, our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps. Furthermore, by incorporating timestamps into the states, our framework not only models a static world but also captures its dynamic evolution over time, enabling both perception and interaction within the simulated world. Extensive experiments in both virtual and real scenarios validate the effectiveness of our approach.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 18 Apr 2025 20:49:30 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ed683733/c56b907b.mp3" length="21608987" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1347</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, Xingang Pan</p>

            <p><strong>Title:</strong><br>
            WORLDMEM: Long-term Consistent World Simulation with Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.12369v1">http://arxiv.org/abs/2504.12369v1</a></p>

            <p><strong>Abstract:</strong><br>
            World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions. However, the limited temporal context window often leads to failures in maintaining long-term consistency, particularly in preserving 3D spatial consistency. In this work, we present WorldMem, a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states (e.g., poses and timestamps). By employing a memory attention mechanism that effectively extracts relevant information from these memory frames based on their states, our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps. Furthermore, by incorporating timestamps into the states, our framework not only models a static world but also captures its dynamic evolution over time, enabling both perception and interaction within the simulated world. Extensive experiments in both virtual and real scenarios validate the effectiveness of our approach.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis</title>
      <itunes:episode>690</itunes:episode>
      <podcast:episode>690</podcast:episode>
      <itunes:title>A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cd22ec33-265f-474d-87d6-8420a8c36a05</guid>
      <link>https://share.transistor.fm/s/8b772bd5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xin Gao, Qizhi Pei, Zinan Tang, Yu Li, Honglin Lin, Jiang Wu, Conghui He, Lijun Wu</p>

            <p><strong>Title:</strong><br>
            A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.12322v1">http://arxiv.org/abs/2504.12322v1</a></p>

            <p><strong>Abstract:</strong><br>
            While data synthesis and distillation are promising strategies to enhance small language models, current approaches heavily rely on Large Language Models (LLMs), which suffer from high computational costs, environmental inefficiency, and potential biases inherited from monolithic architectures. In contrast, smaller LLMs are more accessible and sustainable, but their individual capabilities often fall short in generating high-quality, diverse, and reliable data. Inspired by collaborative human processes (e.g., peer review), we propose a multiple small LLMs involved framework, GRA, that aggregates specialized roles across small LLMs to iterative refinement and quality control typically achieved by a single large LLM. In this collaborative framework, multiple small LLMs assume distinct roles-Generator, Reviewer, and Adjudicator-to simulate a peer-review-inspired data synthesis pipeline. The Generator proposes initial data samples, the Reviewer critiques their quality and diversity, and the Adjudicator resolves conflicts to finalize the output. By decomposing the synthesis process into specialized sub-tasks, collaborative small LLMs can achieve data-level parity with large LLM-based distillation. Through experiments across multiple benchmarks, we demonstrate that GRA-produced data matches or exceeds the quality of single large LLM outputs, e.g., Qwen-2.5-72B-Instruct. Our results challenge the necessity of monolithic large models for high-quality data synthesis, advocating instead for strategic coordination of smaller agents. Our datasets, models, and code are publicly available at https://github.com/GX-XinGao/GRA.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xin Gao, Qizhi Pei, Zinan Tang, Yu Li, Honglin Lin, Jiang Wu, Conghui He, Lijun Wu</p>

            <p><strong>Title:</strong><br>
            A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.12322v1">http://arxiv.org/abs/2504.12322v1</a></p>

            <p><strong>Abstract:</strong><br>
            While data synthesis and distillation are promising strategies to enhance small language models, current approaches heavily rely on Large Language Models (LLMs), which suffer from high computational costs, environmental inefficiency, and potential biases inherited from monolithic architectures. In contrast, smaller LLMs are more accessible and sustainable, but their individual capabilities often fall short in generating high-quality, diverse, and reliable data. Inspired by collaborative human processes (e.g., peer review), we propose a multiple small LLMs involved framework, GRA, that aggregates specialized roles across small LLMs to iterative refinement and quality control typically achieved by a single large LLM. In this collaborative framework, multiple small LLMs assume distinct roles-Generator, Reviewer, and Adjudicator-to simulate a peer-review-inspired data synthesis pipeline. The Generator proposes initial data samples, the Reviewer critiques their quality and diversity, and the Adjudicator resolves conflicts to finalize the output. By decomposing the synthesis process into specialized sub-tasks, collaborative small LLMs can achieve data-level parity with large LLM-based distillation. Through experiments across multiple benchmarks, we demonstrate that GRA-produced data matches or exceeds the quality of single large LLM outputs, e.g., Qwen-2.5-72B-Instruct. Our results challenge the necessity of monolithic large models for high-quality data synthesis, advocating instead for strategic coordination of smaller agents. Our datasets, models, and code are publicly available at https://github.com/GX-XinGao/GRA.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 18 Apr 2025 20:49:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8b772bd5/ee4e1345.mp3" length="22709081" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1416</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xin Gao, Qizhi Pei, Zinan Tang, Yu Li, Honglin Lin, Jiang Wu, Conghui He, Lijun Wu</p>

            <p><strong>Title:</strong><br>
            A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.12322v1">http://arxiv.org/abs/2504.12322v1</a></p>

            <p><strong>Abstract:</strong><br>
            While data synthesis and distillation are promising strategies to enhance small language models, current approaches heavily rely on Large Language Models (LLMs), which suffer from high computational costs, environmental inefficiency, and potential biases inherited from monolithic architectures. In contrast, smaller LLMs are more accessible and sustainable, but their individual capabilities often fall short in generating high-quality, diverse, and reliable data. Inspired by collaborative human processes (e.g., peer review), we propose a multiple small LLMs involved framework, GRA, that aggregates specialized roles across small LLMs to iterative refinement and quality control typically achieved by a single large LLM. In this collaborative framework, multiple small LLMs assume distinct roles-Generator, Reviewer, and Adjudicator-to simulate a peer-review-inspired data synthesis pipeline. The Generator proposes initial data samples, the Reviewer critiques their quality and diversity, and the Adjudicator resolves conflicts to finalize the output. By decomposing the synthesis process into specialized sub-tasks, collaborative small LLMs can achieve data-level parity with large LLM-based distillation. Through experiments across multiple benchmarks, we demonstrate that GRA-produced data matches or exceeds the quality of single large LLM outputs, e.g., Qwen-2.5-72B-Instruct. Our results challenge the necessity of monolithic large models for high-quality data synthesis, advocating instead for strategic coordination of smaller agents. Our datasets, models, and code are publicly available at https://github.com/GX-XinGao/GRA.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness</title>
      <itunes:episode>689</itunes:episode>
      <podcast:episode>689</podcast:episode>
      <itunes:title>ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f01870c2-6bd9-428f-8279-891e9ea90a78</guid>
      <link>https://share.transistor.fm/s/76b64119</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.10514v1">http://arxiv.org/abs/2504.10514v1</a></p>

            <p><strong>Abstract:</strong><br>
            Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.10514v1">http://arxiv.org/abs/2504.10514v1</a></p>

            <p><strong>Abstract:</strong><br>
            Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 17 Apr 2025 20:11:40 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/76b64119/3a6d179e.mp3" length="21472806" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1338</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.10514v1">http://arxiv.org/abs/2504.10514v1</a></p>

            <p><strong>Abstract:</strong><br>
            Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BitNet b1.58 2B4T Technical Report</title>
      <itunes:episode>688</itunes:episode>
      <podcast:episode>688</podcast:episode>
      <itunes:title>BitNet b1.58 2B4T Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">08571216-2dbc-4775-a31d-0fb8bdb94cf4</guid>
      <link>https://share.transistor.fm/s/cc37e4e6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shuming Ma, Hongyu Wang, Shaohan Huang, Xingxing Zhang, Ying Hu, Ting Song, Yan Xia, Furu Wei</p>

            <p><strong>Title:</strong><br>
            BitNet b1.58 2B4T Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.12285v1">http://arxiv.org/abs/2504.12285v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale. Trained on a corpus of 4 trillion tokens, the model has been rigorously evaluated across benchmarks covering language understanding, mathematical reasoning, coding proficiency, and conversational ability. Our results demonstrate that BitNet b1.58 2B4T achieves performance on par with leading open-weight, full-precision LLMs of similar size, while offering significant advantages in computational efficiency, including substantially reduced memory footprint, energy consumption, and decoding latency. To facilitate further research and adoption, the model weights are released via Hugging Face along with open-source inference implementations for both GPU and CPU architectures.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shuming Ma, Hongyu Wang, Shaohan Huang, Xingxing Zhang, Ying Hu, Ting Song, Yan Xia, Furu Wei</p>

            <p><strong>Title:</strong><br>
            BitNet b1.58 2B4T Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.12285v1">http://arxiv.org/abs/2504.12285v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale. Trained on a corpus of 4 trillion tokens, the model has been rigorously evaluated across benchmarks covering language understanding, mathematical reasoning, coding proficiency, and conversational ability. Our results demonstrate that BitNet b1.58 2B4T achieves performance on par with leading open-weight, full-precision LLMs of similar size, while offering significant advantages in computational efficiency, including substantially reduced memory footprint, energy consumption, and decoding latency. To facilitate further research and adoption, the model weights are released via Hugging Face along with open-source inference implementations for both GPU and CPU architectures.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 17 Apr 2025 20:11:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cc37e4e6/98d7dd6c.mp3" length="18979999" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1183</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shuming Ma, Hongyu Wang, Shaohan Huang, Xingxing Zhang, Ying Hu, Ting Song, Yan Xia, Furu Wei</p>

            <p><strong>Title:</strong><br>
            BitNet b1.58 2B4T Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.12285v1">http://arxiv.org/abs/2504.12285v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale. Trained on a corpus of 4 trillion tokens, the model has been rigorously evaluated across benchmarks covering language understanding, mathematical reasoning, coding proficiency, and conversational ability. Our results demonstrate that BitNet b1.58 2B4T achieves performance on par with leading open-weight, full-precision LLMs of similar size, while offering significant advantages in computational efficiency, including substantially reduced memory footprint, energy consumption, and decoding latency. To facilitate further research and adoption, the model weights are released via Hugging Face along with open-source inference implementations for both GPU and CPU architectures.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ReTool: Reinforcement Learning for Strategic Tool Use in LLMs</title>
      <itunes:episode>687</itunes:episode>
      <podcast:episode>687</podcast:episode>
      <itunes:title>ReTool: Reinforcement Learning for Strategic Tool Use in LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6600092d-befc-4ece-b8c9-86487beba540</guid>
      <link>https://share.transistor.fm/s/a0f8f025</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, Wanjun Zhong</p>

            <p><strong>Title:</strong><br>
            ReTool: Reinforcement Learning for Strategic Tool Use in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.11536v2">http://arxiv.org/abs/2504.11536v2</a></p>

            <p><strong>Abstract:</strong><br>
            While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, Wanjun Zhong</p>

            <p><strong>Title:</strong><br>
            ReTool: Reinforcement Learning for Strategic Tool Use in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.11536v2">http://arxiv.org/abs/2504.11536v2</a></p>

            <p><strong>Abstract:</strong><br>
            While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 17 Apr 2025 20:10:57 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a0f8f025/dabd37ab.mp3" length="22436548" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1399</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, Wanjun Zhong</p>

            <p><strong>Title:</strong><br>
            ReTool: Reinforcement Learning for Strategic Tool Use in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.11536v2">http://arxiv.org/abs/2504.11536v2</a></p>

            <p><strong>Abstract:</strong><br>
            While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>xVerify: Efficient Answer Verifier for Reasoning Model Evaluations</title>
      <itunes:episode>686</itunes:episode>
      <podcast:episode>686</podcast:episode>
      <itunes:title>xVerify: Efficient Answer Verifier for Reasoning Model Evaluations</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">792572fa-1899-46b5-9164-b24012785753</guid>
      <link>https://share.transistor.fm/s/50aeacc3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ding Chen, Qingchen Yu, Pengyuan Wang, Wentao Zhang, Bo Tang, Feiyu Xiong, Xinchi Li, Minchuan Yang, Zhiyu Li</p>

            <p><strong>Title:</strong><br>
            xVerify: Efficient Answer Verifier for Reasoning Model Evaluations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.10481v1">http://arxiv.org/abs/2504.10481v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the release of the o1 model by OpenAI, reasoning models adopting slow thinking strategies have gradually emerged. As the responses generated by such models often include complex reasoning, intermediate steps, and self-reflection, existing evaluation methods are often inadequate. They struggle to determine whether the LLM output is truly equivalent to the reference answer, and also have difficulty identifying and extracting the final answer from long, complex responses. To address this issue, we propose xVerify, an efficient answer verifier for reasoning model evaluations. xVerify demonstrates strong capability in equivalence judgment, enabling it to effectively determine whether the answers produced by reasoning models are equivalent to reference answers across various types of objective questions. To train and evaluate xVerify, we construct the VAR dataset by collecting question-answer pairs generated by multiple LLMs across various datasets, leveraging multiple reasoning models and challenging evaluation sets designed specifically for reasoning model assessment. A multi-round annotation process is employed to ensure label accuracy. Based on the VAR dataset, we train multiple xVerify models of different scales. In evaluation experiments conducted on both the test set and generalization set, all xVerify models achieve overall F1 scores and accuracy exceeding 95\%. Notably, the smallest variant, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance. These results validate the effectiveness and generalizability of xVerify.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ding Chen, Qingchen Yu, Pengyuan Wang, Wentao Zhang, Bo Tang, Feiyu Xiong, Xinchi Li, Minchuan Yang, Zhiyu Li</p>

            <p><strong>Title:</strong><br>
            xVerify: Efficient Answer Verifier for Reasoning Model Evaluations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.10481v1">http://arxiv.org/abs/2504.10481v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the release of the o1 model by OpenAI, reasoning models adopting slow thinking strategies have gradually emerged. As the responses generated by such models often include complex reasoning, intermediate steps, and self-reflection, existing evaluation methods are often inadequate. They struggle to determine whether the LLM output is truly equivalent to the reference answer, and also have difficulty identifying and extracting the final answer from long, complex responses. To address this issue, we propose xVerify, an efficient answer verifier for reasoning model evaluations. xVerify demonstrates strong capability in equivalence judgment, enabling it to effectively determine whether the answers produced by reasoning models are equivalent to reference answers across various types of objective questions. To train and evaluate xVerify, we construct the VAR dataset by collecting question-answer pairs generated by multiple LLMs across various datasets, leveraging multiple reasoning models and challenging evaluation sets designed specifically for reasoning model assessment. A multi-round annotation process is employed to ensure label accuracy. Based on the VAR dataset, we train multiple xVerify models of different scales. In evaluation experiments conducted on both the test set and generalization set, all xVerify models achieve overall F1 scores and accuracy exceeding 95\%. Notably, the smallest variant, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance. These results validate the effectiveness and generalizability of xVerify.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 16 Apr 2025 21:00:24 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/50aeacc3/ebf583dc.mp3" length="21122071" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1316</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 63 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ding Chen, Qingchen Yu, Pengyuan Wang, Wentao Zhang, Bo Tang, Feiyu Xiong, Xinchi Li, Minchuan Yang, Zhiyu Li</p>

            <p><strong>Title:</strong><br>
            xVerify: Efficient Answer Verifier for Reasoning Model Evaluations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.10481v1">http://arxiv.org/abs/2504.10481v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the release of the o1 model by OpenAI, reasoning models adopting slow thinking strategies have gradually emerged. As the responses generated by such models often include complex reasoning, intermediate steps, and self-reflection, existing evaluation methods are often inadequate. They struggle to determine whether the LLM output is truly equivalent to the reference answer, and also have difficulty identifying and extracting the final answer from long, complex responses. To address this issue, we propose xVerify, an efficient answer verifier for reasoning model evaluations. xVerify demonstrates strong capability in equivalence judgment, enabling it to effectively determine whether the answers produced by reasoning models are equivalent to reference answers across various types of objective questions. To train and evaluate xVerify, we construct the VAR dataset by collecting question-answer pairs generated by multiple LLMs across various datasets, leveraging multiple reasoning models and challenging evaluation sets designed specifically for reasoning model assessment. A multi-round annotation process is employed to ensure label accuracy. Based on the VAR dataset, we train multiple xVerify models of different scales. In evaluation experiments conducted on both the test set and generalization set, all xVerify models achieve overall F1 scores and accuracy exceeding 95\%. Notably, the smallest variant, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance. These results validate the effectiveness and generalizability of xVerify.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning</title>
      <itunes:episode>685</itunes:episode>
      <podcast:episode>685</podcast:episode>
      <itunes:title>Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cc0441b0-0889-4b5d-9fc9-3b8613fb3583</guid>
      <link>https://share.transistor.fm/s/bcaa9f48</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Qiushi Sun, Kanzhi Cheng, Junxian He, Jun Liu, Zhiyong Wu</p>

            <p><strong>Title:</strong><br>
            Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.08672v1">http://arxiv.org/abs/2504.08672v1</a></p>

            <p><strong>Abstract:</strong><br>
            Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. We introduce a generalizable and purely unsupervised self-training framework, named Genius. Without external auxiliary, Genius requires to seek the optimal response sequence in a stepwise manner and optimize the LLM. To explore the potential steps and exploit the optimal ones, Genius introduces a stepwise foresight re-sampling strategy to sample and estimate the step value by simulating future outcomes. Further, we recognize that the unsupervised setting inevitably induces the intrinsic noise and uncertainty. To provide a robust optimization, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies. Combining these techniques together, Genius provides an advanced initial step towards self-improve LLM reasoning with general queries and without supervision, revolutionizing reasoning scaling laws given the vast availability of general queries. The code will be released at https://github.com/xufangzhi/Genius.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Qiushi Sun, Kanzhi Cheng, Junxian He, Jun Liu, Zhiyong Wu</p>

            <p><strong>Title:</strong><br>
            Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.08672v1">http://arxiv.org/abs/2504.08672v1</a></p>

            <p><strong>Abstract:</strong><br>
            Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. We introduce a generalizable and purely unsupervised self-training framework, named Genius. Without external auxiliary, Genius requires to seek the optimal response sequence in a stepwise manner and optimize the LLM. To explore the potential steps and exploit the optimal ones, Genius introduces a stepwise foresight re-sampling strategy to sample and estimate the step value by simulating future outcomes. Further, we recognize that the unsupervised setting inevitably induces the intrinsic noise and uncertainty. To provide a robust optimization, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies. Combining these techniques together, Genius provides an advanced initial step towards self-improve LLM reasoning with general queries and without supervision, revolutionizing reasoning scaling laws given the vast availability of general queries. The code will be released at https://github.com/xufangzhi/Genius.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 16 Apr 2025 21:00:01 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bcaa9f48/8bce07a1.mp3" length="19483699" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1214</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Qiushi Sun, Kanzhi Cheng, Junxian He, Jun Liu, Zhiyong Wu</p>

            <p><strong>Title:</strong><br>
            Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.08672v1">http://arxiv.org/abs/2504.08672v1</a></p>

            <p><strong>Abstract:</strong><br>
            Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. We introduce a generalizable and purely unsupervised self-training framework, named Genius. Without external auxiliary, Genius requires to seek the optimal response sequence in a stepwise manner and optimize the LLM. To explore the potential steps and exploit the optimal ones, Genius introduces a stepwise foresight re-sampling strategy to sample and estimate the step value by simulating future outcomes. Further, we recognize that the unsupervised setting inevitably induces the intrinsic noise and uncertainty. To provide a robust optimization, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies. Combining these techniques together, Genius provides an advanced initial step towards self-improve LLM reasoning with general queries and without supervision, revolutionizing reasoning scaling laws given the vast availability of general queries. The code will be released at https://github.com/xufangzhi/Genius.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients</title>
      <itunes:episode>684</itunes:episode>
      <podcast:episode>684</podcast:episode>
      <itunes:title>How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c7d9b864-60e9-455d-8a20-890821e1718c</guid>
      <link>https://share.transistor.fm/s/f6cd0e32</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ming Li, Yanhong Li, Ziyue Li, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.10766v1">http://arxiv.org/abs/2504.10766v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the post-training of large language models (LLMs) advances from instruction-following to complex reasoning tasks, understanding how different data affect finetuning dynamics remains largely unexplored. In this paper, we present a spectral analysis of layer-wise gradients induced by low/high-quality instruction and reasoning data for LLM post-training. Our analysis reveals that widely-studied metrics for data evaluation, e.g., IFD, InsTag, Difficulty, and Reward, can be explained and unified by spectral properties computed from gradients' singular value decomposition (SVD). Specifically, higher-quality data are usually associated with lower nuclear norms and higher effective ranks. Notably, effective rank exhibits better robustness and resolution than nuclear norm in capturing subtle quality differences. For example, reasoning data achieves substantially higher effective ranks than instruction data, implying richer gradient structures on more complex tasks. Our experiments also highlight that models within the same family share similar gradient patterns regardless of their sizes, whereas different model families diverge significantly. Providing a unified view on the effects of data quality across instruction and reasoning data, this work illuminates the interplay between data quality and training stability, shedding novel insights into developing better data exploration strategies for post-training.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ming Li, Yanhong Li, Ziyue Li, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.10766v1">http://arxiv.org/abs/2504.10766v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the post-training of large language models (LLMs) advances from instruction-following to complex reasoning tasks, understanding how different data affect finetuning dynamics remains largely unexplored. In this paper, we present a spectral analysis of layer-wise gradients induced by low/high-quality instruction and reasoning data for LLM post-training. Our analysis reveals that widely-studied metrics for data evaluation, e.g., IFD, InsTag, Difficulty, and Reward, can be explained and unified by spectral properties computed from gradients' singular value decomposition (SVD). Specifically, higher-quality data are usually associated with lower nuclear norms and higher effective ranks. Notably, effective rank exhibits better robustness and resolution than nuclear norm in capturing subtle quality differences. For example, reasoning data achieves substantially higher effective ranks than instruction data, implying richer gradient structures on more complex tasks. Our experiments also highlight that models within the same family share similar gradient patterns regardless of their sizes, whereas different model families diverge significantly. Providing a unified view on the effects of data quality across instruction and reasoning data, this work illuminates the interplay between data quality and training stability, shedding novel insights into developing better data exploration strategies for post-training.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 16 Apr 2025 20:59:35 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f6cd0e32/fccdf6a3.mp3" length="21318137" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1329</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ming Li, Yanhong Li, Ziyue Li, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.10766v1">http://arxiv.org/abs/2504.10766v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the post-training of large language models (LLMs) advances from instruction-following to complex reasoning tasks, understanding how different data affect finetuning dynamics remains largely unexplored. In this paper, we present a spectral analysis of layer-wise gradients induced by low/high-quality instruction and reasoning data for LLM post-training. Our analysis reveals that widely-studied metrics for data evaluation, e.g., IFD, InsTag, Difficulty, and Reward, can be explained and unified by spectral properties computed from gradients' singular value decomposition (SVD). Specifically, higher-quality data are usually associated with lower nuclear norms and higher effective ranks. Notably, effective rank exhibits better robustness and resolution than nuclear norm in capturing subtle quality differences. For example, reasoning data achieves substantially higher effective ranks than instruction data, implying richer gradient structures on more complex tasks. Our experiments also highlight that models within the same family share similar gradient patterns regardless of their sizes, whereas different model families diverge significantly. Providing a unified view on the effects of data quality across instruction and reasoning data, this work illuminates the interplay between data quality and training stability, shedding novel insights into developing better data exploration strategies for post-training.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Heimdall: test-time scaling on the generative verification</title>
      <itunes:episode>683</itunes:episode>
      <podcast:episode>683</podcast:episode>
      <itunes:title>Heimdall: test-time scaling on the generative verification</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">743d9c33-cacf-46b4-9f4c-46f7cef4e435</guid>
      <link>https://share.transistor.fm/s/a7e9a23d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.AI, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Wenlei Shi, Xing Jin</p>

            <p><strong>Title:</strong><br>
            Heimdall: test-time scaling on the generative verification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.10337v2">http://arxiv.org/abs/2504.10337v2</a></p>

            <p><strong>Abstract:</strong><br>
            An AI system can create and maintain knowledge only to the extent that it can verify that knowledge itself. Recent work on long Chain-of-Thought reasoning has demonstrated great potential of LLMs on solving competitive problems, but their verification ability remains to be weak and not sufficiently investigated. In this paper, we propose Heimdall, the long CoT verification LLM that can accurately judge the correctness of solutions. With pure reinforcement learning, we boost the verification accuracy from 62.5% to 94.5% on competitive math problems. By scaling with repeated sampling, the accuracy further increases to 97.5%. Through human evaluation, Heimdall demonstrates impressive generalization capabilities, successfully detecting most issues in challenging math proofs, the type of which is not included during training. Furthermore, we propose Pessimistic Verification to extend the functionality of Heimdall to scaling up the problem solving. It calls Heimdall to judge the solutions from a solver model and based on the pessimistic principle, selects the most likely correct solution with the least uncertainty. Taking DeepSeek-R1-Distill-Qwen-32B as the solver model, Pessimistic Verification improves the solution accuracy on AIME2025 from 54.2% to 70.0% with 16x compute budget and to 83.3% with more compute budget. With the stronger solver Gemini 2.5 Pro, the score reaches 93.0%. Finally, we prototype an automatic knowledge discovery system, a ternary system where one poses questions, another provides solutions, and the third verifies the solutions. Using the data synthesis work NuminaMath for the first two components, Heimdall effectively identifies problematic records within the dataset and reveals that nearly half of the data is flawed, which interestingly aligns with the recent ablation studies from NuminaMath.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.AI, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Wenlei Shi, Xing Jin</p>

            <p><strong>Title:</strong><br>
            Heimdall: test-time scaling on the generative verification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.10337v2">http://arxiv.org/abs/2504.10337v2</a></p>

            <p><strong>Abstract:</strong><br>
            An AI system can create and maintain knowledge only to the extent that it can verify that knowledge itself. Recent work on long Chain-of-Thought reasoning has demonstrated great potential of LLMs on solving competitive problems, but their verification ability remains to be weak and not sufficiently investigated. In this paper, we propose Heimdall, the long CoT verification LLM that can accurately judge the correctness of solutions. With pure reinforcement learning, we boost the verification accuracy from 62.5% to 94.5% on competitive math problems. By scaling with repeated sampling, the accuracy further increases to 97.5%. Through human evaluation, Heimdall demonstrates impressive generalization capabilities, successfully detecting most issues in challenging math proofs, the type of which is not included during training. Furthermore, we propose Pessimistic Verification to extend the functionality of Heimdall to scaling up the problem solving. It calls Heimdall to judge the solutions from a solver model and based on the pessimistic principle, selects the most likely correct solution with the least uncertainty. Taking DeepSeek-R1-Distill-Qwen-32B as the solver model, Pessimistic Verification improves the solution accuracy on AIME2025 from 54.2% to 70.0% with 16x compute budget and to 83.3% with more compute budget. With the stronger solver Gemini 2.5 Pro, the score reaches 93.0%. Finally, we prototype an automatic knowledge discovery system, a ternary system where one poses questions, another provides solutions, and the third verifies the solutions. Using the data synthesis work NuminaMath for the first two components, Heimdall effectively identifies problematic records within the dataset and reveals that nearly half of the data is flawed, which interestingly aligns with the recent ablation studies from NuminaMath.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 16 Apr 2025 20:59:12 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a7e9a23d/85924dcd.mp3" length="19250442" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1199</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.AI, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Wenlei Shi, Xing Jin</p>

            <p><strong>Title:</strong><br>
            Heimdall: test-time scaling on the generative verification</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.10337v2">http://arxiv.org/abs/2504.10337v2</a></p>

            <p><strong>Abstract:</strong><br>
            An AI system can create and maintain knowledge only to the extent that it can verify that knowledge itself. Recent work on long Chain-of-Thought reasoning has demonstrated great potential of LLMs on solving competitive problems, but their verification ability remains to be weak and not sufficiently investigated. In this paper, we propose Heimdall, the long CoT verification LLM that can accurately judge the correctness of solutions. With pure reinforcement learning, we boost the verification accuracy from 62.5% to 94.5% on competitive math problems. By scaling with repeated sampling, the accuracy further increases to 97.5%. Through human evaluation, Heimdall demonstrates impressive generalization capabilities, successfully detecting most issues in challenging math proofs, the type of which is not included during training. Furthermore, we propose Pessimistic Verification to extend the functionality of Heimdall to scaling up the problem solving. It calls Heimdall to judge the solutions from a solver model and based on the pessimistic principle, selects the most likely correct solution with the least uncertainty. Taking DeepSeek-R1-Distill-Qwen-32B as the solver model, Pessimistic Verification improves the solution accuracy on AIME2025 from 54.2% to 70.0% with 16x compute budget and to 83.3% with more compute budget. With the stronger solver Gemini 2.5 Pro, the score reaches 93.0%. Finally, we prototype an automatic knowledge discovery system, a ternary system where one poses questions, another provides solutions, and the third verifies the solutions. Using the data synthesis work NuminaMath for the first two components, Heimdall effectively identifies problematic records within the dataset and reveals that nearly half of the data is flawed, which interestingly aligns with the recent ablation studies from NuminaMath.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding</title>
      <itunes:episode>682</itunes:episode>
      <podcast:episode>682</podcast:episode>
      <itunes:title>Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7f880e65-daf7-418b-8c81-2152741951c8</guid>
      <link>https://share.transistor.fm/s/39083316</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tao Zhang, Xiangtai Li, Zilong Huang, Yanwei Li, Weixian Lei, Xueqing Deng, Shihao Chen, Shunping Ji, Jiashi Feng</p>

            <p><strong>Title:</strong><br>
            Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.10465v1">http://arxiv.org/abs/2504.10465v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) achieve remarkable performance for fine-grained pixel-level understanding tasks. However, all the works rely heavily on extra components, such as vision encoder (CLIP), segmentation experts, leading to high system complexity and limiting model scaling. In this work, our goal is to explore a highly simplified MLLM without introducing extra components. Our work is motivated by the recent works on Single trAnsformer as a unified vIsion-Language Model (SAIL) design, where these works jointly learn vision tokens and text tokens in transformers. We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. In particular, we present three technical improvements on the plain baseline. First, we design a learnable upsampling module to refine visual token features. Secondly, we propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs and benefit from the early fusion of visual prompt embeddings and vision tokens. Thirdly, we introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability. In addition, we have collected a comprehensive pixel understanding benchmark (PerBench), using a manual check. It includes three tasks: detailed object description, visual prompt-based question answering, and visual-text referring segmentation. Extensive experiments on four referring segmentation benchmarks, one visual prompt benchmark, and our PerBench show that our Pixel-SAIL achieves comparable or even better results with a much simpler pipeline. Code and model will be released at https://github.com/magic-research/Sa2VA.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tao Zhang, Xiangtai Li, Zilong Huang, Yanwei Li, Weixian Lei, Xueqing Deng, Shihao Chen, Shunping Ji, Jiashi Feng</p>

            <p><strong>Title:</strong><br>
            Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.10465v1">http://arxiv.org/abs/2504.10465v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) achieve remarkable performance for fine-grained pixel-level understanding tasks. However, all the works rely heavily on extra components, such as vision encoder (CLIP), segmentation experts, leading to high system complexity and limiting model scaling. In this work, our goal is to explore a highly simplified MLLM without introducing extra components. Our work is motivated by the recent works on Single trAnsformer as a unified vIsion-Language Model (SAIL) design, where these works jointly learn vision tokens and text tokens in transformers. We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. In particular, we present three technical improvements on the plain baseline. First, we design a learnable upsampling module to refine visual token features. Secondly, we propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs and benefit from the early fusion of visual prompt embeddings and vision tokens. Thirdly, we introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability. In addition, we have collected a comprehensive pixel understanding benchmark (PerBench), using a manual check. It includes three tasks: detailed object description, visual prompt-based question answering, and visual-text referring segmentation. Extensive experiments on four referring segmentation benchmarks, one visual prompt benchmark, and our PerBench show that our Pixel-SAIL achieves comparable or even better results with a much simpler pipeline. Code and model will be released at https://github.com/magic-research/Sa2VA.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 16 Apr 2025 20:58:38 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/39083316/e6b12e1e.mp3" length="20028269" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1248</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tao Zhang, Xiangtai Li, Zilong Huang, Yanwei Li, Weixian Lei, Xueqing Deng, Shihao Chen, Shunping Ji, Jiashi Feng</p>

            <p><strong>Title:</strong><br>
            Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.10465v1">http://arxiv.org/abs/2504.10465v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) achieve remarkable performance for fine-grained pixel-level understanding tasks. However, all the works rely heavily on extra components, such as vision encoder (CLIP), segmentation experts, leading to high system complexity and limiting model scaling. In this work, our goal is to explore a highly simplified MLLM without introducing extra components. Our work is motivated by the recent works on Single trAnsformer as a unified vIsion-Language Model (SAIL) design, where these works jointly learn vision tokens and text tokens in transformers. We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. In particular, we present three technical improvements on the plain baseline. First, we design a learnable upsampling module to refine visual token features. Secondly, we propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs and benefit from the early fusion of visual prompt embeddings and vision tokens. Thirdly, we introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability. In addition, we have collected a comprehensive pixel understanding benchmark (PerBench), using a manual check. It includes three tasks: detailed object description, visual prompt-based question answering, and visual-text referring segmentation. Extensive experiments on four referring segmentation benchmarks, one visual prompt benchmark, and our PerBench show that our Pixel-SAIL achieves comparable or even better results with a much simpler pipeline. Code and model will be released at https://github.com/magic-research/Sa2VA.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TextArena</title>
      <itunes:episode>681</itunes:episode>
      <podcast:episode>681</podcast:episode>
      <itunes:title>TextArena</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">12dae877-661f-4a6b-9a43-3a76c73c19ab</guid>
      <link>https://share.transistor.fm/s/51020a8b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.AI, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, Cheston Tan</p>

            <p><strong>Title:</strong><br>
            TextArena</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.11442v1">http://arxiv.org/abs/2504.11442v1</a></p>

            <p><strong>Abstract:</strong><br>
            TextArena is an open-source collection of competitive text-based games for training and evaluation of agentic behavior in Large Language Models (LLMs). It spans 57+ unique environments (including single-player, two-player, and multi-player setups) and allows for easy evaluation of model capabilities via an online-play system (against humans and other submitted models) with real-time TrueSkill scores. Traditional benchmarks rarely assess dynamic social skills such as negotiation, theory of mind, and deception, creating a gap that TextArena addresses. Designed with research, community and extensibility in mind, TextArena emphasizes ease of adding new games, adapting the framework, testing models, playing against the models, and training models. Detailed documentation of environments, games, leaderboard, and examples are available on https://github.com/LeonGuertler/TextArena and https://www.textarena.ai/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.AI, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, Cheston Tan</p>

            <p><strong>Title:</strong><br>
            TextArena</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.11442v1">http://arxiv.org/abs/2504.11442v1</a></p>

            <p><strong>Abstract:</strong><br>
            TextArena is an open-source collection of competitive text-based games for training and evaluation of agentic behavior in Large Language Models (LLMs). It spans 57+ unique environments (including single-player, two-player, and multi-player setups) and allows for easy evaluation of model capabilities via an online-play system (against humans and other submitted models) with real-time TrueSkill scores. Traditional benchmarks rarely assess dynamic social skills such as negotiation, theory of mind, and deception, creating a gap that TextArena addresses. Designed with research, community and extensibility in mind, TextArena emphasizes ease of adding new games, adapting the framework, testing models, playing against the models, and training models. Detailed documentation of environments, games, leaderboard, and examples are available on https://github.com/LeonGuertler/TextArena and https://www.textarena.ai/.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 16 Apr 2025 20:58:15 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/51020a8b/00c32910.mp3" length="21550005" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1343</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.AI, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, Cheston Tan</p>

            <p><strong>Title:</strong><br>
            TextArena</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.11442v1">http://arxiv.org/abs/2504.11442v1</a></p>

            <p><strong>Abstract:</strong><br>
            TextArena is an open-source collection of competitive text-based games for training and evaluation of agentic behavior in Large Language Models (LLMs). It spans 57+ unique environments (including single-player, two-player, and multi-player setups) and allows for easy evaluation of model capabilities via an online-play system (against humans and other submitted models) with real-time TrueSkill scores. Traditional benchmarks rarely assess dynamic social skills such as negotiation, theory of mind, and deception, creating a gap that TextArena addresses. Designed with research, community and extensibility in mind, TextArena emphasizes ease of adding new games, adapting the framework, testing models, playing against the models, and training models. Detailed documentation of environments, games, leaderboard, and examples are available on https://github.com/LeonGuertler/TextArena and https://www.textarena.ai/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models</title>
      <itunes:episode>680</itunes:episode>
      <podcast:episode>680</podcast:episode>
      <itunes:title>InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">30a85bb7-e537-4f54-8ed6-75b1265d6636</guid>
      <link>https://share.transistor.fm/s/f4996bd9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 172 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenwen Qu, Peng Sun, Penglong Jiao, Han Lv, Lijun Wu, Kaipeng Zhang, Huipeng Deng, Jiaye Ge, Kai Chen, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang</p>

            <p><strong>Title:</strong><br>
            InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.10479v2">http://arxiv.org/abs/2504.10479v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 172 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenwen Qu, Peng Sun, Penglong Jiao, Han Lv, Lijun Wu, Kaipeng Zhang, Huipeng Deng, Jiaye Ge, Kai Chen, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang</p>

            <p><strong>Title:</strong><br>
            InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.10479v2">http://arxiv.org/abs/2504.10479v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 15 Apr 2025 20:45:46 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f4996bd9/8c38e1ca.mp3" length="21949241" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1368</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 172 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenwen Qu, Peng Sun, Penglong Jiao, Han Lv, Lijun Wu, Kaipeng Zhang, Huipeng Deng, Jiaye Ge, Kai Chen, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang</p>

            <p><strong>Title:</strong><br>
            InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.10479v2">http://arxiv.org/abs/2504.10479v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters</title>
      <itunes:episode>679</itunes:episode>
      <podcast:episode>679</podcast:episode>
      <itunes:title>PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c9b81d3b-7eff-4848-a1a1-1e19a543b5a1</guid>
      <link>https://share.transistor.fm/s/0edff06c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 95 | cs.DC, cs.AI, 68T50, I.2.7; I.2.11</p>

            <p><strong>Authors:</strong><br>
            Zonghang Li, Tao Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu</p>

            <p><strong>Title:</strong><br>
            PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.08791v1">http://arxiv.org/abs/2504.08791v1</a></p>

            <p><strong>Abstract:</strong><br>
            Emergency of DeepSeek R1 and QwQ 32B have broken through performance barriers for running frontier large language models (LLMs) on home devices. While consumer hardware is getting stronger and model quantization is improving, existing end-side solutions still demand GPU clusters, large RAM/VRAM, and high bandwidth, far beyond what a common home cluster can handle. This paper introduces prima.cpp, a distributed inference system that runs 70B-scale models on everyday home devices using a mix of CPU/GPU, low RAM/VRAM, Wi-Fi, and cross-platform support. It uses mmap to manage model weights and introduces piped-ring parallelism with prefetching to hide disk loading. By modeling heterogeneity in computation, communication, disk, memory (and its management behavior), and OS, it optimally assigns model layers to each device's CPU and GPU, further reducing token latency. An elegant algorithm named Halda is proposed to solve this NP-hard assignment problem. We evaluate prima.cpp on a common four-node home cluster. It outperforms llama.cpp, exo, and dllama on 30B+ models while keeping memory pressure below 6%. This brings frontier 30B-70B models, such as Llama 3, DeepSeek R1, Qwen 2.5, and QwQ to home assistants, making advanced AI truly accessible to individuals. The code is open source and available at https://github.com/Lizonghang/prima.cpp.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 95 | cs.DC, cs.AI, 68T50, I.2.7; I.2.11</p>

            <p><strong>Authors:</strong><br>
            Zonghang Li, Tao Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu</p>

            <p><strong>Title:</strong><br>
            PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.08791v1">http://arxiv.org/abs/2504.08791v1</a></p>

            <p><strong>Abstract:</strong><br>
            Emergency of DeepSeek R1 and QwQ 32B have broken through performance barriers for running frontier large language models (LLMs) on home devices. While consumer hardware is getting stronger and model quantization is improving, existing end-side solutions still demand GPU clusters, large RAM/VRAM, and high bandwidth, far beyond what a common home cluster can handle. This paper introduces prima.cpp, a distributed inference system that runs 70B-scale models on everyday home devices using a mix of CPU/GPU, low RAM/VRAM, Wi-Fi, and cross-platform support. It uses mmap to manage model weights and introduces piped-ring parallelism with prefetching to hide disk loading. By modeling heterogeneity in computation, communication, disk, memory (and its management behavior), and OS, it optimally assigns model layers to each device's CPU and GPU, further reducing token latency. An elegant algorithm named Halda is proposed to solve this NP-hard assignment problem. We evaluate prima.cpp on a common four-node home cluster. It outperforms llama.cpp, exo, and dllama on 30B+ models while keeping memory pressure below 6%. This brings frontier 30B-70B models, such as Llama 3, DeepSeek R1, Qwen 2.5, and QwQ to home assistants, making advanced AI truly accessible to individuals. The code is open source and available at https://github.com/Lizonghang/prima.cpp.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 15 Apr 2025 20:45:23 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0edff06c/ff7de496.mp3" length="23026312" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1435</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 95 | cs.DC, cs.AI, 68T50, I.2.7; I.2.11</p>

            <p><strong>Authors:</strong><br>
            Zonghang Li, Tao Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu</p>

            <p><strong>Title:</strong><br>
            PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.08791v1">http://arxiv.org/abs/2504.08791v1</a></p>

            <p><strong>Abstract:</strong><br>
            Emergency of DeepSeek R1 and QwQ 32B have broken through performance barriers for running frontier large language models (LLMs) on home devices. While consumer hardware is getting stronger and model quantization is improving, existing end-side solutions still demand GPU clusters, large RAM/VRAM, and high bandwidth, far beyond what a common home cluster can handle. This paper introduces prima.cpp, a distributed inference system that runs 70B-scale models on everyday home devices using a mix of CPU/GPU, low RAM/VRAM, Wi-Fi, and cross-platform support. It uses mmap to manage model weights and introduces piped-ring parallelism with prefetching to hide disk loading. By modeling heterogeneity in computation, communication, disk, memory (and its management behavior), and OS, it optimally assigns model layers to each device's CPU and GPU, further reducing token latency. An elegant algorithm named Halda is proposed to solve this NP-hard assignment problem. We evaluate prima.cpp on a common four-node home cluster. It outperforms llama.cpp, exo, and dllama on 30B+ models while keeping memory pressure below 6%. This brings frontier 30B-70B models, such as Llama 3, DeepSeek R1, Qwen 2.5, and QwQ to home assistants, making advanced AI truly accessible to individuals. The code is open source and available at https://github.com/Lizonghang/prima.cpp.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning</title>
      <itunes:episode>678</itunes:episode>
      <podcast:episode>678</podcast:episode>
      <itunes:title>VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f71358bc-e2ab-47b4-bfe7-8b5c1d8dd638</guid>
      <link>https://share.transistor.fm/s/79a90081</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.08837v1">http://arxiv.org/abs/2504.08837v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a textual rethinking trigger to the end of initial rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve 80.3%, 61.8%, and 43.9% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with GPT-o1.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.08837v1">http://arxiv.org/abs/2504.08837v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a textual rethinking trigger to the end of initial rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve 80.3%, 61.8%, and 43.9% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with GPT-o1.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 15 Apr 2025 20:44:59 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/79a90081/ae74d93e.mp3" length="19954742" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1243</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.08837v1">http://arxiv.org/abs/2504.08837v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a textual rethinking trigger to the end of initial rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve 80.3%, 61.8%, and 43.9% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with GPT-o1.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding</title>
      <itunes:episode>677</itunes:episode>
      <podcast:episode>677</podcast:episode>
      <itunes:title>FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0cd864f6-ced4-4fcd-b5f4-3f211d406c54</guid>
      <link>https://share.transistor.fm/s/663cd3f1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zheng Liu, Mengjie Liu, Jingzhou Chen, Jingwei Xu, Bin Cui, Conghui He, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.09925v1">http://arxiv.org/abs/2504.09925v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. https://github.com/starriver030515/FUSION</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zheng Liu, Mengjie Liu, Jingzhou Chen, Jingwei Xu, Bin Cui, Conghui He, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.09925v1">http://arxiv.org/abs/2504.09925v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. https://github.com/starriver030515/FUSION</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 15 Apr 2025 20:44:36 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/663cd3f1/281cd7ef.mp3" length="19667184" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1226</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zheng Liu, Mengjie Liu, Jingzhou Chen, Jingwei Xu, Bin Cui, Conghui He, Wentao Zhang</p>

            <p><strong>Title:</strong><br>
            FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.09925v1">http://arxiv.org/abs/2504.09925v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. https://github.com/starriver030515/FUSION</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Iterative Self-Training for Code Generation via Reinforced Re-Ranking</title>
      <itunes:episode>676</itunes:episode>
      <podcast:episode>676</podcast:episode>
      <itunes:title>Iterative Self-Training for Code Generation via Reinforced Re-Ranking</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6c50fa4c-5b4d-4783-8c25-8c0f5e7b1f4d</guid>
      <link>https://share.transistor.fm/s/8e4a6d33</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL, cs.IR, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Nikita Sorokin, Ivan Sedykh, Valentin Malykh</p>

            <p><strong>Title:</strong><br>
            Iterative Self-Training for Code Generation via Reinforced Re-Ranking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.09643v1">http://arxiv.org/abs/2504.09643v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generating high-quality code that solves complex programming tasks is challenging, especially with current decoder-based models that produce highly stochastic outputs. In code generation, even minor errors can easily break the entire solution. Leveraging multiple sampled solutions can significantly improve the overall output quality.   One effective way to enhance code generation is by pairing a code generation model with a reranker model, which selects the best solution from the generated samples. We propose a novel iterative self-training approach for self-training reranker models using Proximal Policy Optimization (PPO), aimed at improving both reranking accuracy and the overall code generation process. Unlike traditional PPO approaches, where the focus is on optimizing a generative model with a reward model, our approach emphasizes the development of a robust reward/reranking model. This model improves the quality of generated code through reranking and addresses problems and errors that the reward model might overlook during PPO alignment with the reranker. Our method iteratively refines the training dataset by re-evaluating outputs, identifying high-scoring negative examples, and incorporating them into the training loop, that boosting model performance.   Our evaluation on the MultiPL-E dataset demonstrates that our 13.4B parameter model outperforms a 33B model in code generation quality while being three times faster. Moreover, it achieves performance comparable to GPT-4 and surpasses it in one programming language.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL, cs.IR, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Nikita Sorokin, Ivan Sedykh, Valentin Malykh</p>

            <p><strong>Title:</strong><br>
            Iterative Self-Training for Code Generation via Reinforced Re-Ranking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.09643v1">http://arxiv.org/abs/2504.09643v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generating high-quality code that solves complex programming tasks is challenging, especially with current decoder-based models that produce highly stochastic outputs. In code generation, even minor errors can easily break the entire solution. Leveraging multiple sampled solutions can significantly improve the overall output quality.   One effective way to enhance code generation is by pairing a code generation model with a reranker model, which selects the best solution from the generated samples. We propose a novel iterative self-training approach for self-training reranker models using Proximal Policy Optimization (PPO), aimed at improving both reranking accuracy and the overall code generation process. Unlike traditional PPO approaches, where the focus is on optimizing a generative model with a reward model, our approach emphasizes the development of a robust reward/reranking model. This model improves the quality of generated code through reranking and addresses problems and errors that the reward model might overlook during PPO alignment with the reranker. Our method iteratively refines the training dataset by re-evaluating outputs, identifying high-scoring negative examples, and incorporating them into the training loop, that boosting model performance.   Our evaluation on the MultiPL-E dataset demonstrates that our 13.4B parameter model outperforms a 33B model in code generation quality while being three times faster. Moreover, it achieves performance comparable to GPT-4 and surpasses it in one programming language.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 15 Apr 2025 20:44:13 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8e4a6d33/c3621cb0.mp3" length="19384200" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1208</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL, cs.IR, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Nikita Sorokin, Ivan Sedykh, Valentin Malykh</p>

            <p><strong>Title:</strong><br>
            Iterative Self-Training for Code Generation via Reinforced Re-Ranking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.09643v1">http://arxiv.org/abs/2504.09643v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generating high-quality code that solves complex programming tasks is challenging, especially with current decoder-based models that produce highly stochastic outputs. In code generation, even minor errors can easily break the entire solution. Leveraging multiple sampled solutions can significantly improve the overall output quality.   One effective way to enhance code generation is by pairing a code generation model with a reranker model, which selects the best solution from the generated samples. We propose a novel iterative self-training approach for self-training reranker models using Proximal Policy Optimization (PPO), aimed at improving both reranking accuracy and the overall code generation process. Unlike traditional PPO approaches, where the focus is on optimizing a generative model with a reward model, our approach emphasizes the development of a robust reward/reranking model. This model improves the quality of generated code through reranking and addresses problems and errors that the reward model might overlook during PPO alignment with the reranker. Our method iteratively refines the training dataset by re-evaluating outputs, identifying high-scoring negative examples, and incorporating them into the training loop, that boosting model performance.   Our evaluation on the MultiPL-E dataset demonstrates that our 13.4B parameter model outperforms a 33B model in code generation quality while being three times faster. Moreover, it achieves performance comparable to GPT-4 and surpasses it in one programming language.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model</title>
      <itunes:episode>675</itunes:episode>
      <podcast:episode>675</podcast:episode>
      <itunes:title>Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">13d06602-449a-4880-9687-a5f44604fb16</guid>
      <link>https://share.transistor.fm/s/f47ba886</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, Feng Cheng, Feilong Zuo Xuejiao Zeng, Ziyan Yang, Fangyuan Kong, Zhiwu Qing, Fei Xiao, Meng Wei, Tuyen Hoang, Siyu Zhang, Peihao Zhu, Qi Zhao, Jiangqiao Yan, Liangke Gui, Sheng Bi, Jiashi Li, Yuxi Ren, Rui Wang, Huixia Li, Xuefeng Xiao, Shu Liu, Feng Ling, Heng Zhang, Houmin Wei, Huafeng Kuang, Jerry Duncan, Junda Zhang, Junru Zheng, Li Sun, Manlin Zhang, Renfei Sun, Xiaobin Zhuang, Xiaojie Li, Xin Xia, Xuyan Chi, Yanghua Peng, Yuping Wang, Yuxuan Wang, Zhongkai Zhao, Zhuo Chen, Zuquan Song, Zhenheng Yang, Jiashi Feng, Jianchao Yang, Lu Jiang</p>

            <p><strong>Title:</strong><br>
            Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.08685v1">http://arxiv.org/abs/2504.08685v1</a></p>

            <p><strong>Abstract:</strong><br>
            This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary video generation models of much larger size. Design choices are especially crucial in a resource-constrained setting. This technical report highlights the key design decisions that enhance the performance of the medium-sized diffusion model. Empirically, we make two observations: (1) Seaweed-7B achieves performance comparable to, or even surpasses, larger models trained on substantially greater GPU resources, and (2) our model, which exhibits strong generalization ability, can be effectively adapted across a wide range of downstream applications either by lightweight fine-tuning or continue training. See the project page at https://seaweed.video/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, Feng Cheng, Feilong Zuo Xuejiao Zeng, Ziyan Yang, Fangyuan Kong, Zhiwu Qing, Fei Xiao, Meng Wei, Tuyen Hoang, Siyu Zhang, Peihao Zhu, Qi Zhao, Jiangqiao Yan, Liangke Gui, Sheng Bi, Jiashi Li, Yuxi Ren, Rui Wang, Huixia Li, Xuefeng Xiao, Shu Liu, Feng Ling, Heng Zhang, Houmin Wei, Huafeng Kuang, Jerry Duncan, Junda Zhang, Junru Zheng, Li Sun, Manlin Zhang, Renfei Sun, Xiaobin Zhuang, Xiaojie Li, Xin Xia, Xuyan Chi, Yanghua Peng, Yuping Wang, Yuxuan Wang, Zhongkai Zhao, Zhuo Chen, Zuquan Song, Zhenheng Yang, Jiashi Feng, Jianchao Yang, Lu Jiang</p>

            <p><strong>Title:</strong><br>
            Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.08685v1">http://arxiv.org/abs/2504.08685v1</a></p>

            <p><strong>Abstract:</strong><br>
            This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary video generation models of much larger size. Design choices are especially crucial in a resource-constrained setting. This technical report highlights the key design decisions that enhance the performance of the medium-sized diffusion model. Empirically, we make two observations: (1) Seaweed-7B achieves performance comparable to, or even surpasses, larger models trained on substantially greater GPU resources, and (2) our model, which exhibits strong generalization ability, can be effectively adapted across a wide range of downstream applications either by lightweight fine-tuning or continue training. See the project page at https://seaweed.video/</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 14 Apr 2025 20:14:32 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f47ba886/e68d6a11.mp3" length="21945875" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1368</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, Feng Cheng, Feilong Zuo Xuejiao Zeng, Ziyan Yang, Fangyuan Kong, Zhiwu Qing, Fei Xiao, Meng Wei, Tuyen Hoang, Siyu Zhang, Peihao Zhu, Qi Zhao, Jiangqiao Yan, Liangke Gui, Sheng Bi, Jiashi Li, Yuxi Ren, Rui Wang, Huixia Li, Xuefeng Xiao, Shu Liu, Feng Ling, Heng Zhang, Houmin Wei, Huafeng Kuang, Jerry Duncan, Junda Zhang, Junru Zheng, Li Sun, Manlin Zhang, Renfei Sun, Xiaobin Zhuang, Xiaojie Li, Xin Xia, Xuyan Chi, Yanghua Peng, Yuping Wang, Yuxuan Wang, Zhongkai Zhao, Zhuo Chen, Zuquan Song, Zhenheng Yang, Jiashi Feng, Jianchao Yang, Lu Jiang</p>

            <p><strong>Title:</strong><br>
            Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.08685v1">http://arxiv.org/abs/2504.08685v1</a></p>

            <p><strong>Abstract:</strong><br>
            This technical report presents a cost-efficient strategy for training a video generation foundation model. We present a mid-sized research model with approximately 7 billion parameters (7B) called Seaweed-7B trained from scratch using 665,000 H100 GPU hours. Despite being trained with moderate computational resources, Seaweed-7B demonstrates highly competitive performance compared to contemporary video generation models of much larger size. Design choices are especially crucial in a resource-constrained setting. This technical report highlights the key design decisions that enhance the performance of the medium-sized diffusion model. Empirically, we make two observations: (1) Seaweed-7B achieves performance comparable to, or even surpasses, larger models trained on substantially greater GPU resources, and (2) our model, which exhibits strong generalization ability, can be effectively adapted across a wide range of downstream applications either by lightweight fine-tuning or continue training. See the project page at https://seaweed.video/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation</title>
      <itunes:episode>674</itunes:episode>
      <podcast:episode>674</podcast:episode>
      <itunes:title>GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c025bdf9-970a-493c-ba3b-29c04ced73c8</guid>
      <link>https://share.transistor.fm/s/d22a80a6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.08736v1">http://arxiv.org/abs/2504.08736v1</a></p>

            <p><strong>Abstract:</strong><br>
            In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality -- a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. To mitigate this, we propose semantic regularization, which aligns tokenizer features with semantically consistent features from a pre-trained visual encoder. This constraint prevents excessive latent space complexity during scaling, yielding consistent improvements in both reconstruction and downstream autoregressive generation. Building on semantic regularization, we explore three key practices for scaling tokenizers:(1) using 1D tokenizers for better scalability, (2) prioritizing decoder scaling when expanding both encoder and decoder, and (3) employing entropy loss to stabilize training for billion-scale tokenizers. By scaling to $\bf{3 \space billion}$ parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.08736v1">http://arxiv.org/abs/2504.08736v1</a></p>

            <p><strong>Abstract:</strong><br>
            In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality -- a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. To mitigate this, we propose semantic regularization, which aligns tokenizer features with semantically consistent features from a pre-trained visual encoder. This constraint prevents excessive latent space complexity during scaling, yielding consistent improvements in both reconstruction and downstream autoregressive generation. Building on semantic regularization, we explore three key practices for scaling tokenizers:(1) using 1D tokenizers for better scalability, (2) prioritizing decoder scaling when expanding both encoder and decoder, and (3) employing entropy loss to stabilize training for billion-scale tokenizers. By scaling to $\bf{3 \space billion}$ parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 14 Apr 2025 20:14:10 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d22a80a6/b75417b5.mp3" length="17893365" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1115</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.08736v1">http://arxiv.org/abs/2504.08736v1</a></p>

            <p><strong>Abstract:</strong><br>
            In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality -- a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. To mitigate this, we propose semantic regularization, which aligns tokenizer features with semantically consistent features from a pre-trained visual encoder. This constraint prevents excessive latent space complexity during scaling, yielding consistent improvements in both reconstruction and downstream autoregressive generation. Building on semantic regularization, we explore three key practices for scaling tokenizers:(1) using 1D tokenizers for better scalability, (2) prioritizing decoder scaling when expanding both encoder and decoder, and (3) employing entropy loss to stabilize training for billion-scale tokenizers. By scaling to $\bf{3 \space billion}$ parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft</title>
      <itunes:episode>673</itunes:episode>
      <podcast:episode>673</podcast:episode>
      <itunes:title>MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f5c7693e-03a3-4bc2-be3d-a59e8d56d074</guid>
      <link>https://share.transistor.fm/s/8350db9b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, Jiang Bian</p>

            <p><strong>Title:</strong><br>
            MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.08388v1">http://arxiv.org/abs/2504.08388v1</a></p>

            <p><strong>Abstract:</strong><br>
            World modeling is a crucial task for enabling intelligent agents to effectively interact with humans and operate in dynamic environments. In this work, we propose MineWorld, a real-time interactive world model on Minecraft, an open-ended sandbox game which has been utilized as a common testbed for world modeling. MineWorld is driven by a visual-action autoregressive Transformer, which takes paired game scenes and corresponding actions as input, and generates consequent new scenes following the actions. Specifically, by transforming visual game scenes and actions into discrete token ids with an image tokenizer and an action tokenizer correspondingly, we consist the model input with the concatenation of the two kinds of ids interleaved. The model is then trained with next token prediction to learn rich representations of game states as well as the conditions between states and actions simultaneously. In inference, we develop a novel parallel decoding algorithm that predicts the spatial redundant tokens in each frame at the same time, letting models in different scales generate $4$ to $7$ frames per second and enabling real-time interactions with game players. In evaluation, we propose new metrics to assess not only visual quality but also the action following capacity when generating new scenes, which is crucial for a world model. Our comprehensive evaluation shows the efficacy of MineWorld, outperforming SoTA open-sourced diffusion based world models significantly. The code and model have been released.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, Jiang Bian</p>

            <p><strong>Title:</strong><br>
            MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.08388v1">http://arxiv.org/abs/2504.08388v1</a></p>

            <p><strong>Abstract:</strong><br>
            World modeling is a crucial task for enabling intelligent agents to effectively interact with humans and operate in dynamic environments. In this work, we propose MineWorld, a real-time interactive world model on Minecraft, an open-ended sandbox game which has been utilized as a common testbed for world modeling. MineWorld is driven by a visual-action autoregressive Transformer, which takes paired game scenes and corresponding actions as input, and generates consequent new scenes following the actions. Specifically, by transforming visual game scenes and actions into discrete token ids with an image tokenizer and an action tokenizer correspondingly, we consist the model input with the concatenation of the two kinds of ids interleaved. The model is then trained with next token prediction to learn rich representations of game states as well as the conditions between states and actions simultaneously. In inference, we develop a novel parallel decoding algorithm that predicts the spatial redundant tokens in each frame at the same time, letting models in different scales generate $4$ to $7$ frames per second and enabling real-time interactions with game players. In evaluation, we propose new metrics to assess not only visual quality but also the action following capacity when generating new scenes, which is crucial for a world model. Our comprehensive evaluation shows the efficacy of MineWorld, outperforming SoTA open-sourced diffusion based world models significantly. The code and model have been released.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 14 Apr 2025 20:13:48 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8350db9b/0bccf6c2.mp3" length="18484340" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1152</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, Jiang Bian</p>

            <p><strong>Title:</strong><br>
            MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.08388v1">http://arxiv.org/abs/2504.08388v1</a></p>

            <p><strong>Abstract:</strong><br>
            World modeling is a crucial task for enabling intelligent agents to effectively interact with humans and operate in dynamic environments. In this work, we propose MineWorld, a real-time interactive world model on Minecraft, an open-ended sandbox game which has been utilized as a common testbed for world modeling. MineWorld is driven by a visual-action autoregressive Transformer, which takes paired game scenes and corresponding actions as input, and generates consequent new scenes following the actions. Specifically, by transforming visual game scenes and actions into discrete token ids with an image tokenizer and an action tokenizer correspondingly, we consist the model input with the concatenation of the two kinds of ids interleaved. The model is then trained with next token prediction to learn rich representations of game states as well as the conditions between states and actions simultaneously. In inference, we develop a novel parallel decoding algorithm that predicts the spatial redundant tokens in each frame at the same time, letting models in different scales generate $4$ to $7$ frames per second and enabling real-time interactions with game players. In evaluation, we propose new metrics to assess not only visual quality but also the action following capacity when generating new scenes, which is crucial for a world model. Our comprehensive evaluation shows the efficacy of MineWorld, outperforming SoTA open-sourced diffusion based world models significantly. The code and model have been released.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Kimi-VL Technical Report</title>
      <itunes:episode>672</itunes:episode>
      <podcast:episode>672</podcast:episode>
      <itunes:title>Kimi-VL Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5ab25fb4-758b-4b0a-b64b-a6eed8409416</guid>
      <link>https://share.transistor.fm/s/de32d0f8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabin Zheng, Jiaming Li, Jianlin Su, Jianzhou Wang, Jiaqi Deng, Jiezhong Qiu, Jin Xie, Jinhong Wang, Jingyuan Liu, Junjie Yan, Kun Ouyang, Liang Chen, Lin Sui, Longhui Yu, Mengfan Dong, Mengnan Dong, Nuo Xu, Pengyu Cheng, Qizheng Gu, Runjie Zhou, Shaowei Liu, Sihan Cao, Tao Yu, Tianhui Song, Tongtong Bai, Wei Song, Weiran He, Weixiao Huang, Weixin Xu, Xiaokun Yuan, Xingcheng Yao, Xingzhe Wu, Xinxing Zu, Xinyu Zhou, Xinyuan Wang, Y. Charles, Yan Zhong, Yang Li, Yangyang Hu, Yanru Chen, Yejie Wang, Yibo Liu, Yibo Miao, Yidao Qin, Yimin Chen, Yiping Bao, Yiqin Wang, Yongsheng Kang, Yuanxin Liu, Yulun Du, Yuxin Wu, Yuzhi Wang, Yuzi Yan, Zaida Zhou, Zhaowei Li, Zhejun Jiang, Zheng Zhang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Zijia Zhao, Ziwei Chen</p>

            <p><strong>Title:</strong><br>
            Kimi-VL Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07491v1">http://arxiv.org/abs/2504.07491v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameters, setting a new standard for efficient multimodal thinking models. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabin Zheng, Jiaming Li, Jianlin Su, Jianzhou Wang, Jiaqi Deng, Jiezhong Qiu, Jin Xie, Jinhong Wang, Jingyuan Liu, Junjie Yan, Kun Ouyang, Liang Chen, Lin Sui, Longhui Yu, Mengfan Dong, Mengnan Dong, Nuo Xu, Pengyu Cheng, Qizheng Gu, Runjie Zhou, Shaowei Liu, Sihan Cao, Tao Yu, Tianhui Song, Tongtong Bai, Wei Song, Weiran He, Weixiao Huang, Weixin Xu, Xiaokun Yuan, Xingcheng Yao, Xingzhe Wu, Xinxing Zu, Xinyu Zhou, Xinyuan Wang, Y. Charles, Yan Zhong, Yang Li, Yangyang Hu, Yanru Chen, Yejie Wang, Yibo Liu, Yibo Miao, Yidao Qin, Yimin Chen, Yiping Bao, Yiqin Wang, Yongsheng Kang, Yuanxin Liu, Yulun Du, Yuxin Wu, Yuzhi Wang, Yuzi Yan, Zaida Zhou, Zhaowei Li, Zhejun Jiang, Zheng Zhang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Zijia Zhao, Ziwei Chen</p>

            <p><strong>Title:</strong><br>
            Kimi-VL Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07491v1">http://arxiv.org/abs/2504.07491v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameters, setting a new standard for efficient multimodal thinking models. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 11 Apr 2025 21:12:00 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/de32d0f8/4d7783f1.mp3" length="22197020" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1384</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabin Zheng, Jiaming Li, Jianlin Su, Jianzhou Wang, Jiaqi Deng, Jiezhong Qiu, Jin Xie, Jinhong Wang, Jingyuan Liu, Junjie Yan, Kun Ouyang, Liang Chen, Lin Sui, Longhui Yu, Mengfan Dong, Mengnan Dong, Nuo Xu, Pengyu Cheng, Qizheng Gu, Runjie Zhou, Shaowei Liu, Sihan Cao, Tao Yu, Tianhui Song, Tongtong Bai, Wei Song, Weiran He, Weixiao Huang, Weixin Xu, Xiaokun Yuan, Xingcheng Yao, Xingzhe Wu, Xinxing Zu, Xinyu Zhou, Xinyuan Wang, Y. Charles, Yan Zhong, Yang Li, Yangyang Hu, Yanru Chen, Yejie Wang, Yibo Liu, Yibo Miao, Yidao Qin, Yimin Chen, Yiping Bao, Yiqin Wang, Yongsheng Kang, Yuanxin Liu, Yulun Du, Yuxin Wu, Yuzhi Wang, Yuzi Yan, Zaida Zhou, Zhaowei Li, Zhejun Jiang, Zheng Zhang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Zijia Zhao, Ziwei Chen</p>

            <p><strong>Title:</strong><br>
            Kimi-VL Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07491v1">http://arxiv.org/abs/2504.07491v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameters, setting a new standard for efficient multimodal thinking models. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing</title>
      <itunes:episode>671</itunes:episode>
      <podcast:episode>671</podcast:episode>
      <itunes:title>C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">27820aac-425f-4e9c-b0ff-f977dfea43a4</guid>
      <link>https://share.transistor.fm/s/abe95c4b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhongyang Li, Ziyue Li, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07964v1">http://arxiv.org/abs/2504.07964v1</a></p>

            <p><strong>Abstract:</strong><br>
            Mixture-of-Experts (MoE) Large Language Models (LLMs) suffer from severely sub-optimal expert pathways-our study reveals that naive expert selection learned from pretraining leaves a surprising 10-20% accuracy gap for improvement. Motivated by this observation, we develop a novel class of test-time optimization methods to re-weight or "re-mixing" the experts in different layers jointly for each test sample. Since the test sample's ground truth is unknown, we propose to optimize a surrogate objective defined by the sample's "successful neighbors" from a reference set of samples. We introduce three surrogates and algorithms based on mode-finding, kernel regression, and the average loss of similar reference samples/tasks. To reduce the cost of optimizing whole pathways, we apply our algorithms merely to the core experts' mixing weights in critical layers, which enjoy similar performance but save significant computation. This leads to "Critical-Layer, Core-Expert, Collaborative Pathway Optimization (C3PO)". We apply C3PO to two recent MoE LLMs and examine it on six widely-used benchmarks. It consistently improves the base model by 7-15% in accuracy and outperforms widely used test-time learning baselines, e.g., in-context learning and prompt/prefix tuning, by a large margin. Moreover, C3PO enables MoE LLMs with 1-3B active parameters to outperform LLMs of 7-9B parameters, hence improving MoE's advantages on efficiency. Our thorough ablation study further sheds novel insights on achieving test-time improvement on MoE.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhongyang Li, Ziyue Li, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07964v1">http://arxiv.org/abs/2504.07964v1</a></p>

            <p><strong>Abstract:</strong><br>
            Mixture-of-Experts (MoE) Large Language Models (LLMs) suffer from severely sub-optimal expert pathways-our study reveals that naive expert selection learned from pretraining leaves a surprising 10-20% accuracy gap for improvement. Motivated by this observation, we develop a novel class of test-time optimization methods to re-weight or "re-mixing" the experts in different layers jointly for each test sample. Since the test sample's ground truth is unknown, we propose to optimize a surrogate objective defined by the sample's "successful neighbors" from a reference set of samples. We introduce three surrogates and algorithms based on mode-finding, kernel regression, and the average loss of similar reference samples/tasks. To reduce the cost of optimizing whole pathways, we apply our algorithms merely to the core experts' mixing weights in critical layers, which enjoy similar performance but save significant computation. This leads to "Critical-Layer, Core-Expert, Collaborative Pathway Optimization (C3PO)". We apply C3PO to two recent MoE LLMs and examine it on six widely-used benchmarks. It consistently improves the base model by 7-15% in accuracy and outperforms widely used test-time learning baselines, e.g., in-context learning and prompt/prefix tuning, by a large margin. Moreover, C3PO enables MoE LLMs with 1-3B active parameters to outperform LLMs of 7-9B parameters, hence improving MoE's advantages on efficiency. Our thorough ablation study further sheds novel insights on achieving test-time improvement on MoE.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 11 Apr 2025 21:11:39 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/abe95c4b/afe4e055.mp3" length="21281348" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1326</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 37 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhongyang Li, Ziyue Li, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07964v1">http://arxiv.org/abs/2504.07964v1</a></p>

            <p><strong>Abstract:</strong><br>
            Mixture-of-Experts (MoE) Large Language Models (LLMs) suffer from severely sub-optimal expert pathways-our study reveals that naive expert selection learned from pretraining leaves a surprising 10-20% accuracy gap for improvement. Motivated by this observation, we develop a novel class of test-time optimization methods to re-weight or "re-mixing" the experts in different layers jointly for each test sample. Since the test sample's ground truth is unknown, we propose to optimize a surrogate objective defined by the sample's "successful neighbors" from a reference set of samples. We introduce three surrogates and algorithms based on mode-finding, kernel regression, and the average loss of similar reference samples/tasks. To reduce the cost of optimizing whole pathways, we apply our algorithms merely to the core experts' mixing weights in critical layers, which enjoy similar performance but save significant computation. This leads to "Critical-Layer, Core-Expert, Collaborative Pathway Optimization (C3PO)". We apply C3PO to two recent MoE LLMs and examine it on six widely-used benchmarks. It consistently improves the base model by 7-15% in accuracy and outperforms widely used test-time learning baselines, e.g., in-context learning and prompt/prefix tuning, by a large margin. Moreover, C3PO enables MoE LLMs with 1-3B active parameters to outperform LLMs of 7-9B parameters, hence improving MoE's advantages on efficiency. Our thorough ablation study further sheds novel insights on achieving test-time improvement on MoE.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning</title>
      <itunes:episode>670</itunes:episode>
      <podcast:episode>670</podcast:episode>
      <itunes:title>VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5980a2bc-50c4-4f89-a0cd-3af09bc6bd78</guid>
      <link>https://share.transistor.fm/s/43742b04</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, Feng Zhao</p>

            <p><strong>Title:</strong><br>
            VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07956v1">http://arxiv.org/abs/2504.07956v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advancement of Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs) and large vision-language models (LVLMs). However, a rigorous evaluation framework for video CoT reasoning remains absent. Current video benchmarks fail to adequately assess the reasoning process and expose whether failures stem from deficiencies in perception or reasoning capabilities. Therefore, we introduce VCR-Bench, a novel benchmark designed to comprehensively evaluate LVLMs' Video Chain-of-Thought Reasoning capabilities. VCR-Bench comprises 859 videos spanning a variety of video content and durations, along with 1,034 high-quality question-answer pairs. Each pair is manually annotated with a stepwise CoT rationale, where every step is tagged to indicate its association with the perception or reasoning capabilities. Furthermore, we design seven distinct task dimensions and propose the CoT score to assess the entire CoT process based on the stepwise tagged CoT rationals. Extensive experiments on VCR-Bench highlight substantial limitations in current LVLMs. Even the top-performing model, o1, only achieves a 62.8% CoT score and an 56.7% accuracy, while most models score below 40%. Experiments show most models score lower on perception than reasoning steps, revealing LVLMs' key bottleneck in temporal-spatial information processing for complex video reasoning. A robust positive correlation between the CoT score and accuracy confirms the validity of our evaluation framework and underscores the critical role of CoT reasoning in solving complex video reasoning tasks. We hope VCR-Bench to serve as a standardized evaluation framework and expose the actual drawbacks in complex video reasoning task.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, Feng Zhao</p>

            <p><strong>Title:</strong><br>
            VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07956v1">http://arxiv.org/abs/2504.07956v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advancement of Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs) and large vision-language models (LVLMs). However, a rigorous evaluation framework for video CoT reasoning remains absent. Current video benchmarks fail to adequately assess the reasoning process and expose whether failures stem from deficiencies in perception or reasoning capabilities. Therefore, we introduce VCR-Bench, a novel benchmark designed to comprehensively evaluate LVLMs' Video Chain-of-Thought Reasoning capabilities. VCR-Bench comprises 859 videos spanning a variety of video content and durations, along with 1,034 high-quality question-answer pairs. Each pair is manually annotated with a stepwise CoT rationale, where every step is tagged to indicate its association with the perception or reasoning capabilities. Furthermore, we design seven distinct task dimensions and propose the CoT score to assess the entire CoT process based on the stepwise tagged CoT rationals. Extensive experiments on VCR-Bench highlight substantial limitations in current LVLMs. Even the top-performing model, o1, only achieves a 62.8% CoT score and an 56.7% accuracy, while most models score below 40%. Experiments show most models score lower on perception than reasoning steps, revealing LVLMs' key bottleneck in temporal-spatial information processing for complex video reasoning. A robust positive correlation between the CoT score and accuracy confirms the validity of our evaluation framework and underscores the critical role of CoT reasoning in solving complex video reasoning tasks. We hope VCR-Bench to serve as a standardized evaluation framework and expose the actual drawbacks in complex video reasoning task.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 11 Apr 2025 21:11:18 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/43742b04/2ad3b5b8.mp3" length="21413407" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1335</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, Feng Zhao</p>

            <p><strong>Title:</strong><br>
            VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07956v1">http://arxiv.org/abs/2504.07956v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advancement of Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs) and large vision-language models (LVLMs). However, a rigorous evaluation framework for video CoT reasoning remains absent. Current video benchmarks fail to adequately assess the reasoning process and expose whether failures stem from deficiencies in perception or reasoning capabilities. Therefore, we introduce VCR-Bench, a novel benchmark designed to comprehensively evaluate LVLMs' Video Chain-of-Thought Reasoning capabilities. VCR-Bench comprises 859 videos spanning a variety of video content and durations, along with 1,034 high-quality question-answer pairs. Each pair is manually annotated with a stepwise CoT rationale, where every step is tagged to indicate its association with the perception or reasoning capabilities. Furthermore, we design seven distinct task dimensions and propose the CoT score to assess the entire CoT process based on the stepwise tagged CoT rationals. Extensive experiments on VCR-Bench highlight substantial limitations in current LVLMs. Even the top-performing model, o1, only achieves a 62.8% CoT score and an 56.7% accuracy, while most models score below 40%. Experiments show most models score lower on perception than reasoning steps, revealing LVLMs' key bottleneck in temporal-spatial information processing for complex video reasoning. A robust positive correlation between the CoT score and accuracy confirms the validity of our evaluation framework and underscores the critical role of CoT reasoning in solving complex video reasoning tasks. We hope VCR-Bench to serve as a standardized evaluation framework and expose the actual drawbacks in complex video reasoning task.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DeepSeek-R1 Thoughtology: Let's &lt;think&gt; about LLM Reasoning</title>
      <itunes:episode>669</itunes:episode>
      <podcast:episode>669</podcast:episode>
      <itunes:title>DeepSeek-R1 Thoughtology: Let's &lt;think&gt; about LLM Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1556d511-b415-4bff-abc1-f2c23e5a3583</guid>
      <link>https://share.transistor.fm/s/84187312</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sara Vera Marjanović, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, Nicholas Meade, Dongchan Shin, Amirhossein Kazemnejad, Gaurav Kamath, Marius Mosbach, Karolina Stańczak, Siva Reddy</p>

            <p><strong>Title:</strong><br>
            DeepSeek-R1 Thoughtology: Let's  about LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07128v1">http://arxiv.org/abs/2504.07128v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an answer for a given input, DeepSeek-R1 creates detailed multi-step reasoning chains, seemingly "thinking" about a problem before providing an answer. This reasoning process is publicly available to the user, creating endless opportunities for studying the reasoning behaviour of the model and opening up the field of Thoughtology. Starting from a taxonomy of DeepSeek-R1's basic building blocks of reasoning, our analyses on DeepSeek-R1 investigate the impact and controllability of thought length, management of long or confusing contexts, cultural and safety concerns, and the status of DeepSeek-R1 vis-\`a-vis cognitive phenomena, such as human-like language processing and world modelling. Our findings paint a nuanced picture. Notably, we show DeepSeek-R1 has a 'sweet spot' of reasoning, where extra inference time can impair model performance. Furthermore, we find a tendency for DeepSeek-R1 to persistently ruminate on previously explored problem formulations, obstructing further exploration. We also note strong safety vulnerabilities of DeepSeek-R1 compared to its non-reasoning counterpart, which can also compromise safety-aligned LLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sara Vera Marjanović, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, Nicholas Meade, Dongchan Shin, Amirhossein Kazemnejad, Gaurav Kamath, Marius Mosbach, Karolina Stańczak, Siva Reddy</p>

            <p><strong>Title:</strong><br>
            DeepSeek-R1 Thoughtology: Let's  about LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07128v1">http://arxiv.org/abs/2504.07128v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an answer for a given input, DeepSeek-R1 creates detailed multi-step reasoning chains, seemingly "thinking" about a problem before providing an answer. This reasoning process is publicly available to the user, creating endless opportunities for studying the reasoning behaviour of the model and opening up the field of Thoughtology. Starting from a taxonomy of DeepSeek-R1's basic building blocks of reasoning, our analyses on DeepSeek-R1 investigate the impact and controllability of thought length, management of long or confusing contexts, cultural and safety concerns, and the status of DeepSeek-R1 vis-\`a-vis cognitive phenomena, such as human-like language processing and world modelling. Our findings paint a nuanced picture. Notably, we show DeepSeek-R1 has a 'sweet spot' of reasoning, where extra inference time can impair model performance. Furthermore, we find a tendency for DeepSeek-R1 to persistently ruminate on previously explored problem formulations, obstructing further exploration. We also note strong safety vulnerabilities of DeepSeek-R1 compared to its non-reasoning counterpart, which can also compromise safety-aligned LLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 11 Apr 2025 21:10:56 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/84187312/08cc5867.mp3" length="24763743" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1544</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sara Vera Marjanović, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, Nicholas Meade, Dongchan Shin, Amirhossein Kazemnejad, Gaurav Kamath, Marius Mosbach, Karolina Stańczak, Siva Reddy</p>

            <p><strong>Title:</strong><br>
            DeepSeek-R1 Thoughtology: Let's  about LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07128v1">http://arxiv.org/abs/2504.07128v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an answer for a given input, DeepSeek-R1 creates detailed multi-step reasoning chains, seemingly "thinking" about a problem before providing an answer. This reasoning process is publicly available to the user, creating endless opportunities for studying the reasoning behaviour of the model and opening up the field of Thoughtology. Starting from a taxonomy of DeepSeek-R1's basic building blocks of reasoning, our analyses on DeepSeek-R1 investigate the impact and controllability of thought length, management of long or confusing contexts, cultural and safety concerns, and the status of DeepSeek-R1 vis-\`a-vis cognitive phenomena, such as human-like language processing and world modelling. Our findings paint a nuanced picture. Notably, we show DeepSeek-R1 has a 'sweet spot' of reasoning, where extra inference time can impair model performance. Furthermore, we find a tendency for DeepSeek-R1 to persistently ruminate on previously explored problem formulations, obstructing further exploration. We also note strong safety vulnerabilities of DeepSeek-R1 compared to its non-reasoning counterpart, which can also compromise safety-aligned LLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning</title>
      <itunes:episode>668</itunes:episode>
      <podcast:episode>668</podcast:episode>
      <itunes:title>VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">85fecfee-0819-4cac-9545-c563eda61b2b</guid>
      <link>https://share.transistor.fm/s/5724cf1b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhong-Yu Li, Ruoyi Du, Juncheng Yan, Le Zhuo, Zhen Li, Peng Gao, Zhanyu Ma, Ming-Ming Cheng</p>

            <p><strong>Title:</strong><br>
            VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07960v1">http://arxiv.org/abs/2504.07960v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in diffusion models significantly advances various image generation tasks. However, the current mainstream approach remains focused on building task-specific models, which have limited efficiency when supporting a wide range of different needs. While universal models attempt to address this limitation, they face critical challenges, including generalizable task instruction, appropriate task distributions, and unified architectural design. To tackle these challenges, we propose VisualCloze, a universal image generation framework, which supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation. Unlike existing methods that rely on language-based task instruction, leading to task ambiguity and weak generalization, we integrate visual in-context learning, allowing models to identify tasks from visual demonstrations. Meanwhile, the inherent sparsity of visual task distributions hampers the learning of transferable knowledge across tasks. To this end, we introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge. Furthermore, we uncover that our unified image generation formulation shared a consistent objective with image infilling, enabling us to leverage the strong generative priors of pre-trained infilling models without modifying the architectures.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhong-Yu Li, Ruoyi Du, Juncheng Yan, Le Zhuo, Zhen Li, Peng Gao, Zhanyu Ma, Ming-Ming Cheng</p>

            <p><strong>Title:</strong><br>
            VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07960v1">http://arxiv.org/abs/2504.07960v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in diffusion models significantly advances various image generation tasks. However, the current mainstream approach remains focused on building task-specific models, which have limited efficiency when supporting a wide range of different needs. While universal models attempt to address this limitation, they face critical challenges, including generalizable task instruction, appropriate task distributions, and unified architectural design. To tackle these challenges, we propose VisualCloze, a universal image generation framework, which supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation. Unlike existing methods that rely on language-based task instruction, leading to task ambiguity and weak generalization, we integrate visual in-context learning, allowing models to identify tasks from visual demonstrations. Meanwhile, the inherent sparsity of visual task distributions hampers the learning of transferable knowledge across tasks. To this end, we introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge. Furthermore, we uncover that our unified image generation formulation shared a consistent objective with image infilling, enabling us to leverage the strong generative priors of pre-trained infilling models without modifying the architectures.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 11 Apr 2025 21:10:35 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5724cf1b/edfba29d.mp3" length="19337402" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1205</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhong-Yu Li, Ruoyi Du, Juncheng Yan, Le Zhuo, Zhen Li, Peng Gao, Zhanyu Ma, Ming-Ming Cheng</p>

            <p><strong>Title:</strong><br>
            VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07960v1">http://arxiv.org/abs/2504.07960v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent progress in diffusion models significantly advances various image generation tasks. However, the current mainstream approach remains focused on building task-specific models, which have limited efficiency when supporting a wide range of different needs. While universal models attempt to address this limitation, they face critical challenges, including generalizable task instruction, appropriate task distributions, and unified architectural design. To tackle these challenges, we propose VisualCloze, a universal image generation framework, which supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation. Unlike existing methods that rely on language-based task instruction, leading to task ambiguity and weak generalization, we integrate visual in-context learning, allowing models to identify tasks from visual demonstrations. Meanwhile, the inherent sparsity of visual task distributions hampers the learning of transferable knowledge across tasks. To this end, we introduce Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge. Furthermore, we uncover that our unified image generation formulation shared a consistent objective with image infilling, enabling us to leverage the strong generative priors of pre-trained infilling models without modifying the architectures.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MM-IFEngine: Towards Multimodal Instruction Following</title>
      <itunes:episode>667</itunes:episode>
      <podcast:episode>667</podcast:episode>
      <itunes:title>MM-IFEngine: Towards Multimodal Instruction Following</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7642b3fe-bd4f-47a7-bfde-4f24de31540e</guid>
      <link>https://share.transistor.fm/s/067f7579</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            MM-IFEngine: Towards Multimodal Instruction Following</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07957v1">http://arxiv.org/abs/2504.07957v1</a></p>

            <p><strong>Abstract:</strong><br>
            The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs. Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO). We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both compose-level constraints for output responses and perception-level constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating both rule-based assessment and judge model. We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieves notable gains on various IF benchmarks, such as MM-IFEval (+10.2$\%$), MIA (+7.6$\%$), and IFEval (+12.3$\%$). The full data and evaluation code will be released on https://github.com/SYuan03/MM-IFEngine.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            MM-IFEngine: Towards Multimodal Instruction Following</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07957v1">http://arxiv.org/abs/2504.07957v1</a></p>

            <p><strong>Abstract:</strong><br>
            The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs. Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO). We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both compose-level constraints for output responses and perception-level constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating both rule-based assessment and judge model. We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieves notable gains on various IF benchmarks, such as MM-IFEval (+10.2$\%$), MIA (+7.6$\%$), and IFEval (+12.3$\%$). The full data and evaluation code will be released on https://github.com/SYuan03/MM-IFEngine.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 11 Apr 2025 21:10:14 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/067f7579/ffc909d1.mp3" length="21360295" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1331</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            MM-IFEngine: Towards Multimodal Instruction Following</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07957v1">http://arxiv.org/abs/2504.07957v1</a></p>

            <p><strong>Abstract:</strong><br>
            The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs. Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO). We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both compose-level constraints for output responses and perception-level constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating both rule-based assessment and judge model. We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieves notable gains on various IF benchmarks, such as MM-IFEval (+10.2$\%$), MIA (+7.6$\%$), and IFEval (+12.3$\%$). The full data and evaluation code will be released on https://github.com/SYuan03/MM-IFEngine.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>HoloPart: Generative 3D Part Amodal Segmentation</title>
      <itunes:episode>666</itunes:episode>
      <podcast:episode>666</podcast:episode>
      <itunes:title>HoloPart: Generative 3D Part Amodal Segmentation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">390f7307-29a0-45b5-97e5-f1cdc57d1fad</guid>
      <link>https://share.transistor.fm/s/b51e919a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunhan Yang, Yuan-Chen Guo, Yukun Huang, Zi-Xin Zou, Zhipeng Yu, Yangguang Li, Yan-Pei Cao, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            HoloPart: Generative 3D Part Amodal Segmentation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07943v1">http://arxiv.org/abs/2504.07943v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D part amodal segmentation--decomposing a 3D shape into complete, semantically meaningful parts, even when occluded--is a challenging but crucial task for 3D content creation and understanding. Existing 3D part segmentation methods only identify visible surface patches, limiting their utility. Inspired by 2D amodal segmentation, we introduce this novel task to the 3D domain and propose a practical, two-stage approach, addressing the key challenges of inferring occluded 3D geometry, maintaining global shape consistency, and handling diverse shapes with limited training data. First, we leverage existing 3D part segmentation to obtain initial, incomplete part segments. Second, we introduce HoloPart, a novel diffusion-based model, to complete these segments into full 3D parts. HoloPart utilizes a specialized architecture with local attention to capture fine-grained part geometry and global shape context attention to ensure overall shape consistency. We introduce new benchmarks based on the ABO and PartObjaverse-Tiny datasets and demonstrate that HoloPart significantly outperforms state-of-the-art shape completion methods. By incorporating HoloPart with existing segmentation techniques, we achieve promising results on 3D part amodal segmentation, opening new avenues for applications in geometry editing, animation, and material assignment.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunhan Yang, Yuan-Chen Guo, Yukun Huang, Zi-Xin Zou, Zhipeng Yu, Yangguang Li, Yan-Pei Cao, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            HoloPart: Generative 3D Part Amodal Segmentation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07943v1">http://arxiv.org/abs/2504.07943v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D part amodal segmentation--decomposing a 3D shape into complete, semantically meaningful parts, even when occluded--is a challenging but crucial task for 3D content creation and understanding. Existing 3D part segmentation methods only identify visible surface patches, limiting their utility. Inspired by 2D amodal segmentation, we introduce this novel task to the 3D domain and propose a practical, two-stage approach, addressing the key challenges of inferring occluded 3D geometry, maintaining global shape consistency, and handling diverse shapes with limited training data. First, we leverage existing 3D part segmentation to obtain initial, incomplete part segments. Second, we introduce HoloPart, a novel diffusion-based model, to complete these segments into full 3D parts. HoloPart utilizes a specialized architecture with local attention to capture fine-grained part geometry and global shape context attention to ensure overall shape consistency. We introduce new benchmarks based on the ABO and PartObjaverse-Tiny datasets and demonstrate that HoloPart significantly outperforms state-of-the-art shape completion methods. By incorporating HoloPart with existing segmentation techniques, we achieve promising results on 3D part amodal segmentation, opening new avenues for applications in geometry editing, animation, and material assignment.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 11 Apr 2025 21:09:53 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b51e919a/76bfae20.mp3" length="21824225" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1360</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunhan Yang, Yuan-Chen Guo, Yukun Huang, Zi-Xin Zou, Zhipeng Yu, Yangguang Li, Yan-Pei Cao, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            HoloPart: Generative 3D Part Amodal Segmentation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07943v1">http://arxiv.org/abs/2504.07943v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D part amodal segmentation--decomposing a 3D shape into complete, semantically meaningful parts, even when occluded--is a challenging but crucial task for 3D content creation and understanding. Existing 3D part segmentation methods only identify visible surface patches, limiting their utility. Inspired by 2D amodal segmentation, we introduce this novel task to the 3D domain and propose a practical, two-stage approach, addressing the key challenges of inferring occluded 3D geometry, maintaining global shape consistency, and handling diverse shapes with limited training data. First, we leverage existing 3D part segmentation to obtain initial, incomplete part segments. Second, we introduce HoloPart, a novel diffusion-based model, to complete these segments into full 3D parts. HoloPart utilizes a specialized architecture with local attention to capture fine-grained part geometry and global shape context attention to ensure overall shape consistency. We introduce new benchmarks based on the ABO and PartObjaverse-Tiny datasets and demonstrate that HoloPart significantly outperforms state-of-the-art shape completion methods. By incorporating HoloPart with existing segmentation techniques, we achieve promising results on 3D part amodal segmentation, opening new avenues for applications in geometry editing, animation, and material assignment.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DDT: Decoupled Diffusion Transformer</title>
      <itunes:episode>665</itunes:episode>
      <podcast:episode>665</podcast:episode>
      <itunes:title>DDT: Decoupled Diffusion Transformer</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">49b48831-4bfc-4928-a8e7-ff506393d39f</guid>
      <link>https://share.transistor.fm/s/21154513</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shuai Wang, Zhi Tian, Weilin Huang, Limin Wang</p>

            <p><strong>Title:</strong><br>
            DDT: Decoupled Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05741v2">http://arxiv.org/abs/2504.05741v2</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new \textbf{\color{ddt}D}ecoupled \textbf{\color{ddt}D}iffusion \textbf{\color{ddt}T}ransformer~(\textbf{\color{ddt}DDT}), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. For ImageNet $256\times256$, Our DDT-XL/2 achieves a new state-of-the-art performance of {1.31 FID}~(nearly $4\times$ faster training convergence compared to previous diffusion transformers). For ImageNet $512\times512$, Our DDT-XL/2 achieves a new state-of-the-art FID of 1.28. Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shuai Wang, Zhi Tian, Weilin Huang, Limin Wang</p>

            <p><strong>Title:</strong><br>
            DDT: Decoupled Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05741v2">http://arxiv.org/abs/2504.05741v2</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new \textbf{\color{ddt}D}ecoupled \textbf{\color{ddt}D}iffusion \textbf{\color{ddt}T}ransformer~(\textbf{\color{ddt}DDT}), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. For ImageNet $256\times256$, Our DDT-XL/2 achieves a new state-of-the-art performance of {1.31 FID}~(nearly $4\times$ faster training convergence compared to previous diffusion transformers). For ImageNet $512\times512$, Our DDT-XL/2 achieves a new state-of-the-art FID of 1.28. Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 10 Apr 2025 20:30:39 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/21154513/1ce734bf.mp3" length="19089506" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1189</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shuai Wang, Zhi Tian, Weilin Huang, Limin Wang</p>

            <p><strong>Title:</strong><br>
            DDT: Decoupled Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05741v2">http://arxiv.org/abs/2504.05741v2</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new \textbf{\color{ddt}D}ecoupled \textbf{\color{ddt}D}iffusion \textbf{\color{ddt}T}ransformer~(\textbf{\color{ddt}DDT}), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. For ImageNet $256\times256$, Our DDT-XL/2 achieves a new state-of-the-art performance of {1.31 FID}~(nearly $4\times$ faster training convergence compared to previous diffusion transformers). For ImageNet $512\times512$, Our DDT-XL/2 achieves a new state-of-the-art FID of 1.28. Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens</title>
      <itunes:episode>664</itunes:episode>
      <podcast:episode>664</podcast:episode>
      <itunes:title>OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1fc9d9f6-3b87-43b9-8c38-36eabb537519</guid>
      <link>https://share.transistor.fm/s/fda2ac69</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiacheng Liu, Taylor Blanton, Yanai Elazar, Sewon Min, YenSung Chen, Arnavi Chheda-Kothary, Huy Tran, Byron Bischoff, Eric Marsh, Michael Schmitz, Cassidy Trier, Aaron Sarnat, Jenna James, Jon Borchardt, Bailey Kuehl, Evie Cheng, Karen Farley, Sruthi Sreeram, Taira Anderson, David Albright, Carissa Schoenick, Luca Soldaini, Dirk Groeneveld, Rock Yuren Pang, Pang Wei Koh, Noah A. Smith, Sophie Lebrecht, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi, Jesse Dodge</p>

            <p><strong>Title:</strong><br>
            OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07096v1">http://arxiv.org/abs/2504.07096v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiacheng Liu, Taylor Blanton, Yanai Elazar, Sewon Min, YenSung Chen, Arnavi Chheda-Kothary, Huy Tran, Byron Bischoff, Eric Marsh, Michael Schmitz, Cassidy Trier, Aaron Sarnat, Jenna James, Jon Borchardt, Bailey Kuehl, Evie Cheng, Karen Farley, Sruthi Sreeram, Taira Anderson, David Albright, Carissa Schoenick, Luca Soldaini, Dirk Groeneveld, Rock Yuren Pang, Pang Wei Koh, Noah A. Smith, Sophie Lebrecht, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi, Jesse Dodge</p>

            <p><strong>Title:</strong><br>
            OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07096v1">http://arxiv.org/abs/2504.07096v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 10 Apr 2025 20:30:17 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fda2ac69/e8eadae3.mp3" length="20076349" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1251</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiacheng Liu, Taylor Blanton, Yanai Elazar, Sewon Min, YenSung Chen, Arnavi Chheda-Kothary, Huy Tran, Byron Bischoff, Eric Marsh, Michael Schmitz, Cassidy Trier, Aaron Sarnat, Jenna James, Jon Borchardt, Bailey Kuehl, Evie Cheng, Karen Farley, Sruthi Sreeram, Taira Anderson, David Albright, Carissa Schoenick, Luca Soldaini, Dirk Groeneveld, Rock Yuren Pang, Pang Wei Koh, Noah A. Smith, Sophie Lebrecht, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi, Jesse Dodge</p>

            <p><strong>Title:</strong><br>
            OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07096v1">http://arxiv.org/abs/2504.07096v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Unified Agentic Framework for Evaluating Conditional Image Generation</title>
      <itunes:episode>663</itunes:episode>
      <podcast:episode>663</podcast:episode>
      <itunes:title>A Unified Agentic Framework for Evaluating Conditional Image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d4816c3f-e752-4f73-8f3e-e5a9e5fed256</guid>
      <link>https://share.transistor.fm/s/19c8ff3c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jifang Wang, Xue Yang, Longyue Wang, Zhenran Xu, Yiyu Wang, Yaowei Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            A Unified Agentic Framework for Evaluating Conditional Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07046v1">http://arxiv.org/abs/2504.07046v1</a></p>

            <p><strong>Abstract:</strong><br>
            Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes large multimodal models (LMMs) as its core, integrating a multi-functional toolbox and establishing a fine-grained evaluation framework. Additionally, we synthesize evaluation trajectories for fine-tuning, empowering smaller LMMs to autonomously select appropriate tools and conduct nuanced analyses based on tool outputs. Experiments across seven prominent conditional image generation tasks demonstrate that CIGEval (GPT-4o version) achieves a high correlation of 0.4625 with human assessments, closely matching the inter-annotator correlation of 0.47. Moreover, when implemented with 7B open-source LMMs using only 2.3K training trajectories, CIGEval surpasses the previous GPT-4o-based state-of-the-art method. Case studies on GPT-4o image generation highlight CIGEval's capability in identifying subtle issues related to subject consistency and adherence to control guidance, indicating its great potential for automating evaluation of image generation tasks with human-level reliability.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jifang Wang, Xue Yang, Longyue Wang, Zhenran Xu, Yiyu Wang, Yaowei Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            A Unified Agentic Framework for Evaluating Conditional Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07046v1">http://arxiv.org/abs/2504.07046v1</a></p>

            <p><strong>Abstract:</strong><br>
            Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes large multimodal models (LMMs) as its core, integrating a multi-functional toolbox and establishing a fine-grained evaluation framework. Additionally, we synthesize evaluation trajectories for fine-tuning, empowering smaller LMMs to autonomously select appropriate tools and conduct nuanced analyses based on tool outputs. Experiments across seven prominent conditional image generation tasks demonstrate that CIGEval (GPT-4o version) achieves a high correlation of 0.4625 with human assessments, closely matching the inter-annotator correlation of 0.47. Moreover, when implemented with 7B open-source LMMs using only 2.3K training trajectories, CIGEval surpasses the previous GPT-4o-based state-of-the-art method. Case studies on GPT-4o image generation highlight CIGEval's capability in identifying subtle issues related to subject consistency and adherence to control guidance, indicating its great potential for automating evaluation of image generation tasks with human-level reliability.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 10 Apr 2025 20:29:56 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/19c8ff3c/bde3d317.mp3" length="20314579" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1266</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jifang Wang, Xue Yang, Longyue Wang, Zhenran Xu, Yiyu Wang, Yaowei Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            A Unified Agentic Framework for Evaluating Conditional Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.07046v1">http://arxiv.org/abs/2504.07046v1</a></p>

            <p><strong>Abstract:</strong><br>
            Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes large multimodal models (LMMs) as its core, integrating a multi-functional toolbox and establishing a fine-grained evaluation framework. Additionally, we synthesize evaluation trajectories for fine-tuning, empowering smaller LMMs to autonomously select appropriate tools and conduct nuanced analyses based on tool outputs. Experiments across seven prominent conditional image generation tasks demonstrate that CIGEval (GPT-4o version) achieves a high correlation of 0.4625 with human assessments, closely matching the inter-annotator correlation of 0.47. Moreover, when implemented with 7B open-source LMMs using only 2.3K training trajectories, CIGEval surpasses the previous GPT-4o-based state-of-the-art method. Case studies on GPT-4o image generation highlight CIGEval's capability in identifying subtle issues related to subject consistency and adherence to control guidance, indicating its great potential for automating evaluation of image generation tasks with human-level reliability.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?</title>
      <itunes:episode>662</itunes:episode>
      <podcast:episode>662</podcast:episode>
      <itunes:title>Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6ddd99b6-7eca-4058-a24e-d27370f54b2b</guid>
      <link>https://share.transistor.fm/s/b71f4484</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chenrui Fan, Ming Li, Lichao Sun, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.06514v1">http://arxiv.org/abs/2504.06514v1</a></p>

            <p><strong>Abstract:</strong><br>
            We find that the response length of reasoning LLMs, whether trained by reinforcement learning or supervised learning, drastically increases for ill-posed questions with missing premises (MiP), ending up with redundant and ineffective thinking. This newly introduced scenario exacerbates the general overthinking issue to a large extent, which we name as the MiP-Overthinking. Such failures are against the ``test-time scaling law'' but have been widely observed on multiple datasets we curated with MiP, indicating the harm of cheap overthinking and a lack of critical thinking. Surprisingly, LLMs not specifically trained for reasoning exhibit much better performance on the MiP scenario, producing much shorter responses that quickly identify ill-posed queries. This implies a critical flaw of the current training recipe for reasoning LLMs, which does not encourage efficient thinking adequately, leading to the abuse of thinking patterns. To further investigate the reasons behind such failures, we conduct fine-grained analyses of the reasoning length, overthinking patterns, and location of critical thinking on different types of LLMs. Moreover, our extended ablation study reveals that the overthinking is contagious through the distillation of reasoning models' responses. These results improve the understanding of overthinking and shed novel insights into mitigating the problem.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chenrui Fan, Ming Li, Lichao Sun, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.06514v1">http://arxiv.org/abs/2504.06514v1</a></p>

            <p><strong>Abstract:</strong><br>
            We find that the response length of reasoning LLMs, whether trained by reinforcement learning or supervised learning, drastically increases for ill-posed questions with missing premises (MiP), ending up with redundant and ineffective thinking. This newly introduced scenario exacerbates the general overthinking issue to a large extent, which we name as the MiP-Overthinking. Such failures are against the ``test-time scaling law'' but have been widely observed on multiple datasets we curated with MiP, indicating the harm of cheap overthinking and a lack of critical thinking. Surprisingly, LLMs not specifically trained for reasoning exhibit much better performance on the MiP scenario, producing much shorter responses that quickly identify ill-posed queries. This implies a critical flaw of the current training recipe for reasoning LLMs, which does not encourage efficient thinking adequately, leading to the abuse of thinking patterns. To further investigate the reasons behind such failures, we conduct fine-grained analyses of the reasoning length, overthinking patterns, and location of critical thinking on different types of LLMs. Moreover, our extended ablation study reveals that the overthinking is contagious through the distillation of reasoning models' responses. These results improve the understanding of overthinking and shed novel insights into mitigating the problem.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 10 Apr 2025 20:29:34 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b71f4484/a3036dd7.mp3" length="22349228" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1393</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chenrui Fan, Ming Li, Lichao Sun, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.06514v1">http://arxiv.org/abs/2504.06514v1</a></p>

            <p><strong>Abstract:</strong><br>
            We find that the response length of reasoning LLMs, whether trained by reinforcement learning or supervised learning, drastically increases for ill-posed questions with missing premises (MiP), ending up with redundant and ineffective thinking. This newly introduced scenario exacerbates the general overthinking issue to a large extent, which we name as the MiP-Overthinking. Such failures are against the ``test-time scaling law'' but have been widely observed on multiple datasets we curated with MiP, indicating the harm of cheap overthinking and a lack of critical thinking. Surprisingly, LLMs not specifically trained for reasoning exhibit much better performance on the MiP scenario, producing much shorter responses that quickly identify ill-posed queries. This implies a critical flaw of the current training recipe for reasoning LLMs, which does not encourage efficient thinking adequately, leading to the abuse of thinking patterns. To further investigate the reasons behind such failures, we conduct fine-grained analyses of the reasoning length, overthinking patterns, and location of critical thinking on different types of LLMs. Moreover, our extended ablation study reveals that the overthinking is contagious through the distillation of reasoning models' responses. These results improve the understanding of overthinking and shed novel insights into mitigating the problem.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OmniSVG: A Unified Scalable Vector Graphics Generation Model</title>
      <itunes:episode>661</itunes:episode>
      <podcast:episode>661</podcast:episode>
      <itunes:title>OmniSVG: A Unified Scalable Vector Graphics Generation Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b1aa272a-be3f-49a8-8540-45b23ec2140b</guid>
      <link>https://share.transistor.fm/s/28d57af5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 91 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, Yu-Gang Jiang</p>

            <p><strong>Title:</strong><br>
            OmniSVG: A Unified Scalable Vector Graphics Generation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.06263v1">http://arxiv.org/abs/2504.06263v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scalable Vector Graphics (SVG) is an important image format widely adopted in graphic design because of their resolution independence and editability. The study of generating high-quality SVG has continuously drawn attention from both designers and researchers in the AIGC community. However, existing methods either produces unstructured outputs with huge computational cost or is limited to generating monochrome icons of over-simplified structures. To produce high-quality and complex SVG, we propose OmniSVG, a unified framework that leverages pre-trained Vision-Language Models (VLMs) for end-to-end multimodal SVG generation. By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG decouples structural logic from low-level geometry for efficient training while maintaining the expressiveness of complex SVG structure. To further advance the development of SVG synthesis, we introduce MMSVG-2M, a multimodal dataset with two million richly annotated SVG assets, along with a standardized evaluation protocol for conditional SVG generation tasks. Extensive experiments show that OmniSVG outperforms existing methods and demonstrates its potential for integration into professional SVG design workflows.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 91 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, Yu-Gang Jiang</p>

            <p><strong>Title:</strong><br>
            OmniSVG: A Unified Scalable Vector Graphics Generation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.06263v1">http://arxiv.org/abs/2504.06263v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scalable Vector Graphics (SVG) is an important image format widely adopted in graphic design because of their resolution independence and editability. The study of generating high-quality SVG has continuously drawn attention from both designers and researchers in the AIGC community. However, existing methods either produces unstructured outputs with huge computational cost or is limited to generating monochrome icons of over-simplified structures. To produce high-quality and complex SVG, we propose OmniSVG, a unified framework that leverages pre-trained Vision-Language Models (VLMs) for end-to-end multimodal SVG generation. By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG decouples structural logic from low-level geometry for efficient training while maintaining the expressiveness of complex SVG structure. To further advance the development of SVG synthesis, we introduce MMSVG-2M, a multimodal dataset with two million richly annotated SVG assets, along with a standardized evaluation protocol for conditional SVG generation tasks. Extensive experiments show that OmniSVG outperforms existing methods and demonstrates its potential for integration into professional SVG design workflows.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 09 Apr 2025 21:04:20 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/28d57af5/186a7bc0.mp3" length="20797729" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1296</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 91 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, Yu-Gang Jiang</p>

            <p><strong>Title:</strong><br>
            OmniSVG: A Unified Scalable Vector Graphics Generation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.06263v1">http://arxiv.org/abs/2504.06263v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scalable Vector Graphics (SVG) is an important image format widely adopted in graphic design because of their resolution independence and editability. The study of generating high-quality SVG has continuously drawn attention from both designers and researchers in the AIGC community. However, existing methods either produces unstructured outputs with huge computational cost or is limited to generating monochrome icons of over-simplified structures. To produce high-quality and complex SVG, we propose OmniSVG, a unified framework that leverages pre-trained Vision-Language Models (VLMs) for end-to-end multimodal SVG generation. By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG decouples structural logic from low-level geometry for efficient training while maintaining the expressiveness of complex SVG structure. To further advance the development of SVG synthesis, we introduce MMSVG-2M, a multimodal dataset with two million richly annotated SVG assets, along with a standardized evaluation protocol for conditional SVG generation tasks. Extensive experiments show that OmniSVG outperforms existing methods and demonstrates its potential for integration into professional SVG design workflows.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Hogwild! Inference: Parallel LLM Generation via Concurrent Attention</title>
      <itunes:episode>660</itunes:episode>
      <podcast:episode>660</podcast:episode>
      <itunes:title>Hogwild! Inference: Parallel LLM Generation via Concurrent Attention</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">99dcabd0-6d94-4da1-8661-d14bc3ec61d6</guid>
      <link>https://share.transistor.fm/s/3e0a0ec8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 73 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Gleb Rodionov, Roman Garipov, Alina Shutova, George Yakushev, Vage Egiazarian, Anton Sinitsin, Denis Kuznedelev, Dan Alistarh</p>

            <p><strong>Title:</strong><br>
            Hogwild! Inference: Parallel LLM Generation via Concurrent Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.06261v2">http://arxiv.org/abs/2504.06261v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long inference-time computations. In human problem solving, a common strategy to expedite work is collaboration: by dividing the problem into sub-tasks, exploring different strategies concurrently, etc. Recent research has shown that LLMs can also operate in parallel by implementing explicit cooperation frameworks, such as voting mechanisms or the explicit creation of independent sub-tasks that can be executed in parallel. However, each of these frameworks may not be suitable for all types of tasks, which can hinder their applicability. In this work, we propose a different design approach: we run LLM "workers" in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate. Our approach allows the instances to come up with their own collaboration strategy for the problem at hand, all the while "seeing" each other's partial progress in the concurrent cache. We implement this approach via Hogwild! Inference: a parallel LLM inference engine where multiple instances of the same LLM run in parallel with the same attention cache, with "instant" access to each other's generated tokens. Hogwild! inference takes advantage of Rotary Position Embeddings (RoPE) to avoid recomputation while improving parallel hardware utilization. We find that modern reasoning-capable LLMs can perform inference with shared Key-Value cache out of the box, without additional fine-tuning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 73 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Gleb Rodionov, Roman Garipov, Alina Shutova, George Yakushev, Vage Egiazarian, Anton Sinitsin, Denis Kuznedelev, Dan Alistarh</p>

            <p><strong>Title:</strong><br>
            Hogwild! Inference: Parallel LLM Generation via Concurrent Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.06261v2">http://arxiv.org/abs/2504.06261v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long inference-time computations. In human problem solving, a common strategy to expedite work is collaboration: by dividing the problem into sub-tasks, exploring different strategies concurrently, etc. Recent research has shown that LLMs can also operate in parallel by implementing explicit cooperation frameworks, such as voting mechanisms or the explicit creation of independent sub-tasks that can be executed in parallel. However, each of these frameworks may not be suitable for all types of tasks, which can hinder their applicability. In this work, we propose a different design approach: we run LLM "workers" in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate. Our approach allows the instances to come up with their own collaboration strategy for the problem at hand, all the while "seeing" each other's partial progress in the concurrent cache. We implement this approach via Hogwild! Inference: a parallel LLM inference engine where multiple instances of the same LLM run in parallel with the same attention cache, with "instant" access to each other's generated tokens. Hogwild! inference takes advantage of Rotary Position Embeddings (RoPE) to avoid recomputation while improving parallel hardware utilization. We find that modern reasoning-capable LLMs can perform inference with shared Key-Value cache out of the box, without additional fine-tuning.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 09 Apr 2025 21:03:56 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3e0a0ec8/ce54b17e.mp3" length="23460973" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1463</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 73 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Gleb Rodionov, Roman Garipov, Alina Shutova, George Yakushev, Vage Egiazarian, Anton Sinitsin, Denis Kuznedelev, Dan Alistarh</p>

            <p><strong>Title:</strong><br>
            Hogwild! Inference: Parallel LLM Generation via Concurrent Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.06261v2">http://arxiv.org/abs/2504.06261v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long inference-time computations. In human problem solving, a common strategy to expedite work is collaboration: by dividing the problem into sub-tasks, exploring different strategies concurrently, etc. Recent research has shown that LLMs can also operate in parallel by implementing explicit cooperation frameworks, such as voting mechanisms or the explicit creation of independent sub-tasks that can be executed in parallel. However, each of these frameworks may not be suitable for all types of tasks, which can hinder their applicability. In this work, we propose a different design approach: we run LLM "workers" in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate. Our approach allows the instances to come up with their own collaboration strategy for the problem at hand, all the while "seeing" each other's partial progress in the concurrent cache. We implement this approach via Hogwild! Inference: a parallel LLM inference engine where multiple instances of the same LLM run in parallel with the same attention cache, with "instant" access to each other's generated tokens. Hogwild! inference takes advantage of Rotary Position Embeddings (RoPE) to avoid recomputation while improving parallel hardware utilization. We find that modern reasoning-capable LLMs can perform inference with shared Key-Value cache out of the box, without additional fine-tuning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought</title>
      <itunes:episode>659</itunes:episode>
      <podcast:episode>659</podcast:episode>
      <itunes:title>Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1303cd03-29e0-4782-bba1-2c4487c7b16e</guid>
      <link>https://share.transistor.fm/s/2fdf861e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yi Peng, Chris, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou</p>

            <p><strong>Title:</strong><br>
            Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05599v1">http://arxiv.org/abs/2504.05599v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text alignment, we propose a hybrid optimization strategy that combines Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly enhancing cross-modal integration efficiency. Additionally, we introduce an adaptive-length Chain-of-Thought distillation approach for reasoning data generation. This approach dynamically optimizes reasoning chain lengths, thereby enhancing inference efficiency and preventing excessive reasoning overthinking. Empirical evaluations demonstrate that Skywork R1V, with only 38B parameters, delivers competitive performance, achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista. Meanwhile, it maintains robust textual reasoning performance, evidenced by impressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model weights have been publicly released to promote openness and reproducibility.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yi Peng, Chris, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou</p>

            <p><strong>Title:</strong><br>
            Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05599v1">http://arxiv.org/abs/2504.05599v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text alignment, we propose a hybrid optimization strategy that combines Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly enhancing cross-modal integration efficiency. Additionally, we introduce an adaptive-length Chain-of-Thought distillation approach for reasoning data generation. This approach dynamically optimizes reasoning chain lengths, thereby enhancing inference efficiency and preventing excessive reasoning overthinking. Empirical evaluations demonstrate that Skywork R1V, with only 38B parameters, delivers competitive performance, achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista. Meanwhile, it maintains robust textual reasoning performance, evidenced by impressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model weights have been publicly released to promote openness and reproducibility.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 09 Apr 2025 21:03:32 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2fdf861e/14f1663d.mp3" length="22247217" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1387</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yi Peng, Chris, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou</p>

            <p><strong>Title:</strong><br>
            Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05599v1">http://arxiv.org/abs/2504.05599v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text alignment, we propose a hybrid optimization strategy that combines Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly enhancing cross-modal integration efficiency. Additionally, we introduce an adaptive-length Chain-of-Thought distillation approach for reasoning data generation. This approach dynamically optimizes reasoning chain lengths, thereby enhancing inference efficiency and preventing excessive reasoning overthinking. Empirical evaluations demonstrate that Skywork R1V, with only 38B parameters, delivers competitive performance, achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista. Meanwhile, it maintains robust textual reasoning performance, evidenced by impressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model weights have been publicly released to promote openness and reproducibility.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>An Empirical Study of GPT-4o Image Generation Capabilities</title>
      <itunes:episode>658</itunes:episode>
      <podcast:episode>658</podcast:episode>
      <itunes:title>An Empirical Study of GPT-4o Image Generation Capabilities</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b398e506-c886-4999-9972-a40fffa670b9</guid>
      <link>https://share.transistor.fm/s/ada3583a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sixiang Chen, Jinbin Bai, Zhuoran Zhao, Tian Ye, Qingyu Shi, Donghao Zhou, Wenhao Chai, Xin Lin, Jianzong Wu, Chao Tang, Shilin Xu, Tao Zhang, Haobo Yuan, Yikang Zhou, Wei Chow, Linfeng Li, Xiangtai Li, Lei Zhu, Lu Qi</p>

            <p><strong>Title:</strong><br>
            An Empirical Study of GPT-4o Image Generation Capabilities</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05979v1">http://arxiv.org/abs/2504.05979v1</a></p>

            <p><strong>Abstract:</strong><br>
            The landscape of image generation has rapidly evolved, from early GAN-based approaches to diffusion models and, most recently, to unified generative architectures that seek to bridge understanding and generation tasks. Recent advances, especially the GPT-4o, have demonstrated the feasibility of high-fidelity multimodal generation, their architectural design remains mysterious and unpublished. This prompts the question of whether image and text generation have already been successfully integrated into a unified framework for those methods. In this work, we conduct an empirical study of GPT-4o's image generation capabilities, benchmarking it against leading open-source and commercial models. Our evaluation covers four main categories, including text-to-image, image-to-image, image-to-3D, and image-to-X generation, with more than 20 tasks. Our analysis highlights the strengths and limitations of GPT-4o under various settings, and situates it within the broader evolution of generative modeling. Through this investigation, we identify promising directions for future unified generative models, emphasizing the role of architectural design and data scaling.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sixiang Chen, Jinbin Bai, Zhuoran Zhao, Tian Ye, Qingyu Shi, Donghao Zhou, Wenhao Chai, Xin Lin, Jianzong Wu, Chao Tang, Shilin Xu, Tao Zhang, Haobo Yuan, Yikang Zhou, Wei Chow, Linfeng Li, Xiangtai Li, Lei Zhu, Lu Qi</p>

            <p><strong>Title:</strong><br>
            An Empirical Study of GPT-4o Image Generation Capabilities</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05979v1">http://arxiv.org/abs/2504.05979v1</a></p>

            <p><strong>Abstract:</strong><br>
            The landscape of image generation has rapidly evolved, from early GAN-based approaches to diffusion models and, most recently, to unified generative architectures that seek to bridge understanding and generation tasks. Recent advances, especially the GPT-4o, have demonstrated the feasibility of high-fidelity multimodal generation, their architectural design remains mysterious and unpublished. This prompts the question of whether image and text generation have already been successfully integrated into a unified framework for those methods. In this work, we conduct an empirical study of GPT-4o's image generation capabilities, benchmarking it against leading open-source and commercial models. Our evaluation covers four main categories, including text-to-image, image-to-image, image-to-3D, and image-to-X generation, with more than 20 tasks. Our analysis highlights the strengths and limitations of GPT-4o under various settings, and situates it within the broader evolution of generative modeling. Through this investigation, we identify promising directions for future unified generative models, emphasizing the role of architectural design and data scaling.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 09 Apr 2025 21:03:09 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ada3583a/e57363e0.mp3" length="21613165" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1347</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 50 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sixiang Chen, Jinbin Bai, Zhuoran Zhao, Tian Ye, Qingyu Shi, Donghao Zhou, Wenhao Chai, Xin Lin, Jianzong Wu, Chao Tang, Shilin Xu, Tao Zhang, Haobo Yuan, Yikang Zhou, Wei Chow, Linfeng Li, Xiangtai Li, Lei Zhu, Lu Qi</p>

            <p><strong>Title:</strong><br>
            An Empirical Study of GPT-4o Image Generation Capabilities</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05979v1">http://arxiv.org/abs/2504.05979v1</a></p>

            <p><strong>Abstract:</strong><br>
            The landscape of image generation has rapidly evolved, from early GAN-based approaches to diffusion models and, most recently, to unified generative architectures that seek to bridge understanding and generation tasks. Recent advances, especially the GPT-4o, have demonstrated the feasibility of high-fidelity multimodal generation, their architectural design remains mysterious and unpublished. This prompts the question of whether image and text generation have already been successfully integrated into a unified framework for those methods. In this work, we conduct an empirical study of GPT-4o's image generation capabilities, benchmarking it against leading open-source and commercial models. Our evaluation covers four main categories, including text-to-image, image-to-image, image-to-3D, and image-to-X generation, with more than 20 tasks. Our analysis highlights the strengths and limitations of GPT-4o under various settings, and situates it within the broader evolution of generative modeling. Through this investigation, we identify promising directions for future unified generative models, emphasizing the role of architectural design and data scaling.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values</title>
      <itunes:episode>657</itunes:episode>
      <podcast:episode>657</podcast:episode>
      <itunes:title>COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a523bb59-fe66-491f-94c9-2d994db8e346</guid>
      <link>https://share.transistor.fm/s/c4aee2be</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            M-A-P Team, Siwei Wu, Jincheng Ren, Xinrun Du, Shuyue Guo, Xingwei Qu, Yiming Liang, Jie Liu, Yunwen Li, Tianyu Zheng, Boyu Feng, Huaqing Yuan, Zenith Wang, Jiaheng Liu, Wenhao Huang, Chenglin Cai, Haoran Que, Jian Yang, Yuelin Bai, Zekun Moore Wang, Zhouliang Yu, Qunshu Lin, Ding Pan, Yuchen Jiang, Tiannan Wang, Wangchunshu Zhou, Shenzhi Wang, Xingyuan Bu, Minghao Liu, Guoyin Wang, Ge Zhang, Chenghua Lin</p>

            <p><strong>Title:</strong><br>
            COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05535v1">http://arxiv.org/abs/2504.05535v1</a></p>

            <p><strong>Abstract:</strong><br>
            Aligning large language models (LLMs) with human preferences has achieved remarkable success. However, existing Chinese preference datasets are limited by small scale, narrow domain coverage, and lack of rigorous data validation. Additionally, the reliance on human annotators for instruction and response labeling significantly constrains the scalability of human preference datasets. To address these challenges, we design an LLM-based Chinese preference dataset annotation pipeline with no human intervention. Specifically, we crawled and carefully filtered 92k high-quality Chinese queries and employed 15 mainstream LLMs to generate and score chosen-rejected response pairs. Based on it, we introduce COIG-P (Chinese Open Instruction Generalist - Preference), a high-quality, large-scale Chinese preference dataset, comprises 1,009k Chinese preference pairs spanning 6 diverse domains: Chat, Code, Math, Logic, Novel, and Role. Building upon COIG-P, to reduce the overhead of using LLMs for scoring, we trained a 8B-sized Chinese Reward Model (CRM) and meticulously constructed a Chinese Reward Benchmark (CRBench). Evaluation results based on AlignBench \citep{liu2024alignbenchbenchmarkingchinesealignment} show that that COIG-P significantly outperforms other Chinese preference datasets, and it brings significant performance improvements ranging from 2% to 12% for the Qwen2/2.5 and Infinity-Instruct-3M-0625 model series, respectively. The results on CRBench demonstrate that our CRM has a strong and robust scoring ability. We apply it to filter chosen-rejected response pairs in a test split of COIG-P, and our experiments show that it is comparable to GPT-4o in identifying low-quality samples while maintaining efficiency and cost-effectiveness. Our codes and data are released in https://github.com/multimodal-art-projection/COIG-P.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            M-A-P Team, Siwei Wu, Jincheng Ren, Xinrun Du, Shuyue Guo, Xingwei Qu, Yiming Liang, Jie Liu, Yunwen Li, Tianyu Zheng, Boyu Feng, Huaqing Yuan, Zenith Wang, Jiaheng Liu, Wenhao Huang, Chenglin Cai, Haoran Que, Jian Yang, Yuelin Bai, Zekun Moore Wang, Zhouliang Yu, Qunshu Lin, Ding Pan, Yuchen Jiang, Tiannan Wang, Wangchunshu Zhou, Shenzhi Wang, Xingyuan Bu, Minghao Liu, Guoyin Wang, Ge Zhang, Chenghua Lin</p>

            <p><strong>Title:</strong><br>
            COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05535v1">http://arxiv.org/abs/2504.05535v1</a></p>

            <p><strong>Abstract:</strong><br>
            Aligning large language models (LLMs) with human preferences has achieved remarkable success. However, existing Chinese preference datasets are limited by small scale, narrow domain coverage, and lack of rigorous data validation. Additionally, the reliance on human annotators for instruction and response labeling significantly constrains the scalability of human preference datasets. To address these challenges, we design an LLM-based Chinese preference dataset annotation pipeline with no human intervention. Specifically, we crawled and carefully filtered 92k high-quality Chinese queries and employed 15 mainstream LLMs to generate and score chosen-rejected response pairs. Based on it, we introduce COIG-P (Chinese Open Instruction Generalist - Preference), a high-quality, large-scale Chinese preference dataset, comprises 1,009k Chinese preference pairs spanning 6 diverse domains: Chat, Code, Math, Logic, Novel, and Role. Building upon COIG-P, to reduce the overhead of using LLMs for scoring, we trained a 8B-sized Chinese Reward Model (CRM) and meticulously constructed a Chinese Reward Benchmark (CRBench). Evaluation results based on AlignBench \citep{liu2024alignbenchbenchmarkingchinesealignment} show that that COIG-P significantly outperforms other Chinese preference datasets, and it brings significant performance improvements ranging from 2% to 12% for the Qwen2/2.5 and Infinity-Instruct-3M-0625 model series, respectively. The results on CRBench demonstrate that our CRM has a strong and robust scoring ability. We apply it to filter chosen-rejected response pairs in a test split of COIG-P, and our experiments show that it is comparable to GPT-4o in identifying low-quality samples while maintaining efficiency and cost-effectiveness. Our codes and data are released in https://github.com/multimodal-art-projection/COIG-P.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 09 Apr 2025 21:02:46 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c4aee2be/05396e12.mp3" length="20890135" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1302</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            M-A-P Team, Siwei Wu, Jincheng Ren, Xinrun Du, Shuyue Guo, Xingwei Qu, Yiming Liang, Jie Liu, Yunwen Li, Tianyu Zheng, Boyu Feng, Huaqing Yuan, Zenith Wang, Jiaheng Liu, Wenhao Huang, Chenglin Cai, Haoran Que, Jian Yang, Yuelin Bai, Zekun Moore Wang, Zhouliang Yu, Qunshu Lin, Ding Pan, Yuchen Jiang, Tiannan Wang, Wangchunshu Zhou, Shenzhi Wang, Xingyuan Bu, Minghao Liu, Guoyin Wang, Ge Zhang, Chenghua Lin</p>

            <p><strong>Title:</strong><br>
            COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05535v1">http://arxiv.org/abs/2504.05535v1</a></p>

            <p><strong>Abstract:</strong><br>
            Aligning large language models (LLMs) with human preferences has achieved remarkable success. However, existing Chinese preference datasets are limited by small scale, narrow domain coverage, and lack of rigorous data validation. Additionally, the reliance on human annotators for instruction and response labeling significantly constrains the scalability of human preference datasets. To address these challenges, we design an LLM-based Chinese preference dataset annotation pipeline with no human intervention. Specifically, we crawled and carefully filtered 92k high-quality Chinese queries and employed 15 mainstream LLMs to generate and score chosen-rejected response pairs. Based on it, we introduce COIG-P (Chinese Open Instruction Generalist - Preference), a high-quality, large-scale Chinese preference dataset, comprises 1,009k Chinese preference pairs spanning 6 diverse domains: Chat, Code, Math, Logic, Novel, and Role. Building upon COIG-P, to reduce the overhead of using LLMs for scoring, we trained a 8B-sized Chinese Reward Model (CRM) and meticulously constructed a Chinese Reward Benchmark (CRBench). Evaluation results based on AlignBench \citep{liu2024alignbenchbenchmarkingchinesealignment} show that that COIG-P significantly outperforms other Chinese preference datasets, and it brings significant performance improvements ranging from 2% to 12% for the Qwen2/2.5 and Infinity-Instruct-3M-0625 model series, respectively. The results on CRBench demonstrate that our CRM has a strong and robust scoring ability. We apply it to filter chosen-rejected response pairs in a test split of COIG-P, and our experiments show that it is comparable to GPT-4o in identifying low-quality samples while maintaining efficiency and cost-effectiveness. Our codes and data are released in https://github.com/multimodal-art-projection/COIG-P.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Less-to-More Generalization: Unlocking More Controllability by In-Context Generation</title>
      <itunes:episode>656</itunes:episode>
      <podcast:episode>656</podcast:episode>
      <itunes:title>Less-to-More Generalization: Unlocking More Controllability by In-Context Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d15fea12-e46c-4506-9eda-09b5526dc9c0</guid>
      <link>https://share.transistor.fm/s/428ef881</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, Qian He</p>

            <p><strong>Title:</strong><br>
            Less-to-More Generalization: Unlocking More Controllability by In-Context Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.02160v1">http://arxiv.org/abs/2504.02160v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data. Additionally, we introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding. It is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, Qian He</p>

            <p><strong>Title:</strong><br>
            Less-to-More Generalization: Unlocking More Controllability by In-Context Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.02160v1">http://arxiv.org/abs/2504.02160v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data. Additionally, we introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding. It is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 09 Apr 2025 21:02:23 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/428ef881/9d8b37cf.mp3" length="20408633" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1272</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, Qian He</p>

            <p><strong>Title:</strong><br>
            Less-to-More Generalization: Unlocking More Controllability by In-Context Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.02160v1">http://arxiv.org/abs/2504.02160v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data. Additionally, we introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding. It is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SmolVLM: Redefining small and efficient multimodal models</title>
      <itunes:episode>655</itunes:episode>
      <podcast:episode>655</podcast:episode>
      <itunes:title>SmolVLM: Redefining small and efficient multimodal models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d8e1fe0d-018b-4ec5-8045-c07a7cf7690c</guid>
      <link>https://share.transistor.fm/s/d375d2af</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 96 | cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, Thomas Wolf</p>

            <p><strong>Title:</strong><br>
            SmolVLM: Redefining small and efficient multimodal models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05299v1">http://arxiv.org/abs/2504.05299v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications.   We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints.   Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities.   Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance, facilitating practical, energy-efficient deployments at significantly smaller scales.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 96 | cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, Thomas Wolf</p>

            <p><strong>Title:</strong><br>
            SmolVLM: Redefining small and efficient multimodal models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05299v1">http://arxiv.org/abs/2504.05299v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications.   We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints.   Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities.   Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance, facilitating practical, energy-efficient deployments at significantly smaller scales.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 08 Apr 2025 20:41:05 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d375d2af/28b4e345.mp3" length="24670536" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1538</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 96 | cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, Thomas Wolf</p>

            <p><strong>Title:</strong><br>
            SmolVLM: Redefining small and efficient multimodal models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05299v1">http://arxiv.org/abs/2504.05299v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications.   We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints.   Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities.   Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance, facilitating practical, energy-efficient deployments at significantly smaller scales.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>One-Minute Video Generation with Test-Time Training</title>
      <itunes:episode>654</itunes:episode>
      <podcast:episode>654</podcast:episode>
      <itunes:title>One-Minute Video Generation with Test-Time Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b6231a12-baf6-4e41-9f32-702665725585</guid>
      <link>https://share.transistor.fm/s/1d83cffa</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 61 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Karan Dalal, Daniel Koceja, Gashon Hussein, Jiarui Xu, Yue Zhao, Youjin Song, Shihao Han, Ka Chun Cheung, Jan Kautz, Carlos Guestrin, Tatsunori Hashimoto, Sanmi Koyejo, Yejin Choi, Yu Sun, Xiaolong Wang</p>

            <p><strong>Title:</strong><br>
            One-Minute Video Generation with Test-Time Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05298v1">http://arxiv.org/abs/2504.05298v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle with complex multi-scene stories because their hidden states are less expressive. We experiment with Test-Time Training (TTT) layers, whose hidden states themselves can be neural networks, therefore more expressive. Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos from text storyboards. For proof of concept, we curate a dataset based on Tom and Jerry cartoons. Compared to baselines such as Mamba~2, Gated DeltaNet, and sliding-window attention layers, TTT layers generate much more coherent videos that tell complex stories, leading by 34 Elo points in a human evaluation of 100 videos per method. Although promising, results still contain artifacts, likely due to the limited capability of the pre-trained 5B model. The efficiency of our implementation can also be improved. We have only experimented with one-minute videos due to resource constraints, but the approach can be extended to longer videos and more complex stories. Sample videos, code and annotations are available at: https://test-time-training.github.io/video-dit</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 61 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Karan Dalal, Daniel Koceja, Gashon Hussein, Jiarui Xu, Yue Zhao, Youjin Song, Shihao Han, Ka Chun Cheung, Jan Kautz, Carlos Guestrin, Tatsunori Hashimoto, Sanmi Koyejo, Yejin Choi, Yu Sun, Xiaolong Wang</p>

            <p><strong>Title:</strong><br>
            One-Minute Video Generation with Test-Time Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05298v1">http://arxiv.org/abs/2504.05298v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle with complex multi-scene stories because their hidden states are less expressive. We experiment with Test-Time Training (TTT) layers, whose hidden states themselves can be neural networks, therefore more expressive. Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos from text storyboards. For proof of concept, we curate a dataset based on Tom and Jerry cartoons. Compared to baselines such as Mamba~2, Gated DeltaNet, and sliding-window attention layers, TTT layers generate much more coherent videos that tell complex stories, leading by 34 Elo points in a human evaluation of 100 videos per method. Although promising, results still contain artifacts, likely due to the limited capability of the pre-trained 5B model. The efficiency of our implementation can also be improved. We have only experimented with one-minute videos due to resource constraints, but the approach can be extended to longer videos and more complex stories. Sample videos, code and annotations are available at: https://test-time-training.github.io/video-dit</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 08 Apr 2025 20:40:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1d83cffa/2c102478.mp3" length="18121109" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1129</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 61 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Karan Dalal, Daniel Koceja, Gashon Hussein, Jiarui Xu, Yue Zhao, Youjin Song, Shihao Han, Ka Chun Cheung, Jan Kautz, Carlos Guestrin, Tatsunori Hashimoto, Sanmi Koyejo, Yejin Choi, Yu Sun, Xiaolong Wang</p>

            <p><strong>Title:</strong><br>
            One-Minute Video Generation with Test-Time Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05298v1">http://arxiv.org/abs/2504.05298v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle with complex multi-scene stories because their hidden states are less expressive. We experiment with Test-Time Training (TTT) layers, whose hidden states themselves can be neural networks, therefore more expressive. Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos from text storyboards. For proof of concept, we curate a dataset based on Tom and Jerry cartoons. Compared to baselines such as Mamba~2, Gated DeltaNet, and sliding-window attention layers, TTT layers generate much more coherent videos that tell complex stories, leading by 34 Elo points in a human evaluation of 100 videos per method. Although promising, results still contain artifacts, likely due to the limited capability of the pre-trained 5B model. The efficiency of our implementation can also be improved. We have only experimented with one-minute videos due to resource constraints, but the approach can be extended to longer videos and more complex stories. Sample videos, code and annotations are available at: https://test-time-training.github.io/video-dit</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Rethinking Reflection in Pre-Training</title>
      <itunes:episode>653</itunes:episode>
      <podcast:episode>653</podcast:episode>
      <itunes:title>Rethinking Reflection in Pre-Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">85be5f7c-e512-4b15-a000-099b836d370d</guid>
      <link>https://share.transistor.fm/s/7a2c470e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Essential AI, :, Darsh J Shah, Peter Rushton, Somanshu Singla, Mohit Parmar, Kurt Smith, Yash Vanjani, Ashish Vaswani, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Anthony Polloreno, Ashish Tanwer, Burhan Drak Sibai, Divya S Mansingka, Divya Shivaprasad, Ishaan Shah, Karl Stratos, Khoi Nguyen, Michael Callahan, Michael Pust, Mrinal Iyer, Philip Monk, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Tim Romanski</p>

            <p><strong>Title:</strong><br>
            Rethinking Reflection in Pre-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.04022v1">http://arxiv.org/abs/2504.04022v1</a></p>

            <p><strong>Abstract:</strong><br>
            A language model's ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model's pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model can still arrive at the correct answer by recognizing and correcting these mistakes. By tracking performance across different stages of pre-training, we observe that this self-correcting ability appears early and improves steadily over time. For instance, an OLMo2-7B model pre-trained on 4 trillion tokens displays self-correction on our six self-reflection tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Essential AI, :, Darsh J Shah, Peter Rushton, Somanshu Singla, Mohit Parmar, Kurt Smith, Yash Vanjani, Ashish Vaswani, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Anthony Polloreno, Ashish Tanwer, Burhan Drak Sibai, Divya S Mansingka, Divya Shivaprasad, Ishaan Shah, Karl Stratos, Khoi Nguyen, Michael Callahan, Michael Pust, Mrinal Iyer, Philip Monk, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Tim Romanski</p>

            <p><strong>Title:</strong><br>
            Rethinking Reflection in Pre-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.04022v1">http://arxiv.org/abs/2504.04022v1</a></p>

            <p><strong>Abstract:</strong><br>
            A language model's ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model's pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model can still arrive at the correct answer by recognizing and correcting these mistakes. By tracking performance across different stages of pre-training, we observe that this self-correcting ability appears early and improves steadily over time. For instance, an OLMo2-7B model pre-trained on 4 trillion tokens displays self-correction on our six self-reflection tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 08 Apr 2025 20:40:21 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7a2c470e/6d9563d5.mp3" length="20836576" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1299</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Essential AI, :, Darsh J Shah, Peter Rushton, Somanshu Singla, Mohit Parmar, Kurt Smith, Yash Vanjani, Ashish Vaswani, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Anthony Polloreno, Ashish Tanwer, Burhan Drak Sibai, Divya S Mansingka, Divya Shivaprasad, Ishaan Shah, Karl Stratos, Khoi Nguyen, Michael Callahan, Michael Pust, Mrinal Iyer, Philip Monk, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Tim Romanski</p>

            <p><strong>Title:</strong><br>
            Rethinking Reflection in Pre-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.04022v1">http://arxiv.org/abs/2504.04022v1</a></p>

            <p><strong>Abstract:</strong><br>
            A language model's ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model's pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model can still arrive at the correct answer by recognizing and correcting these mistakes. By tracking performance across different stages of pre-training, we observe that this self-correcting ability appears early and improves steadily over time. For instance, an OLMo2-7B model pre-trained on 4 trillion tokens displays self-correction on our six self-reflection tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>URECA: Unique Region Caption Anything</title>
      <itunes:episode>652</itunes:episode>
      <podcast:episode>652</podcast:episode>
      <itunes:title>URECA: Unique Region Caption Anything</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bbf630ae-5b72-474f-84ab-72db29a988d8</guid>
      <link>https://share.transistor.fm/s/e86bc9de</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Sangbeom Lim, Junwan Kim, Heeji Yoon, Jaewoo Jung, Seungryong Kim</p>

            <p><strong>Title:</strong><br>
            URECA: Unique Region Caption Anything</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05305v1">http://arxiv.org/abs/2504.05305v1</a></p>

            <p><strong>Abstract:</strong><br>
            Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features. However, existing methods struggle to produce unique captions across multi-granularity, limiting their real-world applicability. To address the need for detailed region-level understanding, we introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning. Unlike prior datasets that focus primarily on salient objects, URECA dataset ensures a unique and consistent mapping between regions and captions by incorporating a diverse set of objects, parts, and background elements. Central to this is a stage-wise data curation pipeline, where each stage incrementally refines region selection and caption generation. By leveraging Multimodal Large Language Models (MLLMs) at each stage, our pipeline produces distinctive and contextually grounded captions with improved accuracy and semantic diversity. Building upon this dataset, we present URECA, a novel captioning model designed to effectively encode multi-granularity regions. URECA maintains essential spatial properties such as position and shape through simple yet impactful modifications to existing MLLMs, enabling fine-grained and semantically rich region descriptions. Our approach introduces dynamic mask modeling and a high-resolution mask encoder to enhance caption uniqueness. Experiments show that URECA achieves state-of-the-art performance on URECA dataset and generalizes well to existing region-level captioning benchmarks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Sangbeom Lim, Junwan Kim, Heeji Yoon, Jaewoo Jung, Seungryong Kim</p>

            <p><strong>Title:</strong><br>
            URECA: Unique Region Caption Anything</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05305v1">http://arxiv.org/abs/2504.05305v1</a></p>

            <p><strong>Abstract:</strong><br>
            Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features. However, existing methods struggle to produce unique captions across multi-granularity, limiting their real-world applicability. To address the need for detailed region-level understanding, we introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning. Unlike prior datasets that focus primarily on salient objects, URECA dataset ensures a unique and consistent mapping between regions and captions by incorporating a diverse set of objects, parts, and background elements. Central to this is a stage-wise data curation pipeline, where each stage incrementally refines region selection and caption generation. By leveraging Multimodal Large Language Models (MLLMs) at each stage, our pipeline produces distinctive and contextually grounded captions with improved accuracy and semantic diversity. Building upon this dataset, we present URECA, a novel captioning model designed to effectively encode multi-granularity regions. URECA maintains essential spatial properties such as position and shape through simple yet impactful modifications to existing MLLMs, enabling fine-grained and semantically rich region descriptions. Our approach introduces dynamic mask modeling and a high-resolution mask encoder to enhance caption uniqueness. Experiments show that URECA achieves state-of-the-art performance on URECA dataset and generalizes well to existing region-level captioning benchmarks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 08 Apr 2025 20:39:59 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e86bc9de/8d97ae32.mp3" length="20875864" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1301</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Sangbeom Lim, Junwan Kim, Heeji Yoon, Jaewoo Jung, Seungryong Kim</p>

            <p><strong>Title:</strong><br>
            URECA: Unique Region Caption Anything</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.05305v1">http://arxiv.org/abs/2504.05305v1</a></p>

            <p><strong>Abstract:</strong><br>
            Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features. However, existing methods struggle to produce unique captions across multi-granularity, limiting their real-world applicability. To address the need for detailed region-level understanding, we introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning. Unlike prior datasets that focus primarily on salient objects, URECA dataset ensures a unique and consistent mapping between regions and captions by incorporating a diverse set of objects, parts, and background elements. Central to this is a stage-wise data curation pipeline, where each stage incrementally refines region selection and caption generation. By leveraging Multimodal Large Language Models (MLLMs) at each stage, our pipeline produces distinctive and contextually grounded captions with improved accuracy and semantic diversity. Building upon this dataset, we present URECA, a novel captioning model designed to effectively encode multi-granularity regions. URECA maintains essential spatial properties such as position and shape through simple yet impactful modifications to existing MLLMs, enabling fine-grained and semantically rich region descriptions. Our approach introduces dynamic mask modeling and a high-resolution mask encoder to enhance caption uniqueness. Experiments show that URECA achieves state-of-the-art performance on URECA dataset and generalizes well to existing region-level captioning benchmarks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models</title>
      <itunes:episode>651</itunes:episode>
      <podcast:episode>651</podcast:episode>
      <itunes:title>T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">00182d36-bbb5-401d-aa6a-adf95005d790</guid>
      <link>https://share.transistor.fm/s/77011e69</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Minki Kang, Jongwon Jeong, Jaewoong Cho</p>

            <p><strong>Title:</strong><br>
            T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.04718v1">http://arxiv.org/abs/2504.04718v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving self-verification by sLMs underexplored. In this work, we investigate whether sLMs can reliably self-verify their outputs under test-time scaling. We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking. To address this limitation, we propose Tool-integrated self-verification (T1), which delegates memorization-heavy verification steps to external tools, such as a code interpreter. Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance. Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model. Moreover, T1 generalizes effectively to both mathematical (MATH500) and multi-domain knowledge-intensive tasks (MMLU-Pro). Our findings highlight the potential of tool integration to substantially improve the self-verification abilities of sLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Minki Kang, Jongwon Jeong, Jaewoong Cho</p>

            <p><strong>Title:</strong><br>
            T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.04718v1">http://arxiv.org/abs/2504.04718v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving self-verification by sLMs underexplored. In this work, we investigate whether sLMs can reliably self-verify their outputs under test-time scaling. We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking. To address this limitation, we propose Tool-integrated self-verification (T1), which delegates memorization-heavy verification steps to external tools, such as a code interpreter. Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance. Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model. Moreover, T1 generalizes effectively to both mathematical (MATH500) and multi-domain knowledge-intensive tasks (MMLU-Pro). Our findings highlight the potential of tool integration to substantially improve the self-verification abilities of sLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 08 Apr 2025 20:39:37 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/77011e69/c1208d62.mp3" length="19974799" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1245</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Minki Kang, Jongwon Jeong, Jaewoong Cho</p>

            <p><strong>Title:</strong><br>
            T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.04718v1">http://arxiv.org/abs/2504.04718v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving self-verification by sLMs underexplored. In this work, we investigate whether sLMs can reliably self-verify their outputs under test-time scaling. We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking. To address this limitation, we propose Tool-integrated self-verification (T1), which delegates memorization-heavy verification steps to external tools, such as a code interpreter. Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance. Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model. Moreover, T1 generalizes effectively to both mathematical (MATH500) and multi-domain knowledge-intensive tasks (MMLU-Pro). Our findings highlight the potential of tool integration to substantially improve the self-verification abilities of sLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving</title>
      <itunes:episode>650</itunes:episode>
      <podcast:episode>650</podcast:episode>
      <itunes:title>Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e70a3577-85c1-4a25-8e78-04e816071067</guid>
      <link>https://share.transistor.fm/s/2fbe9fd2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.SE, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, Liang Xiang</p>

            <p><strong>Title:</strong><br>
            Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.02605v1">http://arxiv.org/abs/2504.02605v1</a></p>

            <p><strong>Abstract:</strong><br>
            The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.SE, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, Liang Xiang</p>

            <p><strong>Title:</strong><br>
            Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.02605v1">http://arxiv.org/abs/2504.02605v1</a></p>

            <p><strong>Abstract:</strong><br>
            The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 07 Apr 2025 19:52:04 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2fbe9fd2/15b1bd70.mp3" length="24762909" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1544</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.SE, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, Liang Xiang</p>

            <p><strong>Title:</strong><br>
            Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.02605v1">http://arxiv.org/abs/2504.02605v1</a></p>

            <p><strong>Abstract:</strong><br>
            The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems</title>
      <itunes:episode>649</itunes:episode>
      <podcast:episode>649</podcast:episode>
      <itunes:title>Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f52475ef-57a0-46d8-b146-14869f8a97c7</guid>
      <link>https://share.transistor.fm/s/39ee298d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 98 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia, Jiawei Xu, Jinyu Xiang, Yizhang Lin, Tianming Liu, Tongliang Liu, Yu Su, Huan Sun, Glen Berseth, Jianyun Nie, Ian Foster, Logan Ward, Qingyun Wu, Yu Gu, Mingchen Zhuge, Xiangru Tang, Haohan Wang, Jiaxuan You, Chi Wang, Jian Pei, Qiang Yang, Xiaoliang Qi, Chenglin Wu</p>

            <p><strong>Title:</strong><br>
            Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.01990v1">http://arxiv.org/abs/2504.01990v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable of sophisticated reasoning, robust perception, and versatile action across diverse domains. As these agents increasingly drive AI research and practical applications, their design, evaluation, and continuous improvement present intricate, multifaceted challenges. This survey provides a comprehensive overview, framing intelligent agents within a modular, brain-inspired architecture that integrates principles from cognitive science, neuroscience, and computational research. We structure our exploration into four interconnected parts. First, we delve into the modular foundation of intelligent agents, systematically mapping their cognitive, perceptual, and operational modules onto analogous human brain functionalities, and elucidating core components such as memory, world modeling, reward processing, and emotion-like systems. Second, we discuss self-enhancement and adaptive evolution mechanisms, exploring how agents autonomously refine their capabilities, adapt to dynamic environments, and achieve continual learning through automated optimization paradigms, including emerging AutoML and LLM-driven optimization strategies. Third, we examine collaborative and evolutionary multi-agent systems, investigating the collective intelligence emerging from agent interactions, cooperation, and societal structures, highlighting parallels to human social dynamics. Finally, we address the critical imperative of building safe, secure, and beneficial AI systems, emphasizing intrinsic and extrinsic security threats, ethical alignment, robustness, and practical mitigation strategies necessary for trustworthy real-world deployment.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 98 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia, Jiawei Xu, Jinyu Xiang, Yizhang Lin, Tianming Liu, Tongliang Liu, Yu Su, Huan Sun, Glen Berseth, Jianyun Nie, Ian Foster, Logan Ward, Qingyun Wu, Yu Gu, Mingchen Zhuge, Xiangru Tang, Haohan Wang, Jiaxuan You, Chi Wang, Jian Pei, Qiang Yang, Xiaoliang Qi, Chenglin Wu</p>

            <p><strong>Title:</strong><br>
            Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.01990v1">http://arxiv.org/abs/2504.01990v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable of sophisticated reasoning, robust perception, and versatile action across diverse domains. As these agents increasingly drive AI research and practical applications, their design, evaluation, and continuous improvement present intricate, multifaceted challenges. This survey provides a comprehensive overview, framing intelligent agents within a modular, brain-inspired architecture that integrates principles from cognitive science, neuroscience, and computational research. We structure our exploration into four interconnected parts. First, we delve into the modular foundation of intelligent agents, systematically mapping their cognitive, perceptual, and operational modules onto analogous human brain functionalities, and elucidating core components such as memory, world modeling, reward processing, and emotion-like systems. Second, we discuss self-enhancement and adaptive evolution mechanisms, exploring how agents autonomously refine their capabilities, adapt to dynamic environments, and achieve continual learning through automated optimization paradigms, including emerging AutoML and LLM-driven optimization strategies. Third, we examine collaborative and evolutionary multi-agent systems, investigating the collective intelligence emerging from agent interactions, cooperation, and societal structures, highlighting parallels to human social dynamics. Finally, we address the critical imperative of building safe, secure, and beneficial AI systems, emphasizing intrinsic and extrinsic security threats, ethical alignment, robustness, and practical mitigation strategies necessary for trustworthy real-world deployment.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 04 Apr 2025 20:44:48 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/39ee298d/cc5018df.mp3" length="20009107" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1247</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 98 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia, Jiawei Xu, Jinyu Xiang, Yizhang Lin, Tianming Liu, Tongliang Liu, Yu Su, Huan Sun, Glen Berseth, Jianyun Nie, Ian Foster, Logan Ward, Qingyun Wu, Yu Gu, Mingchen Zhuge, Xiangru Tang, Haohan Wang, Jiaxuan You, Chi Wang, Jian Pei, Qiang Yang, Xiaoliang Qi, Chenglin Wu</p>

            <p><strong>Title:</strong><br>
            Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.01990v1">http://arxiv.org/abs/2504.01990v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable of sophisticated reasoning, robust perception, and versatile action across diverse domains. As these agents increasingly drive AI research and practical applications, their design, evaluation, and continuous improvement present intricate, multifaceted challenges. This survey provides a comprehensive overview, framing intelligent agents within a modular, brain-inspired architecture that integrates principles from cognitive science, neuroscience, and computational research. We structure our exploration into four interconnected parts. First, we delve into the modular foundation of intelligent agents, systematically mapping their cognitive, perceptual, and operational modules onto analogous human brain functionalities, and elucidating core components such as memory, world modeling, reward processing, and emotion-like systems. Second, we discuss self-enhancement and adaptive evolution mechanisms, exploring how agents autonomously refine their capabilities, adapt to dynamic environments, and achieve continual learning through automated optimization paradigms, including emerging AutoML and LLM-driven optimization strategies. Third, we examine collaborative and evolutionary multi-agent systems, investigating the collective intelligence emerging from agent interactions, cooperation, and societal structures, highlighting parallels to human social dynamics. Finally, we address the critical imperative of building safe, secure, and beneficial AI systems, emphasizing intrinsic and extrinsic security threats, ethical alignment, robustness, and practical mitigation strategies necessary for trustworthy real-world deployment.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing</title>
      <itunes:episode>648</itunes:episode>
      <podcast:episode>648</podcast:episode>
      <itunes:title>Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3f191c12-c008-4f52-a67d-e32894db970d</guid>
      <link>https://share.transistor.fm/s/b78b0fe6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Hao Li, Zicheng Zhang, Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, Haodong Duan</p>

            <p><strong>Title:</strong><br>
            Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.02826v1">http://arxiv.org/abs/2504.02826v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To address this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach. Our experiments reveal that while GPT-4o-Native significantly outperforms other open-source and proprietary models, even this state-of-the-art system struggles with logical reasoning tasks, highlighting an area that remains underexplored. As an initial effort, RISEBench aims to provide foundational insights into reasoning-aware visual editing and to catalyze future research. Though still in its early stages, we are committed to continuously expanding and refining the benchmark to support more comprehensive, reliable, and scalable evaluations of next-generation multimodal systems. Our code and data will be released at https://github.com/PhoenixZ810/RISEBench.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Hao Li, Zicheng Zhang, Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, Haodong Duan</p>

            <p><strong>Title:</strong><br>
            Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.02826v1">http://arxiv.org/abs/2504.02826v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To address this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach. Our experiments reveal that while GPT-4o-Native significantly outperforms other open-source and proprietary models, even this state-of-the-art system struggles with logical reasoning tasks, highlighting an area that remains underexplored. As an initial effort, RISEBench aims to provide foundational insights into reasoning-aware visual editing and to catalyze future research. Though still in its early stages, we are committed to continuously expanding and refining the benchmark to support more comprehensive, reliable, and scalable evaluations of next-generation multimodal systems. Our code and data will be released at https://github.com/PhoenixZ810/RISEBench.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 04 Apr 2025 20:44:26 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b78b0fe6/da289691.mp3" length="22207940" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1384</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 55 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Hao Li, Zicheng Zhang, Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, Haodong Duan</p>

            <p><strong>Title:</strong><br>
            Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.02826v1">http://arxiv.org/abs/2504.02826v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance consistency, and supporting flexible input formats. To address this gap, we introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE). RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning. We curate high-quality test cases for each category and propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach. Our experiments reveal that while GPT-4o-Native significantly outperforms other open-source and proprietary models, even this state-of-the-art system struggles with logical reasoning tasks, highlighting an area that remains underexplored. As an initial effort, RISEBench aims to provide foundational insights into reasoning-aware visual editing and to catalyze future research. Though still in its early stages, we are committed to continuously expanding and refining the benchmark to support more comprehensive, reliable, and scalable evaluations of next-generation multimodal systems. Our code and data will be released at https://github.com/PhoenixZ810/RISEBench.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ZClip: Adaptive Spike Mitigation for LLM Pre-Training</title>
      <itunes:episode>647</itunes:episode>
      <podcast:episode>647</podcast:episode>
      <itunes:title>ZClip: Adaptive Spike Mitigation for LLM Pre-Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4af6e993-6cca-4b77-8aaf-567265c448ee</guid>
      <link>https://share.transistor.fm/s/cb449e17</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Abhay Kumar, Louis Owen, Nilabhra Roy Chowdhury, Fabian Güra</p>

            <p><strong>Title:</strong><br>
            ZClip: Adaptive Spike Mitigation for LLM Pre-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.02507v1">http://arxiv.org/abs/2504.02507v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training large language models (LLMs) presents numerous challenges, including gradient instability and loss spikes. These phenomena can lead to catastrophic divergence, requiring costly checkpoint restoration and data batch skipping. Traditional gradient clipping techniques, such as constant or norm-based methods, fail to address these issues effectively due to their reliance on fixed thresholds or heuristics, leading to inefficient learning and requiring frequent manual intervention. In this work, we propose ZClip, an adaptive gradient clipping algorithm that dynamically adjusts the clipping threshold based on statistical properties of gradient norms over time. Unlike prior reactive strategies, ZClip proactively adapts to training dynamics without making any prior assumptions on the scale and the temporal evolution of gradient norms. At its core, it leverages z-score-based anomaly detection to identify and mitigate large gradient spikes, preventing malignant loss spikes while not interfering with convergence otherwise. Our code is available at: https://github.com/bluorion-com/ZClip.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Abhay Kumar, Louis Owen, Nilabhra Roy Chowdhury, Fabian Güra</p>

            <p><strong>Title:</strong><br>
            ZClip: Adaptive Spike Mitigation for LLM Pre-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.02507v1">http://arxiv.org/abs/2504.02507v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training large language models (LLMs) presents numerous challenges, including gradient instability and loss spikes. These phenomena can lead to catastrophic divergence, requiring costly checkpoint restoration and data batch skipping. Traditional gradient clipping techniques, such as constant or norm-based methods, fail to address these issues effectively due to their reliance on fixed thresholds or heuristics, leading to inefficient learning and requiring frequent manual intervention. In this work, we propose ZClip, an adaptive gradient clipping algorithm that dynamically adjusts the clipping threshold based on statistical properties of gradient norms over time. Unlike prior reactive strategies, ZClip proactively adapts to training dynamics without making any prior assumptions on the scale and the temporal evolution of gradient norms. At its core, it leverages z-score-based anomaly detection to identify and mitigate large gradient spikes, preventing malignant loss spikes while not interfering with convergence otherwise. Our code is available at: https://github.com/bluorion-com/ZClip.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 04 Apr 2025 20:44:03 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cb449e17/d898516e.mp3" length="19543009" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1218</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Abhay Kumar, Louis Owen, Nilabhra Roy Chowdhury, Fabian Güra</p>

            <p><strong>Title:</strong><br>
            ZClip: Adaptive Spike Mitigation for LLM Pre-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.02507v1">http://arxiv.org/abs/2504.02507v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training large language models (LLMs) presents numerous challenges, including gradient instability and loss spikes. These phenomena can lead to catastrophic divergence, requiring costly checkpoint restoration and data batch skipping. Traditional gradient clipping techniques, such as constant or norm-based methods, fail to address these issues effectively due to their reliance on fixed thresholds or heuristics, leading to inefficient learning and requiring frequent manual intervention. In this work, we propose ZClip, an adaptive gradient clipping algorithm that dynamically adjusts the clipping threshold based on statistical properties of gradient norms over time. Unlike prior reactive strategies, ZClip proactively adapts to training dynamics without making any prior assumptions on the scale and the temporal evolution of gradient norms. At its core, it leverages z-score-based anomaly detection to identify and mitigate large gradient spikes, preventing malignant loss spikes while not interfering with convergence otherwise. Our code is available at: https://github.com/bluorion-com/ZClip.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation</title>
      <itunes:episode>646</itunes:episode>
      <podcast:episode>646</podcast:episode>
      <itunes:title>GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c344f13f-e125-4764-af34-2a946a079613</guid>
      <link>https://share.transistor.fm/s/0ea52819</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, Li Yuan</p>

            <p><strong>Title:</strong><br>
            GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.02782v1">http://arxiv.org/abs/2504.02782v1</a></p>

            <p><strong>Abstract:</strong><br>
            The recent breakthroughs in OpenAI's GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community. This technical report presents the first-look evaluation benchmark (named GPT-ImgEval), quantitatively and qualitatively diagnosing GPT-4o's performance across three critical dimensions: (1) generation quality, (2) editing proficiency, and (3) world knowledge-informed semantic synthesis. Across all three tasks, GPT-4o demonstrates strong performance, significantly surpassing existing methods in both image generation control and output quality, while also showcasing exceptional knowledge reasoning capabilities. Furthermore, based on the GPT-4o's generated data, we propose a classification-model-based approach to investigate the underlying architecture of GPT-4o, where our empirical results suggest the model consists of an auto-regressive (AR) combined with a diffusion-based head for image decoding, rather than the VAR-like architectures. We also provide a complete speculation on GPT-4o's overall architecture. In addition, we conduct a series of analyses to identify and visualize GPT-4o's specific limitations and the synthetic artifacts commonly observed in its image generation. We also present a comparative study of multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss the safety implications of GPT-4o's outputs, particularly their detectability by existing image forensic models. We hope that our work can offer valuable insight and provide a reliable benchmark to guide future research, foster reproducibility, and accelerate innovation in the field of image generation and beyond. The codes and datasets used for evaluating GPT-4o can be found at https://github.com/PicoTrex/GPT-ImgEval.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, Li Yuan</p>

            <p><strong>Title:</strong><br>
            GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.02782v1">http://arxiv.org/abs/2504.02782v1</a></p>

            <p><strong>Abstract:</strong><br>
            The recent breakthroughs in OpenAI's GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community. This technical report presents the first-look evaluation benchmark (named GPT-ImgEval), quantitatively and qualitatively diagnosing GPT-4o's performance across three critical dimensions: (1) generation quality, (2) editing proficiency, and (3) world knowledge-informed semantic synthesis. Across all three tasks, GPT-4o demonstrates strong performance, significantly surpassing existing methods in both image generation control and output quality, while also showcasing exceptional knowledge reasoning capabilities. Furthermore, based on the GPT-4o's generated data, we propose a classification-model-based approach to investigate the underlying architecture of GPT-4o, where our empirical results suggest the model consists of an auto-regressive (AR) combined with a diffusion-based head for image decoding, rather than the VAR-like architectures. We also provide a complete speculation on GPT-4o's overall architecture. In addition, we conduct a series of analyses to identify and visualize GPT-4o's specific limitations and the synthetic artifacts commonly observed in its image generation. We also present a comparative study of multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss the safety implications of GPT-4o's outputs, particularly their detectability by existing image forensic models. We hope that our work can offer valuable insight and provide a reliable benchmark to guide future research, foster reproducibility, and accelerate innovation in the field of image generation and beyond. The codes and datasets used for evaluating GPT-4o can be found at https://github.com/PicoTrex/GPT-ImgEval.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 04 Apr 2025 20:43:42 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0ea52819/2cd7cfb4.mp3" length="20776014" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1295</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, Li Yuan</p>

            <p><strong>Title:</strong><br>
            GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.02782v1">http://arxiv.org/abs/2504.02782v1</a></p>

            <p><strong>Abstract:</strong><br>
            The recent breakthroughs in OpenAI's GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community. This technical report presents the first-look evaluation benchmark (named GPT-ImgEval), quantitatively and qualitatively diagnosing GPT-4o's performance across three critical dimensions: (1) generation quality, (2) editing proficiency, and (3) world knowledge-informed semantic synthesis. Across all three tasks, GPT-4o demonstrates strong performance, significantly surpassing existing methods in both image generation control and output quality, while also showcasing exceptional knowledge reasoning capabilities. Furthermore, based on the GPT-4o's generated data, we propose a classification-model-based approach to investigate the underlying architecture of GPT-4o, where our empirical results suggest the model consists of an auto-regressive (AR) combined with a diffusion-based head for image decoding, rather than the VAR-like architectures. We also provide a complete speculation on GPT-4o's overall architecture. In addition, we conduct a series of analyses to identify and visualize GPT-4o's specific limitations and the synthetic artifacts commonly observed in its image generation. We also present a comparative study of multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discuss the safety implications of GPT-4o's outputs, particularly their detectability by existing image forensic models. We hope that our work can offer valuable insight and provide a reliable benchmark to guide future research, foster reproducibility, and accelerate innovation in the field of image generation and beyond. The codes and datasets used for evaluating GPT-4o can be found at https://github.com/PicoTrex/GPT-ImgEval.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme</title>
      <itunes:episode>645</itunes:episode>
      <podcast:episode>645</podcast:episode>
      <itunes:title>Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f2c81ee0-5be3-4a4a-9a05-8640d6a9b003</guid>
      <link>https://share.transistor.fm/s/7f1b0a83</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.LG, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yan Ma, Steffi Chern, Xuyang Shen, Yiran Zhong, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.02587v1">http://arxiv.org/abs/2504.02587v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, existing RL applications in VLMs often rely on heavily engineered frameworks that hinder reproducibility and accessibility, while lacking standardized evaluation protocols, making it difficult to compare results or interpret training dynamics. This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.LG, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yan Ma, Steffi Chern, Xuyang Shen, Yiran Zhong, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.02587v1">http://arxiv.org/abs/2504.02587v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, existing RL applications in VLMs often rely on heavily engineered frameworks that hinder reproducibility and accessibility, while lacking standardized evaluation protocols, making it difficult to compare results or interpret training dynamics. This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 04 Apr 2025 20:43:20 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7f1b0a83/c88153b5.mp3" length="21729423" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1354</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.LG, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yan Ma, Steffi Chern, Xuyang Shen, Yiran Zhong, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.02587v1">http://arxiv.org/abs/2504.02587v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, existing RL applications in VLMs often rely on heavily engineered frameworks that hinder reproducibility and accessibility, while lacking standardized evaluation protocols, making it difficult to compare results or interpret training dynamics. This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WikiVideo: Article Generation from Multiple Videos</title>
      <itunes:episode>644</itunes:episode>
      <podcast:episode>644</podcast:episode>
      <itunes:title>WikiVideo: Article Generation from Multiple Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a2e9e663-2221-4e74-939a-eb37e1f75a51</guid>
      <link>https://share.transistor.fm/s/8157d9d0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Alexander Martin, Reno Kriz, William Gantt Walden, Kate Sanders, Hannah Recknor, Eugene Yang, Francis Ferraro, Benjamin Van Durme</p>

            <p><strong>Title:</strong><br>
            WikiVideo: Article Generation from Multiple Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.00939v1">http://arxiv.org/abs/2504.00939v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Alexander Martin, Reno Kriz, William Gantt Walden, Kate Sanders, Hannah Recknor, Eugene Yang, Francis Ferraro, Benjamin Van Durme</p>

            <p><strong>Title:</strong><br>
            WikiVideo: Article Generation from Multiple Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.00939v1">http://arxiv.org/abs/2504.00939v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 04 Apr 2025 20:42:58 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8157d9d0/724563a2.mp3" length="20724994" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1292</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Alexander Martin, Reno Kriz, William Gantt Walden, Kate Sanders, Hannah Recknor, Eugene Yang, Francis Ferraro, Benjamin Van Durme</p>

            <p><strong>Title:</strong><br>
            WikiVideo: Article Generation from Multiple Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.00939v1">http://arxiv.org/abs/2504.00939v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization</title>
      <itunes:episode>643</itunes:episode>
      <podcast:episode>643</podcast:episode>
      <itunes:title>MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">71a75c6b-be8c-4354-be7b-919b047f4885</guid>
      <link>https://share.transistor.fm/s/e2594a32</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Siyuan Li, Luyuan Zhang, Zedong Wang, Juanxi Tian, Cheng Tan, Zicheng Liu, Chang Yu, Qingsong Xie, Haonan Lu, Haoqian Wang, Zhen Lei</p>

            <p><strong>Title:</strong><br>
            MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.00999v1">http://arxiv.org/abs/2504.00999v1</a></p>

            <p><strong>Abstract:</strong><br>
            Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation. However, most existing methods struggle to address the trade-off in shared latent space for generation quality vs. representation learning and efficiency. To push the limits of this paradigm, we propose MergeVQ, which incorporates token merging techniques into VQ-based generative models to bridge the gap between image generation and visual representation learning in a unified architecture. During pre-training, MergeVQ decouples top-k semantics from latent space with the token merge module after self-attention blocks in the encoder for subsequent Look-up Free Quantization (LFQ) and global alignment and recovers their fine-grained details through cross-attention in the decoder for reconstruction. As for the second-stage generation, we introduce MergeAR, which performs KV Cache compression for efficient raster-order prediction. Extensive experiments on ImageNet verify that MergeVQ as an AR generative model achieves competitive performance in both visual representation learning and image generation tasks while maintaining favorable token efficiency and inference speed. The code and model will be available at https://apexgen-x.github.io/MergeVQ.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Siyuan Li, Luyuan Zhang, Zedong Wang, Juanxi Tian, Cheng Tan, Zicheng Liu, Chang Yu, Qingsong Xie, Haonan Lu, Haoqian Wang, Zhen Lei</p>

            <p><strong>Title:</strong><br>
            MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.00999v1">http://arxiv.org/abs/2504.00999v1</a></p>

            <p><strong>Abstract:</strong><br>
            Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation. However, most existing methods struggle to address the trade-off in shared latent space for generation quality vs. representation learning and efficiency. To push the limits of this paradigm, we propose MergeVQ, which incorporates token merging techniques into VQ-based generative models to bridge the gap between image generation and visual representation learning in a unified architecture. During pre-training, MergeVQ decouples top-k semantics from latent space with the token merge module after self-attention blocks in the encoder for subsequent Look-up Free Quantization (LFQ) and global alignment and recovers their fine-grained details through cross-attention in the decoder for reconstruction. As for the second-stage generation, we introduce MergeAR, which performs KV Cache compression for efficient raster-order prediction. Extensive experiments on ImageNet verify that MergeVQ as an AR generative model achieves competitive performance in both visual representation learning and image generation tasks while maintaining favorable token efficiency and inference speed. The code and model will be available at https://apexgen-x.github.io/MergeVQ.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 03 Apr 2025 20:55:00 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e2594a32/e3b7938d.mp3" length="19138489" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1192</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 57 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Siyuan Li, Luyuan Zhang, Zedong Wang, Juanxi Tian, Cheng Tan, Zicheng Liu, Chang Yu, Qingsong Xie, Haonan Lu, Haoqian Wang, Zhen Lei</p>

            <p><strong>Title:</strong><br>
            MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.00999v1">http://arxiv.org/abs/2504.00999v1</a></p>

            <p><strong>Abstract:</strong><br>
            Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation. However, most existing methods struggle to address the trade-off in shared latent space for generation quality vs. representation learning and efficiency. To push the limits of this paradigm, we propose MergeVQ, which incorporates token merging techniques into VQ-based generative models to bridge the gap between image generation and visual representation learning in a unified architecture. During pre-training, MergeVQ decouples top-k semantics from latent space with the token merge module after self-attention blocks in the encoder for subsequent Look-up Free Quantization (LFQ) and global alignment and recovers their fine-grained details through cross-attention in the decoder for reconstruction. As for the second-stage generation, we introduce MergeAR, which performs KV Cache compression for efficient raster-order prediction. Extensive experiments on ImageNet verify that MergeVQ as an AR generative model achieves competitive performance in both visual representation learning and image generation tasks while maintaining favorable token efficiency and inference speed. The code and model will be available at https://apexgen-x.github.io/MergeVQ.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction</title>
      <itunes:episode>642</itunes:episode>
      <podcast:episode>642</podcast:episode>
      <itunes:title>AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">db2a1ec0-02cf-4bf7-b821-a0224ef3134d</guid>
      <link>https://share.transistor.fm/s/de9933b5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junhao Cheng, Yuying Ge, Yixiao Ge, Jing Liao, Ying Shan</p>

            <p><strong>Title:</strong><br>
            AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.01014v1">http://arxiv.org/abs/2504.01014v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in image and video synthesis have opened up new promise in generative games. One particularly intriguing application is transforming characters from anime films into interactive, playable entities. This allows players to immerse themselves in the dynamic anime world as their favorite characters for life simulation through language instructions. Such games are defined as infinite game since they eliminate predetermined boundaries and fixed gameplay rules, where players can interact with the game world through open-ended language and experience ever-evolving storylines and environments. Recently, a pioneering approach for infinite anime life simulation employs large language models (LLMs) to translate multi-turn text dialogues into language instructions for image generation. However, it neglects historical visual context, leading to inconsistent gameplay. Furthermore, it only generates static images, failing to incorporate the dynamics necessary for an engaging gaming experience. In this work, we propose AnimeGamer, which is built upon Multimodal Large Language Models (MLLMs) to generate each game state, including dynamic animation shots that depict character movements and updates to character states, as illustrated in Figure 1. We introduce novel action-aware multimodal representations to represent animation shots, which can be decoded into high-quality video clips using a video diffusion model. By taking historical animation shot representations as context and predicting subsequent representations, AnimeGamer can generate games with contextual consistency and satisfactory dynamics. Extensive evaluations using both automated metrics and human evaluations demonstrate that AnimeGamer outperforms existing methods in various aspects of the gaming experience. Codes and checkpoints are available at https://github.com/TencentARC/AnimeGamer.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junhao Cheng, Yuying Ge, Yixiao Ge, Jing Liao, Ying Shan</p>

            <p><strong>Title:</strong><br>
            AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.01014v1">http://arxiv.org/abs/2504.01014v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in image and video synthesis have opened up new promise in generative games. One particularly intriguing application is transforming characters from anime films into interactive, playable entities. This allows players to immerse themselves in the dynamic anime world as their favorite characters for life simulation through language instructions. Such games are defined as infinite game since they eliminate predetermined boundaries and fixed gameplay rules, where players can interact with the game world through open-ended language and experience ever-evolving storylines and environments. Recently, a pioneering approach for infinite anime life simulation employs large language models (LLMs) to translate multi-turn text dialogues into language instructions for image generation. However, it neglects historical visual context, leading to inconsistent gameplay. Furthermore, it only generates static images, failing to incorporate the dynamics necessary for an engaging gaming experience. In this work, we propose AnimeGamer, which is built upon Multimodal Large Language Models (MLLMs) to generate each game state, including dynamic animation shots that depict character movements and updates to character states, as illustrated in Figure 1. We introduce novel action-aware multimodal representations to represent animation shots, which can be decoded into high-quality video clips using a video diffusion model. By taking historical animation shot representations as context and predicting subsequent representations, AnimeGamer can generate games with contextual consistency and satisfactory dynamics. Extensive evaluations using both automated metrics and human evaluations demonstrate that AnimeGamer outperforms existing methods in various aspects of the gaming experience. Codes and checkpoints are available at https://github.com/TencentARC/AnimeGamer.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 03 Apr 2025 20:54:28 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/de9933b5/440d8b1c.mp3" length="22513465" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1403</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junhao Cheng, Yuying Ge, Yixiao Ge, Jing Liao, Ying Shan</p>

            <p><strong>Title:</strong><br>
            AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.01014v1">http://arxiv.org/abs/2504.01014v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in image and video synthesis have opened up new promise in generative games. One particularly intriguing application is transforming characters from anime films into interactive, playable entities. This allows players to immerse themselves in the dynamic anime world as their favorite characters for life simulation through language instructions. Such games are defined as infinite game since they eliminate predetermined boundaries and fixed gameplay rules, where players can interact with the game world through open-ended language and experience ever-evolving storylines and environments. Recently, a pioneering approach for infinite anime life simulation employs large language models (LLMs) to translate multi-turn text dialogues into language instructions for image generation. However, it neglects historical visual context, leading to inconsistent gameplay. Furthermore, it only generates static images, failing to incorporate the dynamics necessary for an engaging gaming experience. In this work, we propose AnimeGamer, which is built upon Multimodal Large Language Models (MLLMs) to generate each game state, including dynamic animation shots that depict character movements and updates to character states, as illustrated in Figure 1. We introduce novel action-aware multimodal representations to represent animation shots, which can be decoded into high-quality video clips using a video diffusion model. By taking historical animation shot representations as context and predicting subsequent representations, AnimeGamer can generate games with contextual consistency and satisfactory dynamics. Extensive evaluations using both automated metrics and human evaluations demonstrate that AnimeGamer outperforms existing methods in various aspects of the gaming experience. Codes and checkpoints are available at https://github.com/TencentARC/AnimeGamer.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Understanding R1-Zero-Like Training: A Critical Perspective</title>
      <itunes:episode>641</itunes:episode>
      <podcast:episode>641</podcast:episode>
      <itunes:title>Understanding R1-Zero-Like Training: A Critical Perspective</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">37657df8-e235-4934-b468-d3a36f8fcbcc</guid>
      <link>https://share.transistor.fm/s/354e1f59</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin</p>

            <p><strong>Title:</strong><br>
            Understanding R1-Zero-Like Training: A Critical Perspective</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.20783v1">http://arxiv.org/abs/2503.20783v1</a></p>

            <p><strong>Abstract:</strong><br>
            DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin</p>

            <p><strong>Title:</strong><br>
            Understanding R1-Zero-Like Training: A Critical Perspective</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.20783v1">http://arxiv.org/abs/2503.20783v1</a></p>

            <p><strong>Abstract:</strong><br>
            DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 03 Apr 2025 20:54:07 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/354e1f59/e3a92d25.mp3" length="19185659" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1195</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin</p>

            <p><strong>Title:</strong><br>
            Understanding R1-Zero-Like Training: A Critical Perspective</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.20783v1">http://arxiv.org/abs/2503.20783v1</a></p>

            <p><strong>Abstract:</strong><br>
            DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Towards Physically Plausible Video Generation via VLM Planning</title>
      <itunes:episode>640</itunes:episode>
      <podcast:episode>640</podcast:episode>
      <itunes:title>Towards Physically Plausible Video Generation via VLM Planning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">90dd3282-11b2-40f4-9846-d1467dbf29ee</guid>
      <link>https://share.transistor.fm/s/2a3e9fda</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xindi Yang, Baolu Li, Yiming Zhang, Zhenfei Yin, Lei Bai, Liqian Ma, Zhiyong Wang, Jianfei Cai, Tien-Tsin Wong, Huchuan Lu, Xu Jia</p>

            <p><strong>Title:</strong><br>
            Towards Physically Plausible Video Generation via VLM Planning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.23368v2">http://arxiv.org/abs/2503.23368v2</a></p>

            <p><strong>Abstract:</strong><br>
            Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics. In the first stage, we employ a Vision Language Model (VLM) as a coarse-grained motion planner, integrating chain-of-thought and physics-aware reasoning to predict a rough motion trajectories/changes that approximate real-world physical dynamics while ensuring the inter-frame consistency. In the second stage, we use the predicted motion trajectories/changes to guide the video generation of a VDM. As the predicted motion trajectories/changes are rough, noise is added during inference to provide freedom to the VDM in generating motion with more fine details. Extensive experimental results demonstrate that our framework can produce physically plausible motion, and comparative evaluations highlight the notable superiority of our approach over existing methods. More video results are available on our Project Page: https://madaoer.github.io/projects/physically_plausible_video_generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xindi Yang, Baolu Li, Yiming Zhang, Zhenfei Yin, Lei Bai, Liqian Ma, Zhiyong Wang, Jianfei Cai, Tien-Tsin Wong, Huchuan Lu, Xu Jia</p>

            <p><strong>Title:</strong><br>
            Towards Physically Plausible Video Generation via VLM Planning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.23368v2">http://arxiv.org/abs/2503.23368v2</a></p>

            <p><strong>Abstract:</strong><br>
            Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics. In the first stage, we employ a Vision Language Model (VLM) as a coarse-grained motion planner, integrating chain-of-thought and physics-aware reasoning to predict a rough motion trajectories/changes that approximate real-world physical dynamics while ensuring the inter-frame consistency. In the second stage, we use the predicted motion trajectories/changes to guide the video generation of a VDM. As the predicted motion trajectories/changes are rough, noise is added during inference to provide freedom to the VDM in generating motion with more fine details. Extensive experimental results demonstrate that our framework can produce physically plausible motion, and comparative evaluations highlight the notable superiority of our approach over existing methods. More video results are available on our Project Page: https://madaoer.github.io/projects/physically_plausible_video_generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 03 Apr 2025 20:53:45 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2a3e9fda/deca6d19.mp3" length="21586838" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1345</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xindi Yang, Baolu Li, Yiming Zhang, Zhenfei Yin, Lei Bai, Liqian Ma, Zhiyong Wang, Jianfei Cai, Tien-Tsin Wong, Huchuan Lu, Xu Jia</p>

            <p><strong>Title:</strong><br>
            Towards Physically Plausible Video Generation via VLM Planning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.23368v2">http://arxiv.org/abs/2503.23368v2</a></p>

            <p><strong>Abstract:</strong><br>
            Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics. In the first stage, we employ a Vision Language Model (VLM) as a coarse-grained motion planner, integrating chain-of-thought and physics-aware reasoning to predict a rough motion trajectories/changes that approximate real-world physical dynamics while ensuring the inter-frame consistency. In the second stage, we use the predicted motion trajectories/changes to guide the video generation of a VDM. As the predicted motion trajectories/changes are rough, noise is added during inference to provide freedom to the VDM in generating motion with more fine details. Extensive experimental results demonstrate that our framework can produce physically plausible motion, and comparative evaluations highlight the notable superiority of our approach over existing methods. More video results are available on our Project Page: https://madaoer.github.io/projects/physically_plausible_video_generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance</title>
      <itunes:episode>639</itunes:episode>
      <podcast:episode>639</podcast:episode>
      <itunes:title>DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">55e48d0c-c561-41d8-a084-3b6d8b834ec6</guid>
      <link>https://share.transistor.fm/s/95dba394</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuxuan Luo, Zhengkun Rong, Lizhen Wang, Longhao Zhang, Tianshu Hu, Yongming Zhu</p>

            <p><strong>Title:</strong><br>
            DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.01724v2">http://arxiv.org/abs/2504.01724v2</a></p>

            <p><strong>Abstract:</strong><br>
            While recent image-based human animation methods achieve realistic body and facial motion synthesis, critical gaps remain in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence, which leads to their lower expressiveness and robustness. We propose a diffusion transformer (DiT) based framework, DreamActor-M1, with hybrid guidance to overcome these limitations. For motion guidance, our hybrid control signals that integrate implicit facial representations, 3D head spheres, and 3D body skeletons achieve robust control of facial expressions and body movements, while producing expressive and identity-preserving animations. For scale adaptation, to handle various body poses and image scales ranging from portraits to full-body views, we employ a progressive training strategy using data with varying resolutions and scales. For appearance guidance, we integrate motion patterns from sequential frames with complementary visual references, ensuring long-term temporal coherence for unseen regions during complex movements. Experiments demonstrate that our method outperforms the state-of-the-art works, delivering expressive results for portraits, upper-body, and full-body generation with robust long-term consistency. Project Page: https://grisoon.github.io/DreamActor-M1/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuxuan Luo, Zhengkun Rong, Lizhen Wang, Longhao Zhang, Tianshu Hu, Yongming Zhu</p>

            <p><strong>Title:</strong><br>
            DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.01724v2">http://arxiv.org/abs/2504.01724v2</a></p>

            <p><strong>Abstract:</strong><br>
            While recent image-based human animation methods achieve realistic body and facial motion synthesis, critical gaps remain in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence, which leads to their lower expressiveness and robustness. We propose a diffusion transformer (DiT) based framework, DreamActor-M1, with hybrid guidance to overcome these limitations. For motion guidance, our hybrid control signals that integrate implicit facial representations, 3D head spheres, and 3D body skeletons achieve robust control of facial expressions and body movements, while producing expressive and identity-preserving animations. For scale adaptation, to handle various body poses and image scales ranging from portraits to full-body views, we employ a progressive training strategy using data with varying resolutions and scales. For appearance guidance, we integrate motion patterns from sequential frames with complementary visual references, ensuring long-term temporal coherence for unseen regions during complex movements. Experiments demonstrate that our method outperforms the state-of-the-art works, delivering expressive results for portraits, upper-body, and full-body generation with robust long-term consistency. Project Page: https://grisoon.github.io/DreamActor-M1/.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 03 Apr 2025 20:53:23 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/95dba394/77198ad3.mp3" length="20140308" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1255</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuxuan Luo, Zhengkun Rong, Lizhen Wang, Longhao Zhang, Tianshu Hu, Yongming Zhu</p>

            <p><strong>Title:</strong><br>
            DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.01724v2">http://arxiv.org/abs/2504.01724v2</a></p>

            <p><strong>Abstract:</strong><br>
            While recent image-based human animation methods achieve realistic body and facial motion synthesis, critical gaps remain in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence, which leads to their lower expressiveness and robustness. We propose a diffusion transformer (DiT) based framework, DreamActor-M1, with hybrid guidance to overcome these limitations. For motion guidance, our hybrid control signals that integrate implicit facial representations, 3D head spheres, and 3D body skeletons achieve robust control of facial expressions and body movements, while producing expressive and identity-preserving animations. For scale adaptation, to handle various body poses and image scales ranging from portraits to full-body views, we employ a progressive training strategy using data with varying resolutions and scales. For appearance guidance, we integrate motion patterns from sequential frames with complementary visual references, ensuring long-term temporal coherence for unseen regions during complex movements. Experiments demonstrate that our method outperforms the state-of-the-art works, delivering expressive results for portraits, upper-body, and full-body generation with robust long-term consistency. Project Page: https://grisoon.github.io/DreamActor-M1/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step</title>
      <itunes:episode>638</itunes:episode>
      <podcast:episode>638</podcast:episode>
      <itunes:title>VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8ee66ca8-abbb-4980-b974-d11e5e4c7421</guid>
      <link>https://share.transistor.fm/s/637d9d47</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hanyang Wang, Fangfu Liu, Jiawei Chi, Yueqi Duan</p>

            <p><strong>Title:</strong><br>
            VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.01956v2">http://arxiv.org/abs/2504.01956v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from performance degradation by minimal overlap across input views with insufficient visual information. Fortunately, recent video generative models show promise in addressing this challenge as they are capable of generating video clips with plausible 3D structures. Powered by large pretrained video diffusion models, some pioneering research start to explore the potential of video generative prior and create 3D scenes from sparse views. Despite impressive improvements, they are limited by slow inference time and the lack of 3D constraint, leading to inefficiencies and reconstruction artifacts that do not align with real-world geometry structure. In this paper, we propose VideoScene to distill the video diffusion model to generate 3D scenes in one step, aiming to build an efficient and effective tool to bridge the gap from video to 3D. Specifically, we design a 3D-aware leap flow distillation strategy to leap over time-consuming redundant information and train a dynamic denoising policy network to adaptively determine the optimal leap timestep during inference. Extensive experiments demonstrate that our VideoScene achieves faster and superior 3D scene generation results than previous video diffusion models, highlighting its potential as an efficient tool for future video to 3D applications. Project Page: https://hanyang-21.github.io/VideoScene</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hanyang Wang, Fangfu Liu, Jiawei Chi, Yueqi Duan</p>

            <p><strong>Title:</strong><br>
            VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.01956v2">http://arxiv.org/abs/2504.01956v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from performance degradation by minimal overlap across input views with insufficient visual information. Fortunately, recent video generative models show promise in addressing this challenge as they are capable of generating video clips with plausible 3D structures. Powered by large pretrained video diffusion models, some pioneering research start to explore the potential of video generative prior and create 3D scenes from sparse views. Despite impressive improvements, they are limited by slow inference time and the lack of 3D constraint, leading to inefficiencies and reconstruction artifacts that do not align with real-world geometry structure. In this paper, we propose VideoScene to distill the video diffusion model to generate 3D scenes in one step, aiming to build an efficient and effective tool to bridge the gap from video to 3D. Specifically, we design a 3D-aware leap flow distillation strategy to leap over time-consuming redundant information and train a dynamic denoising policy network to adaptively determine the optimal leap timestep during inference. Extensive experiments demonstrate that our VideoScene achieves faster and superior 3D scene generation results than previous video diffusion models, highlighting its potential as an efficient tool for future video to 3D applications. Project Page: https://hanyang-21.github.io/VideoScene</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 03 Apr 2025 20:52:59 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/637d9d47/f5f746ee.mp3" length="21002965" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1309</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hanyang Wang, Fangfu Liu, Jiawei Chi, Yueqi Duan</p>

            <p><strong>Title:</strong><br>
            VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2504.01956v2">http://arxiv.org/abs/2504.01956v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from performance degradation by minimal overlap across input views with insufficient visual information. Fortunately, recent video generative models show promise in addressing this challenge as they are capable of generating video clips with plausible 3D structures. Powered by large pretrained video diffusion models, some pioneering research start to explore the potential of video generative prior and create 3D scenes from sparse views. Despite impressive improvements, they are limited by slow inference time and the lack of 3D constraint, leading to inefficiencies and reconstruction artifacts that do not align with real-world geometry structure. In this paper, we propose VideoScene to distill the video diffusion model to generate 3D scenes in one step, aiming to build an efficient and effective tool to bridge the gap from video to 3D. Specifically, we design a 3D-aware leap flow distillation strategy to leap over time-consuming redundant information and train a dynamic denoising policy network to adaptively determine the optimal leap timestep during inference. Extensive experiments demonstrate that our VideoScene achieves faster and superior 3D scene generation results than previous video diffusion models, highlighting its potential as an efficient tool for future video to 3D applications. Project Page: https://hanyang-21.github.io/VideoScene</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>START: Self-taught Reasoner with Tools</title>
      <itunes:episode>637</itunes:episode>
      <podcast:episode>637</podcast:episode>
      <itunes:title>START: Self-taught Reasoner with Tools</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7157ed5c-f9c6-4e97-af32-9872be9dbc68</guid>
      <link>https://share.transistor.fm/s/adc4623d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chengpeng Li, Mingfeng Xue, Zhenru Zhang, Jiaxi Yang, Beichen Zhang, Xiang Wang, Bowen Yu, Binyuan Hui, Junyang Lin, Dayiheng Liu</p>

            <p><strong>Title:</strong><br>
            START: Self-taught Reasoner with Tools</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.04625v1">http://arxiv.org/abs/2503.04625v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introduce START (Self-Taught Reasoner with Tools), a novel tool-integrated long CoT reasoning LLM that significantly enhances reasoning capabilities by leveraging external tools. Through code execution, START is capable of performing complex computations, self-checking, exploring diverse methods, and self-debugging, thereby addressing the limitations of LRMs. The core innovation of START lies in its self-learning framework, which comprises two key techniques: 1) Hint-infer: We demonstrate that inserting artificially designed hints (e.g., ``Wait, maybe using Python here is a good idea.'') during the inference process of a LRM effectively stimulates its ability to utilize external tools without the need for any demonstration data. Hint-infer can also serve as a simple and effective sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and modifying the reasoning trajectories with tool invocation generated by a LRM via Hint-infer, followed by fine-tuning the LRM. Through this framework, we have fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA (GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the competition-level code benchmark (LiveCodeBench), START achieves accuracy rates of 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary model o1-Preview.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chengpeng Li, Mingfeng Xue, Zhenru Zhang, Jiaxi Yang, Beichen Zhang, Xiang Wang, Bowen Yu, Binyuan Hui, Junyang Lin, Dayiheng Liu</p>

            <p><strong>Title:</strong><br>
            START: Self-taught Reasoner with Tools</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.04625v1">http://arxiv.org/abs/2503.04625v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introduce START (Self-Taught Reasoner with Tools), a novel tool-integrated long CoT reasoning LLM that significantly enhances reasoning capabilities by leveraging external tools. Through code execution, START is capable of performing complex computations, self-checking, exploring diverse methods, and self-debugging, thereby addressing the limitations of LRMs. The core innovation of START lies in its self-learning framework, which comprises two key techniques: 1) Hint-infer: We demonstrate that inserting artificially designed hints (e.g., ``Wait, maybe using Python here is a good idea.'') during the inference process of a LRM effectively stimulates its ability to utilize external tools without the need for any demonstration data. Hint-infer can also serve as a simple and effective sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and modifying the reasoning trajectories with tool invocation generated by a LRM via Hint-infer, followed by fine-tuning the LRM. Through this framework, we have fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA (GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the competition-level code benchmark (LiveCodeBench), START achieves accuracy rates of 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary model o1-Preview.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 07 Mar 2025 19:15:26 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/adc4623d/87e1868c.mp3" length="23788623" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1483</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chengpeng Li, Mingfeng Xue, Zhenru Zhang, Jiaxi Yang, Beichen Zhang, Xiang Wang, Bowen Yu, Binyuan Hui, Junyang Lin, Dayiheng Liu</p>

            <p><strong>Title:</strong><br>
            START: Self-taught Reasoner with Tools</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.04625v1">http://arxiv.org/abs/2503.04625v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introduce START (Self-Taught Reasoner with Tools), a novel tool-integrated long CoT reasoning LLM that significantly enhances reasoning capabilities by leveraging external tools. Through code execution, START is capable of performing complex computations, self-checking, exploring diverse methods, and self-debugging, thereby addressing the limitations of LRMs. The core innovation of START lies in its self-learning framework, which comprises two key techniques: 1) Hint-infer: We demonstrate that inserting artificially designed hints (e.g., ``Wait, maybe using Python here is a good idea.'') during the inference process of a LRM effectively stimulates its ability to utilize external tools without the need for any demonstration data. Hint-infer can also serve as a simple and effective sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and modifying the reasoning trajectories with tool invocation generated by a LRM via Hint-infer, followed by fine-tuning the LRM. Through this framework, we have fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA (GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the competition-level code benchmark (LiveCodeBench), START achieves accuracy rates of 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary model o1-Preview.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Token-Efficient Long Video Understanding for Multimodal LLMs</title>
      <itunes:episode>636</itunes:episode>
      <podcast:episode>636</podcast:episode>
      <itunes:title>Token-Efficient Long Video Understanding for Multimodal LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">762590cd-f34c-469d-8579-c79a3a1b6d9f</guid>
      <link>https://share.transistor.fm/s/422e6747</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon</p>

            <p><strong>Title:</strong><br>
            Token-Efficient Long Video Understanding for Multimodal LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.04130v1">http://arxiv.org/abs/2503.04130v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM (\textbf{S}patiotemporal \textbf{TO}ken \textbf{R}eduction for \textbf{M}ultimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5\% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to $8\times$ and the decoding latency by 2.4-2.9$\times$ for the fixed numbers of input frames. Project page is available at https://research.nvidia.com/labs/lpr/storm</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon</p>

            <p><strong>Title:</strong><br>
            Token-Efficient Long Video Understanding for Multimodal LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.04130v1">http://arxiv.org/abs/2503.04130v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM (\textbf{S}patiotemporal \textbf{TO}ken \textbf{R}eduction for \textbf{M}ultimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5\% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to $8\times$ and the decoding latency by 2.4-2.9$\times$ for the fixed numbers of input frames. Project page is available at https://research.nvidia.com/labs/lpr/storm</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 07 Mar 2025 19:15:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/422e6747/f652ec45.mp3" length="20510173" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1278</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon</p>

            <p><strong>Title:</strong><br>
            Token-Efficient Long Video Understanding for Multimodal LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.04130v1">http://arxiv.org/abs/2503.04130v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM (\textbf{S}patiotemporal \textbf{TO}ken \textbf{R}eduction for \textbf{M}ultimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5\% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to $8\times$ and the decoding latency by 2.4-2.9$\times$ for the fixed numbers of input frames. Project page is available at https://research.nvidia.com/labs/lpr/storm</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM</title>
      <itunes:episode>635</itunes:episode>
      <podcast:episode>635</podcast:episode>
      <itunes:title>LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">73d367a3-831e-434b-9620-7cbe790d2172</guid>
      <link>https://share.transistor.fm/s/7d479950</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sambal Shikhar, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jean Lahoud, Fahad Khan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal</p>

            <p><strong>Title:</strong><br>
            LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.04724v1">http://arxiv.org/abs/2503.04724v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sambal Shikhar, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jean Lahoud, Fahad Khan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal</p>

            <p><strong>Title:</strong><br>
            LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.04724v1">http://arxiv.org/abs/2503.04724v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 07 Mar 2025 19:14:44 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7d479950/7e488f86.mp3" length="25284108" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1577</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sambal Shikhar, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jean Lahoud, Fahad Khan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal</p>

            <p><strong>Title:</strong><br>
            LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.04724v1">http://arxiv.org/abs/2503.04724v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>EgoLife: Towards Egocentric Life Assistant</title>
      <itunes:episode>634</itunes:episode>
      <podcast:episode>634</podcast:episode>
      <itunes:title>EgoLife: Towards Egocentric Life Assistant</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e6dd115a-dece-4d69-ab5e-71d9a77d6b73</guid>
      <link>https://share.transistor.fm/s/b8d886c2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, Bei Ouyang, Zhengyu Lin, Marco Cominelli, Zhongang Cai, Yuanhan Zhang, Peiyuan Zhang, Fangzhou Hong, Joerg Widmer, Francesco Gringoli, Lei Yang, Bo Li, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            EgoLife: Towards Egocentric Life Assistant</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.03803v1">http://arxiv.org/abs/2503.03803v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. To lay the foundation for this assistant, we conducted a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities - including discussions, shopping, cooking, socializing, and entertainment - using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references. This effort resulted in the EgoLife Dataset, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation. Leveraging this dataset, we introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide meaningful assistance in daily life by addressing practical questions such as recalling past relevant events, monitoring health habits, and offering personalized recommendations. To address the key technical challenges of (1) developing robust visual-audio models for egocentric data, (2) enabling identity recognition, and (3) facilitating long-context question answering over extensive temporal information, we introduce EgoButler, an integrated system comprising EgoGPT and EgoRAG. EgoGPT is an omni-modal model trained on egocentric datasets, achieving state-of-the-art performance on egocentric video understanding. EgoRAG is a retrieval-based component that supports answering ultra-long-context questions. Our experimental studies verify their working mechanisms and reveal critical factors and bottlenecks, guiding future improvements. By releasing our datasets, models, and benchmarks, we aim to stimulate further research in egocentric AI assistants.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, Bei Ouyang, Zhengyu Lin, Marco Cominelli, Zhongang Cai, Yuanhan Zhang, Peiyuan Zhang, Fangzhou Hong, Joerg Widmer, Francesco Gringoli, Lei Yang, Bo Li, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            EgoLife: Towards Egocentric Life Assistant</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.03803v1">http://arxiv.org/abs/2503.03803v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. To lay the foundation for this assistant, we conducted a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities - including discussions, shopping, cooking, socializing, and entertainment - using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references. This effort resulted in the EgoLife Dataset, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation. Leveraging this dataset, we introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide meaningful assistance in daily life by addressing practical questions such as recalling past relevant events, monitoring health habits, and offering personalized recommendations. To address the key technical challenges of (1) developing robust visual-audio models for egocentric data, (2) enabling identity recognition, and (3) facilitating long-context question answering over extensive temporal information, we introduce EgoButler, an integrated system comprising EgoGPT and EgoRAG. EgoGPT is an omni-modal model trained on egocentric datasets, achieving state-of-the-art performance on egocentric video understanding. EgoRAG is a retrieval-based component that supports answering ultra-long-context questions. Our experimental studies verify their working mechanisms and reveal critical factors and bottlenecks, guiding future improvements. By releasing our datasets, models, and benchmarks, we aim to stimulate further research in egocentric AI assistants.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 07 Mar 2025 19:14:22 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b8d886c2/dda164d9.mp3" length="21309293" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1328</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, Bei Ouyang, Zhengyu Lin, Marco Cominelli, Zhongang Cai, Yuanhan Zhang, Peiyuan Zhang, Fangzhou Hong, Joerg Widmer, Francesco Gringoli, Lei Yang, Bo Li, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            EgoLife: Towards Egocentric Life Assistant</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.03803v1">http://arxiv.org/abs/2503.03803v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. To lay the foundation for this assistant, we conducted a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities - including discussions, shopping, cooking, socializing, and entertainment - using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references. This effort resulted in the EgoLife Dataset, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation. Leveraging this dataset, we introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide meaningful assistance in daily life by addressing practical questions such as recalling past relevant events, monitoring health habits, and offering personalized recommendations. To address the key technical challenges of (1) developing robust visual-audio models for egocentric data, (2) enabling identity recognition, and (3) facilitating long-context question answering over extensive temporal information, we introduce EgoButler, an integrated system comprising EgoGPT and EgoRAG. EgoGPT is an omni-modal model trained on egocentric datasets, achieving state-of-the-art performance on egocentric video understanding. EgoRAG is a retrieval-based component that supports answering ultra-long-context questions. Our experimental studies verify their working mechanisms and reveal critical factors and bottlenecks, guiding future improvements. By releasing our datasets, models, and benchmarks, we aim to stimulate further research in egocentric AI assistants.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers</title>
      <itunes:episode>633</itunes:episode>
      <podcast:episode>633</podcast:episode>
      <itunes:title>Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">12adbf3d-630d-457c-a002-a490473f6319</guid>
      <link>https://share.transistor.fm/s/7bb18771</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yiran Zhao, Chaoqun Liu, Yue Deng, Jiahao Ying, Mahani Aljunied, Zhaodonghui Li, Lidong Bing, Hou Pong Chan, Yu Rong, Deli Zhao, Wenxuan Zhang</p>

            <p><strong>Title:</strong><br>
            Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.00865v1">http://arxiv.org/abs/2503.00865v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have revolutionized natural language processing (NLP), yet open-source multilingual LLMs remain scarce, with existing models often limited in language coverage. Such models typically prioritize well-resourced languages, while widely spoken but under-resourced languages are often overlooked. To address this disparity, we introduce $\texttt{Babel}$, an open multilingual LLM that covers the top 25 languages by number of speakers, supports over 90% of the global population, and includes many languages neglected by other open multilingual LLMs. Unlike traditional continue pretraining approaches, Babel expands its parameter count through a layer extension technique that elevates Babel's performance ceiling. We introduce two variants: $\texttt{Babel-9B}$, designed for efficient inference and fine-tuning, and $\texttt{Babel-83B}$, which sets a new standard for open multilingual LLMs. Extensive evaluations on multilingual tasks demonstrate its superior performance compared to open LLMs of comparable size. In addition, using open-source supervised fine-tuning datasets, Babel achieves remarkable performance, with Babel-9B-Chat leading among 10B-sized LLMs and Babel-83B-Chat setting a new standard for multilingual tasks, reaching the same level of commercial models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yiran Zhao, Chaoqun Liu, Yue Deng, Jiahao Ying, Mahani Aljunied, Zhaodonghui Li, Lidong Bing, Hou Pong Chan, Yu Rong, Deli Zhao, Wenxuan Zhang</p>

            <p><strong>Title:</strong><br>
            Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.00865v1">http://arxiv.org/abs/2503.00865v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have revolutionized natural language processing (NLP), yet open-source multilingual LLMs remain scarce, with existing models often limited in language coverage. Such models typically prioritize well-resourced languages, while widely spoken but under-resourced languages are often overlooked. To address this disparity, we introduce $\texttt{Babel}$, an open multilingual LLM that covers the top 25 languages by number of speakers, supports over 90% of the global population, and includes many languages neglected by other open multilingual LLMs. Unlike traditional continue pretraining approaches, Babel expands its parameter count through a layer extension technique that elevates Babel's performance ceiling. We introduce two variants: $\texttt{Babel-9B}$, designed for efficient inference and fine-tuning, and $\texttt{Babel-83B}$, which sets a new standard for open multilingual LLMs. Extensive evaluations on multilingual tasks demonstrate its superior performance compared to open LLMs of comparable size. In addition, using open-source supervised fine-tuning datasets, Babel achieves remarkable performance, with Babel-9B-Chat leading among 10B-sized LLMs and Babel-83B-Chat setting a new standard for multilingual tasks, reaching the same level of commercial models.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 06 Mar 2025 19:15:25 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7bb18771/cf211876.mp3" length="17487096" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1089</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yiran Zhao, Chaoqun Liu, Yue Deng, Jiahao Ying, Mahani Aljunied, Zhaodonghui Li, Lidong Bing, Hou Pong Chan, Yu Rong, Deli Zhao, Wenxuan Zhang</p>

            <p><strong>Title:</strong><br>
            Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.00865v1">http://arxiv.org/abs/2503.00865v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have revolutionized natural language processing (NLP), yet open-source multilingual LLMs remain scarce, with existing models often limited in language coverage. Such models typically prioritize well-resourced languages, while widely spoken but under-resourced languages are often overlooked. To address this disparity, we introduce $\texttt{Babel}$, an open multilingual LLM that covers the top 25 languages by number of speakers, supports over 90% of the global population, and includes many languages neglected by other open multilingual LLMs. Unlike traditional continue pretraining approaches, Babel expands its parameter count through a layer extension technique that elevates Babel's performance ceiling. We introduce two variants: $\texttt{Babel-9B}$, designed for efficient inference and fine-tuning, and $\texttt{Babel-83B}$, which sets a new standard for open multilingual LLMs. Extensive evaluations on multilingual tasks demonstrate its superior performance compared to open LLMs of comparable size. In addition, using open-source supervised fine-tuning datasets, Babel achieves remarkable performance, with Babel-9B-Chat leading among 10B-sized LLMs and Babel-83B-Chat setting a new standard for multilingual tasks, reaching the same level of commercial models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs</title>
      <itunes:episode>632</itunes:episode>
      <podcast:episode>632</podcast:episode>
      <itunes:title>HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c65c4f91-6022-4e43-9076-c86b5435fa29</guid>
      <link>https://share.transistor.fm/s/718c33c7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen</p>

            <p><strong>Title:</strong><br>
            HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.02003v2">http://arxiv.org/abs/2503.02003v2</a></p>

            <p><strong>Abstract:</strong><br>
            An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the query. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Interestingly, in few-shot settings, HoT outperforms vanilla chain of thought prompting (CoT) on a wide range of 17 tasks from arithmetic, reading comprehension to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to make users believe that an answer is correct.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen</p>

            <p><strong>Title:</strong><br>
            HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.02003v2">http://arxiv.org/abs/2503.02003v2</a></p>

            <p><strong>Abstract:</strong><br>
            An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the query. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Interestingly, in few-shot settings, HoT outperforms vanilla chain of thought prompting (CoT) on a wide range of 17 tasks from arithmetic, reading comprehension to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to make users believe that an answer is correct.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 06 Mar 2025 19:15:04 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/718c33c7/aa8b28d5.mp3" length="23497345" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1465</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen</p>

            <p><strong>Title:</strong><br>
            HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.02003v2">http://arxiv.org/abs/2503.02003v2</a></p>

            <p><strong>Abstract:</strong><br>
            An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the query. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Interestingly, in few-shot settings, HoT outperforms vanilla chain of thought prompting (CoT) on a wide range of 17 tasks from arithmetic, reading comprehension to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to make users believe that an answer is correct.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Process-based Self-Rewarding Language Models</title>
      <itunes:episode>631</itunes:episode>
      <podcast:episode>631</podcast:episode>
      <itunes:title>Process-based Self-Rewarding Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b1437d18-dd90-431a-9d0c-af5d5419b0f5</guid>
      <link>https://share.transistor.fm/s/7f498fe9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shimao Zhang, Xiao Liu, Xin Zhang, Junxiao Liu, Zheheng Luo, Shujian Huang, Yeyun Gong</p>

            <p><strong>Title:</strong><br>
            Process-based Self-Rewarding Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.03746v1">http://arxiv.org/abs/2503.03746v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models have demonstrated outstanding performance across various downstream tasks and have been widely applied in multiple scenarios. Human-annotated preference data is used for training to further improve LLMs' performance, which is constrained by the upper limit of human performance. Therefore, Self-Rewarding method has been proposed, where LLMs generate training data by rewarding their own outputs. However, the existing self-rewarding paradigm is not effective in mathematical reasoning scenarios and may even lead to a decline in performance. In this work, we propose the Process-based Self-Rewarding pipeline for language models, which introduces long-thought reasoning, step-wise LLM-as-a-Judge, and step-wise preference optimization within the self-rewarding paradigm. Our new paradigm successfully enhances the performance of LLMs on multiple mathematical reasoning benchmarks through iterative Process-based Self-Rewarding, demonstrating the immense potential of self-rewarding to achieve LLM reasoning that may surpass human capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shimao Zhang, Xiao Liu, Xin Zhang, Junxiao Liu, Zheheng Luo, Shujian Huang, Yeyun Gong</p>

            <p><strong>Title:</strong><br>
            Process-based Self-Rewarding Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.03746v1">http://arxiv.org/abs/2503.03746v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models have demonstrated outstanding performance across various downstream tasks and have been widely applied in multiple scenarios. Human-annotated preference data is used for training to further improve LLMs' performance, which is constrained by the upper limit of human performance. Therefore, Self-Rewarding method has been proposed, where LLMs generate training data by rewarding their own outputs. However, the existing self-rewarding paradigm is not effective in mathematical reasoning scenarios and may even lead to a decline in performance. In this work, we propose the Process-based Self-Rewarding pipeline for language models, which introduces long-thought reasoning, step-wise LLM-as-a-Judge, and step-wise preference optimization within the self-rewarding paradigm. Our new paradigm successfully enhances the performance of LLMs on multiple mathematical reasoning benchmarks through iterative Process-based Self-Rewarding, demonstrating the immense potential of self-rewarding to achieve LLM reasoning that may surpass human capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 06 Mar 2025 19:14:43 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7f498fe9/22ee4a58.mp3" length="22882075" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1426</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shimao Zhang, Xiao Liu, Xin Zhang, Junxiao Liu, Zheheng Luo, Shujian Huang, Yeyun Gong</p>

            <p><strong>Title:</strong><br>
            Process-based Self-Rewarding Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.03746v1">http://arxiv.org/abs/2503.03746v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models have demonstrated outstanding performance across various downstream tasks and have been widely applied in multiple scenarios. Human-annotated preference data is used for training to further improve LLMs' performance, which is constrained by the upper limit of human performance. Therefore, Self-Rewarding method has been proposed, where LLMs generate training data by rewarding their own outputs. However, the existing self-rewarding paradigm is not effective in mathematical reasoning scenarios and may even lead to a decline in performance. In this work, we propose the Process-based Self-Rewarding pipeline for language models, which introduces long-thought reasoning, step-wise LLM-as-a-Judge, and step-wise preference optimization within the self-rewarding paradigm. Our new paradigm successfully enhances the performance of LLMs on multiple mathematical reasoning benchmarks through iterative Process-based Self-Rewarding, demonstrating the immense potential of self-rewarding to achieve LLM reasoning that may surpass human capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Visual-RFT: Visual Reinforcement Fine-Tuning</title>
      <itunes:episode>630</itunes:episode>
      <podcast:episode>630</podcast:episode>
      <itunes:title>Visual-RFT: Visual Reinforcement Fine-Tuning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5db696ac-0798-422f-bdb8-abc67994a5b7</guid>
      <link>https://share.transistor.fm/s/66d3d6f2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            Visual-RFT: Visual Reinforcement Fine-Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.01785v1">http://arxiv.org/abs/2503.01785v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO). We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection. Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT). For example, Visual-RFT improves accuracy by $24.3\%$ over the baseline in one-shot fine-grained image classification with around 100 samples. In few-shot object detection, Visual-RFT also exceeds the baseline by $21.9$ on COCO's two-shot setting and $15.4$ on LVIS. Our Visual-RFT represents a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven approach that enhances reasoning and adaptability for domain-specific tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            Visual-RFT: Visual Reinforcement Fine-Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.01785v1">http://arxiv.org/abs/2503.01785v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO). We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection. Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT). For example, Visual-RFT improves accuracy by $24.3\%$ over the baseline in one-shot fine-grained image classification with around 100 samples. In few-shot object detection, Visual-RFT also exceeds the baseline by $21.9$ on COCO's two-shot setting and $15.4$ on LVIS. Our Visual-RFT represents a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven approach that enhances reasoning and adaptability for domain-specific tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Mar 2025 19:13:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/66d3d6f2/43496523.mp3" length="21969671" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1369</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            Visual-RFT: Visual Reinforcement Fine-Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.01785v1">http://arxiv.org/abs/2503.01785v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO). We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection. Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT). For example, Visual-RFT improves accuracy by $24.3\%$ over the baseline in one-shot fine-grained image classification with around 100 samples. In few-shot object detection, Visual-RFT also exceeds the baseline by $21.9$ on COCO's two-shot setting and $15.4$ on LVIS. Our Visual-RFT represents a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven approach that enhances reasoning and adaptability for domain-specific tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs</title>
      <itunes:episode>629</itunes:episode>
      <podcast:episode>629</podcast:episode>
      <itunes:title>Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b809f54e-5164-496b-b92b-c147d56768f1</guid>
      <link>https://share.transistor.fm/s/b03d18d5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan Hu, Xin Jin, Mahmoud Khademi, Dongwoo Kim, Young Jin Kim, Gina Lee, Jinyu Li, Yunsheng Li, Chen Liang, Xihui Lin, Zeqi Lin, Mengchen Liu, Yang Liu, Gilsinia Lopez, Chong Luo, Piyush Madan, Vadim Mazalov, Ali Mousavi, Anh Nguyen, Jing Pan, Daniel Perez-Becker, Jacob Platin, Thomas Portet, Kai Qiu, Bo Ren, Liliang Ren, Sambuddha Roy, Ning Shang, Yelong Shen, Saksham Singhal, Subhojit Som, Xia Song, Tetyana Sych, Praneetha Vaddamanu, Shuohang Wang, Yiming Wang, Zhenghao Wang, Haibin Wu, Haoran Xu, Weijian Xu, Yifan Yang, Ziyi Yang, Donghan Yu, Ishmam Zabir, Jianwen Zhang, Li Lyna Zhang, Yunan Zhang, Xiren Zhou</p>

            <p><strong>Title:</strong><br>
            Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.01743v1">http://arxiv.org/abs/2503.01743v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of 200K tokens to better support multilingual applications, as well as group query attention for more efficient long-sequence generation. Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model. Its novel modality extension approach leverages LoRA adapters and modality-specific routers to allow multiple inference modes combining various modalities without interference. For example, it now ranks first in the OpenASR leaderboard to date, although the LoRA component of the speech/audio modality has just 460 million parameters. Phi-4-Multimodal supports scenarios involving (vision + language), (vision + speech), and (speech/audio) inputs, outperforming larger vision-language and speech-language models on a wide range of tasks. Additionally, we experiment to further train Phi-4-Mini to enhance its reasoning capabilities. Despite its compact 3.8-billion-parameter size, this experimental version achieves reasoning performance on par with or surpassing significantly larger models, including DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan Hu, Xin Jin, Mahmoud Khademi, Dongwoo Kim, Young Jin Kim, Gina Lee, Jinyu Li, Yunsheng Li, Chen Liang, Xihui Lin, Zeqi Lin, Mengchen Liu, Yang Liu, Gilsinia Lopez, Chong Luo, Piyush Madan, Vadim Mazalov, Ali Mousavi, Anh Nguyen, Jing Pan, Daniel Perez-Becker, Jacob Platin, Thomas Portet, Kai Qiu, Bo Ren, Liliang Ren, Sambuddha Roy, Ning Shang, Yelong Shen, Saksham Singhal, Subhojit Som, Xia Song, Tetyana Sych, Praneetha Vaddamanu, Shuohang Wang, Yiming Wang, Zhenghao Wang, Haibin Wu, Haoran Xu, Weijian Xu, Yifan Yang, Ziyi Yang, Donghan Yu, Ishmam Zabir, Jianwen Zhang, Li Lyna Zhang, Yunan Zhang, Xiren Zhou</p>

            <p><strong>Title:</strong><br>
            Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.01743v1">http://arxiv.org/abs/2503.01743v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of 200K tokens to better support multilingual applications, as well as group query attention for more efficient long-sequence generation. Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model. Its novel modality extension approach leverages LoRA adapters and modality-specific routers to allow multiple inference modes combining various modalities without interference. For example, it now ranks first in the OpenASR leaderboard to date, although the LoRA component of the speech/audio modality has just 460 million parameters. Phi-4-Multimodal supports scenarios involving (vision + language), (vision + speech), and (speech/audio) inputs, outperforming larger vision-language and speech-language models on a wide range of tasks. Additionally, we experiment to further train Phi-4-Mini to enhance its reasoning capabilities. Despite its compact 3.8-billion-parameter size, this experimental version achieves reasoning performance on par with or surpassing significantly larger models, including DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Mar 2025 19:13:26 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b03d18d5/7a07f932.mp3" length="24768796" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1544</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan Hu, Xin Jin, Mahmoud Khademi, Dongwoo Kim, Young Jin Kim, Gina Lee, Jinyu Li, Yunsheng Li, Chen Liang, Xihui Lin, Zeqi Lin, Mengchen Liu, Yang Liu, Gilsinia Lopez, Chong Luo, Piyush Madan, Vadim Mazalov, Ali Mousavi, Anh Nguyen, Jing Pan, Daniel Perez-Becker, Jacob Platin, Thomas Portet, Kai Qiu, Bo Ren, Liliang Ren, Sambuddha Roy, Ning Shang, Yelong Shen, Saksham Singhal, Subhojit Som, Xia Song, Tetyana Sych, Praneetha Vaddamanu, Shuohang Wang, Yiming Wang, Zhenghao Wang, Haibin Wu, Haoran Xu, Weijian Xu, Yifan Yang, Ziyi Yang, Donghan Yu, Ishmam Zabir, Jianwen Zhang, Li Lyna Zhang, Yunan Zhang, Xiren Zhou</p>

            <p><strong>Title:</strong><br>
            Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.01743v1">http://arxiv.org/abs/2503.01743v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of 200K tokens to better support multilingual applications, as well as group query attention for more efficient long-sequence generation. Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model. Its novel modality extension approach leverages LoRA adapters and modality-specific routers to allow multiple inference modes combining various modalities without interference. For example, it now ranks first in the OpenASR leaderboard to date, although the LoRA component of the speech/audio modality has just 460 million parameters. Phi-4-Multimodal supports scenarios involving (vision + language), (vision + speech), and (speech/audio) inputs, outperforming larger vision-language and speech-language models on a wide range of tasks. Additionally, we experiment to further train Phi-4-Mini to enhance its reasoning capabilities. Despite its compact 3.8-billion-parameter size, this experimental version achieves reasoning performance on par with or surpassing significantly larger models, including DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models</title>
      <itunes:episode>628</itunes:episode>
      <podcast:episode>628</podcast:episode>
      <itunes:title>Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">26f9304f-5162-4614-bd89-96ada9220262</guid>
      <link>https://share.transistor.fm/s/8f3ee121</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, Huan Ling</p>

            <p><strong>Title:</strong><br>
            Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.01774v1">http://arxiv.org/abs/2503.01774v1</a></p>

            <p><strong>Abstract:</strong><br>
            Neural Radiance Fields and 3D Gaussian Splatting have revolutionized 3D reconstruction and novel-view synthesis task. However, achieving photorealistic rendering from extreme novel viewpoints remains challenging, as artifacts persist across representations. In this work, we introduce Difix3D+, a novel pipeline designed to enhance 3D reconstruction and novel-view synthesis through single-step diffusion models. At the core of our approach is Difix, a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by underconstrained regions of the 3D representation. Difix serves two critical roles in our pipeline. First, it is used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction and then distilled back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality. More importantly, Difix also acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D supervision and the limited capacity of current reconstruction models. Difix3D+ is a general solution, a single model compatible with both NeRF and 3DGS representations, and it achieves an average 2$\times$ improvement in FID score over baselines while maintaining 3D consistency.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, Huan Ling</p>

            <p><strong>Title:</strong><br>
            Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.01774v1">http://arxiv.org/abs/2503.01774v1</a></p>

            <p><strong>Abstract:</strong><br>
            Neural Radiance Fields and 3D Gaussian Splatting have revolutionized 3D reconstruction and novel-view synthesis task. However, achieving photorealistic rendering from extreme novel viewpoints remains challenging, as artifacts persist across representations. In this work, we introduce Difix3D+, a novel pipeline designed to enhance 3D reconstruction and novel-view synthesis through single-step diffusion models. At the core of our approach is Difix, a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by underconstrained regions of the 3D representation. Difix serves two critical roles in our pipeline. First, it is used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction and then distilled back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality. More importantly, Difix also acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D supervision and the limited capacity of current reconstruction models. Difix3D+ is a general solution, a single model compatible with both NeRF and 3DGS representations, and it achieves an average 2$\times$ improvement in FID score over baselines while maintaining 3D consistency.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Mar 2025 19:13:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8f3ee121/3b1a9b3a.mp3" length="18368562" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1144</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, Huan Ling</p>

            <p><strong>Title:</strong><br>
            Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2503.01774v1">http://arxiv.org/abs/2503.01774v1</a></p>

            <p><strong>Abstract:</strong><br>
            Neural Radiance Fields and 3D Gaussian Splatting have revolutionized 3D reconstruction and novel-view synthesis task. However, achieving photorealistic rendering from extreme novel viewpoints remains challenging, as artifacts persist across representations. In this work, we introduce Difix3D+, a novel pipeline designed to enhance 3D reconstruction and novel-view synthesis through single-step diffusion models. At the core of our approach is Difix, a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by underconstrained regions of the 3D representation. Difix serves two critical roles in our pipeline. First, it is used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction and then distilled back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality. More importantly, Difix also acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D supervision and the limited capacity of current reconstruction models. Difix3D+ is a general solution, a single model compatible with both NeRF and 3DGS representations, and it achieves an average 2$\times$ improvement in FID score over baselines while maintaining 3D consistency.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking</title>
      <itunes:episode>627</itunes:episode>
      <podcast:episode>627</podcast:episode>
      <itunes:title>DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3a3a1b29-ba76-4faf-9a56-add5080421a8</guid>
      <link>https://share.transistor.fm/s/356d5087</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhuoqun Li, Haiyang Yu, Xuanang Chen, Hongyu Lin, Yaojie Lu, Fei Huang, Xianpei Han, Yongbin Li, Le Sun</p>

            <p><strong>Title:</strong><br>
            DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20730v1">http://arxiv.org/abs/2502.20730v1</a></p>

            <p><strong>Abstract:</strong><br>
            Designing solutions for complex engineering challenges is crucial in human production activities. However, previous research in the retrieval-augmented generation (RAG) field has not sufficiently addressed tasks related to the design of complex engineering solutions. To fill this gap, we introduce a new benchmark, SolutionBench, to evaluate a system's ability to generate complete and feasible solutions for engineering problems with multiple complex constraints. To further advance the design of complex engineering solutions, we propose a novel system, SolutionRAG, that leverages the tree-based exploration and bi-point thinking mechanism to generate reliable solutions. Extensive experimental results demonstrate that SolutionRAG achieves state-of-the-art (SOTA) performance on the SolutionBench, highlighting its potential to enhance the automation and reliability of complex engineering solution design in real-world applications.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhuoqun Li, Haiyang Yu, Xuanang Chen, Hongyu Lin, Yaojie Lu, Fei Huang, Xianpei Han, Yongbin Li, Le Sun</p>

            <p><strong>Title:</strong><br>
            DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20730v1">http://arxiv.org/abs/2502.20730v1</a></p>

            <p><strong>Abstract:</strong><br>
            Designing solutions for complex engineering challenges is crucial in human production activities. However, previous research in the retrieval-augmented generation (RAG) field has not sufficiently addressed tasks related to the design of complex engineering solutions. To fill this gap, we introduce a new benchmark, SolutionBench, to evaluate a system's ability to generate complete and feasible solutions for engineering problems with multiple complex constraints. To further advance the design of complex engineering solutions, we propose a novel system, SolutionRAG, that leverages the tree-based exploration and bi-point thinking mechanism to generate reliable solutions. Extensive experimental results demonstrate that SolutionRAG achieves state-of-the-art (SOTA) performance on the SolutionBench, highlighting its potential to enhance the automation and reliability of complex engineering solution design in real-world applications.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 03 Mar 2025 19:18:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/356d5087/e080e0e4.mp3" length="21957613" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1369</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhuoqun Li, Haiyang Yu, Xuanang Chen, Hongyu Lin, Yaojie Lu, Fei Huang, Xianpei Han, Yongbin Li, Le Sun</p>

            <p><strong>Title:</strong><br>
            DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20730v1">http://arxiv.org/abs/2502.20730v1</a></p>

            <p><strong>Abstract:</strong><br>
            Designing solutions for complex engineering challenges is crucial in human production activities. However, previous research in the retrieval-augmented generation (RAG) field has not sufficiently addressed tasks related to the design of complex engineering solutions. To fill this gap, we introduce a new benchmark, SolutionBench, to evaluate a system's ability to generate complete and feasible solutions for engineering problems with multiple complex constraints. To further advance the design of complex engineering solutions, we propose a novel system, SolutionRAG, that leverages the tree-based exploration and bi-point thinking mechanism to generate reliable solutions. Extensive experimental results demonstrate that SolutionRAG achieves state-of-the-art (SOTA) performance on the SolutionBench, highlighting its potential to enhance the automation and reliability of complex engineering solution design in real-world applications.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Chain of Draft: Thinking Faster by Writing Less</title>
      <itunes:episode>626</itunes:episode>
      <podcast:episode>626</podcast:episode>
      <itunes:title>Chain of Draft: Thinking Faster by Writing Less</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2ee18d3d-ddfb-427f-9199-d30f0308f4e4</guid>
      <link>https://share.transistor.fm/s/da7802d6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He</p>

            <p><strong>Title:</strong><br>
            Chain of Draft: Thinking Faster by Writing Less</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.18600v1">http://arxiv.org/abs/2502.18600v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated remarkable performance in solving complex reasoning tasks through mechanisms like Chain-of-Thought (CoT) prompting, which emphasizes verbose, step-by-step reasoning. However, humans typically employ a more efficient strategy: drafting concise intermediate thoughts that capture only essential information. In this work, we propose Chain of Draft (CoD), a novel paradigm inspired by human cognitive processes, where LLMs generate minimalistic yet informative intermediate reasoning outputs while solving tasks. By reducing verbosity and focusing on critical insights, CoD matches or surpasses CoT in accuracy while using as little as only 7.6% of the tokens, significantly reducing cost and latency across various reasoning tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He</p>

            <p><strong>Title:</strong><br>
            Chain of Draft: Thinking Faster by Writing Less</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.18600v1">http://arxiv.org/abs/2502.18600v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated remarkable performance in solving complex reasoning tasks through mechanisms like Chain-of-Thought (CoT) prompting, which emphasizes verbose, step-by-step reasoning. However, humans typically employ a more efficient strategy: drafting concise intermediate thoughts that capture only essential information. In this work, we propose Chain of Draft (CoD), a novel paradigm inspired by human cognitive processes, where LLMs generate minimalistic yet informative intermediate reasoning outputs while solving tasks. By reducing verbosity and focusing on critical insights, CoD matches or surpasses CoT in accuracy while using as little as only 7.6% of the tokens, significantly reducing cost and latency across various reasoning tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 03 Mar 2025 19:18:06 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/da7802d6/867dd2be.mp3" length="21767799" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1357</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He</p>

            <p><strong>Title:</strong><br>
            Chain of Draft: Thinking Faster by Writing Less</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.18600v1">http://arxiv.org/abs/2502.18600v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated remarkable performance in solving complex reasoning tasks through mechanisms like Chain-of-Thought (CoT) prompting, which emphasizes verbose, step-by-step reasoning. However, humans typically employ a more efficient strategy: drafting concise intermediate thoughts that capture only essential information. In this work, we propose Chain of Draft (CoD), a novel paradigm inspired by human cognitive processes, where LLMs generate minimalistic yet informative intermediate reasoning outputs while solving tasks. By reducing verbosity and focusing on critical insights, CoD matches or surpasses CoT in accuracy while using as little as only 7.6% of the tokens, significantly reducing cost and latency across various reasoning tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Multi-Turn Code Generation Through Single-Step Rewards</title>
      <itunes:episode>625</itunes:episode>
      <podcast:episode>625</podcast:episode>
      <itunes:title>Multi-Turn Code Generation Through Single-Step Rewards</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">219b2bf7-241d-4e9e-a284-70e4d5199410</guid>
      <link>https://share.transistor.fm/s/85e45098</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury</p>

            <p><strong>Title:</strong><br>
            Multi-Turn Code Generation Through Single-Step Rewards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20380v1">http://arxiv.org/abs/2502.20380v1</a></p>

            <p><strong>Abstract:</strong><br>
            We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, $\mu$Code, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. $\mu$Code iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $\mu$Code at utilizing the execution feedback. Our code is available at https://github.com/portal-cornell/muCode.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury</p>

            <p><strong>Title:</strong><br>
            Multi-Turn Code Generation Through Single-Step Rewards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20380v1">http://arxiv.org/abs/2502.20380v1</a></p>

            <p><strong>Abstract:</strong><br>
            We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, $\mu$Code, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. $\mu$Code iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $\mu$Code at utilizing the execution feedback. Our code is available at https://github.com/portal-cornell/muCode.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 03 Mar 2025 19:17:45 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/85e45098/42af8a0a.mp3" length="24589867" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1533</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury</p>

            <p><strong>Title:</strong><br>
            Multi-Turn Code Generation Through Single-Step Rewards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20380v1">http://arxiv.org/abs/2502.20380v1</a></p>

            <p><strong>Abstract:</strong><br>
            We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, $\mu$Code, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. $\mu$Code iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $\mu$Code at utilizing the execution feedback. Our code is available at https://github.com/portal-cornell/muCode.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Self-rewarding correction for mathematical reasoning</title>
      <itunes:episode>624</itunes:episode>
      <podcast:episode>624</podcast:episode>
      <itunes:title>Self-rewarding correction for mathematical reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cf33236d-8b7a-4711-acbc-383cbf076d29</guid>
      <link>https://share.transistor.fm/s/aff8281f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, Tong Zhang</p>

            <p><strong>Title:</strong><br>
            Self-rewarding correction for mathematical reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19613v1">http://arxiv.org/abs/2502.19613v1</a></p>

            <p><strong>Abstract:</strong><br>
            We study self-rewarding reasoning large language models (LLMs), which can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment. We particularly focus on the representative task of self-correction, where models autonomously detect errors in their responses, revise outputs, and decide when to terminate iterative refinement loops. To enable this, we propose a two-staged algorithmic framework for constructing self-rewarding reasoning models using only self-generated data. In the first stage, we employ sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models' ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals. Experiments with Llama-3 and Qwen-2.5 demonstrate that our approach surpasses intrinsic self-correction capabilities and achieves performance comparable to systems that rely on external reward models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, Tong Zhang</p>

            <p><strong>Title:</strong><br>
            Self-rewarding correction for mathematical reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19613v1">http://arxiv.org/abs/2502.19613v1</a></p>

            <p><strong>Abstract:</strong><br>
            We study self-rewarding reasoning large language models (LLMs), which can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment. We particularly focus on the representative task of self-correction, where models autonomously detect errors in their responses, revise outputs, and decide when to terminate iterative refinement loops. To enable this, we propose a two-staged algorithmic framework for constructing self-rewarding reasoning models using only self-generated data. In the first stage, we employ sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models' ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals. Experiments with Llama-3 and Qwen-2.5 demonstrate that our approach surpasses intrinsic self-correction capabilities and achieves performance comparable to systems that rely on external reward models.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 28 Feb 2025 21:04:30 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/aff8281f/c1aad104.mp3" length="23575896" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1470</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, Tong Zhang</p>

            <p><strong>Title:</strong><br>
            Self-rewarding correction for mathematical reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19613v1">http://arxiv.org/abs/2502.19613v1</a></p>

            <p><strong>Abstract:</strong><br>
            We study self-rewarding reasoning large language models (LLMs), which can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment. We particularly focus on the representative task of self-correction, where models autonomously detect errors in their responses, revise outputs, and decide when to terminate iterative refinement loops. To enable this, we propose a two-staged algorithmic framework for constructing self-rewarding reasoning models using only self-generated data. In the first stage, we employ sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models' ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals. Experiments with Llama-3 and Qwen-2.5 demonstrate that our approach surpasses intrinsic self-correction capabilities and achieves performance comparable to systems that rely on external reward models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning</title>
      <itunes:episode>623</itunes:episode>
      <podcast:episode>623</podcast:episode>
      <itunes:title>MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">712f33b6-7f98-4453-9754-49b9c0cf58cd</guid>
      <link>https://share.transistor.fm/s/80ce1b3f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, Daniel Rueckert</p>

            <p><strong>Title:</strong><br>
            MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19634v1">http://arxiv.org/abs/2502.19634v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning is a critical frontier for advancing medical image analysis, where transparency and trustworthiness play a central role in both clinician trust and regulatory approval. Although Medical Visual Language Models (VLMs) show promise for radiological tasks, most existing VLMs merely produce final answers without revealing the underlying reasoning. To address this gap, we introduce MedVLM-R1, a medical VLM that explicitly generates natural language reasoning to enhance transparency and trustworthiness. Instead of relying on supervised fine-tuning (SFT), which often suffers from overfitting to training distributions and fails to foster genuine reasoning, MedVLM-R1 employs a reinforcement learning framework that incentivizes the model to discover human-interpretable reasoning paths without using any reasoning references. Despite limited training data (600 visual question answering samples) and model parameters (2B), MedVLM-R1 boosts accuracy from 55.11% to 78.22% across MRI, CT, and X-ray benchmarks, outperforming larger models trained on over a million samples. It also demonstrates robust domain generalization under out-of-distribution tasks. By unifying medical image analysis with explicit reasoning, MedVLM-R1 marks a pivotal step toward trustworthy and interpretable AI in clinical practice.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, Daniel Rueckert</p>

            <p><strong>Title:</strong><br>
            MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19634v1">http://arxiv.org/abs/2502.19634v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning is a critical frontier for advancing medical image analysis, where transparency and trustworthiness play a central role in both clinician trust and regulatory approval. Although Medical Visual Language Models (VLMs) show promise for radiological tasks, most existing VLMs merely produce final answers without revealing the underlying reasoning. To address this gap, we introduce MedVLM-R1, a medical VLM that explicitly generates natural language reasoning to enhance transparency and trustworthiness. Instead of relying on supervised fine-tuning (SFT), which often suffers from overfitting to training distributions and fails to foster genuine reasoning, MedVLM-R1 employs a reinforcement learning framework that incentivizes the model to discover human-interpretable reasoning paths without using any reasoning references. Despite limited training data (600 visual question answering samples) and model parameters (2B), MedVLM-R1 boosts accuracy from 55.11% to 78.22% across MRI, CT, and X-ray benchmarks, outperforming larger models trained on over a million samples. It also demonstrates robust domain generalization under out-of-distribution tasks. By unifying medical image analysis with explicit reasoning, MedVLM-R1 marks a pivotal step toward trustworthy and interpretable AI in clinical practice.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 28 Feb 2025 21:04:08 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/80ce1b3f/6384fde3.mp3" length="22438272" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1399</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, Daniel Rueckert</p>

            <p><strong>Title:</strong><br>
            MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19634v1">http://arxiv.org/abs/2502.19634v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning is a critical frontier for advancing medical image analysis, where transparency and trustworthiness play a central role in both clinician trust and regulatory approval. Although Medical Visual Language Models (VLMs) show promise for radiological tasks, most existing VLMs merely produce final answers without revealing the underlying reasoning. To address this gap, we introduce MedVLM-R1, a medical VLM that explicitly generates natural language reasoning to enhance transparency and trustworthiness. Instead of relying on supervised fine-tuning (SFT), which often suffers from overfitting to training distributions and fails to foster genuine reasoning, MedVLM-R1 employs a reinforcement learning framework that incentivizes the model to discover human-interpretable reasoning paths without using any reasoning references. Despite limited training data (600 visual question answering samples) and model parameters (2B), MedVLM-R1 boosts accuracy from 55.11% to 78.22% across MRI, CT, and X-ray benchmarks, outperforming larger models trained on over a million samples. It also demonstrates robust domain generalization under out-of-distribution tasks. By unifying medical image analysis with explicit reasoning, MedVLM-R1 marks a pivotal step toward trustworthy and interpretable AI in clinical practice.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts</title>
      <itunes:episode>622</itunes:episode>
      <podcast:episode>622</podcast:episode>
      <itunes:title>R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4150f58b-d677-4535-823e-e2020c08f858</guid>
      <link>https://share.transistor.fm/s/8afc92b2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhongyang Li, Ziyue Li, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20395v1">http://arxiv.org/abs/2502.20395v1</a></p>

            <p><strong>Abstract:</strong><br>
            In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)' powerful reasoning capabilities, deterring LMMs' performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by diverse downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained router does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method "Re-Routing in Test-Time(R2-T2) that locally optimizes the vector of routing weights in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and greatly improves state-of-the-art LMMs' performance on challenging benchmarks of diverse tasks, without training any base-model parameters.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhongyang Li, Ziyue Li, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20395v1">http://arxiv.org/abs/2502.20395v1</a></p>

            <p><strong>Abstract:</strong><br>
            In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)' powerful reasoning capabilities, deterring LMMs' performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by diverse downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained router does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method "Re-Routing in Test-Time(R2-T2) that locally optimizes the vector of routing weights in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and greatly improves state-of-the-art LMMs' performance on challenging benchmarks of diverse tasks, without training any base-model parameters.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 28 Feb 2025 21:03:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8afc92b2/691fa5ef.mp3" length="21586004" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1345</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhongyang Li, Ziyue Li, Tianyi Zhou</p>

            <p><strong>Title:</strong><br>
            R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20395v1">http://arxiv.org/abs/2502.20395v1</a></p>

            <p><strong>Abstract:</strong><br>
            In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)' powerful reasoning capabilities, deterring LMMs' performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by diverse downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained router does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method "Re-Routing in Test-Time(R2-T2) that locally optimizes the vector of routing weights in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and greatly improves state-of-the-art LMMs' performance on challenging benchmarks of diverse tasks, without training any base-model parameters.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LongRoPE2: Near-Lossless LLM Context Window Scaling</title>
      <itunes:episode>621</itunes:episode>
      <podcast:episode>621</podcast:episode>
      <itunes:title>LongRoPE2: Near-Lossless LLM Context Window Scaling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a800461c-c2a2-4348-b9ea-e7803f1f9805</guid>
      <link>https://share.transistor.fm/s/be82e5b3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang</p>

            <p><strong>Title:</strong><br>
            LongRoPE2: Near-Lossless LLM Context Window Scaling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20082v1">http://arxiv.org/abs/2502.20082v1</a></p>

            <p><strong>Abstract:</strong><br>
            LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens -- 80x fewer than Meta's approach, which fails to reach the target effective context length. Code will be available at https://github.com/microsoft/LongRoPE.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang</p>

            <p><strong>Title:</strong><br>
            LongRoPE2: Near-Lossless LLM Context Window Scaling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20082v1">http://arxiv.org/abs/2502.20082v1</a></p>

            <p><strong>Abstract:</strong><br>
            LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens -- 80x fewer than Meta's approach, which fails to reach the target effective context length. Code will be available at https://github.com/microsoft/LongRoPE.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 28 Feb 2025 21:03:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/be82e5b3/d7094cbb.mp3" length="22219199" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1385</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang</p>

            <p><strong>Title:</strong><br>
            LongRoPE2: Near-Lossless LLM Context Window Scaling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20082v1">http://arxiv.org/abs/2502.20082v1</a></p>

            <p><strong>Abstract:</strong><br>
            LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens -- 80x fewer than Meta's approach, which fails to reach the target effective context length. Code will be available at https://github.com/microsoft/LongRoPE.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving</title>
      <itunes:episode>620</itunes:episode>
      <podcast:episode>620</podcast:episode>
      <itunes:title>FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">85065279-b9df-4ebb-b7b6-1f991cf2d610</guid>
      <link>https://share.transistor.fm/s/d6416b26</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guizhen Chen, Weiwen Xu, Hao Zhang, Hou Pong Chan, Chaoqun Liu, Lidong Bing, Deli Zhao, Anh Tuan Luu, Yu Rong</p>

            <p><strong>Title:</strong><br>
            FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20238v1">http://arxiv.org/abs/2502.20238v1</a></p>

            <p><strong>Abstract:</strong><br>
            Many challenging reasoning tasks require not just rapid, intuitive responses, but a more deliberate, multi-step approach. Recent progress in large language models (LLMs) highlights an important shift from the "System 1" way of quick reactions to the "System 2" style of reflection-and-correction problem solving. However, current benchmarks heavily rely on the final-answer accuracy, leaving much of a model's intermediate reasoning steps unexamined. This fails to assess the model's ability to reflect and rectify mistakes within the reasoning process. To bridge this gap, we introduce FINEREASON, a logic-puzzle benchmark for fine-grained evaluation of LLMs' reasoning capabilities. Each puzzle can be decomposed into atomic steps, making it ideal for rigorous validation of intermediate correctness. Building on this, we introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move. To support broader research, we also provide a puzzle training set aimed at enhancing performance on general mathematical tasks. We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guizhen Chen, Weiwen Xu, Hao Zhang, Hou Pong Chan, Chaoqun Liu, Lidong Bing, Deli Zhao, Anh Tuan Luu, Yu Rong</p>

            <p><strong>Title:</strong><br>
            FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20238v1">http://arxiv.org/abs/2502.20238v1</a></p>

            <p><strong>Abstract:</strong><br>
            Many challenging reasoning tasks require not just rapid, intuitive responses, but a more deliberate, multi-step approach. Recent progress in large language models (LLMs) highlights an important shift from the "System 1" way of quick reactions to the "System 2" style of reflection-and-correction problem solving. However, current benchmarks heavily rely on the final-answer accuracy, leaving much of a model's intermediate reasoning steps unexamined. This fails to assess the model's ability to reflect and rectify mistakes within the reasoning process. To bridge this gap, we introduce FINEREASON, a logic-puzzle benchmark for fine-grained evaluation of LLMs' reasoning capabilities. Each puzzle can be decomposed into atomic steps, making it ideal for rigorous validation of intermediate correctness. Building on this, we introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move. To support broader research, we also provide a puzzle training set aimed at enhancing performance on general mathematical tasks. We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 28 Feb 2025 21:03:02 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d6416b26/1edb4320.mp3" length="25458847" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1587</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guizhen Chen, Weiwen Xu, Hao Zhang, Hou Pong Chan, Chaoqun Liu, Lidong Bing, Deli Zhao, Anh Tuan Luu, Yu Rong</p>

            <p><strong>Title:</strong><br>
            FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20238v1">http://arxiv.org/abs/2502.20238v1</a></p>

            <p><strong>Abstract:</strong><br>
            Many challenging reasoning tasks require not just rapid, intuitive responses, but a more deliberate, multi-step approach. Recent progress in large language models (LLMs) highlights an important shift from the "System 1" way of quick reactions to the "System 2" style of reflection-and-correction problem solving. However, current benchmarks heavily rely on the final-answer accuracy, leaving much of a model's intermediate reasoning steps unexamined. This fails to assess the model's ability to reflect and rectify mistakes within the reasoning process. To bridge this gap, we introduce FINEREASON, a logic-puzzle benchmark for fine-grained evaluation of LLMs' reasoning capabilities. Each puzzle can be decomposed into atomic steps, making it ideal for rigorous validation of intermediate correctness. Building on this, we introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move. To support broader research, we also provide a puzzle training set aimed at enhancing performance on general mathematical tasks. We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale</title>
      <itunes:episode>619</itunes:episode>
      <podcast:episode>619</podcast:episode>
      <itunes:title>CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">77659fd3-2feb-43fa-a1d4-9ba6ba90990f</guid>
      <link>https://share.transistor.fm/s/dd529079</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CL, cs.AI, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Chenlong Wang, Zhaoyang Chu, Zhengxiang Cheng, Xuyi Yang, Kaiyue Qiu, Yao Wan, Zhou Zhao, Xuanhua Shi, Dongping Chen</p>

            <p><strong>Title:</strong><br>
            CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.16645v1">http://arxiv.org/abs/2502.16645v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have exhibited exceptional performance in software engineering yet face challenges in adapting to continually evolving code knowledge, particularly regarding the frequent updates of third-party library APIs. This limitation, stemming from static pre-training datasets, often results in non-executable code or implementations with suboptimal safety and efficiency. To this end, this paper introduces CODESYNC, a data engine for identifying outdated code patterns and collecting real-time code knowledge updates from Python third-party libraries. Building upon CODESYNC, we develop CODESYNCBENCH, a comprehensive benchmark for assessing LLMs' ability to stay synchronized with code evolution, which covers real-world updates for 220 APIs from six Python libraries. Our benchmark offers 3,300 test cases across three evaluation tasks and an update-aware instruction tuning dataset consisting of 2,200 training samples. Extensive experiments on 14 state-of-the-art LLMs reveal that they struggle with dynamic code evolution, even with the support of advanced knowledge updating methods (e.g., DPO, ORPO, and SimPO). We believe that our benchmark can offer a strong foundation for the development of more effective methods for real-time code knowledge updating in the future. The experimental code and dataset are publicly available at: https://github.com/Lucky-voyage/Code-Sync.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CL, cs.AI, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Chenlong Wang, Zhaoyang Chu, Zhengxiang Cheng, Xuyi Yang, Kaiyue Qiu, Yao Wan, Zhou Zhao, Xuanhua Shi, Dongping Chen</p>

            <p><strong>Title:</strong><br>
            CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.16645v1">http://arxiv.org/abs/2502.16645v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have exhibited exceptional performance in software engineering yet face challenges in adapting to continually evolving code knowledge, particularly regarding the frequent updates of third-party library APIs. This limitation, stemming from static pre-training datasets, often results in non-executable code or implementations with suboptimal safety and efficiency. To this end, this paper introduces CODESYNC, a data engine for identifying outdated code patterns and collecting real-time code knowledge updates from Python third-party libraries. Building upon CODESYNC, we develop CODESYNCBENCH, a comprehensive benchmark for assessing LLMs' ability to stay synchronized with code evolution, which covers real-world updates for 220 APIs from six Python libraries. Our benchmark offers 3,300 test cases across three evaluation tasks and an update-aware instruction tuning dataset consisting of 2,200 training samples. Extensive experiments on 14 state-of-the-art LLMs reveal that they struggle with dynamic code evolution, even with the support of advanced knowledge updating methods (e.g., DPO, ORPO, and SimPO). We believe that our benchmark can offer a strong foundation for the development of more effective methods for real-time code knowledge updating in the future. The experimental code and dataset are publicly available at: https://github.com/Lucky-voyage/Code-Sync.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 28 Feb 2025 21:02:40 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/dd529079/c0fba365.mp3" length="21053960" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1312</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CL, cs.AI, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Chenlong Wang, Zhaoyang Chu, Zhengxiang Cheng, Xuyi Yang, Kaiyue Qiu, Yao Wan, Zhou Zhao, Xuanhua Shi, Dongping Chen</p>

            <p><strong>Title:</strong><br>
            CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.16645v1">http://arxiv.org/abs/2502.16645v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have exhibited exceptional performance in software engineering yet face challenges in adapting to continually evolving code knowledge, particularly regarding the frequent updates of third-party library APIs. This limitation, stemming from static pre-training datasets, often results in non-executable code or implementations with suboptimal safety and efficiency. To this end, this paper introduces CODESYNC, a data engine for identifying outdated code patterns and collecting real-time code knowledge updates from Python third-party libraries. Building upon CODESYNC, we develop CODESYNCBENCH, a comprehensive benchmark for assessing LLMs' ability to stay synchronized with code evolution, which covers real-world updates for 220 APIs from six Python libraries. Our benchmark offers 3,300 test cases across three evaluation tasks and an update-aware instruction tuning dataset consisting of 2,200 training samples. Extensive experiments on 14 state-of-the-art LLMs reveal that they struggle with dynamic code evolution, even with the support of advanced knowledge updating methods (e.g., DPO, ORPO, and SimPO). We believe that our benchmark can offer a strong foundation for the development of more effective methods for real-time code knowledge updating in the future. The experimental code and dataset are publicly available at: https://github.com/Lucky-voyage/Code-Sync.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UniTok: A Unified Tokenizer for Visual Generation and Understanding</title>
      <itunes:episode>618</itunes:episode>
      <podcast:episode>618</podcast:episode>
      <itunes:title>UniTok: A Unified Tokenizer for Visual Generation and Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b2c5e081-4c71-4f48-8df1-3419fa0b3916</guid>
      <link>https://share.transistor.fm/s/729f4866</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, Xiaojuan Qi</p>

            <p><strong>Title:</strong><br>
            UniTok: A Unified Tokenizer for Visual Generation and Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20321v1">http://arxiv.org/abs/2502.20321v1</a></p>

            <p><strong>Abstract:</strong><br>
            The representation disparity between visual generation and understanding imposes a critical gap in integrating these capabilities into a single framework. To bridge this gap, we introduce UniTok, a discrete visual tokenizer that encodes fine-grained details for generation while also capturing high-level semantics for understanding. Despite recent studies have shown that these objectives could induce loss conflicts in training, we reveal that the underlying bottleneck stems from limited representational capacity of discrete tokens. We address this by introducing multi-codebook quantization, which divides vector quantization with several independent sub-codebooks to expand the latent feature space, while avoiding training instability caused by overlarge codebooks. Our method significantly raises the upper limit of unified discrete tokenizers to match or even surpass domain-specific continuous tokenizers. For instance, UniTok achieves a remarkable rFID of 0.38 (versus 0.87 for SD-VAE) and a zero-shot accuracy of 78.6% (versus 76.2% for CLIP) on ImageNet. Our code is available at https://github.com/FoundationVision/UniTok.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, Xiaojuan Qi</p>

            <p><strong>Title:</strong><br>
            UniTok: A Unified Tokenizer for Visual Generation and Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20321v1">http://arxiv.org/abs/2502.20321v1</a></p>

            <p><strong>Abstract:</strong><br>
            The representation disparity between visual generation and understanding imposes a critical gap in integrating these capabilities into a single framework. To bridge this gap, we introduce UniTok, a discrete visual tokenizer that encodes fine-grained details for generation while also capturing high-level semantics for understanding. Despite recent studies have shown that these objectives could induce loss conflicts in training, we reveal that the underlying bottleneck stems from limited representational capacity of discrete tokens. We address this by introducing multi-codebook quantization, which divides vector quantization with several independent sub-codebooks to expand the latent feature space, while avoiding training instability caused by overlarge codebooks. Our method significantly raises the upper limit of unified discrete tokenizers to match or even surpass domain-specific continuous tokenizers. For instance, UniTok achieves a remarkable rFID of 0.38 (versus 0.87 for SD-VAE) and a zero-shot accuracy of 78.6% (versus 76.2% for CLIP) on ImageNet. Our code is available at https://github.com/FoundationVision/UniTok.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 28 Feb 2025 21:02:18 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/729f4866/34d75611.mp3" length="23788234" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1483</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, Xiaojuan Qi</p>

            <p><strong>Title:</strong><br>
            UniTok: A Unified Tokenizer for Visual Generation and Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20321v1">http://arxiv.org/abs/2502.20321v1</a></p>

            <p><strong>Abstract:</strong><br>
            The representation disparity between visual generation and understanding imposes a critical gap in integrating these capabilities into a single framework. To bridge this gap, we introduce UniTok, a discrete visual tokenizer that encodes fine-grained details for generation while also capturing high-level semantics for understanding. Despite recent studies have shown that these objectives could induce loss conflicts in training, we reveal that the underlying bottleneck stems from limited representational capacity of discrete tokens. We address this by introducing multi-codebook quantization, which divides vector quantization with several independent sub-codebooks to expand the latent feature space, while avoiding training instability caused by overlarge codebooks. Our method significantly raises the upper limit of unified discrete tokenizers to match or even surpass domain-specific continuous tokenizers. For instance, UniTok achieves a remarkable rFID of 0.38 (versus 0.87 for SD-VAE) and a zero-shot accuracy of 78.6% (versus 76.2% for CLIP) on ImageNet. Our code is available at https://github.com/FoundationVision/UniTok.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>NeoBERT: A Next-Generation BERT</title>
      <itunes:episode>617</itunes:episode>
      <podcast:episode>617</podcast:episode>
      <itunes:title>NeoBERT: A Next-Generation BERT</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bbf1edcc-6dd4-4b14-adc6-54747b1f0319</guid>
      <link>https://share.transistor.fm/s/8d353c46</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lola Le Breton, Quentin Fournier, Mariam El Mezouar, Sarath Chandar</p>

            <p><strong>Title:</strong><br>
            NeoBERT: A Next-Generation BERT</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19587v1">http://arxiv.org/abs/2502.19587v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of large auto-regressive language models such as LLaMA and DeepSeek. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress despite being foundational for many downstream NLP applications. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lola Le Breton, Quentin Fournier, Mariam El Mezouar, Sarath Chandar</p>

            <p><strong>Title:</strong><br>
            NeoBERT: A Next-Generation BERT</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19587v1">http://arxiv.org/abs/2502.19587v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of large auto-regressive language models such as LLaMA and DeepSeek. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress despite being foundational for many downstream NLP applications. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 28 Feb 2025 21:01:56 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8d353c46/bf112c9b.mp3" length="22772975" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1420</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lola Le Breton, Quentin Fournier, Mariam El Mezouar, Sarath Chandar</p>

            <p><strong>Title:</strong><br>
            NeoBERT: A Next-Generation BERT</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19587v1">http://arxiv.org/abs/2502.19587v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of large auto-regressive language models such as LLaMA and DeepSeek. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress despite being foundational for many downstream NLP applications. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance</title>
      <itunes:episode>616</itunes:episode>
      <podcast:episode>616</podcast:episode>
      <itunes:title>Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b3209772-e2c4-4bc0-8104-cd0076ef6ed9</guid>
      <link>https://share.transistor.fm/s/ea9fcf65</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chenghua Huang, Lu Wang, Fangkai Yang, Pu Zhao, Zhixu Li, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang</p>

            <p><strong>Title:</strong><br>
            Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.16944v1">http://arxiv.org/abs/2502.16944v1</a></p>

            <p><strong>Abstract:</strong><br>
            Proximal Policy Optimization (PPO)-based Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences. It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance. This approach increases computational complexity and instability due to actor-critic interdependence. Additionally, PPO lacks access to true environment rewards in LLM tasks, limiting its adaptability. Under such conditions, pretraining a value model or a reward model becomes equivalent, as both provide fixed supervisory signals without new ground-truth feedback. To address these issues, we propose \textbf{Decoupled Value Policy Optimization (DVPO)}, a lean framework that replaces traditional reward modeling with a pretrained \emph{global value model (GVM)}. The GVM is conditioned on policy trajectories and predicts token-level return-to-go estimates. By decoupling value model from policy training (via frozen GVM-driven RL objectives), DVPO eliminates actor-critic interdependence, reducing GPU memory usage by 40\% and training time by 35\% compared to conventional RLHF. Experiments across benchmarks show DVPO outperforms efficient RLHF methods (e.g., DPO) while matching state-of-the-art PPO in performance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chenghua Huang, Lu Wang, Fangkai Yang, Pu Zhao, Zhixu Li, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang</p>

            <p><strong>Title:</strong><br>
            Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.16944v1">http://arxiv.org/abs/2502.16944v1</a></p>

            <p><strong>Abstract:</strong><br>
            Proximal Policy Optimization (PPO)-based Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences. It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance. This approach increases computational complexity and instability due to actor-critic interdependence. Additionally, PPO lacks access to true environment rewards in LLM tasks, limiting its adaptability. Under such conditions, pretraining a value model or a reward model becomes equivalent, as both provide fixed supervisory signals without new ground-truth feedback. To address these issues, we propose \textbf{Decoupled Value Policy Optimization (DVPO)}, a lean framework that replaces traditional reward modeling with a pretrained \emph{global value model (GVM)}. The GVM is conditioned on policy trajectories and predicts token-level return-to-go estimates. By decoupling value model from policy training (via frozen GVM-driven RL objectives), DVPO eliminates actor-critic interdependence, reducing GPU memory usage by 40\% and training time by 35\% compared to conventional RLHF. Experiments across benchmarks show DVPO outperforms efficient RLHF methods (e.g., DPO) while matching state-of-the-art PPO in performance.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 28 Feb 2025 21:01:34 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ea9fcf65/228a9970.mp3" length="20701197" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1290</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chenghua Huang, Lu Wang, Fangkai Yang, Pu Zhao, Zhixu Li, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang</p>

            <p><strong>Title:</strong><br>
            Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.16944v1">http://arxiv.org/abs/2502.16944v1</a></p>

            <p><strong>Abstract:</strong><br>
            Proximal Policy Optimization (PPO)-based Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences. It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance. This approach increases computational complexity and instability due to actor-critic interdependence. Additionally, PPO lacks access to true environment rewards in LLM tasks, limiting its adaptability. Under such conditions, pretraining a value model or a reward model becomes equivalent, as both provide fixed supervisory signals without new ground-truth feedback. To address these issues, we propose \textbf{Decoupled Value Policy Optimization (DVPO)}, a lean framework that replaces traditional reward modeling with a pretrained \emph{global value model (GVM)}. The GVM is conditioned on policy trajectories and predicts token-level return-to-go estimates. By decoupling value model from policy training (via frozen GVM-driven RL objectives), DVPO eliminates actor-critic interdependence, reducing GPU memory usage by 40\% and training time by 35\% compared to conventional RLHF. Experiments across benchmarks show DVPO outperforms efficient RLHF methods (e.g., DPO) while matching state-of-the-art PPO in performance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think</title>
      <itunes:episode>615</itunes:episode>
      <podcast:episode>615</podcast:episode>
      <itunes:title>Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6cd271b5-322f-495c-8659-fd90a0421c1c</guid>
      <link>https://share.transistor.fm/s/148d8388</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, Baobao Chang</p>

            <p><strong>Title:</strong><br>
            Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20172v1">http://arxiv.org/abs/2502.20172v1</a></p>

            <p><strong>Abstract:</strong><br>
            The field of advanced text-to-image generation is witnessing the emergence of unified frameworks that integrate powerful text encoders, such as CLIP and T5, with Diffusion Transformer backbones. Although there have been efforts to control output images with additional conditions, like canny and depth map, a comprehensive framework for arbitrary text-image interleaved control is still lacking. This gap is especially evident when attempting to merge concepts or visual elements from multiple images in the generation process. To mitigate the gap, we conducted preliminary experiments showing that large multimodal models (LMMs) offer an effective shared representation space, where image and text can be well-aligned to serve as a condition for external diffusion models. Based on this discovery, we propose Dream Engine, an efficient and unified framework designed for arbitrary text-image interleaved control in image generation models. Building on powerful text-to-image models like SD3.5, we replace the original text-only encoders by incorporating versatile multimodal information encoders such as QwenVL. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark, and matching the performance of state-of-the-art text-to-image models like SD3.5 and FLUX.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, Baobao Chang</p>

            <p><strong>Title:</strong><br>
            Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20172v1">http://arxiv.org/abs/2502.20172v1</a></p>

            <p><strong>Abstract:</strong><br>
            The field of advanced text-to-image generation is witnessing the emergence of unified frameworks that integrate powerful text encoders, such as CLIP and T5, with Diffusion Transformer backbones. Although there have been efforts to control output images with additional conditions, like canny and depth map, a comprehensive framework for arbitrary text-image interleaved control is still lacking. This gap is especially evident when attempting to merge concepts or visual elements from multiple images in the generation process. To mitigate the gap, we conducted preliminary experiments showing that large multimodal models (LMMs) offer an effective shared representation space, where image and text can be well-aligned to serve as a condition for external diffusion models. Based on this discovery, we propose Dream Engine, an efficient and unified framework designed for arbitrary text-image interleaved control in image generation models. Building on powerful text-to-image models like SD3.5, we replace the original text-only encoders by incorporating versatile multimodal information encoders such as QwenVL. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark, and matching the performance of state-of-the-art text-to-image models like SD3.5 and FLUX.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 28 Feb 2025 21:01:12 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/148d8388/aa879c56.mp3" length="21427229" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1336</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, Baobao Chang</p>

            <p><strong>Title:</strong><br>
            Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.20172v1">http://arxiv.org/abs/2502.20172v1</a></p>

            <p><strong>Abstract:</strong><br>
            The field of advanced text-to-image generation is witnessing the emergence of unified frameworks that integrate powerful text encoders, such as CLIP and T5, with Diffusion Transformer backbones. Although there have been efforts to control output images with additional conditions, like canny and depth map, a comprehensive framework for arbitrary text-image interleaved control is still lacking. This gap is especially evident when attempting to merge concepts or visual elements from multiple images in the generation process. To mitigate the gap, we conducted preliminary experiments showing that large multimodal models (LMMs) offer an effective shared representation space, where image and text can be well-aligned to serve as a condition for external diffusion models. Based on this discovery, we propose Dream Engine, an efficient and unified framework designed for arbitrary text-image interleaved control in image generation models. Building on powerful text-to-image models like SD3.5, we replace the original text-only encoders by incorporating versatile multimodal information encoders such as QwenVL. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark, and matching the performance of state-of-the-art text-to-image models like SD3.5 and FLUX.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GHOST 2.0: generative high-fidelity one shot transfer of heads</title>
      <itunes:episode>614</itunes:episode>
      <podcast:episode>614</podcast:episode>
      <itunes:title>GHOST 2.0: generative high-fidelity one shot transfer of heads</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">28506a65-73c9-416a-aa6f-79bdccb8bf4b</guid>
      <link>https://share.transistor.fm/s/46f7fc29</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Alexander Groshev, Anastasiia Iashchenko, Pavel Paramonov, Denis Dimitrov, Andrey Kuznetsov</p>

            <p><strong>Title:</strong><br>
            GHOST 2.0: generative high-fidelity one shot transfer of heads</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.18417v3">http://arxiv.org/abs/2502.18417v3</a></p>

            <p><strong>Abstract:</strong><br>
            While the task of face swapping has recently gained attention in the research community, a related problem of head swapping remains largely unexplored. In addition to skin color transfer, head swap poses extra challenges, such as the need to preserve structural information of the whole head during synthesis and inpaint gaps between swapped head and background. In this paper, we address these concerns with GHOST 2.0, which consists of two problem-specific modules. First, we introduce enhanced Aligner model for head reenactment, which preserves identity information at multiple scales and is robust to extreme pose variations. Secondly, we use a Blender module that seamlessly integrates the reenacted head into the target background by transferring skin color and inpainting mismatched regions. Both modules outperform the baselines on the corresponding tasks, allowing to achieve state of the art results in head swapping. We also tackle complex cases, such as large difference in hair styles of source and target. Code is available at https://github.com/ai-forever/ghost-2.0</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Alexander Groshev, Anastasiia Iashchenko, Pavel Paramonov, Denis Dimitrov, Andrey Kuznetsov</p>

            <p><strong>Title:</strong><br>
            GHOST 2.0: generative high-fidelity one shot transfer of heads</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.18417v3">http://arxiv.org/abs/2502.18417v3</a></p>

            <p><strong>Abstract:</strong><br>
            While the task of face swapping has recently gained attention in the research community, a related problem of head swapping remains largely unexplored. In addition to skin color transfer, head swap poses extra challenges, such as the need to preserve structural information of the whole head during synthesis and inpaint gaps between swapped head and background. In this paper, we address these concerns with GHOST 2.0, which consists of two problem-specific modules. First, we introduce enhanced Aligner model for head reenactment, which preserves identity information at multiple scales and is robust to extreme pose variations. Secondly, we use a Blender module that seamlessly integrates the reenacted head into the target background by transferring skin color and inpainting mismatched regions. Both modules outperform the baselines on the corresponding tasks, allowing to achieve state of the art results in head swapping. We also tackle complex cases, such as large difference in hair styles of source and target. Code is available at https://github.com/ai-forever/ghost-2.0</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 27 Feb 2025 20:48:04 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/46f7fc29/a3024c45.mp3" length="18001166" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1121</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 49 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Alexander Groshev, Anastasiia Iashchenko, Pavel Paramonov, Denis Dimitrov, Andrey Kuznetsov</p>

            <p><strong>Title:</strong><br>
            GHOST 2.0: generative high-fidelity one shot transfer of heads</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.18417v3">http://arxiv.org/abs/2502.18417v3</a></p>

            <p><strong>Abstract:</strong><br>
            While the task of face swapping has recently gained attention in the research community, a related problem of head swapping remains largely unexplored. In addition to skin color transfer, head swap poses extra challenges, such as the need to preserve structural information of the whole head during synthesis and inpaint gaps between swapped head and background. In this paper, we address these concerns with GHOST 2.0, which consists of two problem-specific modules. First, we introduce enhanced Aligner model for head reenactment, which preserves identity information at multiple scales and is robust to extreme pose variations. Secondly, we use a Blender module that seamlessly integrates the reenacted head into the target background by transferring skin color and inpainting mismatched regions. Both modules outperform the baselines on the corresponding tasks, allowing to achieve state of the art results in head swapping. We also tackle complex cases, such as large difference in hair styles of source and target. Code is available at https://github.com/ai-forever/ghost-2.0</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Kanana: Compute-efficient Bilingual Language Models</title>
      <itunes:episode>613</itunes:episode>
      <podcast:episode>613</podcast:episode>
      <itunes:title>Kanana: Compute-efficient Bilingual Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1a45d794-f95f-4792-8ef8-9d5d1b209730</guid>
      <link>https://share.transistor.fm/s/6343a626</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kanana LLM Team, Yunju Bak, Hojin Lee, Minho Ryu, Jiyeon Ham, Seungjae Jung, Daniel Wontae Nam, Taegyeong Eo, Donghun Lee, Doohae Jung, Boseop Kim, Nayeon Kim, Jaesun Park, Hyunho Kim, Hyunwoong Ko, Changmin Lee, Kyoung-Woon On, Seulye Baeg, Junrae Cho, Sunghee Jung, Jieun Kang, EungGyun Kim, Eunhwa Kim, Byeongil Ko, Daniel Lee, Minchul Lee, Miok Lee, Shinbok Lee, Gaeun Seo</p>

            <p><strong>Title:</strong><br>
            Kanana: Compute-efficient Bilingual Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.18934v2">http://arxiv.org/abs/2502.18934v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Kanana, a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English. The computational cost of Kanana is significantly lower than that of state-of-the-art models of similar size. The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models, including high quality data filtering, staged pre-training, depth up-scaling, and pruning and distillation. Furthermore, the report outlines the methodologies utilized during the post-training of the Kanana models, encompassing supervised fine-tuning and preference optimization, aimed at enhancing their capability for seamless interaction with users. Lastly, the report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling. The Kanana model series spans from 2.1B to 32.5B parameters with 2.1B models (base, instruct, embedding) publicly released to promote research on Korean language models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kanana LLM Team, Yunju Bak, Hojin Lee, Minho Ryu, Jiyeon Ham, Seungjae Jung, Daniel Wontae Nam, Taegyeong Eo, Donghun Lee, Doohae Jung, Boseop Kim, Nayeon Kim, Jaesun Park, Hyunho Kim, Hyunwoong Ko, Changmin Lee, Kyoung-Woon On, Seulye Baeg, Junrae Cho, Sunghee Jung, Jieun Kang, EungGyun Kim, Eunhwa Kim, Byeongil Ko, Daniel Lee, Minchul Lee, Miok Lee, Shinbok Lee, Gaeun Seo</p>

            <p><strong>Title:</strong><br>
            Kanana: Compute-efficient Bilingual Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.18934v2">http://arxiv.org/abs/2502.18934v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Kanana, a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English. The computational cost of Kanana is significantly lower than that of state-of-the-art models of similar size. The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models, including high quality data filtering, staged pre-training, depth up-scaling, and pruning and distillation. Furthermore, the report outlines the methodologies utilized during the post-training of the Kanana models, encompassing supervised fine-tuning and preference optimization, aimed at enhancing their capability for seamless interaction with users. Lastly, the report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling. The Kanana model series spans from 2.1B to 32.5B parameters with 2.1B models (base, instruct, embedding) publicly released to promote research on Korean language models.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 27 Feb 2025 20:47:43 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6343a626/ef02fd4f.mp3" length="21265416" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1325</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kanana LLM Team, Yunju Bak, Hojin Lee, Minho Ryu, Jiyeon Ham, Seungjae Jung, Daniel Wontae Nam, Taegyeong Eo, Donghun Lee, Doohae Jung, Boseop Kim, Nayeon Kim, Jaesun Park, Hyunho Kim, Hyunwoong Ko, Changmin Lee, Kyoung-Woon On, Seulye Baeg, Junrae Cho, Sunghee Jung, Jieun Kang, EungGyun Kim, Eunhwa Kim, Byeongil Ko, Daniel Lee, Minchul Lee, Miok Lee, Shinbok Lee, Gaeun Seo</p>

            <p><strong>Title:</strong><br>
            Kanana: Compute-efficient Bilingual Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.18934v2">http://arxiv.org/abs/2502.18934v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Kanana, a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English. The computational cost of Kanana is significantly lower than that of state-of-the-art models of similar size. The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models, including high quality data filtering, staged pre-training, depth up-scaling, and pruning and distillation. Furthermore, the report outlines the methodologies utilized during the post-training of the Kanana models, encompassing supervised fine-tuning and preference optimization, aimed at enhancing their capability for seamless interaction with users. Lastly, the report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling. The Kanana model series spans from 2.1B to 32.5B parameters with 2.1B models (base, instruct, embedding) publicly released to promote research on Korean language models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding</title>
      <itunes:episode>612</itunes:episode>
      <podcast:episode>612</podcast:episode>
      <itunes:title>TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d51db910-42a8-44ad-a188-1cc49ad1e3ab</guid>
      <link>https://share.transistor.fm/s/ba213c86</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.AI, cs.CL, cs.CV, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Max Ku, Thomas Chong, Jonathan Leung, Krish Shah, Alvin Yu, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19400v1">http://arxiv.org/abs/2502.19400v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding domain-specific theorems often requires more than just text-based reasoning; effective communication through structured visual explanations is crucial for deeper comprehension. While large language models (LLMs) demonstrate strong performance in text-based theorem reasoning, their ability to generate coherent and pedagogically meaningful visual explanations remains an open challenge. In this work, we introduce TheoremExplainAgent, an agentic approach for generating long-form theorem explanation videos (over 5 minutes) using Manim animations. To systematically evaluate multimodal theorem explanations, we propose TheoremExplainBench, a benchmark covering 240 theorems across multiple STEM disciplines, along with 5 automated evaluation metrics. Our results reveal that agentic planning is essential for generating detailed long-form videos, and the o3-mini agent achieves a success rate of 93.8% and an overall score of 0.77. However, our quantitative and qualitative studies show that most of the videos produced exhibit minor issues with visual element layout. Furthermore, multimodal explanations expose deeper reasoning flaws that text-based explanations fail to reveal, highlighting the importance of multimodal explanations.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.AI, cs.CL, cs.CV, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Max Ku, Thomas Chong, Jonathan Leung, Krish Shah, Alvin Yu, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19400v1">http://arxiv.org/abs/2502.19400v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding domain-specific theorems often requires more than just text-based reasoning; effective communication through structured visual explanations is crucial for deeper comprehension. While large language models (LLMs) demonstrate strong performance in text-based theorem reasoning, their ability to generate coherent and pedagogically meaningful visual explanations remains an open challenge. In this work, we introduce TheoremExplainAgent, an agentic approach for generating long-form theorem explanation videos (over 5 minutes) using Manim animations. To systematically evaluate multimodal theorem explanations, we propose TheoremExplainBench, a benchmark covering 240 theorems across multiple STEM disciplines, along with 5 automated evaluation metrics. Our results reveal that agentic planning is essential for generating detailed long-form videos, and the o3-mini agent achieves a success rate of 93.8% and an overall score of 0.77. However, our quantitative and qualitative studies show that most of the videos produced exhibit minor issues with visual element layout. Furthermore, multimodal explanations expose deeper reasoning flaws that text-based explanations fail to reveal, highlighting the importance of multimodal explanations.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 27 Feb 2025 20:47:22 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ba213c86/509b3d23.mp3" length="21525836" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1342</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.AI, cs.CL, cs.CV, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Max Ku, Thomas Chong, Jonathan Leung, Krish Shah, Alvin Yu, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19400v1">http://arxiv.org/abs/2502.19400v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding domain-specific theorems often requires more than just text-based reasoning; effective communication through structured visual explanations is crucial for deeper comprehension. While large language models (LLMs) demonstrate strong performance in text-based theorem reasoning, their ability to generate coherent and pedagogically meaningful visual explanations remains an open challenge. In this work, we introduce TheoremExplainAgent, an agentic approach for generating long-form theorem explanation videos (over 5 minutes) using Manim animations. To systematically evaluate multimodal theorem explanations, we propose TheoremExplainBench, a benchmark covering 240 theorems across multiple STEM disciplines, along with 5 automated evaluation metrics. Our results reveal that agentic planning is essential for generating detailed long-form videos, and the o3-mini agent achieves a success rate of 93.8% and an overall score of 0.77. However, our quantitative and qualitative studies show that most of the videos produced exhibit minor issues with visual element layout. Furthermore, multimodal explanations expose deeper reasoning flaws that text-based explanations fail to reveal, highlighting the importance of multimodal explanations.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance</title>
      <itunes:episode>611</itunes:episode>
      <podcast:episode>611</podcast:episode>
      <itunes:title>Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0ff22685-2a12-4826-9751-2ea7aed4d564</guid>
      <link>https://share.transistor.fm/s/2ff6176f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xueqing Peng, Triantafillos Papadopoulos, Efstathia Soufleri, Polydoros Giannouris, Ruoyu Xiang, Yan Wang, Lingfei Qian, Jimin Huang, Qianqian Xie, Sophia Ananiadou</p>

            <p><strong>Title:</strong><br>
            Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.18772v1">http://arxiv.org/abs/2502.18772v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite Greece's pivotal role in the global economy, large language models (LLMs) remain underexplored for Greek financial context due to the linguistic complexity of Greek and the scarcity of domain-specific datasets. Previous efforts in multilingual financial natural language processing (NLP) have exposed considerable performance disparities, yet no dedicated Greek financial benchmarks or Greek-specific financial LLMs have been developed until now. To bridge this gap, we introduce Plutus-ben, the first Greek Financial Evaluation Benchmark, and Plutus-8B, the pioneering Greek Financial LLM, fine-tuned with Greek domain-specific data. Plutus-ben addresses five core financial NLP tasks in Greek: numeric and textual named entity recognition, question answering, abstractive summarization, and topic classification, thereby facilitating systematic and reproducible LLM assessments. To underpin these tasks, we present three novel, high-quality Greek financial datasets, thoroughly annotated by expert native Greek speakers, augmented by two existing resources. Our comprehensive evaluation of 22 LLMs on Plutus-ben reveals that Greek financial NLP remains challenging due to linguistic complexity, domain-specific terminology, and financial reasoning gaps. These findings underscore the limitations of cross-lingual transfer, the necessity for financial expertise in Greek-trained models, and the challenges of adapting financial LLMs to Greek text. We release Plutus-ben, Plutus-8B, and all associated datasets publicly to promote reproducible research and advance Greek financial NLP, fostering broader multilingual inclusivity in finance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xueqing Peng, Triantafillos Papadopoulos, Efstathia Soufleri, Polydoros Giannouris, Ruoyu Xiang, Yan Wang, Lingfei Qian, Jimin Huang, Qianqian Xie, Sophia Ananiadou</p>

            <p><strong>Title:</strong><br>
            Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.18772v1">http://arxiv.org/abs/2502.18772v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite Greece's pivotal role in the global economy, large language models (LLMs) remain underexplored for Greek financial context due to the linguistic complexity of Greek and the scarcity of domain-specific datasets. Previous efforts in multilingual financial natural language processing (NLP) have exposed considerable performance disparities, yet no dedicated Greek financial benchmarks or Greek-specific financial LLMs have been developed until now. To bridge this gap, we introduce Plutus-ben, the first Greek Financial Evaluation Benchmark, and Plutus-8B, the pioneering Greek Financial LLM, fine-tuned with Greek domain-specific data. Plutus-ben addresses five core financial NLP tasks in Greek: numeric and textual named entity recognition, question answering, abstractive summarization, and topic classification, thereby facilitating systematic and reproducible LLM assessments. To underpin these tasks, we present three novel, high-quality Greek financial datasets, thoroughly annotated by expert native Greek speakers, augmented by two existing resources. Our comprehensive evaluation of 22 LLMs on Plutus-ben reveals that Greek financial NLP remains challenging due to linguistic complexity, domain-specific terminology, and financial reasoning gaps. These findings underscore the limitations of cross-lingual transfer, the necessity for financial expertise in Greek-trained models, and the challenges of adapting financial LLMs to Greek text. We release Plutus-ben, Plutus-8B, and all associated datasets publicly to promote reproducible research and advance Greek financial NLP, fostering broader multilingual inclusivity in finance.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 27 Feb 2025 20:47:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2ff6176f/ee29c3d9.mp3" length="24012683" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1497</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xueqing Peng, Triantafillos Papadopoulos, Efstathia Soufleri, Polydoros Giannouris, Ruoyu Xiang, Yan Wang, Lingfei Qian, Jimin Huang, Qianqian Xie, Sophia Ananiadou</p>

            <p><strong>Title:</strong><br>
            Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.18772v1">http://arxiv.org/abs/2502.18772v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite Greece's pivotal role in the global economy, large language models (LLMs) remain underexplored for Greek financial context due to the linguistic complexity of Greek and the scarcity of domain-specific datasets. Previous efforts in multilingual financial natural language processing (NLP) have exposed considerable performance disparities, yet no dedicated Greek financial benchmarks or Greek-specific financial LLMs have been developed until now. To bridge this gap, we introduce Plutus-ben, the first Greek Financial Evaluation Benchmark, and Plutus-8B, the pioneering Greek Financial LLM, fine-tuned with Greek domain-specific data. Plutus-ben addresses five core financial NLP tasks in Greek: numeric and textual named entity recognition, question answering, abstractive summarization, and topic classification, thereby facilitating systematic and reproducible LLM assessments. To underpin these tasks, we present three novel, high-quality Greek financial datasets, thoroughly annotated by expert native Greek speakers, augmented by two existing resources. Our comprehensive evaluation of 22 LLMs on Plutus-ben reveals that Greek financial NLP remains challenging due to linguistic complexity, domain-specific terminology, and financial reasoning gaps. These findings underscore the limitations of cross-lingual transfer, the necessity for financial expertise in Greek-trained models, and the challenges of adapting financial LLMs to Greek text. We release Plutus-ben, Plutus-8B, and all associated datasets publicly to promote reproducible research and advance Greek financial NLP, fostering broader multilingual inclusivity in finance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Language Models' Factuality Depends on the Language of Inquiry</title>
      <itunes:episode>610</itunes:episode>
      <podcast:episode>610</podcast:episode>
      <itunes:title>Language Models' Factuality Depends on the Language of Inquiry</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7d79f101-f891-4a9e-a400-d8ad0b869eba</guid>
      <link>https://share.transistor.fm/s/821444f0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tushar Aggarwal, Kumar Tanmay, Ayush Agrawal, Kumar Ayush, Hamid Palangi, Paul Pu Liang</p>

            <p><strong>Title:</strong><br>
            Language Models' Factuality Depends on the Language of Inquiry</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.17955v1">http://arxiv.org/abs/2502.17955v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multilingual language models (LMs) are expected to recall factual knowledge consistently across languages, yet they often fail to transfer knowledge between languages even when they possess the correct information in one of the languages. For example, we find that an LM may correctly identify Rashed Al Shashai as being from Saudi Arabia when asked in Arabic, but consistently fails to do so when asked in English or Swahili. To systematically investigate this limitation, we introduce a benchmark of 10,000 country-related facts across 13 languages and propose three novel metrics: Factual Recall Score, Knowledge Transferability Score, and Cross-Lingual Factual Knowledge Transferability Score-to quantify factual recall and knowledge transferability in LMs across different languages. Our results reveal fundamental weaknesses in today's state-of-the-art LMs, particularly in cross-lingual generalization where models fail to transfer knowledge effectively across different languages, leading to inconsistent performance sensitive to the language used. Our findings emphasize the need for LMs to recognize language-specific factual reliability and leverage the most trustworthy information across languages. We release our benchmark and evaluation framework to drive future research in multilingual knowledge transfer.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tushar Aggarwal, Kumar Tanmay, Ayush Agrawal, Kumar Ayush, Hamid Palangi, Paul Pu Liang</p>

            <p><strong>Title:</strong><br>
            Language Models' Factuality Depends on the Language of Inquiry</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.17955v1">http://arxiv.org/abs/2502.17955v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multilingual language models (LMs) are expected to recall factual knowledge consistently across languages, yet they often fail to transfer knowledge between languages even when they possess the correct information in one of the languages. For example, we find that an LM may correctly identify Rashed Al Shashai as being from Saudi Arabia when asked in Arabic, but consistently fails to do so when asked in English or Swahili. To systematically investigate this limitation, we introduce a benchmark of 10,000 country-related facts across 13 languages and propose three novel metrics: Factual Recall Score, Knowledge Transferability Score, and Cross-Lingual Factual Knowledge Transferability Score-to quantify factual recall and knowledge transferability in LMs across different languages. Our results reveal fundamental weaknesses in today's state-of-the-art LMs, particularly in cross-lingual generalization where models fail to transfer knowledge effectively across different languages, leading to inconsistent performance sensitive to the language used. Our findings emphasize the need for LMs to recognize language-specific factual reliability and leverage the most trustworthy information across languages. We release our benchmark and evaluation framework to drive future research in multilingual knowledge transfer.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 27 Feb 2025 20:46:39 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/821444f0/8b31af2c.mp3" length="21575135" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1345</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tushar Aggarwal, Kumar Tanmay, Ayush Agrawal, Kumar Ayush, Hamid Palangi, Paul Pu Liang</p>

            <p><strong>Title:</strong><br>
            Language Models' Factuality Depends on the Language of Inquiry</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.17955v1">http://arxiv.org/abs/2502.17955v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multilingual language models (LMs) are expected to recall factual knowledge consistently across languages, yet they often fail to transfer knowledge between languages even when they possess the correct information in one of the languages. For example, we find that an LM may correctly identify Rashed Al Shashai as being from Saudi Arabia when asked in Arabic, but consistently fails to do so when asked in English or Swahili. To systematically investigate this limitation, we introduce a benchmark of 10,000 country-related facts across 13 languages and propose three novel metrics: Factual Recall Score, Knowledge Transferability Score, and Cross-Lingual Factual Knowledge Transferability Score-to quantify factual recall and knowledge transferability in LMs across different languages. Our results reveal fundamental weaknesses in today's state-of-the-art LMs, particularly in cross-lingual generalization where models fail to transfer knowledge effectively across different languages, leading to inconsistent performance sensitive to the language used. Our findings emphasize the need for LMs to recognize language-specific factual reliability and leverage the most trustworthy information across languages. We release our benchmark and evaluation framework to drive future research in multilingual knowledge transfer.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?</title>
      <itunes:episode>609</itunes:episode>
      <podcast:episode>609</podcast:episode>
      <itunes:title>Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1acf19eb-ac6c-4848-9234-8029642534fc</guid>
      <link>https://share.transistor.fm/s/db8c8a24</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yancheng He, Shilong Li, Jiaheng Liu, Weixun Wang, Xingyuan Bu, Ge Zhang, Zhongyuan Peng, Zhaoxiang Zhang, Zhicheng Zheng, Wenbo Su, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19361v2">http://arxiv.org/abs/2502.19361v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, o1-like models have drawn significant attention, where these models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs). In this paper, to understand the qualities of these long CoTs and measure the critique abilities of existing LLMs on these long CoTs, we introduce the DeltaBench, including the generated long CoTs from different o1-like models (e.g., QwQ, DeepSeek-R1) for different reasoning tasks (e.g., Math, Code, General Reasoning), to measure the ability to detect errors in long CoT reasoning. Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models. Then, we conduct extensive evaluations of existing process reward models (PRMs) and critic models to detect the errors of each annotated process, which aims to investigate the boundaries and limitations of existing PRMs and critic models. Finally, we hope that DeltaBench could guide developers to better understand the long CoT reasoning abilities of their models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yancheng He, Shilong Li, Jiaheng Liu, Weixun Wang, Xingyuan Bu, Ge Zhang, Zhongyuan Peng, Zhaoxiang Zhang, Zhicheng Zheng, Wenbo Su, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19361v2">http://arxiv.org/abs/2502.19361v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, o1-like models have drawn significant attention, where these models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs). In this paper, to understand the qualities of these long CoTs and measure the critique abilities of existing LLMs on these long CoTs, we introduce the DeltaBench, including the generated long CoTs from different o1-like models (e.g., QwQ, DeepSeek-R1) for different reasoning tasks (e.g., Math, Code, General Reasoning), to measure the ability to detect errors in long CoT reasoning. Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models. Then, we conduct extensive evaluations of existing process reward models (PRMs) and critic models to detect the errors of each annotated process, which aims to investigate the boundaries and limitations of existing PRMs and critic models. Finally, we hope that DeltaBench could guide developers to better understand the long CoT reasoning abilities of their models.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 27 Feb 2025 20:46:18 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/db8c8a24/bdc4b717.mp3" length="23163811" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1444</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yancheng He, Shilong Li, Jiaheng Liu, Weixun Wang, Xingyuan Bu, Ge Zhang, Zhongyuan Peng, Zhaoxiang Zhang, Zhicheng Zheng, Wenbo Su, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19361v2">http://arxiv.org/abs/2502.19361v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, o1-like models have drawn significant attention, where these models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs). In this paper, to understand the qualities of these long CoTs and measure the critique abilities of existing LLMs on these long CoTs, we introduce the DeltaBench, including the generated long CoTs from different o1-like models (e.g., QwQ, DeepSeek-R1) for different reasoning tasks (e.g., Math, Code, General Reasoning), to measure the ability to detect errors in long CoT reasoning. Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models. Then, we conduct extensive evaluations of existing process reward models (PRMs) and critic models to detect the errors of each annotated process, which aims to investigate the boundaries and limitations of existing PRMs and critic models. Finally, we hope that DeltaBench could guide developers to better understand the long CoT reasoning abilities of their models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Towards an AI co-scientist</title>
      <itunes:episode>608</itunes:episode>
      <podcast:episode>608</podcast:episode>
      <itunes:title>Towards an AI co-scientist</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1503d5ee-644b-4a78-b7ff-a7b2581594d9</guid>
      <link>https://share.transistor.fm/s/01f1ac62</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.AI, cs.CL, cs.HC, cs.LG, physics.soc-ph, q-bio.OT</p>

            <p><strong>Authors:</strong><br>
            Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vikram Dhillon, Eeshit Dhaval Vaishnav, Byron Lee, Tiago R D Costa, José R Penadés, Gary Peltz, Yunhan Xu, Annalisa Pawlosky, Alan Karthikesalingam, Vivek Natarajan</p>

            <p><strong>Title:</strong><br>
            Towards an AI co-scientist</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.18864v1">http://arxiv.org/abs/2502.18864v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scientific discovery relies on scientists generating novel hypotheses that undergo rigorous experimental validation. To augment this process, we introduce an AI co-scientist, a multi-agent system built on Gemini 2.0. The AI co-scientist is intended to help uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and aligned to scientist-provided research objectives and guidance. The system's design incorporates a generate, debate, and evolve approach to hypothesis generation, inspired by the scientific method and accelerated by scaling test-time compute. Key contributions include: (1) a multi-agent architecture with an asynchronous task execution framework for flexible compute scaling; (2) a tournament evolution process for self-improving hypotheses generation. Automated evaluations show continued benefits of test-time compute, improving hypothesis quality. While general purpose, we focus development and validation in three biomedical areas: drug repurposing, novel target discovery, and explaining mechanisms of bacterial evolution and anti-microbial resistance. For drug repurposing, the system proposes candidates with promising validation findings, including candidates for acute myeloid leukemia that show tumor inhibition in vitro at clinically applicable concentrations. For novel target discovery, the AI co-scientist proposed new epigenetic targets for liver fibrosis, validated by anti-fibrotic activity and liver cell regeneration in human hepatic organoids. Finally, the AI co-scientist recapitulated unpublished experimental results via a parallel in silico discovery of a novel gene transfer mechanism in bacterial evolution. These results, detailed in separate, co-timed reports, demonstrate the potential to augment biomedical and scientific discovery and usher an era of AI empowered scientists.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.AI, cs.CL, cs.HC, cs.LG, physics.soc-ph, q-bio.OT</p>

            <p><strong>Authors:</strong><br>
            Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vikram Dhillon, Eeshit Dhaval Vaishnav, Byron Lee, Tiago R D Costa, José R Penadés, Gary Peltz, Yunhan Xu, Annalisa Pawlosky, Alan Karthikesalingam, Vivek Natarajan</p>

            <p><strong>Title:</strong><br>
            Towards an AI co-scientist</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.18864v1">http://arxiv.org/abs/2502.18864v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scientific discovery relies on scientists generating novel hypotheses that undergo rigorous experimental validation. To augment this process, we introduce an AI co-scientist, a multi-agent system built on Gemini 2.0. The AI co-scientist is intended to help uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and aligned to scientist-provided research objectives and guidance. The system's design incorporates a generate, debate, and evolve approach to hypothesis generation, inspired by the scientific method and accelerated by scaling test-time compute. Key contributions include: (1) a multi-agent architecture with an asynchronous task execution framework for flexible compute scaling; (2) a tournament evolution process for self-improving hypotheses generation. Automated evaluations show continued benefits of test-time compute, improving hypothesis quality. While general purpose, we focus development and validation in three biomedical areas: drug repurposing, novel target discovery, and explaining mechanisms of bacterial evolution and anti-microbial resistance. For drug repurposing, the system proposes candidates with promising validation findings, including candidates for acute myeloid leukemia that show tumor inhibition in vitro at clinically applicable concentrations. For novel target discovery, the AI co-scientist proposed new epigenetic targets for liver fibrosis, validated by anti-fibrotic activity and liver cell regeneration in human hepatic organoids. Finally, the AI co-scientist recapitulated unpublished experimental results via a parallel in silico discovery of a novel gene transfer mechanism in bacterial evolution. These results, detailed in separate, co-timed reports, demonstrate the potential to augment biomedical and scientific discovery and usher an era of AI empowered scientists.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 27 Feb 2025 20:45:57 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/01f1ac62/eb7d884d.mp3" length="24108350" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1503</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.AI, cs.CL, cs.HC, cs.LG, physics.soc-ph, q-bio.OT</p>

            <p><strong>Authors:</strong><br>
            Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vikram Dhillon, Eeshit Dhaval Vaishnav, Byron Lee, Tiago R D Costa, José R Penadés, Gary Peltz, Yunhan Xu, Annalisa Pawlosky, Alan Karthikesalingam, Vivek Natarajan</p>

            <p><strong>Title:</strong><br>
            Towards an AI co-scientist</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.18864v1">http://arxiv.org/abs/2502.18864v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scientific discovery relies on scientists generating novel hypotheses that undergo rigorous experimental validation. To augment this process, we introduce an AI co-scientist, a multi-agent system built on Gemini 2.0. The AI co-scientist is intended to help uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and aligned to scientist-provided research objectives and guidance. The system's design incorporates a generate, debate, and evolve approach to hypothesis generation, inspired by the scientific method and accelerated by scaling test-time compute. Key contributions include: (1) a multi-agent architecture with an asynchronous task execution framework for flexible compute scaling; (2) a tournament evolution process for self-improving hypotheses generation. Automated evaluations show continued benefits of test-time compute, improving hypothesis quality. While general purpose, we focus development and validation in three biomedical areas: drug repurposing, novel target discovery, and explaining mechanisms of bacterial evolution and anti-microbial resistance. For drug repurposing, the system proposes candidates with promising validation findings, including candidates for acute myeloid leukemia that show tumor inhibition in vitro at clinically applicable concentrations. For novel target discovery, the AI co-scientist proposed new epigenetic targets for liver fibrosis, validated by anti-fibrotic activity and liver cell regeneration in human hepatic organoids. Finally, the AI co-scientist recapitulated unpublished experimental results via a parallel in silico discovery of a novel gene transfer mechanism in bacterial evolution. These results, detailed in separate, co-timed reports, demonstrate the potential to augment biomedical and scientific discovery and usher an era of AI empowered scientists.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems</title>
      <itunes:episode>607</itunes:episode>
      <podcast:episode>607</podcast:episode>
      <itunes:title>Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bcfbace7-b43e-45a2-8b46-1c8c9a6de84a</guid>
      <link>https://share.transistor.fm/s/b03b3532</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin Xu, Lei Hou, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19328v1">http://arxiv.org/abs/2502.19328v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs). However, existing reward models primarily focus on human preferences, neglecting verifiable correctness signals which have shown strong potential in training LLMs. In this paper, we propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals from different aspects to provide reliable rewards. We empirically implement a reward agent, named RewardAgent, that combines human preference rewards with two verifiable signals: factuality and instruction following, to provide more reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks. RewardAgent significantly outperforms vanilla reward models, demonstrating its effectiveness. We further construct training preference pairs using RewardAgent and train an LLM with the DPO objective, achieving superior performance on various NLP benchmarks compared to conventional reward models. Our codes are publicly released to facilitate further research (https://github.com/THU-KEG/Agentic-Reward-Modeling).</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin Xu, Lei Hou, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19328v1">http://arxiv.org/abs/2502.19328v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs). However, existing reward models primarily focus on human preferences, neglecting verifiable correctness signals which have shown strong potential in training LLMs. In this paper, we propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals from different aspects to provide reliable rewards. We empirically implement a reward agent, named RewardAgent, that combines human preference rewards with two verifiable signals: factuality and instruction following, to provide more reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks. RewardAgent significantly outperforms vanilla reward models, demonstrating its effectiveness. We further construct training preference pairs using RewardAgent and train an LLM with the DPO objective, achieving superior performance on various NLP benchmarks compared to conventional reward models. Our codes are publicly released to facilitate further research (https://github.com/THU-KEG/Agentic-Reward-Modeling).</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 27 Feb 2025 20:45:36 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b03b3532/cfc2a485.mp3" length="17672706" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1101</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin Xu, Lei Hou, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19328v1">http://arxiv.org/abs/2502.19328v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs). However, existing reward models primarily focus on human preferences, neglecting verifiable correctness signals which have shown strong potential in training LLMs. In this paper, we propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals from different aspects to provide reliable rewards. We empirically implement a reward agent, named RewardAgent, that combines human preference rewards with two verifiable signals: factuality and instruction following, to provide more reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks. RewardAgent significantly outperforms vanilla reward models, demonstrating its effectiveness. We further construct training preference pairs using RewardAgent and train an LLM with the DPO objective, achieving superior performance on various NLP benchmarks compared to conventional reward models. Our codes are publicly released to facilitate further research (https://github.com/THU-KEG/Agentic-Reward-Modeling).</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation</title>
      <itunes:episode>606</itunes:episode>
      <podcast:episode>606</podcast:episode>
      <itunes:title>Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">390a402c-281d-420d-a9e9-1760ec7abb31</guid>
      <link>https://share.transistor.fm/s/31308d53</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.LG, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Shiven Sinha, Shashwat Goel, Ponnurangam Kumaraguru, Jonas Geiping, Matthias Bethge, Ameya Prabhu</p>

            <p><strong>Title:</strong><br>
            Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19414v1">http://arxiv.org/abs/2502.19414v1</a></p>

            <p><strong>Abstract:</strong><br>
            There is growing excitement about the potential of Language Models (LMs) to accelerate scientific discovery. Falsifying hypotheses is key to scientific progress, as it allows claims to be iteratively refined over time. This process requires significant researcher effort, reasoning, and ingenuity. Yet current benchmarks for LMs predominantly assess their ability to generate solutions rather than challenge them. We advocate for developing benchmarks that evaluate this inverse capability - creating counterexamples for subtly incorrect solutions. To demonstrate this approach, we start with the domain of algorithmic problem solving, where counterexamples can be evaluated automatically using code execution. Specifically, we introduce REFUTE, a dynamically updating benchmark that includes recent problems and incorrect submissions from programming competitions, where human experts successfully identified counterexamples. Our analysis finds that the best reasoning agents, even OpenAI o3-mini (high) with code execution feedback, can create counterexamples for only &lt;9% of incorrect solutions in REFUTE, even though ratings indicate its ability to solve up to 48% of these problems from scratch. We hope our work spurs progress in evaluating and enhancing LMs' ability to falsify incorrect solutions - a capability that is crucial for both accelerating research and making models self-improve through reliable reflective reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.LG, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Shiven Sinha, Shashwat Goel, Ponnurangam Kumaraguru, Jonas Geiping, Matthias Bethge, Ameya Prabhu</p>

            <p><strong>Title:</strong><br>
            Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19414v1">http://arxiv.org/abs/2502.19414v1</a></p>

            <p><strong>Abstract:</strong><br>
            There is growing excitement about the potential of Language Models (LMs) to accelerate scientific discovery. Falsifying hypotheses is key to scientific progress, as it allows claims to be iteratively refined over time. This process requires significant researcher effort, reasoning, and ingenuity. Yet current benchmarks for LMs predominantly assess their ability to generate solutions rather than challenge them. We advocate for developing benchmarks that evaluate this inverse capability - creating counterexamples for subtly incorrect solutions. To demonstrate this approach, we start with the domain of algorithmic problem solving, where counterexamples can be evaluated automatically using code execution. Specifically, we introduce REFUTE, a dynamically updating benchmark that includes recent problems and incorrect submissions from programming competitions, where human experts successfully identified counterexamples. Our analysis finds that the best reasoning agents, even OpenAI o3-mini (high) with code execution feedback, can create counterexamples for only &lt;9% of incorrect solutions in REFUTE, even though ratings indicate its ability to solve up to 48% of these problems from scratch. We hope our work spurs progress in evaluating and enhancing LMs' ability to falsify incorrect solutions - a capability that is crucial for both accelerating research and making models self-improve through reliable reflective reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 27 Feb 2025 20:45:14 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/31308d53/28545fc1.mp3" length="21962611" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1369</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.LG, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Shiven Sinha, Shashwat Goel, Ponnurangam Kumaraguru, Jonas Geiping, Matthias Bethge, Ameya Prabhu</p>

            <p><strong>Title:</strong><br>
            Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.19414v1">http://arxiv.org/abs/2502.19414v1</a></p>

            <p><strong>Abstract:</strong><br>
            There is growing excitement about the potential of Language Models (LMs) to accelerate scientific discovery. Falsifying hypotheses is key to scientific progress, as it allows claims to be iteratively refined over time. This process requires significant researcher effort, reasoning, and ingenuity. Yet current benchmarks for LMs predominantly assess their ability to generate solutions rather than challenge them. We advocate for developing benchmarks that evaluate this inverse capability - creating counterexamples for subtly incorrect solutions. To demonstrate this approach, we start with the domain of algorithmic problem solving, where counterexamples can be evaluated automatically using code execution. Specifically, we introduce REFUTE, a dynamically updating benchmark that includes recent problems and incorrect submissions from programming competitions, where human experts successfully identified counterexamples. Our analysis finds that the best reasoning agents, even OpenAI o3-mini (high) with code execution feedback, can create counterexamples for only &lt;9% of incorrect solutions in REFUTE, even though ratings indicate its ability to solve up to 48% of these problems from scratch. We hope our work spurs progress in evaluating and enhancing LMs' ability to falsify incorrect solutions - a capability that is crucial for both accelerating research and making models self-improve through reliable reflective reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Rank1: Test-Time Compute for Reranking in Information Retrieval</title>
      <itunes:episode>605</itunes:episode>
      <podcast:episode>605</podcast:episode>
      <itunes:title>Rank1: Test-Time Compute for Reranking in Information Retrieval</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">470001bf-e294-4223-9c3f-39fdca386f3f</guid>
      <link>https://share.transistor.fm/s/b740217a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.IR, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Orion Weller, Kathryn Ricci, Eugene Yang, Andrew Yates, Dawn Lawrie, Benjamin Van Durme</p>

            <p><strong>Title:</strong><br>
            Rank1: Test-Time Compute for Reranking in Information Retrieval</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.18418v1">http://arxiv.org/abs/2502.18418v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Rank1, the first reranking model trained to take advantage of test-time compute. Rank1 demonstrates the applicability within retrieval of using a reasoning language model (i.e. OpenAI's o1, Deepseek's R1, etc.) for distillation in order to rapidly improve the performance of a smaller model. We gather and open-source a dataset of more than 600,000 examples of R1 reasoning traces from queries and passages in MS MARCO. Models trained on this dataset show: (1) state-of-the-art performance on advanced reasoning and instruction following datasets; (2) work remarkably well out of distribution due to the ability to respond to user-input prompts; and (3) have explainable reasoning chains that can be given to users or RAG-based systems. Further, we demonstrate that quantized versions of these models retain strong performance while using less compute/memory. Overall, Rank1 shows that test-time compute allows for a fundamentally new type of explainable and performant reranker model for search.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.IR, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Orion Weller, Kathryn Ricci, Eugene Yang, Andrew Yates, Dawn Lawrie, Benjamin Van Durme</p>

            <p><strong>Title:</strong><br>
            Rank1: Test-Time Compute for Reranking in Information Retrieval</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.18418v1">http://arxiv.org/abs/2502.18418v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Rank1, the first reranking model trained to take advantage of test-time compute. Rank1 demonstrates the applicability within retrieval of using a reasoning language model (i.e. OpenAI's o1, Deepseek's R1, etc.) for distillation in order to rapidly improve the performance of a smaller model. We gather and open-source a dataset of more than 600,000 examples of R1 reasoning traces from queries and passages in MS MARCO. Models trained on this dataset show: (1) state-of-the-art performance on advanced reasoning and instruction following datasets; (2) work remarkably well out of distribution due to the ability to respond to user-input prompts; and (3) have explainable reasoning chains that can be given to users or RAG-based systems. Further, we demonstrate that quantized versions of these models retain strong performance while using less compute/memory. Overall, Rank1 shows that test-time compute allows for a fundamentally new type of explainable and performant reranker model for search.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 27 Feb 2025 20:44:53 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b740217a/490a6438.mp3" length="19120880" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1191</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.IR, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Orion Weller, Kathryn Ricci, Eugene Yang, Andrew Yates, Dawn Lawrie, Benjamin Van Durme</p>

            <p><strong>Title:</strong><br>
            Rank1: Test-Time Compute for Reranking in Information Retrieval</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.18418v1">http://arxiv.org/abs/2502.18418v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Rank1, the first reranking model trained to take advantage of test-time compute. Rank1 demonstrates the applicability within retrieval of using a reasoning language model (i.e. OpenAI's o1, Deepseek's R1, etc.) for distillation in order to rapidly improve the performance of a smaller model. We gather and open-source a dataset of more than 600,000 examples of R1 reasoning traces from queries and passages in MS MARCO. Models trained on this dataset show: (1) state-of-the-art performance on advanced reasoning and instruction following datasets; (2) work remarkably well out of distribution due to the ability to respond to user-input prompts; and (3) have explainable reasoning chains that can be given to users or RAG-based systems. Further, we demonstrate that quantized versions of these models retain strong performance while using less compute/memory. Overall, Rank1 shows that test-time compute allows for a fundamentally new type of explainable and performant reranker model for search.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MLGym: A New Framework and Benchmark for Advancing AI Research Agents</title>
      <itunes:episode>604</itunes:episode>
      <podcast:episode>604</podcast:episode>
      <itunes:title>MLGym: A New Framework and Benchmark for Advancing AI Research Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b2b8c539-2c70-4f95-a893-7adc15c19108</guid>
      <link>https://share.transistor.fm/s/b5bd29b1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 122 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, Roberta Raileanu</p>

            <p><strong>Title:</strong><br>
            MLGym: A New Framework and Benchmark for Advancing AI Research Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14499v1">http://arxiv.org/abs/2502.14499v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym-bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task. We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new learning algorithms for training agents on AI research tasks. We find that current frontier models can improve on the given baselines, usually by finding better hyperparameters, but do not generate novel hypotheses, algorithms, architectures, or substantial improvements. We open-source our framework and benchmark to facilitate future research in advancing the AI research capabilities of LLM agents.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 122 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, Roberta Raileanu</p>

            <p><strong>Title:</strong><br>
            MLGym: A New Framework and Benchmark for Advancing AI Research Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14499v1">http://arxiv.org/abs/2502.14499v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym-bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task. We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new learning algorithms for training agents on AI research tasks. We find that current frontier models can improve on the given baselines, usually by finding better hyperparameters, but do not generate novel hypotheses, algorithms, architectures, or substantial improvements. We open-source our framework and benchmark to facilitate future research in advancing the AI research capabilities of LLM agents.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 21 Feb 2025 20:49:39 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b5bd29b1/a0725808.mp3" length="25434159" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1586</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 122 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, Roberta Raileanu</p>

            <p><strong>Title:</strong><br>
            MLGym: A New Framework and Benchmark for Advancing AI Research Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14499v1">http://arxiv.org/abs/2502.14499v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. MLGym-bench consists of 13 diverse and open-ended AI research tasks from diverse domains such as computer vision, natural language processing, reinforcement learning, and game theory. Solving these tasks requires real-world AI research skills such as generating new ideas and hypotheses, creating and processing data, implementing ML methods, training models, running experiments, analyzing the results, and iterating through this process to improve on a given task. We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate models or agents, generate synthetic data at scale, as well as develop new learning algorithms for training agents on AI research tasks. We find that current frontier models can improve on the given baselines, usually by finding better hyperparameters, but do not generate novel hypotheses, algorithms, architectures, or substantial improvements. We open-source our framework and benchmark to facilitate future research in advancing the AI research capabilities of LLM agents.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features</title>
      <itunes:episode>603</itunes:episode>
      <podcast:episode>603</podcast:episode>
      <itunes:title>SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">763eb78f-8d98-4790-a624-8f199919f370</guid>
      <link>https://share.transistor.fm/s/aeba5206</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 82 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, Xiaohua Zhai</p>

            <p><strong>Title:</strong><br>
            SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14786v1">http://arxiv.org/abs/2502.14786v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 82 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, Xiaohua Zhai</p>

            <p><strong>Title:</strong><br>
            SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14786v1">http://arxiv.org/abs/2502.14786v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 21 Feb 2025 20:49:18 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/aeba5206/d796940c.mp3" length="24430688" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1523</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 82 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, Xiaohua Zhai</p>

            <p><strong>Title:</strong><br>
            SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14786v1">http://arxiv.org/abs/2502.14786v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines</title>
      <itunes:episode>602</itunes:episode>
      <podcast:episode>602</podcast:episode>
      <itunes:title>SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3ff9bdc2-d0e7-4ff3-94ad-63f75b624167</guid>
      <link>https://share.transistor.fm/s/7bc2a39c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 81 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixing Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu Yang, Zekun Moore Wang, Junting Zhou, Yuelin Bai, Xingyuan Bu, Chenglin Cai, Liang Chen, Yifan Chen, Chengtuo Cheng, Tianhao Cheng, Keyi Ding, Siming Huang, Yun Huang, Yaoru Li, Yizhe Li, Zhaoqun Li, Tianhao Liang, Chengdong Lin, Hongquan Lin, Yinghao Ma, Zhongyuan Peng, Zifan Peng, Qige Qi, Shi Qiu, Xingwei Qu, Yizhou Tan, Zili Wang, Chenqing Wang, Hao Wang, Yiya Wang, Yubo Wang, Jiajun Xu, Kexin Yang, Ruibin Yuan, Yuanhao Yue, Tianyang Zhan, Chun Zhang, Jingyang Zhang, Xiyue Zhang, Xingjian Zhang, Yue Zhang, Yongchi Zhao, Xiangyu Zheng, Chenghua Zhong, Yang Gao, Zhoujun Li, Dayiheng Liu, Qian Liu, Tianyu Liu, Shiwen Ni, Junran Peng, Yujia Qin, Wenbo Su, Guoyin Wang, Shi Wang, Jian Yang, Min Yang, Meng Cao, Xiang Yue, Zhaoxiang Zhang, Wangchunshu Zhou, Jiaheng Liu, Qunshu Lin, Wenhao Huang, Ge Zhang</p>

            <p><strong>Title:</strong><br>
            SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14739v1">http://arxiv.org/abs/2502.14739v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 81 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixing Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu Yang, Zekun Moore Wang, Junting Zhou, Yuelin Bai, Xingyuan Bu, Chenglin Cai, Liang Chen, Yifan Chen, Chengtuo Cheng, Tianhao Cheng, Keyi Ding, Siming Huang, Yun Huang, Yaoru Li, Yizhe Li, Zhaoqun Li, Tianhao Liang, Chengdong Lin, Hongquan Lin, Yinghao Ma, Zhongyuan Peng, Zifan Peng, Qige Qi, Shi Qiu, Xingwei Qu, Yizhou Tan, Zili Wang, Chenqing Wang, Hao Wang, Yiya Wang, Yubo Wang, Jiajun Xu, Kexin Yang, Ruibin Yuan, Yuanhao Yue, Tianyang Zhan, Chun Zhang, Jingyang Zhang, Xiyue Zhang, Xingjian Zhang, Yue Zhang, Yongchi Zhao, Xiangyu Zheng, Chenghua Zhong, Yang Gao, Zhoujun Li, Dayiheng Liu, Qian Liu, Tianyu Liu, Shiwen Ni, Junran Peng, Yujia Qin, Wenbo Su, Guoyin Wang, Shi Wang, Jian Yang, Min Yang, Meng Cao, Xiang Yue, Zhaoxiang Zhang, Wangchunshu Zhou, Jiaheng Liu, Qunshu Lin, Wenhao Huang, Ge Zhang</p>

            <p><strong>Title:</strong><br>
            SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14739v1">http://arxiv.org/abs/2502.14739v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 21 Feb 2025 20:48:57 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7bc2a39c/f99ffaf2.mp3" length="23199328" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1446</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 81 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixing Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu Yang, Zekun Moore Wang, Junting Zhou, Yuelin Bai, Xingyuan Bu, Chenglin Cai, Liang Chen, Yifan Chen, Chengtuo Cheng, Tianhao Cheng, Keyi Ding, Siming Huang, Yun Huang, Yaoru Li, Yizhe Li, Zhaoqun Li, Tianhao Liang, Chengdong Lin, Hongquan Lin, Yinghao Ma, Zhongyuan Peng, Zifan Peng, Qige Qi, Shi Qiu, Xingwei Qu, Yizhou Tan, Zili Wang, Chenqing Wang, Hao Wang, Yiya Wang, Yubo Wang, Jiajun Xu, Kexin Yang, Ruibin Yuan, Yuanhao Yue, Tianyang Zhan, Chun Zhang, Jingyang Zhang, Xiyue Zhang, Xingjian Zhang, Yue Zhang, Yongchi Zhao, Xiangyu Zheng, Chenghua Zhong, Yang Gao, Zhoujun Li, Dayiheng Liu, Qian Liu, Tianyu Liu, Shiwen Ni, Junran Peng, Yujia Qin, Wenbo Su, Guoyin Wang, Shi Wang, Jian Yang, Min Yang, Meng Cao, Xiang Yue, Zhaoxiang Zhang, Wangchunshu Zhou, Jiaheng Liu, Qunshu Lin, Wenhao Huang, Ge Zhang</p>

            <p><strong>Title:</strong><br>
            SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14739v1">http://arxiv.org/abs/2502.14739v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?</title>
      <itunes:episode>601</itunes:episode>
      <podcast:episode>601</podcast:episode>
      <itunes:title>How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8dd4ded3-1aaf-4375-a831-199531f93cdd</guid>
      <link>https://share.transistor.fm/s/68472a73</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sergey Pletenev, Maria Marina, Daniil Moskovskiy, Vasily Konovalov, Pavel Braslavski, Alexander Panchenko, Mikhail Salnikov</p>

            <p><strong>Title:</strong><br>
            How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14502v1">http://arxiv.org/abs/2502.14502v1</a></p>

            <p><strong>Abstract:</strong><br>
            The performance of Large Language Models (LLMs) on many tasks is greatly limited by the knowledge learned during pre-training and stored in the model's parameters. Low-rank adaptation (LoRA) is a popular and efficient training technique for updating or domain-specific adaptation of LLMs. In this study, we investigate how new facts can be incorporated into the LLM using LoRA without compromising the previously learned knowledge. We fine-tuned Llama-3.1-8B-instruct using LoRA with varying amounts of new knowledge. Our experiments have shown that the best results are obtained when the training data contains a mixture of known and new facts. However, this approach is still potentially harmful because the model's performance on external question-answering benchmarks declines after such fine-tuning. When the training data is biased towards certain entities, the model tends to regress to few overrepresented answers. In addition, we found that the model becomes more confident and refuses to provide an answer in only few cases. These findings highlight the potential pitfalls of LoRA-based LLM updates and underscore the importance of training data composition and tuning parameters to balance new knowledge integration and general model capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sergey Pletenev, Maria Marina, Daniil Moskovskiy, Vasily Konovalov, Pavel Braslavski, Alexander Panchenko, Mikhail Salnikov</p>

            <p><strong>Title:</strong><br>
            How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14502v1">http://arxiv.org/abs/2502.14502v1</a></p>

            <p><strong>Abstract:</strong><br>
            The performance of Large Language Models (LLMs) on many tasks is greatly limited by the knowledge learned during pre-training and stored in the model's parameters. Low-rank adaptation (LoRA) is a popular and efficient training technique for updating or domain-specific adaptation of LLMs. In this study, we investigate how new facts can be incorporated into the LLM using LoRA without compromising the previously learned knowledge. We fine-tuned Llama-3.1-8B-instruct using LoRA with varying amounts of new knowledge. Our experiments have shown that the best results are obtained when the training data contains a mixture of known and new facts. However, this approach is still potentially harmful because the model's performance on external question-answering benchmarks declines after such fine-tuning. When the training data is biased towards certain entities, the model tends to regress to few overrepresented answers. In addition, we found that the model becomes more confident and refuses to provide an answer in only few cases. These findings highlight the potential pitfalls of LoRA-based LLM updates and underscore the importance of training data composition and tuning parameters to balance new knowledge integration and general model capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 21 Feb 2025 20:48:36 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/68472a73/9645572e.mp3" length="21869388" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1363</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sergey Pletenev, Maria Marina, Daniil Moskovskiy, Vasily Konovalov, Pavel Braslavski, Alexander Panchenko, Mikhail Salnikov</p>

            <p><strong>Title:</strong><br>
            How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14502v1">http://arxiv.org/abs/2502.14502v1</a></p>

            <p><strong>Abstract:</strong><br>
            The performance of Large Language Models (LLMs) on many tasks is greatly limited by the knowledge learned during pre-training and stored in the model's parameters. Low-rank adaptation (LoRA) is a popular and efficient training technique for updating or domain-specific adaptation of LLMs. In this study, we investigate how new facts can be incorporated into the LLM using LoRA without compromising the previously learned knowledge. We fine-tuned Llama-3.1-8B-instruct using LoRA with varying amounts of new knowledge. Our experiments have shown that the best results are obtained when the training data contains a mixture of known and new facts. However, this approach is still potentially harmful because the model's performance on external question-answering benchmarks declines after such fine-tuning. When the training data is biased towards certain entities, the model tends to regress to few overrepresented answers. In addition, we found that the model becomes more confident and refuses to provide an answer in only few cases. These findings highlight the potential pitfalls of LoRA-based LLM updates and underscore the importance of training data composition and tuning parameters to balance new knowledge integration and general model capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>S*: Test Time Scaling for Code Generation</title>
      <itunes:episode>600</itunes:episode>
      <podcast:episode>600</podcast:episode>
      <itunes:title>S*: Test Time Scaling for Code Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0d520eb6-a251-486d-bd59-ad55ed94fa24</guid>
      <link>https://share.transistor.fm/s/46112506</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica</p>

            <p><strong>Title:</strong><br>
            S*: Test Time Scaling for Code Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14382v1">http://arxiv.org/abs/2502.14382v1</a></p>

            <p><strong>Abstract:</strong><br>
            Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially improves the coverage and selection accuracy of generated code. S* extends the existing parallel scaling paradigm with sequential scaling to push performance boundaries. It further leverages a novel selection mechanism that adaptively generates distinguishing inputs for pairwise comparison, combined with execution-grounded information to robustly identify correct solutions. We evaluate across 12 Large Language Models and Large Reasoning Model and show: (1) S* consistently improves performance across model families and sizes, enabling a 3B model to outperform GPT-4o-mini; (2) S* enables non-reasoning models to surpass reasoning models - GPT-4o-mini with S* outperforms o1-preview by 3.7% on LiveCodeBench; (3) S* further boosts state-of-the-art reasoning models - DeepSeek-R1-Distill-Qwen-32B with S* achieves 85.7% on LiveCodeBench, approaching o1 (high) at 88.5%. Code will be available under https://github.com/NovaSky-AI/SkyThought.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica</p>

            <p><strong>Title:</strong><br>
            S*: Test Time Scaling for Code Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14382v1">http://arxiv.org/abs/2502.14382v1</a></p>

            <p><strong>Abstract:</strong><br>
            Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially improves the coverage and selection accuracy of generated code. S* extends the existing parallel scaling paradigm with sequential scaling to push performance boundaries. It further leverages a novel selection mechanism that adaptively generates distinguishing inputs for pairwise comparison, combined with execution-grounded information to robustly identify correct solutions. We evaluate across 12 Large Language Models and Large Reasoning Model and show: (1) S* consistently improves performance across model families and sizes, enabling a 3B model to outperform GPT-4o-mini; (2) S* enables non-reasoning models to surpass reasoning models - GPT-4o-mini with S* outperforms o1-preview by 3.7% on LiveCodeBench; (3) S* further boosts state-of-the-art reasoning models - DeepSeek-R1-Distill-Qwen-32B with S* achieves 85.7% on LiveCodeBench, approaching o1 (high) at 88.5%. Code will be available under https://github.com/NovaSky-AI/SkyThought.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 21 Feb 2025 20:48:15 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/46112506/16149e3d.mp3" length="23196796" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1446</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica</p>

            <p><strong>Title:</strong><br>
            S*: Test Time Scaling for Code Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14382v1">http://arxiv.org/abs/2502.14382v1</a></p>

            <p><strong>Abstract:</strong><br>
            Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially improves the coverage and selection accuracy of generated code. S* extends the existing parallel scaling paradigm with sequential scaling to push performance boundaries. It further leverages a novel selection mechanism that adaptively generates distinguishing inputs for pairwise comparison, combined with execution-grounded information to robustly identify correct solutions. We evaluate across 12 Large Language Models and Large Reasoning Model and show: (1) S* consistently improves performance across model families and sizes, enabling a 3B model to outperform GPT-4o-mini; (2) S* enables non-reasoning models to surpass reasoning models - GPT-4o-mini with S* outperforms o1-preview by 3.7% on LiveCodeBench; (3) S* further boosts state-of-the-art reasoning models - DeepSeek-R1-Distill-Qwen-32B with S* achieves 85.7% on LiveCodeBench, approaching o1 (high) at 88.5%. Code will be available under https://github.com/NovaSky-AI/SkyThought.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning</title>
      <itunes:episode>599</itunes:episode>
      <podcast:episode>599</podcast:episode>
      <itunes:title>Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8193bb54-4ab9-40e8-8d1a-791252999b45</guid>
      <link>https://share.transistor.fm/s/b2fcadd1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, Chong Luo</p>

            <p><strong>Title:</strong><br>
            Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14768v1">http://arxiv.org/abs/2502.14768v1</a></p>

            <p><strong>Abstract:</strong><br>
            Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that emphasizes the thinking and answering process, a stringent format reward function that penalizes outputs for taking shortcuts, and a straightforward training recipe that achieves stable convergence. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus. Remarkably, after training on just 5K logic problems, it demonstrates generalization abilities to the challenging math benchmarks AIME and AMC.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, Chong Luo</p>

            <p><strong>Title:</strong><br>
            Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14768v1">http://arxiv.org/abs/2502.14768v1</a></p>

            <p><strong>Abstract:</strong><br>
            Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that emphasizes the thinking and answering process, a stringent format reward function that penalizes outputs for taking shortcuts, and a straightforward training recipe that achieves stable convergence. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus. Remarkably, after training on just 5K logic problems, it demonstrates generalization abilities to the challenging math benchmarks AIME and AMC.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 21 Feb 2025 20:47:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b2fcadd1/52d1f21a.mp3" length="23919479" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1491</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, Chong Luo</p>

            <p><strong>Title:</strong><br>
            Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14768v1">http://arxiv.org/abs/2502.14768v1</a></p>

            <p><strong>Abstract:</strong><br>
            Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that emphasizes the thinking and answering process, a stringent format reward function that penalizes outputs for taking shortcuts, and a straightforward training recipe that achieves stable convergence. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus. Remarkably, after training on just 5K logic problems, it demonstrates generalization abilities to the challenging math benchmarks AIME and AMC.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Discovering highly efficient low-weight quantum error-correcting codes with reinforcement learning</title>
      <itunes:episode>598</itunes:episode>
      <podcast:episode>598</podcast:episode>
      <itunes:title>Discovering highly efficient low-weight quantum error-correcting codes with reinforcement learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7e144609-28d1-4ba0-b5f5-77540239d8ab</guid>
      <link>https://share.transistor.fm/s/1e382198</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | quant-ph, cs.AI, cs.IT, cs.LG, math.IT</p>

            <p><strong>Authors:</strong><br>
            Austin Yubo He, Zi-Wen Liu</p>

            <p><strong>Title:</strong><br>
            Discovering highly efficient low-weight quantum error-correcting codes with reinforcement learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14372v1">http://arxiv.org/abs/2502.14372v1</a></p>

            <p><strong>Abstract:</strong><br>
            The realization of scalable fault-tolerant quantum computing is expected to hinge on quantum error-correcting codes. In the quest for more efficient quantum fault tolerance, a critical code parameter is the weight of measurements that extract information about errors to enable error correction: as higher measurement weights require higher implementation costs and introduce more errors, it is important in code design to optimize measurement weight. This underlies the surging interest in quantum low-density parity-check (qLDPC) codes, the study of which has primarily focused on the asymptotic (large-code-limit) properties. In this work, we introduce a versatile and computationally efficient approach to stabilizer code weight reduction based on reinforcement learning (RL), which produces new low-weight codes that substantially outperform the state of the art in practically relevant parameter regimes, extending significantly beyond previously accessible small distances. For example, our approach demonstrates savings in physical qubit overhead compared to existing results by 1 to 2 orders of magnitude for weight 6 codes and brings the overhead into a feasible range for near-future experiments. We also investigate the interplay between code parameters using our RL framework, offering new insights into the potential efficiency and power of practically viable coding strategies. Overall, our results demonstrate how RL can effectively advance the crucial yet challenging problem of quantum code discovery and thereby facilitate a faster path to the practical implementation of fault-tolerant quantum technologies.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | quant-ph, cs.AI, cs.IT, cs.LG, math.IT</p>

            <p><strong>Authors:</strong><br>
            Austin Yubo He, Zi-Wen Liu</p>

            <p><strong>Title:</strong><br>
            Discovering highly efficient low-weight quantum error-correcting codes with reinforcement learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14372v1">http://arxiv.org/abs/2502.14372v1</a></p>

            <p><strong>Abstract:</strong><br>
            The realization of scalable fault-tolerant quantum computing is expected to hinge on quantum error-correcting codes. In the quest for more efficient quantum fault tolerance, a critical code parameter is the weight of measurements that extract information about errors to enable error correction: as higher measurement weights require higher implementation costs and introduce more errors, it is important in code design to optimize measurement weight. This underlies the surging interest in quantum low-density parity-check (qLDPC) codes, the study of which has primarily focused on the asymptotic (large-code-limit) properties. In this work, we introduce a versatile and computationally efficient approach to stabilizer code weight reduction based on reinforcement learning (RL), which produces new low-weight codes that substantially outperform the state of the art in practically relevant parameter regimes, extending significantly beyond previously accessible small distances. For example, our approach demonstrates savings in physical qubit overhead compared to existing results by 1 to 2 orders of magnitude for weight 6 codes and brings the overhead into a feasible range for near-future experiments. We also investigate the interplay between code parameters using our RL framework, offering new insights into the potential efficiency and power of practically viable coding strategies. Overall, our results demonstrate how RL can effectively advance the crucial yet challenging problem of quantum code discovery and thereby facilitate a faster path to the practical implementation of fault-tolerant quantum technologies.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 21 Feb 2025 20:47:33 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1e382198/dccc0445.mp3" length="20682410" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1289</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | quant-ph, cs.AI, cs.IT, cs.LG, math.IT</p>

            <p><strong>Authors:</strong><br>
            Austin Yubo He, Zi-Wen Liu</p>

            <p><strong>Title:</strong><br>
            Discovering highly efficient low-weight quantum error-correcting codes with reinforcement learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14372v1">http://arxiv.org/abs/2502.14372v1</a></p>

            <p><strong>Abstract:</strong><br>
            The realization of scalable fault-tolerant quantum computing is expected to hinge on quantum error-correcting codes. In the quest for more efficient quantum fault tolerance, a critical code parameter is the weight of measurements that extract information about errors to enable error correction: as higher measurement weights require higher implementation costs and introduce more errors, it is important in code design to optimize measurement weight. This underlies the surging interest in quantum low-density parity-check (qLDPC) codes, the study of which has primarily focused on the asymptotic (large-code-limit) properties. In this work, we introduce a versatile and computationally efficient approach to stabilizer code weight reduction based on reinforcement learning (RL), which produces new low-weight codes that substantially outperform the state of the art in practically relevant parameter regimes, extending significantly beyond previously accessible small distances. For example, our approach demonstrates savings in physical qubit overhead compared to existing results by 1 to 2 orders of magnitude for weight 6 codes and brings the overhead into a feasible range for near-future experiments. We also investigate the interplay between code parameters using our RL framework, offering new insights into the potential efficiency and power of practically viable coding strategies. Overall, our results demonstrate how RL can effectively advance the crucial yet challenging problem of quantum code discovery and thereby facilitate a faster path to the practical implementation of fault-tolerant quantum technologies.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models</title>
      <itunes:episode>597</itunes:episode>
      <podcast:episode>597</podcast:episode>
      <itunes:title>LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e5ea4e6e-1f24-4ea4-9277-de72e57107f3</guid>
      <link>https://share.transistor.fm/s/1f92fd02</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14834v1">http://arxiv.org/abs/2502.14834v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Code and data: https://github.com/THU-KEG/LongWriter-V</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14834v1">http://arxiv.org/abs/2502.14834v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Code and data: https://github.com/THU-KEG/LongWriter-V</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 21 Feb 2025 20:47:11 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1f92fd02/45a1f735.mp3" length="18991755" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1183</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14834v1">http://arxiv.org/abs/2502.14834v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Code and data: https://github.com/THU-KEG/LongWriter-V</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information</title>
      <itunes:episode>596</itunes:episode>
      <podcast:episode>596</podcast:episode>
      <itunes:title>Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5f4b2dee-cfed-45c3-bcb6-10bc27920b75</guid>
      <link>https://share.transistor.fm/s/893529d2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yein Park, Chanwoong Yoon, Jungwoo Park, Minbyul Jeong, Jaewoo Kang</p>

            <p><strong>Title:</strong><br>
            Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14258v1">http://arxiv.org/abs/2502.14258v1</a></p>

            <p><strong>Abstract:</strong><br>
            While the ability of language models to elicit facts has been widely investigated, how they handle temporally changing facts remains underexplored. We discover Temporal Heads, specific attention heads primarily responsible for processing temporal knowledge through circuit analysis. We confirm that these heads are present across multiple models, though their specific locations may vary, and their responses differ depending on the type of knowledge and its corresponding years. Disabling these heads degrades the model's ability to recall time-specific knowledge while maintaining its general capabilities without compromising time-invariant and question-answering performances. Moreover, the heads are activated not only numeric conditions ("In 2004") but also textual aliases ("In the year ..."), indicating that they encode a temporal dimension beyond simple numerical representation. Furthermore, we expand the potential of our findings by demonstrating how temporal knowledge can be edited by adjusting the values of these heads.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yein Park, Chanwoong Yoon, Jungwoo Park, Minbyul Jeong, Jaewoo Kang</p>

            <p><strong>Title:</strong><br>
            Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14258v1">http://arxiv.org/abs/2502.14258v1</a></p>

            <p><strong>Abstract:</strong><br>
            While the ability of language models to elicit facts has been widely investigated, how they handle temporally changing facts remains underexplored. We discover Temporal Heads, specific attention heads primarily responsible for processing temporal knowledge through circuit analysis. We confirm that these heads are present across multiple models, though their specific locations may vary, and their responses differ depending on the type of knowledge and its corresponding years. Disabling these heads degrades the model's ability to recall time-specific knowledge while maintaining its general capabilities without compromising time-invariant and question-answering performances. Moreover, the heads are activated not only numeric conditions ("In 2004") but also textual aliases ("In the year ..."), indicating that they encode a temporal dimension beyond simple numerical representation. Furthermore, we expand the potential of our findings by demonstrating how temporal knowledge can be edited by adjusting the values of these heads.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 21 Feb 2025 20:46:51 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/893529d2/83c1eb83.mp3" length="22377233" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1395</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yein Park, Chanwoong Yoon, Jungwoo Park, Minbyul Jeong, Jaewoo Kang</p>

            <p><strong>Title:</strong><br>
            Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.14258v1">http://arxiv.org/abs/2502.14258v1</a></p>

            <p><strong>Abstract:</strong><br>
            While the ability of language models to elicit facts has been widely investigated, how they handle temporally changing facts remains underexplored. We discover Temporal Heads, specific attention heads primarily responsible for processing temporal knowledge through circuit analysis. We confirm that these heads are present across multiple models, though their specific locations may vary, and their responses differ depending on the type of knowledge and its corresponding years. Disabling these heads degrades the model's ability to recall time-specific knowledge while maintaining its general capabilities without compromising time-invariant and question-answering performances. Moreover, the heads are activated not only numeric conditions ("In 2004") but also textual aliases ("In the year ..."), indicating that they encode a temporal dimension beyond simple numerical representation. Furthermore, we expand the potential of our findings by demonstrating how temporal knowledge can be edited by adjusting the values of these heads.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning</title>
      <itunes:episode>595</itunes:episode>
      <podcast:episode>595</podcast:episode>
      <itunes:title>S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4541415d-1d2a-4c66-9a11-fd9aec8f063e</guid>
      <link>https://share.transistor.fm/s/89a37d04</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, Jia Li</p>

            <p><strong>Title:</strong><br>
            S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12853v1">http://arxiv.org/abs/2502.12853v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs' deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S$^2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by both outcome-level and process-level reinforcement learning, with minimized resource requirements, enabling the model to adaptively refine its reasoning process during inference. Our results demonstrate that, with only 3.1k self-verifying and self-correcting behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0\% to 81.6\%, outperforming models trained on an equivalent amount of long-CoT distilled data. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of S$^2$R. Our code and data are available at https://github.com/NineAbyss/S2R.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, Jia Li</p>

            <p><strong>Title:</strong><br>
            S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12853v1">http://arxiv.org/abs/2502.12853v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs' deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S$^2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by both outcome-level and process-level reinforcement learning, with minimized resource requirements, enabling the model to adaptively refine its reasoning process during inference. Our results demonstrate that, with only 3.1k self-verifying and self-correcting behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0\% to 81.6\%, outperforming models trained on an equivalent amount of long-CoT distilled data. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of S$^2$R. Our code and data are available at https://github.com/NineAbyss/S2R.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 21 Feb 2025 20:46:30 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/89a37d04/54a77182.mp3" length="22516815" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1404</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, Jia Li</p>

            <p><strong>Title:</strong><br>
            S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12853v1">http://arxiv.org/abs/2502.12853v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs' deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S$^2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by both outcome-level and process-level reinforcement learning, with minimized resource requirements, enabling the model to adaptively refine its reasoning process during inference. Our results demonstrate that, with only 3.1k self-verifying and self-correcting behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0\% to 81.6\%, outperforming models trained on an equivalent amount of long-CoT distilled data. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of S$^2$R. Our code and data are available at https://github.com/NineAbyss/S2R.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Qwen2.5-VL Technical Report</title>
      <itunes:episode>594</itunes:episode>
      <podcast:episode>594</podcast:episode>
      <itunes:title>Qwen2.5-VL Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">24c1aeb2-1484-43bf-961a-91c38db32860</guid>
      <link>https://share.transistor.fm/s/65a12943</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 97 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Qwen2.5-VL Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13923v1">http://arxiv.org/abs/2502.13923v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 97 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Qwen2.5-VL Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13923v1">http://arxiv.org/abs/2502.13923v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 20 Feb 2025 20:44:42 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/65a12943/821c5970.mp3" length="20401471" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1271</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 97 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Qwen2.5-VL Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13923v1">http://arxiv.org/abs/2502.13923v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning</title>
      <itunes:episode>593</itunes:episode>
      <podcast:episode>593</podcast:episode>
      <itunes:title>RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">394f237e-9d55-49b5-a4e2-d5fdb3046d14</guid>
      <link>https://share.transistor.fm/s/441714a4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, Ying Zhang, Wenyu Liu, Qian Zhang, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13144v1">http://arxiv.org/abs/2502.13144v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing end-to-end autonomous driving (AD) algorithms typically follow the Imitation Learning (IL) paradigm, which faces challenges such as causal confusion and the open-loop gap. In this work, we establish a 3DGS-based closed-loop Reinforcement Learning (RL) training paradigm. By leveraging 3DGS techniques, we construct a photorealistic digital replica of the real physical world, enabling the AD policy to extensively explore the state space and learn to handle out-of-distribution scenarios through large-scale trial and error. To enhance safety, we design specialized rewards that guide the policy to effectively respond to safety-critical events and understand real-world causal relationships. For better alignment with human driving behavior, IL is incorporated into RL training as a regularization term. We introduce a closed-loop evaluation benchmark consisting of diverse, previously unseen 3DGS environments. Compared to IL-based methods, RAD achieves stronger performance in most closed-loop metrics, especially 3x lower collision rate. Abundant closed-loop results are presented at https://hgao-cv.github.io/RAD.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, Ying Zhang, Wenyu Liu, Qian Zhang, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13144v1">http://arxiv.org/abs/2502.13144v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing end-to-end autonomous driving (AD) algorithms typically follow the Imitation Learning (IL) paradigm, which faces challenges such as causal confusion and the open-loop gap. In this work, we establish a 3DGS-based closed-loop Reinforcement Learning (RL) training paradigm. By leveraging 3DGS techniques, we construct a photorealistic digital replica of the real physical world, enabling the AD policy to extensively explore the state space and learn to handle out-of-distribution scenarios through large-scale trial and error. To enhance safety, we design specialized rewards that guide the policy to effectively respond to safety-critical events and understand real-world causal relationships. For better alignment with human driving behavior, IL is incorporated into RL training as a regularization term. We introduce a closed-loop evaluation benchmark consisting of diverse, previously unseen 3DGS environments. Compared to IL-based methods, RAD achieves stronger performance in most closed-loop metrics, especially 3x lower collision rate. Abundant closed-loop results are presented at https://hgao-cv.github.io/RAD.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 20 Feb 2025 20:44:21 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/441714a4/da085b1a.mp3" length="19977307" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1245</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, Ying Zhang, Wenyu Liu, Qian Zhang, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13144v1">http://arxiv.org/abs/2502.13144v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing end-to-end autonomous driving (AD) algorithms typically follow the Imitation Learning (IL) paradigm, which faces challenges such as causal confusion and the open-loop gap. In this work, we establish a 3DGS-based closed-loop Reinforcement Learning (RL) training paradigm. By leveraging 3DGS techniques, we construct a photorealistic digital replica of the real physical world, enabling the AD policy to extensively explore the state space and learn to handle out-of-distribution scenarios through large-scale trial and error. To enhance safety, we design specialized rewards that guide the policy to effectively respond to safety-critical events and understand real-world causal relationships. For better alignment with human driving behavior, IL is incorporated into RL training as a regularization term. We introduce a closed-loop evaluation benchmark consisting of diverse, previously unseen 3DGS environments. Compared to IL-based methods, RAD achieves stronger performance in most closed-loop metrics, especially 3x lower collision rate. Abundant closed-loop results are presented at https://hgao-cv.github.io/RAD.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation</title>
      <itunes:episode>592</itunes:episode>
      <podcast:episode>592</podcast:episode>
      <itunes:title>SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c11e723e-df46-4fab-94be-87b5630b51b9</guid>
      <link>https://share.transistor.fm/s/4eb18382</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.SD, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13128v1">http://arxiv.org/abs/2502.13128v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training and inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The generated samples are showcased on our project page at https://liuzh-19.github.io/SongGen/ , and the code will be available at https://github.com/LiuZH-19/SongGen .</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.SD, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13128v1">http://arxiv.org/abs/2502.13128v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training and inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The generated samples are showcased on our project page at https://liuzh-19.github.io/SongGen/ , and the code will be available at https://github.com/LiuZH-19/SongGen .</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 20 Feb 2025 20:44:00 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4eb18382/60f6adaa.mp3" length="23499854" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1465</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.SD, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13128v1">http://arxiv.org/abs/2502.13128v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training and inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The generated samples are showcased on our project page at https://liuzh-19.github.io/SongGen/ , and the code will be available at https://github.com/LiuZH-19/SongGen .</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MoM: Linear Sequence Modeling with Mixture-of-Memories</title>
      <itunes:episode>591</itunes:episode>
      <podcast:episode>591</podcast:episode>
      <itunes:title>MoM: Linear Sequence Modeling with Mixture-of-Memories</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1c926300-186f-41a5-af70-967c5f2ca279</guid>
      <link>https://share.transistor.fm/s/ac14054f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            MoM: Linear Sequence Modeling with Mixture-of-Memories</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13685v1">http://arxiv.org/abs/2502.13685v1</a></p>

            <p><strong>Abstract:</strong><br>
            Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive downstream tasks. Drawing inspiration from neuroscience, particularly the brain's ability to maintain robust long-term memory while mitigating "memory interference", we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM significantly outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models. The code is released at https://github.com/OpenSparseLLMs/MoM and is also released as a part of https://github.com/OpenSparseLLMs/Linear-MoE.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            MoM: Linear Sequence Modeling with Mixture-of-Memories</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13685v1">http://arxiv.org/abs/2502.13685v1</a></p>

            <p><strong>Abstract:</strong><br>
            Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive downstream tasks. Drawing inspiration from neuroscience, particularly the brain's ability to maintain robust long-term memory while mitigating "memory interference", we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM significantly outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models. The code is released at https://github.com/OpenSparseLLMs/MoM and is also released as a part of https://github.com/OpenSparseLLMs/Linear-MoE.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 20 Feb 2025 20:43:39 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ac14054f/2f183e90.mp3" length="19456492" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1212</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            MoM: Linear Sequence Modeling with Mixture-of-Memories</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13685v1">http://arxiv.org/abs/2502.13685v1</a></p>

            <p><strong>Abstract:</strong><br>
            Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive downstream tasks. Drawing inspiration from neuroscience, particularly the brain's ability to maintain robust long-term memory while mitigating "memory interference", we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM significantly outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models. The code is released at https://github.com/OpenSparseLLMs/MoM and is also released as a part of https://github.com/OpenSparseLLMs/Linear-MoE.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering</title>
      <itunes:episode>590</itunes:episode>
      <podcast:episode>590</podcast:episode>
      <itunes:title>Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0c3ea810-406a-47a4-a828-21a5f482cea6</guid>
      <link>https://share.transistor.fm/s/0c61c172</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            William Jurayj, Jeffrey Cheng, Benjamin Van Durme</p>

            <p><strong>Title:</strong><br>
            Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13962v1">http://arxiv.org/abs/2502.13962v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            William Jurayj, Jeffrey Cheng, Benjamin Van Durme</p>

            <p><strong>Title:</strong><br>
            Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13962v1">http://arxiv.org/abs/2502.13962v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 20 Feb 2025 20:43:18 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0c61c172/bc4471b2.mp3" length="20615521" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1285</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            William Jurayj, Jeffrey Cheng, Benjamin Van Durme</p>

            <p><strong>Title:</strong><br>
            Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13962v1">http://arxiv.org/abs/2502.13962v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Craw4LLM: Efficient Web Crawling for LLM Pretraining</title>
      <itunes:episode>589</itunes:episode>
      <podcast:episode>589</podcast:episode>
      <itunes:title>Craw4LLM: Efficient Web Crawling for LLM Pretraining</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2dd6055c-11dc-4063-8c18-398f2a5dc9e0</guid>
      <link>https://share.transistor.fm/s/45f6eb15</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shi Yu, Zhiyuan Liu, Chenyan Xiong</p>

            <p><strong>Title:</strong><br>
            Craw4LLM: Efficient Web Crawling for LLM Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13347v1">http://arxiv.org/abs/2502.13347v1</a></p>

            <p><strong>Abstract:</strong><br>
            Web crawl is a main source of large language models' (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality. This paper presents Crawl4LLM, an efficient web crawling method that explores the web graph based on the preference of LLM pretraining. Specifically, it leverages the influence of a webpage in LLM pretraining as the priority score of the web crawler's scheduler, replacing the standard graph connectivity based priority. Our experiments on a web graph containing 900 million webpages from a commercial search engine's index demonstrate the efficiency of Crawl4LLM in obtaining high-quality pretraining data. With just 21% URLs crawled, LLMs pretrained on Crawl4LLM data reach the same downstream performances of previous crawls, significantly reducing the crawling waste and alleviating the burdens on websites. Our code is publicly available at https://github.com/cxcscmu/Crawl4LLM.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shi Yu, Zhiyuan Liu, Chenyan Xiong</p>

            <p><strong>Title:</strong><br>
            Craw4LLM: Efficient Web Crawling for LLM Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13347v1">http://arxiv.org/abs/2502.13347v1</a></p>

            <p><strong>Abstract:</strong><br>
            Web crawl is a main source of large language models' (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality. This paper presents Crawl4LLM, an efficient web crawling method that explores the web graph based on the preference of LLM pretraining. Specifically, it leverages the influence of a webpage in LLM pretraining as the priority score of the web crawler's scheduler, replacing the standard graph connectivity based priority. Our experiments on a web graph containing 900 million webpages from a commercial search engine's index demonstrate the efficiency of Crawl4LLM in obtaining high-quality pretraining data. With just 21% URLs crawled, LLMs pretrained on Crawl4LLM data reach the same downstream performances of previous crawls, significantly reducing the crawling waste and alleviating the burdens on websites. Our code is publicly available at https://github.com/cxcscmu/Crawl4LLM.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 20 Feb 2025 20:42:57 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/45f6eb15/54e3cbe2.mp3" length="21879817" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1364</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shi Yu, Zhiyuan Liu, Chenyan Xiong</p>

            <p><strong>Title:</strong><br>
            Craw4LLM: Efficient Web Crawling for LLM Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13347v1">http://arxiv.org/abs/2502.13347v1</a></p>

            <p><strong>Abstract:</strong><br>
            Web crawl is a main source of large language models' (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality. This paper presents Crawl4LLM, an efficient web crawling method that explores the web graph based on the preference of LLM pretraining. Specifically, it leverages the influence of a webpage in LLM pretraining as the priority score of the web crawler's scheduler, replacing the standard graph connectivity based priority. Our experiments on a web graph containing 900 million webpages from a commercial search engine's index demonstrate the efficiency of Crawl4LLM in obtaining high-quality pretraining data. With just 21% URLs crawled, LLMs pretrained on Crawl4LLM data reach the same downstream performances of previous crawls, significantly reducing the crawling waste and alleviating the burdens on websites. Our code is publicly available at https://github.com/cxcscmu/Crawl4LLM.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization</title>
      <itunes:episode>588</itunes:episode>
      <podcast:episode>588</podcast:episode>
      <itunes:title>LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0a2bab5e-1f7c-4381-bb27-e97eb6b52d8d</guid>
      <link>https://share.transistor.fm/s/a5a6e43c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Guanzheng Chen, Xin Li, Michael Qizhe Shieh, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13922v2">http://arxiv.org/abs/2502.13922v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated remarkable capabilities through pretraining and alignment. However, superior short-context LLMs may underperform in long-context scenarios due to insufficient long-context alignment. This alignment process remains challenging due to the impracticality of human annotation for extended contexts and the difficulty in balancing short- and long-context performance. To address these challenges, we introduce LongPO, that enables short-context LLMs to self-evolve to excel on long-context tasks by internally transferring short-context capabilities. LongPO harnesses LLMs to learn from self-generated short-to-long preference data, comprising paired responses generated for identical instructions with long-context inputs and their compressed short-context counterparts, respectively. This preference reveals capabilities and potentials of LLMs cultivated during short-context alignment that may be diminished in under-aligned long-context scenarios. Additionally, LongPO incorporates a short-to-long KL constraint to mitigate short-context performance decline during long-context alignment. When applied to Mistral-7B-Instruct-v0.2 from 128K to 512K context lengths, LongPO fully retains short-context performance and largely outperforms naive SFT and DPO in both long- and short-context tasks. Specifically, LongPO-trained models can achieve results on long-context benchmarks comparable to, or even surpassing, those of superior LLMs (e.g., GPT-4-128K) that involve extensive long-context annotation and larger parameter scales. Our code is available at https://github.com/DAMO-NLP-SG/LongPO.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Guanzheng Chen, Xin Li, Michael Qizhe Shieh, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13922v2">http://arxiv.org/abs/2502.13922v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated remarkable capabilities through pretraining and alignment. However, superior short-context LLMs may underperform in long-context scenarios due to insufficient long-context alignment. This alignment process remains challenging due to the impracticality of human annotation for extended contexts and the difficulty in balancing short- and long-context performance. To address these challenges, we introduce LongPO, that enables short-context LLMs to self-evolve to excel on long-context tasks by internally transferring short-context capabilities. LongPO harnesses LLMs to learn from self-generated short-to-long preference data, comprising paired responses generated for identical instructions with long-context inputs and their compressed short-context counterparts, respectively. This preference reveals capabilities and potentials of LLMs cultivated during short-context alignment that may be diminished in under-aligned long-context scenarios. Additionally, LongPO incorporates a short-to-long KL constraint to mitigate short-context performance decline during long-context alignment. When applied to Mistral-7B-Instruct-v0.2 from 128K to 512K context lengths, LongPO fully retains short-context performance and largely outperforms naive SFT and DPO in both long- and short-context tasks. Specifically, LongPO-trained models can achieve results on long-context benchmarks comparable to, or even surpassing, those of superior LLMs (e.g., GPT-4-128K) that involve extensive long-context annotation and larger parameter scales. Our code is available at https://github.com/DAMO-NLP-SG/LongPO.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 20 Feb 2025 20:42:36 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a5a6e43c/e2054c08.mp3" length="24319081" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1516</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Guanzheng Chen, Xin Li, Michael Qizhe Shieh, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13922v2">http://arxiv.org/abs/2502.13922v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated remarkable capabilities through pretraining and alignment. However, superior short-context LLMs may underperform in long-context scenarios due to insufficient long-context alignment. This alignment process remains challenging due to the impracticality of human annotation for extended contexts and the difficulty in balancing short- and long-context performance. To address these challenges, we introduce LongPO, that enables short-context LLMs to self-evolve to excel on long-context tasks by internally transferring short-context capabilities. LongPO harnesses LLMs to learn from self-generated short-to-long preference data, comprising paired responses generated for identical instructions with long-context inputs and their compressed short-context counterparts, respectively. This preference reveals capabilities and potentials of LLMs cultivated during short-context alignment that may be diminished in under-aligned long-context scenarios. Additionally, LongPO incorporates a short-to-long KL constraint to mitigate short-context performance decline during long-context alignment. When applied to Mistral-7B-Instruct-v0.2 from 128K to 512K context lengths, LongPO fully retains short-context performance and largely outperforms naive SFT and DPO in both long- and short-context tasks. Specifically, LongPO-trained models can achieve results on long-context benchmarks comparable to, or even surpassing, those of superior LLMs (e.g., GPT-4-128K) that involve extensive long-context annotation and larger parameter scales. Our code is available at https://github.com/DAMO-NLP-SG/LongPO.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Small Models Struggle to Learn from Strong Reasoners</title>
      <itunes:episode>587</itunes:episode>
      <podcast:episode>587</podcast:episode>
      <itunes:title>Small Models Struggle to Learn from Strong Reasoners</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5c85fa98-8e84-4230-91d4-513baf6acf62</guid>
      <link>https://share.transistor.fm/s/70397506</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran</p>

            <p><strong>Title:</strong><br>
            Small Models Struggle to Learn from Strong Reasoners</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12143v1">http://arxiv.org/abs/2502.12143v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) excel in complex reasoning tasks, and distilling their reasoning capabilities into smaller models has shown promise. However, we uncover an interesting phenomenon, which we term the Small Model Learnability Gap: small models ($\leq$3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, they perform better when fine-tuned on shorter, simpler reasoning chains that better align with their intrinsic learning capacity. To address this, we propose Mix Distillation, a simple yet effective strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models. Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone. These findings highlight the limitations of direct strong model distillation and underscore the importance of adapting reasoning complexity for effective reasoning capability transfer.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran</p>

            <p><strong>Title:</strong><br>
            Small Models Struggle to Learn from Strong Reasoners</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12143v1">http://arxiv.org/abs/2502.12143v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) excel in complex reasoning tasks, and distilling their reasoning capabilities into smaller models has shown promise. However, we uncover an interesting phenomenon, which we term the Small Model Learnability Gap: small models ($\leq$3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, they perform better when fine-tuned on shorter, simpler reasoning chains that better align with their intrinsic learning capacity. To address this, we propose Mix Distillation, a simple yet effective strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models. Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone. These findings highlight the limitations of direct strong model distillation and underscore the importance of adapting reasoning complexity for effective reasoning capability transfer.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 20 Feb 2025 20:42:15 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/70397506/37e01c97.mp3" length="19289306" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1202</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran</p>

            <p><strong>Title:</strong><br>
            Small Models Struggle to Learn from Strong Reasoners</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12143v1">http://arxiv.org/abs/2502.12143v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) excel in complex reasoning tasks, and distilling their reasoning capabilities into smaller models has shown promise. However, we uncover an interesting phenomenon, which we term the Small Model Learnability Gap: small models ($\leq$3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, they perform better when fine-tuned on shorter, simpler reasoning chains that better align with their intrinsic learning capacity. To address this, we propose Mix Distillation, a simple yet effective strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models. Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone. These findings highlight the limitations of direct strong model distillation and underscore the importance of adapting reasoning complexity for effective reasoning capability transfer.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Autellix: An Efficient Serving Engine for LLM Agents as General Programs</title>
      <itunes:episode>586</itunes:episode>
      <podcast:episode>586</podcast:episode>
      <itunes:title>Autellix: An Efficient Serving Engine for LLM Agents as General Programs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d20f59e5-8966-4c29-a261-d897f14d456f</guid>
      <link>https://share.transistor.fm/s/bd08dbcf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.LG, cs.AI, cs.DC</p>

            <p><strong>Authors:</strong><br>
            Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, Ion Stoica</p>

            <p><strong>Title:</strong><br>
            Autellix: An Efficient Serving Engine for LLM Agents as General Programs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13965v1">http://arxiv.org/abs/2502.13965v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs, which scale LLM calls and output tokens to help AI agents reason, explore, and solve complex tasks. However, existing LLM serving systems ignore dependencies between programs and calls, missing significant opportunities for optimization. Our analysis reveals that programs submitted to LLM serving engines experience long cumulative wait times, primarily due to head-of-line blocking at both the individual LLM request and the program. To address this, we introduce Autellix, an LLM serving system that treats programs as first-class citizens to minimize their end-to-end latencies. Autellix intercepts LLM calls submitted by programs, enriching schedulers with program-level context. We propose two scheduling algorithms-for single-threaded and distributed programs-that preempt and prioritize LLM calls based on their programs' previously completed calls. Our evaluation demonstrates that across diverse LLMs and agentic workloads, Autellix improves throughput of programs by 4-15x at the same latency compared to state-of-the-art systems, such as vLLM.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.LG, cs.AI, cs.DC</p>

            <p><strong>Authors:</strong><br>
            Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, Ion Stoica</p>

            <p><strong>Title:</strong><br>
            Autellix: An Efficient Serving Engine for LLM Agents as General Programs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13965v1">http://arxiv.org/abs/2502.13965v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs, which scale LLM calls and output tokens to help AI agents reason, explore, and solve complex tasks. However, existing LLM serving systems ignore dependencies between programs and calls, missing significant opportunities for optimization. Our analysis reveals that programs submitted to LLM serving engines experience long cumulative wait times, primarily due to head-of-line blocking at both the individual LLM request and the program. To address this, we introduce Autellix, an LLM serving system that treats programs as first-class citizens to minimize their end-to-end latencies. Autellix intercepts LLM calls submitted by programs, enriching schedulers with program-level context. We propose two scheduling algorithms-for single-threaded and distributed programs-that preempt and prioritize LLM calls based on their programs' previously completed calls. Our evaluation demonstrates that across diverse LLMs and agentic workloads, Autellix improves throughput of programs by 4-15x at the same latency compared to state-of-the-art systems, such as vLLM.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 20 Feb 2025 20:41:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bd08dbcf/eafbcbdb.mp3" length="21657065" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1350</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.LG, cs.AI, cs.DC</p>

            <p><strong>Authors:</strong><br>
            Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, Ion Stoica</p>

            <p><strong>Title:</strong><br>
            Autellix: An Efficient Serving Engine for LLM Agents as General Programs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13965v1">http://arxiv.org/abs/2502.13965v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs, which scale LLM calls and output tokens to help AI agents reason, explore, and solve complex tasks. However, existing LLM serving systems ignore dependencies between programs and calls, missing significant opportunities for optimization. Our analysis reveals that programs submitted to LLM serving engines experience long cumulative wait times, primarily due to head-of-line blocking at both the individual LLM request and the program. To address this, we introduce Autellix, an LLM serving system that treats programs as first-class citizens to minimize their end-to-end latencies. Autellix intercepts LLM calls submitted by programs, enriching schedulers with program-level context. We propose two scheduling algorithms-for single-threaded and distributed programs-that preempt and prioritize LLM calls based on their programs' previously completed calls. Our evaluation demonstrates that across diverse LLMs and agentic workloads, Autellix improves throughput of programs by 4-15x at the same latency compared to state-of-the-art systems, such as vLLM.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?</title>
      <itunes:episode>585</itunes:episode>
      <podcast:episode>585</podcast:episode>
      <itunes:title>SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">010a04e7-3b75-40f0-90dd-c6717ef4e12a</guid>
      <link>https://share.transistor.fm/s/8c24e2c1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CL, cs.AI, cs.IR, cs.IT, math.IT</p>

            <p><strong>Authors:</strong><br>
            Yucheng Shi, Tianze Yang, Canyu Chen, Quanzheng Li, Tianming Liu, Xiang Li, Ninghao Liu</p>

            <p><strong>Title:</strong><br>
            SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13233v1">http://arxiv.org/abs/2502.13233v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have shown remarkable capabilities in general domains but often struggle with tasks requiring specialized knowledge. Conventional Retrieval-Augmented Generation (RAG) techniques typically retrieve external information from static knowledge bases, which can be outdated or incomplete, missing fine-grained clinical details essential for accurate medical question answering. In this work, we propose SearchRAG, a novel framework that overcomes these limitations by leveraging real-time search engines. Our method employs synthetic query generation to convert complex medical questions into search-engine-friendly queries and utilizes uncertainty-based knowledge selection to filter and incorporate the most relevant and informative medical knowledge into the LLM's input. Experimental results demonstrate that our method significantly improves response accuracy in medical question answering tasks, particularly for complex questions requiring detailed and up-to-date knowledge.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CL, cs.AI, cs.IR, cs.IT, math.IT</p>

            <p><strong>Authors:</strong><br>
            Yucheng Shi, Tianze Yang, Canyu Chen, Quanzheng Li, Tianming Liu, Xiang Li, Ninghao Liu</p>

            <p><strong>Title:</strong><br>
            SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13233v1">http://arxiv.org/abs/2502.13233v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have shown remarkable capabilities in general domains but often struggle with tasks requiring specialized knowledge. Conventional Retrieval-Augmented Generation (RAG) techniques typically retrieve external information from static knowledge bases, which can be outdated or incomplete, missing fine-grained clinical details essential for accurate medical question answering. In this work, we propose SearchRAG, a novel framework that overcomes these limitations by leveraging real-time search engines. Our method employs synthetic query generation to convert complex medical questions into search-engine-friendly queries and utilizes uncertainty-based knowledge selection to filter and incorporate the most relevant and informative medical knowledge into the LLM's input. Experimental results demonstrate that our method significantly improves response accuracy in medical question answering tasks, particularly for complex questions requiring detailed and up-to-date knowledge.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 20 Feb 2025 20:41:33 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8c24e2c1/abc79a24.mp3" length="21249983" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1324</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CL, cs.AI, cs.IR, cs.IT, math.IT</p>

            <p><strong>Authors:</strong><br>
            Yucheng Shi, Tianze Yang, Canyu Chen, Quanzheng Li, Tianming Liu, Xiang Li, Ninghao Liu</p>

            <p><strong>Title:</strong><br>
            SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13233v1">http://arxiv.org/abs/2502.13233v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have shown remarkable capabilities in general domains but often struggle with tasks requiring specialized knowledge. Conventional Retrieval-Augmented Generation (RAG) techniques typically retrieve external information from static knowledge bases, which can be outdated or incomplete, missing fine-grained clinical details essential for accurate medical question answering. In this work, we propose SearchRAG, a novel framework that overcomes these limitations by leveraging real-time search engines. Our method employs synthetic query generation to convert complex medical questions into search-engine-friendly queries and utilizes uncertainty-based knowledge selection to filter and incorporate the most relevant and informative medical knowledge into the LLM's input. Experimental results demonstrate that our method significantly improves response accuracy in medical question answering tasks, particularly for complex questions requiring detailed and up-to-date knowledge.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Soundwave: Less is More for Speech-Text Alignment in LLMs</title>
      <itunes:episode>584</itunes:episode>
      <podcast:episode>584</podcast:episode>
      <itunes:title>Soundwave: Less is More for Speech-Text Alignment in LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">55a270f1-3544-48dd-999e-810321ad1bbf</guid>
      <link>https://share.transistor.fm/s/ea8891ee</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.CL, cs.AI, cs.SD</p>

            <p><strong>Authors:</strong><br>
            Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li</p>

            <p><strong>Title:</strong><br>
            Soundwave: Less is More for Speech-Text Alignment in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12900v1">http://arxiv.org/abs/2502.12900v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at https://github.com/FreedomIntelligence/Soundwave.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.CL, cs.AI, cs.SD</p>

            <p><strong>Authors:</strong><br>
            Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li</p>

            <p><strong>Title:</strong><br>
            Soundwave: Less is More for Speech-Text Alignment in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12900v1">http://arxiv.org/abs/2502.12900v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at https://github.com/FreedomIntelligence/Soundwave.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 19 Feb 2025 20:46:29 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ea8891ee/678a2615.mp3" length="21051009" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1312</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.CL, cs.AI, cs.SD</p>

            <p><strong>Authors:</strong><br>
            Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li</p>

            <p><strong>Title:</strong><br>
            Soundwave: Less is More for Speech-Text Alignment in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12900v1">http://arxiv.org/abs/2502.12900v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at https://github.com/FreedomIntelligence/Soundwave.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity</title>
      <itunes:episode>583</itunes:episode>
      <podcast:episode>583</podcast:episode>
      <itunes:title>Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d8f4aa7c-9134-438f-8239-2d27a64acc64</guid>
      <link>https://share.transistor.fm/s/f6fda3bc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev</p>

            <p><strong>Title:</strong><br>
            Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13063v1">http://arxiv.org/abs/2502.13063v1</a></p>

            <p><strong>Abstract:</strong><br>
            A range of recent works addresses the problem of compression of sequence of tokens into a shorter sequence of real-valued vectors to be used as inputs instead of token embeddings or key-value cache. These approaches allow to reduce the amount of compute in existing language models. Despite relying on powerful models as encoders, the maximum attainable lossless compression ratio is typically not higher than x10. This fact is highly intriguing because, in theory, the maximum information capacity of large real-valued vectors is far beyond the presented rates even for 16-bit precision and a modest vector size. In this work, we explore the limits of compression by replacing the encoder with a per-sample optimization procedure. We show that vectors with compression ratios up to x1500 exist, which highlights two orders of magnitude gap between existing and practically attainable solutions. Furthermore, we empirically show that the compression limits are determined not by the length of the input but by the amount of uncertainty to be reduced, namely, the cross-entropy loss on this sequence without any conditioning. The obtained limits highlight the substantial gap between the theoretical capacity of input embeddings and their practical utilization, suggesting significant room for optimization in model design.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev</p>

            <p><strong>Title:</strong><br>
            Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13063v1">http://arxiv.org/abs/2502.13063v1</a></p>

            <p><strong>Abstract:</strong><br>
            A range of recent works addresses the problem of compression of sequence of tokens into a shorter sequence of real-valued vectors to be used as inputs instead of token embeddings or key-value cache. These approaches allow to reduce the amount of compute in existing language models. Despite relying on powerful models as encoders, the maximum attainable lossless compression ratio is typically not higher than x10. This fact is highly intriguing because, in theory, the maximum information capacity of large real-valued vectors is far beyond the presented rates even for 16-bit precision and a modest vector size. In this work, we explore the limits of compression by replacing the encoder with a per-sample optimization procedure. We show that vectors with compression ratios up to x1500 exist, which highlights two orders of magnitude gap between existing and practically attainable solutions. Furthermore, we empirically show that the compression limits are determined not by the length of the input but by the amount of uncertainty to be reduced, namely, the cross-entropy loss on this sequence without any conditioning. The obtained limits highlight the substantial gap between the theoretical capacity of input embeddings and their practical utilization, suggesting significant room for optimization in model design.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 19 Feb 2025 20:46:07 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f6fda3bc/14cd7ca2.mp3" length="22267320" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1388</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev</p>

            <p><strong>Title:</strong><br>
            Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13063v1">http://arxiv.org/abs/2502.13063v1</a></p>

            <p><strong>Abstract:</strong><br>
            A range of recent works addresses the problem of compression of sequence of tokens into a shorter sequence of real-valued vectors to be used as inputs instead of token embeddings or key-value cache. These approaches allow to reduce the amount of compute in existing language models. Despite relying on powerful models as encoders, the maximum attainable lossless compression ratio is typically not higher than x10. This fact is highly intriguing because, in theory, the maximum information capacity of large real-valued vectors is far beyond the presented rates even for 16-bit precision and a modest vector size. In this work, we explore the limits of compression by replacing the encoder with a per-sample optimization procedure. We show that vectors with compression ratios up to x1500 exist, which highlights two orders of magnitude gap between existing and practically attainable solutions. Furthermore, we empirically show that the compression limits are determined not by the length of the input but by the amount of uncertainty to be reduced, namely, the cross-entropy loss on this sequence without any conditioning. The obtained limits highlight the substantial gap between the theoretical capacity of input embeddings and their practical utilization, suggesting significant room for optimization in model design.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Continuous Diffusion Model for Language Modeling</title>
      <itunes:episode>582</itunes:episode>
      <podcast:episode>582</podcast:episode>
      <itunes:title>Continuous Diffusion Model for Language Modeling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9f060fe4-d511-41f8-b276-9e3c0ae8f2aa</guid>
      <link>https://share.transistor.fm/s/d290a026</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jaehyeong Jo, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Continuous Diffusion Model for Language Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.11564v1">http://arxiv.org/abs/2502.11564v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have emerged as a promising alternative to autoregressive models in modeling discrete categorical data. Yet diffusion models that directly work on discrete data space do not fully exploit the power of iterative refinement, as the signals are lost during the transition between discrete states. Existing continuous diffusion models for discrete data have limited performance compared to discrete approaches, and the unclear link between them restricts the development of diffusion models for discrete data. In this work, we propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution. We establish a connection between the discrete diffusion and continuous flow on the statistical manifold, and building on the analogy, we introduce a simple design for the diffusion process that generalizes previous discrete diffusion models. We further propose a simulation-free training framework based on radial symmetry and a simple technique to address the high dimensionality of the manifold. Comprehensive experiments on language modeling benchmarks and other modalities show that our method outperforms existing discrete diffusion models and approaches the performance of autoregressive models. Codes available at \href{https://github.com/harryjo97/RDLM}{https://github.com/harryjo97/RDLM}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jaehyeong Jo, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Continuous Diffusion Model for Language Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.11564v1">http://arxiv.org/abs/2502.11564v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have emerged as a promising alternative to autoregressive models in modeling discrete categorical data. Yet diffusion models that directly work on discrete data space do not fully exploit the power of iterative refinement, as the signals are lost during the transition between discrete states. Existing continuous diffusion models for discrete data have limited performance compared to discrete approaches, and the unclear link between them restricts the development of diffusion models for discrete data. In this work, we propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution. We establish a connection between the discrete diffusion and continuous flow on the statistical manifold, and building on the analogy, we introduce a simple design for the diffusion process that generalizes previous discrete diffusion models. We further propose a simulation-free training framework based on radial symmetry and a simple technique to address the high dimensionality of the manifold. Comprehensive experiments on language modeling benchmarks and other modalities show that our method outperforms existing discrete diffusion models and approaches the performance of autoregressive models. Codes available at \href{https://github.com/harryjo97/RDLM}{https://github.com/harryjo97/RDLM}.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 19 Feb 2025 20:45:45 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d290a026/1bee4de8.mp3" length="18021214" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1123</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jaehyeong Jo, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            Continuous Diffusion Model for Language Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.11564v1">http://arxiv.org/abs/2502.11564v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have emerged as a promising alternative to autoregressive models in modeling discrete categorical data. Yet diffusion models that directly work on discrete data space do not fully exploit the power of iterative refinement, as the signals are lost during the transition between discrete states. Existing continuous diffusion models for discrete data have limited performance compared to discrete approaches, and the unclear link between them restricts the development of diffusion models for discrete data. In this work, we propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution. We establish a connection between the discrete diffusion and continuous flow on the statistical manifold, and building on the analogy, we introduce a simple design for the diffusion process that generalizes previous discrete diffusion models. We further propose a simulation-free training framework based on radial symmetry and a simple technique to address the high dimensionality of the manifold. Comprehensive experiments on language modeling benchmarks and other modalities show that our method outperforms existing discrete diffusion models and approaches the performance of autoregressive models. Codes available at \href{https://github.com/harryjo97/RDLM}{https://github.com/harryjo97/RDLM}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Phantom: Subject-consistent video generation via cross-modal alignment</title>
      <itunes:episode>581</itunes:episode>
      <podcast:episode>581</podcast:episode>
      <itunes:title>Phantom: Subject-consistent video generation via cross-modal alignment</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c2bef3fd-fe28-4fbe-8e13-a83c24ac48e2</guid>
      <link>https://share.transistor.fm/s/98caa16b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Qian He, Xinglong Wu</p>

            <p><strong>Title:</strong><br>
            Phantom: Subject-consistent video generation via cross-modal alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.11079v1">http://arxiv.org/abs/2502.11079v1</a></p>

            <p><strong>Abstract:</strong><br>
            The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent video through textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages. The project homepage is here https://phantom-video.github.io/Phantom/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Qian He, Xinglong Wu</p>

            <p><strong>Title:</strong><br>
            Phantom: Subject-consistent video generation via cross-modal alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.11079v1">http://arxiv.org/abs/2502.11079v1</a></p>

            <p><strong>Abstract:</strong><br>
            The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent video through textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages. The project homepage is here https://phantom-video.github.io/Phantom/.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 19 Feb 2025 20:45:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/98caa16b/c2c4b602.mp3" length="20500152" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1278</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Qian He, Xinglong Wu</p>

            <p><strong>Title:</strong><br>
            Phantom: Subject-consistent video generation via cross-modal alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.11079v1">http://arxiv.org/abs/2502.11079v1</a></p>

            <p><strong>Abstract:</strong><br>
            The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent video through textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages. The project homepage is here https://phantom-video.github.io/Phantom/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Rethinking Diverse Human Preference Learning through Principal Component Analysis</title>
      <itunes:episode>580</itunes:episode>
      <podcast:episode>580</podcast:episode>
      <itunes:title>Rethinking Diverse Human Preference Learning through Principal Component Analysis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b9ee34ff-e890-4beb-9a70-7645e3cd1f65</guid>
      <link>https://share.transistor.fm/s/ffe272d1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen</p>

            <p><strong>Title:</strong><br>
            Rethinking Diverse Human Preference Learning through Principal Component Analysis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13131v1">http://arxiv.org/abs/2502.13131v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding human preferences is crucial for improving foundation models and building personalized AI systems. However, preferences are inherently diverse and complex, making it difficult for traditional reward models to capture their full range. While fine-grained preference data can help, collecting it is expensive and hard to scale. In this paper, we introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons without requiring fine-grained annotations. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA). By constructing a dataset of embedding differences between preferred and rejected responses, DRMs identify orthogonal basis vectors that capture distinct aspects of preference. These decomposed rewards can be flexibly combined to align with different user needs, offering an interpretable and scalable alternative to traditional reward models. We demonstrate that DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training. Our results highlight DRMs as a powerful framework for personalized and interpretable LLM alignment.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen</p>

            <p><strong>Title:</strong><br>
            Rethinking Diverse Human Preference Learning through Principal Component Analysis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13131v1">http://arxiv.org/abs/2502.13131v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding human preferences is crucial for improving foundation models and building personalized AI systems. However, preferences are inherently diverse and complex, making it difficult for traditional reward models to capture their full range. While fine-grained preference data can help, collecting it is expensive and hard to scale. In this paper, we introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons without requiring fine-grained annotations. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA). By constructing a dataset of embedding differences between preferred and rejected responses, DRMs identify orthogonal basis vectors that capture distinct aspects of preference. These decomposed rewards can be flexibly combined to align with different user needs, offering an interpretable and scalable alternative to traditional reward models. We demonstrate that DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training. Our results highlight DRMs as a powerful framework for personalized and interpretable LLM alignment.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 19 Feb 2025 20:45:02 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ffe272d1/45b5302a.mp3" length="21803778" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1359</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen</p>

            <p><strong>Title:</strong><br>
            Rethinking Diverse Human Preference Learning through Principal Component Analysis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13131v1">http://arxiv.org/abs/2502.13131v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding human preferences is crucial for improving foundation models and building personalized AI systems. However, preferences are inherently diverse and complex, making it difficult for traditional reward models to capture their full range. While fine-grained preference data can help, collecting it is expensive and hard to scale. In this paper, we introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons without requiring fine-grained annotations. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA). By constructing a dataset of embedding differences between preferred and rejected responses, DRMs identify orthogonal basis vectors that capture distinct aspects of preference. These decomposed rewards can be flexibly combined to align with different user needs, offering an interpretable and scalable alternative to traditional reward models. We demonstrate that DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training. Our results highlight DRMs as a powerful framework for personalized and interpretable LLM alignment.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Magma: A Foundation Model for Multimodal AI Agents</title>
      <itunes:episode>579</itunes:episode>
      <podcast:episode>579</podcast:episode>
      <itunes:title>Magma: A Foundation Model for Multimodal AI Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">99f2868b-1b9e-4357-a515-e3699e53c1cc</guid>
      <link>https://share.transistor.fm/s/99f3c88c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.HC, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao</p>

            <p><strong>Title:</strong><br>
            Magma: A Foundation Model for Multimodal AI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13130v1">http://arxiv.org/abs/2502.13130v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that it not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial-temporal intelligence) and complete agentic tasks ranging from UI navigation to robot manipulation. To endow the agentic capabilities, Magma is pretrained on large amounts of heterogeneous datasets spanning from images, videos to robotics data, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and the object movements (e.g., the trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show that SoM and ToM reach great synergy and facilitate the acquisition of spatial-temporal intelligence for our Magma model, which is fundamental to a wide range of tasks as shown in Fig.1. In particular, Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are specifically tailored to these tasks. On image and video-related multimodal tasks, Magma also compares favorably to popular large multimodal models that are trained on much larger datasets. We make our model and code public for reproducibility at https://microsoft.github.io/Magma.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.HC, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao</p>

            <p><strong>Title:</strong><br>
            Magma: A Foundation Model for Multimodal AI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13130v1">http://arxiv.org/abs/2502.13130v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that it not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial-temporal intelligence) and complete agentic tasks ranging from UI navigation to robot manipulation. To endow the agentic capabilities, Magma is pretrained on large amounts of heterogeneous datasets spanning from images, videos to robotics data, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and the object movements (e.g., the trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show that SoM and ToM reach great synergy and facilitate the acquisition of spatial-temporal intelligence for our Magma model, which is fundamental to a wide range of tasks as shown in Fig.1. In particular, Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are specifically tailored to these tasks. On image and video-related multimodal tasks, Magma also compares favorably to popular large multimodal models that are trained on much larger datasets. We make our model and code public for reproducibility at https://microsoft.github.io/Magma.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 19 Feb 2025 20:44:40 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/99f3c88c/d0fababb.mp3" length="22174477" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1382</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.HC, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao</p>

            <p><strong>Title:</strong><br>
            Magma: A Foundation Model for Multimodal AI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13130v1">http://arxiv.org/abs/2502.13130v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that it not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial-temporal intelligence) and complete agentic tasks ranging from UI navigation to robot manipulation. To endow the agentic capabilities, Magma is pretrained on large amounts of heterogeneous datasets spanning from images, videos to robotics data, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and the object movements (e.g., the trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show that SoM and ToM reach great synergy and facilitate the acquisition of spatial-temporal intelligence for our Magma model, which is fundamental to a wide range of tasks as shown in Fig.1. In particular, Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are specifically tailored to these tasks. On image and video-related multimodal tasks, Magma also compares favorably to popular large multimodal models that are trained on much larger datasets. We make our model and code public for reproducibility at https://microsoft.github.io/Magma.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation</title>
      <itunes:episode>578</itunes:episode>
      <podcast:episode>578</podcast:episode>
      <itunes:title>Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">329610ad-ead5-4557-9abc-e7763e7e25c3</guid>
      <link>https://share.transistor.fm/s/a556214e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Bencheng Liao, Hongyuan Tao, Qian Zhang, Tianheng Cheng, Yingyue Li, Haoran Yin, Wenyu Liu, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13145v1">http://arxiv.org/abs/2502.13145v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance but face deployment challenges due to their quadratic computational complexity, growing Key-Value cache requirements, and reliance on separate vision encoders. We propose mmMamba, a framework for developing linear-complexity native multimodal state space models through progressive distillation from existing MLLMs using moderate academic computational resources. Our approach enables the direct conversion of trained decoder-only MLLMs to linear-complexity architectures without requiring pre-trained RNN-based LLM or vision encoders. We propose an seeding strategy to carve Mamba from trained Transformer and a three-stage distillation recipe, which can effectively transfer the knowledge from Transformer to Mamba while preserving multimodal capabilities. Our method also supports flexible hybrid architectures that combine Transformer and Mamba layers for customizable efficiency-performance trade-offs. Distilled from the Transformer-based decoder-only HoVLE, mmMamba-linear achieves competitive performance against existing linear and quadratic-complexity VLMs, while mmMamba-hybrid further improves performance significantly, approaching HoVLE's capabilities. At 103K tokens, mmMamba-linear demonstrates 20.6$\times$ speedup and 75.8% GPU memory reduction compared to HoVLE, while mmMamba-hybrid achieves 13.5$\times$ speedup and 60.2% memory savings. Code and models are released at https://github.com/hustvl/mmMamba</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Bencheng Liao, Hongyuan Tao, Qian Zhang, Tianheng Cheng, Yingyue Li, Haoran Yin, Wenyu Liu, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13145v1">http://arxiv.org/abs/2502.13145v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance but face deployment challenges due to their quadratic computational complexity, growing Key-Value cache requirements, and reliance on separate vision encoders. We propose mmMamba, a framework for developing linear-complexity native multimodal state space models through progressive distillation from existing MLLMs using moderate academic computational resources. Our approach enables the direct conversion of trained decoder-only MLLMs to linear-complexity architectures without requiring pre-trained RNN-based LLM or vision encoders. We propose an seeding strategy to carve Mamba from trained Transformer and a three-stage distillation recipe, which can effectively transfer the knowledge from Transformer to Mamba while preserving multimodal capabilities. Our method also supports flexible hybrid architectures that combine Transformer and Mamba layers for customizable efficiency-performance trade-offs. Distilled from the Transformer-based decoder-only HoVLE, mmMamba-linear achieves competitive performance against existing linear and quadratic-complexity VLMs, while mmMamba-hybrid further improves performance significantly, approaching HoVLE's capabilities. At 103K tokens, mmMamba-linear demonstrates 20.6$\times$ speedup and 75.8% GPU memory reduction compared to HoVLE, while mmMamba-hybrid achieves 13.5$\times$ speedup and 60.2% memory savings. Code and models are released at https://github.com/hustvl/mmMamba</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 19 Feb 2025 20:44:19 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a556214e/1d45e7e1.mp3" length="21049376" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1312</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Bencheng Liao, Hongyuan Tao, Qian Zhang, Tianheng Cheng, Yingyue Li, Haoran Yin, Wenyu Liu, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13145v1">http://arxiv.org/abs/2502.13145v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance but face deployment challenges due to their quadratic computational complexity, growing Key-Value cache requirements, and reliance on separate vision encoders. We propose mmMamba, a framework for developing linear-complexity native multimodal state space models through progressive distillation from existing MLLMs using moderate academic computational resources. Our approach enables the direct conversion of trained decoder-only MLLMs to linear-complexity architectures without requiring pre-trained RNN-based LLM or vision encoders. We propose an seeding strategy to carve Mamba from trained Transformer and a three-stage distillation recipe, which can effectively transfer the knowledge from Transformer to Mamba while preserving multimodal capabilities. Our method also supports flexible hybrid architectures that combine Transformer and Mamba layers for customizable efficiency-performance trade-offs. Distilled from the Transformer-based decoder-only HoVLE, mmMamba-linear achieves competitive performance against existing linear and quadratic-complexity VLMs, while mmMamba-hybrid further improves performance significantly, approaching HoVLE's capabilities. At 103K tokens, mmMamba-linear demonstrates 20.6$\times$ speedup and 75.8% GPU memory reduction compared to HoVLE, while mmMamba-hybrid achieves 13.5$\times$ speedup and 60.2% memory savings. Code and models are released at https://github.com/hustvl/mmMamba</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation</title>
      <itunes:episode>577</itunes:episode>
      <podcast:episode>577</podcast:episode>
      <itunes:title>SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9c113d96-3d91-4e17-94ab-f0f0ea6bbf7d</guid>
      <link>https://share.transistor.fm/s/4791f037</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.RO, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi</p>

            <p><strong>Title:</strong><br>
            SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13143v1">http://arxiv.org/abs/2502.13143v1</a></p>

            <p><strong>Abstract:</strong><br>
            Spatial intelligence is a critical component of embodied AI, promoting robots to understand and interact with their environments. While recent advances have enhanced the ability of VLMs to perceive object locations and positional relationships, they still lack the capability to precisely understand object orientations-a key requirement for tasks involving fine-grained manipulations. Addressing this limitation not only requires geometric reasoning but also an expressive and intuitive way to represent orientation. In this context, we propose that natural language offers a more flexible representation space than canonical frames, making it particularly suitable for instruction-following robotic systems. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the ''handle'' direction of a knife). To support this, we construct OrienText300K, a large-scale dataset of 3D models annotated with semantic orientations that link geometric understanding to functional semantics. By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints. Extensive experiments in simulation and real world demonstrate that our approach significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy on Open6DOR and 74.9% accuracy on SIMPLER.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.RO, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi</p>

            <p><strong>Title:</strong><br>
            SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13143v1">http://arxiv.org/abs/2502.13143v1</a></p>

            <p><strong>Abstract:</strong><br>
            Spatial intelligence is a critical component of embodied AI, promoting robots to understand and interact with their environments. While recent advances have enhanced the ability of VLMs to perceive object locations and positional relationships, they still lack the capability to precisely understand object orientations-a key requirement for tasks involving fine-grained manipulations. Addressing this limitation not only requires geometric reasoning but also an expressive and intuitive way to represent orientation. In this context, we propose that natural language offers a more flexible representation space than canonical frames, making it particularly suitable for instruction-following robotic systems. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the ''handle'' direction of a knife). To support this, we construct OrienText300K, a large-scale dataset of 3D models annotated with semantic orientations that link geometric understanding to functional semantics. By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints. Extensive experiments in simulation and real world demonstrate that our approach significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy on Open6DOR and 74.9% accuracy on SIMPLER.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 19 Feb 2025 20:43:57 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4791f037/2d3d506e.mp3" length="20687414" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1289</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.RO, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi</p>

            <p><strong>Title:</strong><br>
            SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.13143v1">http://arxiv.org/abs/2502.13143v1</a></p>

            <p><strong>Abstract:</strong><br>
            Spatial intelligence is a critical component of embodied AI, promoting robots to understand and interact with their environments. While recent advances have enhanced the ability of VLMs to perceive object locations and positional relationships, they still lack the capability to precisely understand object orientations-a key requirement for tasks involving fine-grained manipulations. Addressing this limitation not only requires geometric reasoning but also an expressive and intuitive way to represent orientation. In this context, we propose that natural language offers a more flexible representation space than canonical frames, making it particularly suitable for instruction-following robotic systems. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the ''handle'' direction of a knife). To support this, we construct OrienText300K, a large-scale dataset of 3D models annotated with semantic orientations that link geometric understanding to functional semantics. By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints. Extensive experiments in simulation and real world demonstrate that our approach significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy on Open6DOR and 74.9% accuracy on SIMPLER.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models</title>
      <itunes:episode>576</itunes:episode>
      <podcast:episode>576</podcast:episode>
      <itunes:title>SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b5849e0a-d2e8-49f9-b210-ad909602e9f2</guid>
      <link>https://share.transistor.fm/s/35c05df7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Seanie Lee, Dong Bok Lee, Dominik Wagner, Minki Kang, Haebin Seong, Tobias Bocklet, Juho Lee, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12464v1">http://arxiv.org/abs/2502.12464v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deploying large language models (LLMs) in real-world applications requires robust safety guard models to detect and block harmful user prompts. While large safety guard models achieve strong performance, their computational cost is substantial. To mitigate this, smaller distilled models are used, but they often underperform on "hard" examples where the larger model provides accurate predictions. We observe that many inputs can be reliably handled by the smaller model, while only a small fraction require the larger model's capacity. Motivated by this, we propose SafeRoute, a binary router that distinguishes hard examples from easy ones. Our method selectively applies the larger safety guard model to the data that the router considers hard, improving efficiency while maintaining accuracy compared to solely using the larger safety guard model. Experimental results on multiple benchmark datasets demonstrate that our adaptive model selection significantly enhances the trade-off between computational cost and safety performance, outperforming relevant baselines.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Seanie Lee, Dong Bok Lee, Dominik Wagner, Minki Kang, Haebin Seong, Tobias Bocklet, Juho Lee, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12464v1">http://arxiv.org/abs/2502.12464v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deploying large language models (LLMs) in real-world applications requires robust safety guard models to detect and block harmful user prompts. While large safety guard models achieve strong performance, their computational cost is substantial. To mitigate this, smaller distilled models are used, but they often underperform on "hard" examples where the larger model provides accurate predictions. We observe that many inputs can be reliably handled by the smaller model, while only a small fraction require the larger model's capacity. Motivated by this, we propose SafeRoute, a binary router that distinguishes hard examples from easy ones. Our method selectively applies the larger safety guard model to the data that the router considers hard, improving efficiency while maintaining accuracy compared to solely using the larger safety guard model. Experimental results on multiple benchmark datasets demonstrate that our adaptive model selection significantly enhances the trade-off between computational cost and safety performance, outperforming relevant baselines.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 19 Feb 2025 20:43:35 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/35c05df7/70b103f6.mp3" length="19769176" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1232</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Seanie Lee, Dong Bok Lee, Dominik Wagner, Minki Kang, Haebin Seong, Tobias Bocklet, Juho Lee, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12464v1">http://arxiv.org/abs/2502.12464v1</a></p>

            <p><strong>Abstract:</strong><br>
            Deploying large language models (LLMs) in real-world applications requires robust safety guard models to detect and block harmful user prompts. While large safety guard models achieve strong performance, their computational cost is substantial. To mitigate this, smaller distilled models are used, but they often underperform on "hard" examples where the larger model provides accurate predictions. We observe that many inputs can be reliably handled by the smaller model, while only a small fraction require the larger model's capacity. Motivated by this, we propose SafeRoute, a binary router that distinguishes hard examples from easy ones. Our method selectively applies the larger safety guard model to the data that the router considers hard, improving efficiency while maintaining accuracy compared to solely using the larger safety guard model. Experimental results on multiple benchmark datasets demonstrate that our adaptive model selection significantly enhances the trade-off between computational cost and safety performance, outperforming relevant baselines.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>You Do Not Fully Utilize Transformer's Representation Capacity</title>
      <itunes:episode>575</itunes:episode>
      <podcast:episode>575</podcast:episode>
      <itunes:title>You Do Not Fully Utilize Transformer's Representation Capacity</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c33b50df-8196-4a7d-b398-1e53c2a17421</guid>
      <link>https://share.transistor.fm/s/3db10eae</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Gleb Gerasimov, Yaroslav Aksenov, Nikita Balagansky, Viacheslav Sinii, Daniil Gavrilov</p>

            <p><strong>Title:</strong><br>
            You Do Not Fully Utilize Transformer's Representation Capacity</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09245v1">http://arxiv.org/abs/2502.09245v1</a></p>

            <p><strong>Abstract:</strong><br>
            In contrast to RNNs, which compress previous tokens into a single hidden state, Transformers can attend to all previous tokens directly. However, standard Transformers only use representations from the immediately preceding layer. In this paper, we show that this design choice causes representation collapse and leads to suboptimal performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model's overall memory footprint while expanding its representational capacity by allowing access to hidden states from earlier layers. Through extensive experiments across various architectures and different lookup mechanisms, we demonstrate consistent performance improvements on a wide range of tasks. Moreover, our analysis of the learned representation dynamics and our exploration of depthwise circuits reveal how LIMe integrates information across layers, pointing to promising directions for future research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Gleb Gerasimov, Yaroslav Aksenov, Nikita Balagansky, Viacheslav Sinii, Daniil Gavrilov</p>

            <p><strong>Title:</strong><br>
            You Do Not Fully Utilize Transformer's Representation Capacity</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09245v1">http://arxiv.org/abs/2502.09245v1</a></p>

            <p><strong>Abstract:</strong><br>
            In contrast to RNNs, which compress previous tokens into a single hidden state, Transformers can attend to all previous tokens directly. However, standard Transformers only use representations from the immediately preceding layer. In this paper, we show that this design choice causes representation collapse and leads to suboptimal performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model's overall memory footprint while expanding its representational capacity by allowing access to hidden states from earlier layers. Through extensive experiments across various architectures and different lookup mechanisms, we demonstrate consistent performance improvements on a wide range of tasks. Moreover, our analysis of the learned representation dynamics and our exploration of depthwise circuits reveal how LIMe integrates information across layers, pointing to promising directions for future research.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 19 Feb 2025 20:43:14 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3db10eae/704e7340.mp3" length="20156582" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1256</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Gleb Gerasimov, Yaroslav Aksenov, Nikita Balagansky, Viacheslav Sinii, Daniil Gavrilov</p>

            <p><strong>Title:</strong><br>
            You Do Not Fully Utilize Transformer's Representation Capacity</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09245v1">http://arxiv.org/abs/2502.09245v1</a></p>

            <p><strong>Abstract:</strong><br>
            In contrast to RNNs, which compress previous tokens into a single hidden state, Transformers can attend to all previous tokens directly. However, standard Transformers only use representations from the immediately preceding layer. In this paper, we show that this design choice causes representation collapse and leads to suboptimal performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model's overall memory footprint while expanding its representational capacity by allowing access to hidden states from earlier layers. Through extensive experiments across various architectures and different lookup mechanisms, we demonstrate consistent performance improvements on a wide range of tasks. Moreover, our analysis of the learned representation dynamics and our exploration of depthwise circuits reveal how LIMe integrates information across layers, pointing to promising directions for future research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention</title>
      <itunes:episode>574</itunes:episode>
      <podcast:episode>574</podcast:episode>
      <itunes:title>Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9eb6eb9f-3e3c-4e4a-ab5b-f61b3d654f3c</guid>
      <link>https://share.transistor.fm/s/658ef8d4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng</p>

            <p><strong>Title:</strong><br>
            Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.11089v1">http://arxiv.org/abs/2502.11089v1</a></p>

            <p><strong>Abstract:</strong><br>
            Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng</p>

            <p><strong>Title:</strong><br>
            Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.11089v1">http://arxiv.org/abs/2502.11089v1</a></p>

            <p><strong>Abstract:</strong><br>
            Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 18 Feb 2025 20:51:22 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/658ef8d4/2cf40d70.mp3" length="22194152" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1383</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 68 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng</p>

            <p><strong>Title:</strong><br>
            Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.11089v1">http://arxiv.org/abs/2502.11089v1</a></p>

            <p><strong>Abstract:</strong><br>
            Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Learning Getting-Up Policies for Real-World Humanoid Robots</title>
      <itunes:episode>573</itunes:episode>
      <podcast:episode>573</podcast:episode>
      <itunes:title>Learning Getting-Up Policies for Real-World Humanoid Robots</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">282ba0c8-07e7-4f33-b27f-5158ce5ef224</guid>
      <link>https://share.transistor.fm/s/fb5847aa</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.RO, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xialin He, Runpei Dong, Zixuan Chen, Saurabh Gupta</p>

            <p><strong>Title:</strong><br>
            Learning Getting-Up Policies for Real-World Humanoid Robots</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12152v1">http://arxiv.org/abs/2502.12152v1</a></p>

            <p><strong>Abstract:</strong><br>
            Automatic fall recovery is a crucial prerequisite before humanoid robots can be reliably deployed. Hand-designing controllers for getting up is difficult because of the varied configurations a humanoid can end up in after a fall and the challenging terrains humanoid robots are expected to operate on. This paper develops a learning framework to produce controllers that enable humanoid robots to get up from varying configurations on varying terrains. Unlike previous successful applications of humanoid locomotion learning, the getting-up task involves complex contact patterns, which necessitates accurately modeling the collision geometry and sparser rewards. We address these challenges through a two-phase approach that follows a curriculum. The first stage focuses on discovering a good getting-up trajectory under minimal constraints on smoothness or speed / torque limits. The second stage then refines the discovered motions into deployable (i.e. smooth and slow) motions that are robust to variations in initial configuration and terrains. We find these innovations enable a real-world G1 humanoid robot to get up from two main situations that we considered: a) lying face up and b) lying face down, both tested on flat, deformable, slippery surfaces and slopes (e.g., sloppy grass and snowfield). To the best of our knowledge, this is the first successful demonstration of learned getting-up policies for human-sized humanoid robots in the real world. Project page: https://humanoid-getup.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.RO, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xialin He, Runpei Dong, Zixuan Chen, Saurabh Gupta</p>

            <p><strong>Title:</strong><br>
            Learning Getting-Up Policies for Real-World Humanoid Robots</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12152v1">http://arxiv.org/abs/2502.12152v1</a></p>

            <p><strong>Abstract:</strong><br>
            Automatic fall recovery is a crucial prerequisite before humanoid robots can be reliably deployed. Hand-designing controllers for getting up is difficult because of the varied configurations a humanoid can end up in after a fall and the challenging terrains humanoid robots are expected to operate on. This paper develops a learning framework to produce controllers that enable humanoid robots to get up from varying configurations on varying terrains. Unlike previous successful applications of humanoid locomotion learning, the getting-up task involves complex contact patterns, which necessitates accurately modeling the collision geometry and sparser rewards. We address these challenges through a two-phase approach that follows a curriculum. The first stage focuses on discovering a good getting-up trajectory under minimal constraints on smoothness or speed / torque limits. The second stage then refines the discovered motions into deployable (i.e. smooth and slow) motions that are robust to variations in initial configuration and terrains. We find these innovations enable a real-world G1 humanoid robot to get up from two main situations that we considered: a) lying face up and b) lying face down, both tested on flat, deformable, slippery surfaces and slopes (e.g., sloppy grass and snowfield). To the best of our knowledge, this is the first successful demonstration of learned getting-up policies for human-sized humanoid robots in the real world. Project page: https://humanoid-getup.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 18 Feb 2025 20:51:00 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fb5847aa/5569319c.mp3" length="23735563" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1480</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.RO, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xialin He, Runpei Dong, Zixuan Chen, Saurabh Gupta</p>

            <p><strong>Title:</strong><br>
            Learning Getting-Up Policies for Real-World Humanoid Robots</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12152v1">http://arxiv.org/abs/2502.12152v1</a></p>

            <p><strong>Abstract:</strong><br>
            Automatic fall recovery is a crucial prerequisite before humanoid robots can be reliably deployed. Hand-designing controllers for getting up is difficult because of the varied configurations a humanoid can end up in after a fall and the challenging terrains humanoid robots are expected to operate on. This paper develops a learning framework to produce controllers that enable humanoid robots to get up from varying configurations on varying terrains. Unlike previous successful applications of humanoid locomotion learning, the getting-up task involves complex contact patterns, which necessitates accurately modeling the collision geometry and sparser rewards. We address these challenges through a two-phase approach that follows a curriculum. The first stage focuses on discovering a good getting-up trajectory under minimal constraints on smoothness or speed / torque limits. The second stage then refines the discovered motions into deployable (i.e. smooth and slow) motions that are robust to variations in initial configuration and terrains. We find these innovations enable a real-world G1 humanoid robot to get up from two main situations that we considered: a) lying face up and b) lying face down, both tested on flat, deformable, slippery surfaces and slopes (e.g., sloppy grass and snowfield). To the best of our knowledge, this is the first successful demonstration of learned getting-up policies for human-sized humanoid robots in the real world. Project page: https://humanoid-getup.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?</title>
      <itunes:episode>572</itunes:episode>
      <podcast:episode>572</podcast:episode>
      <itunes:title>SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">250a6632-bd2f-462d-90e9-05f24da81056</guid>
      <link>https://share.transistor.fm/s/46750c62</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.LG, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Samuel Miserendino, Michele Wang, Tejal Patwardhan, Johannes Heidecke</p>

            <p><strong>Title:</strong><br>
            SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12115v1">http://arxiv.org/abs/2502.12115v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at \$1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks--ranging from \$50 bug fixes to \$32,000 feature implementations--and managerial tasks, where models choose between technical implementation proposals. Independent tasks are graded with end-to-end tests triple-verified by experienced software engineers, while managerial decisions are assessed against the choices of the original hired engineering managers. We evaluate model performance and find that frontier models are still unable to solve the majority of tasks. To facilitate future research, we open-source a unified Docker image and a public evaluation split, SWE-Lancer Diamond (https://github.com/openai/SWELancer-Benchmark). By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.LG, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Samuel Miserendino, Michele Wang, Tejal Patwardhan, Johannes Heidecke</p>

            <p><strong>Title:</strong><br>
            SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12115v1">http://arxiv.org/abs/2502.12115v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at \$1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks--ranging from \$50 bug fixes to \$32,000 feature implementations--and managerial tasks, where models choose between technical implementation proposals. Independent tasks are graded with end-to-end tests triple-verified by experienced software engineers, while managerial decisions are assessed against the choices of the original hired engineering managers. We evaluate model performance and find that frontier models are still unable to solve the majority of tasks. To facilitate future research, we open-source a unified Docker image and a public evaluation split, SWE-Lancer Diamond (https://github.com/openai/SWELancer-Benchmark). By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 18 Feb 2025 20:50:38 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/46750c62/fafcada6.mp3" length="21073197" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1313</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.LG, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Samuel Miserendino, Michele Wang, Tejal Patwardhan, Johannes Heidecke</p>

            <p><strong>Title:</strong><br>
            SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12115v1">http://arxiv.org/abs/2502.12115v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at \$1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks--ranging from \$50 bug fixes to \$32,000 feature implementations--and managerial tasks, where models choose between technical implementation proposals. Independent tasks are graded with end-to-end tests triple-verified by experienced software engineers, while managerial decisions are assessed against the choices of the original hired engineering managers. We evaluate model performance and find that frontier models are still unable to solve the majority of tasks. To facilitate future research, we open-source a unified Docker image and a public evaluation split, SWE-Lancer Diamond (https://github.com/openai/SWELancer-Benchmark). By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CRANE: Reasoning with constrained LLM generation</title>
      <itunes:episode>571</itunes:episode>
      <podcast:episode>571</podcast:episode>
      <itunes:title>CRANE: Reasoning with constrained LLM generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">befc98b8-316b-4b9e-b108-8bde822d1aca</guid>
      <link>https://share.transistor.fm/s/a9c5793e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.PL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, Gagandeep Singh</p>

            <p><strong>Title:</strong><br>
            CRANE: Reasoning with constrained LLM generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09061v1">http://arxiv.org/abs/2502.09061v1</a></p>

            <p><strong>Abstract:</strong><br>
            Code generation, symbolic math reasoning, and other tasks require LLMs to produce outputs that are both syntactically and semantically correct. Constrained LLM generation is a promising direction to enforce adherence to formal grammar, but prior works have empirically observed that strict enforcement of formal constraints often diminishes the reasoning capabilities of LLMs. In this work, we first provide a theoretical explanation for why constraining LLM outputs to very restrictive grammars that only allow syntactically valid final answers reduces the reasoning capabilities of the model. Second, we demonstrate that by augmenting the output grammar with carefully designed additional rules, it is always possible to preserve the reasoning capabilities of the LLM while ensuring syntactic and semantic correctness in its outputs. Building on these theoretical insights, we propose a reasoning-augmented constrained decoding algorithm, CRANE, which effectively balances the correctness of constrained generation with the flexibility of unconstrained generation. Experiments on multiple open-source LLMs and benchmarks show that CRANE significantly outperforms both state-of-the-art constrained decoding strategies and standard unconstrained decoding, showing up to 10% points accuracy improvement over baselines on challenging symbolic reasoning benchmarks GSM-symbolic and FOLIO.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.PL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, Gagandeep Singh</p>

            <p><strong>Title:</strong><br>
            CRANE: Reasoning with constrained LLM generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09061v1">http://arxiv.org/abs/2502.09061v1</a></p>

            <p><strong>Abstract:</strong><br>
            Code generation, symbolic math reasoning, and other tasks require LLMs to produce outputs that are both syntactically and semantically correct. Constrained LLM generation is a promising direction to enforce adherence to formal grammar, but prior works have empirically observed that strict enforcement of formal constraints often diminishes the reasoning capabilities of LLMs. In this work, we first provide a theoretical explanation for why constraining LLM outputs to very restrictive grammars that only allow syntactically valid final answers reduces the reasoning capabilities of the model. Second, we demonstrate that by augmenting the output grammar with carefully designed additional rules, it is always possible to preserve the reasoning capabilities of the LLM while ensuring syntactic and semantic correctness in its outputs. Building on these theoretical insights, we propose a reasoning-augmented constrained decoding algorithm, CRANE, which effectively balances the correctness of constrained generation with the flexibility of unconstrained generation. Experiments on multiple open-source LLMs and benchmarks show that CRANE significantly outperforms both state-of-the-art constrained decoding strategies and standard unconstrained decoding, showing up to 10% points accuracy improvement over baselines on challenging symbolic reasoning benchmarks GSM-symbolic and FOLIO.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 18 Feb 2025 20:50:06 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a9c5793e/d5c2165b.mp3" length="20580796" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1283</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.PL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, Gagandeep Singh</p>

            <p><strong>Title:</strong><br>
            CRANE: Reasoning with constrained LLM generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09061v1">http://arxiv.org/abs/2502.09061v1</a></p>

            <p><strong>Abstract:</strong><br>
            Code generation, symbolic math reasoning, and other tasks require LLMs to produce outputs that are both syntactically and semantically correct. Constrained LLM generation is a promising direction to enforce adherence to formal grammar, but prior works have empirically observed that strict enforcement of formal constraints often diminishes the reasoning capabilities of LLMs. In this work, we first provide a theoretical explanation for why constraining LLM outputs to very restrictive grammars that only allow syntactically valid final answers reduces the reasoning capabilities of the model. Second, we demonstrate that by augmenting the output grammar with carefully designed additional rules, it is always possible to preserve the reasoning capabilities of the LLM while ensuring syntactic and semantic correctness in its outputs. Building on these theoretical insights, we propose a reasoning-augmented constrained decoding algorithm, CRANE, which effectively balances the correctness of constrained generation with the flexibility of unconstrained generation. Experiments on multiple open-source LLMs and benchmarks show that CRANE significantly outperforms both state-of-the-art constrained decoding strategies and standard unconstrained decoding, showing up to 10% points accuracy improvement over baselines on challenging symbolic reasoning benchmarks GSM-symbolic and FOLIO.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training</title>
      <itunes:episode>570</itunes:episode>
      <podcast:episode>570</podcast:episode>
      <itunes:title>How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9d47431c-879e-4aa5-af18-ab1548c86e36</guid>
      <link>https://share.transistor.fm/s/c167afae</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.LG, cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, Huajun Chen</p>

            <p><strong>Title:</strong><br>
            How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.11196v1">http://arxiv.org/abs/2502.11196v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite exceptional capabilities in knowledge-intensive tasks, Large Language Models (LLMs) face a critical gap in understanding how they internalize new knowledge, particularly how to structurally embed acquired knowledge in their neural computations. We address this issue through the lens of knowledge circuit evolution, identifying computational subgraphs that facilitate knowledge storage and processing. Our systematic analysis of circuit evolution throughout continual pre-training reveals several key findings: (1) the acquisition of new knowledge is influenced by its relevance to pre-existing knowledge; (2) the evolution of knowledge circuits exhibits a distinct phase shift from formation to optimization; (3) the evolution of knowledge circuits follows a deep-to-shallow pattern. These insights not only advance our theoretical understanding of the mechanisms of new knowledge acquisition in LLMs, but also provide potential implications for improving continual pre-training strategies to enhance model performance. Code and data will be available at https://github.com/zjunlp/DynamicKnowledgeCircuits.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.LG, cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, Huajun Chen</p>

            <p><strong>Title:</strong><br>
            How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.11196v1">http://arxiv.org/abs/2502.11196v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite exceptional capabilities in knowledge-intensive tasks, Large Language Models (LLMs) face a critical gap in understanding how they internalize new knowledge, particularly how to structurally embed acquired knowledge in their neural computations. We address this issue through the lens of knowledge circuit evolution, identifying computational subgraphs that facilitate knowledge storage and processing. Our systematic analysis of circuit evolution throughout continual pre-training reveals several key findings: (1) the acquisition of new knowledge is influenced by its relevance to pre-existing knowledge; (2) the evolution of knowledge circuits exhibits a distinct phase shift from formation to optimization; (3) the evolution of knowledge circuits follows a deep-to-shallow pattern. These insights not only advance our theoretical understanding of the mechanisms of new knowledge acquisition in LLMs, but also provide potential implications for improving continual pre-training strategies to enhance model performance. Code and data will be available at https://github.com/zjunlp/DynamicKnowledgeCircuits.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 18 Feb 2025 20:49:44 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c167afae/264bbfa1.mp3" length="23719715" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1479</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.LG, cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, Huajun Chen</p>

            <p><strong>Title:</strong><br>
            How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.11196v1">http://arxiv.org/abs/2502.11196v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite exceptional capabilities in knowledge-intensive tasks, Large Language Models (LLMs) face a critical gap in understanding how they internalize new knowledge, particularly how to structurally embed acquired knowledge in their neural computations. We address this issue through the lens of knowledge circuit evolution, identifying computational subgraphs that facilitate knowledge storage and processing. Our systematic analysis of circuit evolution throughout continual pre-training reveals several key findings: (1) the acquisition of new knowledge is influenced by its relevance to pre-existing knowledge; (2) the evolution of knowledge circuits exhibits a distinct phase shift from formation to optimization; (3) the evolution of knowledge circuits follows a deep-to-shallow pattern. These insights not only advance our theoretical understanding of the mechanisms of new knowledge acquisition in LLMs, but also provide potential implications for improving continual pre-training strategies to enhance model performance. Code and data will be available at https://github.com/zjunlp/DynamicKnowledgeCircuits.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation</title>
      <itunes:episode>569</itunes:episode>
      <podcast:episode>569</podcast:episode>
      <itunes:title>HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">03fca239-a143-4025-a510-cb20bc3ab428</guid>
      <link>https://share.transistor.fm/s/31d57e6b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ling Yang, Xinchen Zhang, Ye Tian, Chenming Shang, Minghao Xu, Wentao Zhang, Bin Cui</p>

            <p><strong>Title:</strong><br>
            HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12148v1">http://arxiv.org/abs/2502.12148v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: https://github.com/Gen-Verse/HermesFlow</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ling Yang, Xinchen Zhang, Ye Tian, Chenming Shang, Minghao Xu, Wentao Zhang, Bin Cui</p>

            <p><strong>Title:</strong><br>
            HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12148v1">http://arxiv.org/abs/2502.12148v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: https://github.com/Gen-Verse/HermesFlow</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 18 Feb 2025 20:49:12 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/31d57e6b/3590810d.mp3" length="19279722" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1201</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ling Yang, Xinchen Zhang, Ye Tian, Chenming Shang, Minghao Xu, Wentao Zhang, Bin Cui</p>

            <p><strong>Title:</strong><br>
            HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.12148v1">http://arxiv.org/abs/2502.12148v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: https://github.com/Gen-Verse/HermesFlow</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models</title>
      <itunes:episode>568</itunes:episode>
      <podcast:episode>568</podcast:episode>
      <itunes:title>I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">08095968-f428-4013-8aab-02984770a4b0</guid>
      <link>https://share.transistor.fm/s/27eba097</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhenxing Mi, Kuan-Chieh Wang, Guocheng Qian, Hanrong Ye, Runtao Liu, Sergey Tulyakov, Kfir Aberman, Dan Xu</p>

            <p><strong>Title:</strong><br>
            I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.10458v1">http://arxiv.org/abs/2502.10458v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the $\textbf{LLM decoder}$ shares the same input feature space with $\textbf{diffusion decoders}$ that use the corresponding $\textbf{LLM encoder}$ for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: https://mizhenxing.github.io/ThinkDiff.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhenxing Mi, Kuan-Chieh Wang, Guocheng Qian, Hanrong Ye, Runtao Liu, Sergey Tulyakov, Kfir Aberman, Dan Xu</p>

            <p><strong>Title:</strong><br>
            I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.10458v1">http://arxiv.org/abs/2502.10458v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the $\textbf{LLM decoder}$ shares the same input feature space with $\textbf{diffusion decoders}$ that use the corresponding $\textbf{LLM encoder}$ for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: https://mizhenxing.github.io/ThinkDiff.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 18 Feb 2025 20:48:50 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/27eba097/ddef6050.mp3" length="21554265" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1343</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhenxing Mi, Kuan-Chieh Wang, Guocheng Qian, Hanrong Ye, Runtao Liu, Sergey Tulyakov, Kfir Aberman, Dan Xu</p>

            <p><strong>Title:</strong><br>
            I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.10458v1">http://arxiv.org/abs/2502.10458v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the $\textbf{LLM decoder}$ shares the same input feature space with $\textbf{diffusion decoders}$ that use the corresponding $\textbf{LLM encoder}$ for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: https://mizhenxing.github.io/ThinkDiff.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors</title>
      <itunes:episode>567</itunes:episode>
      <podcast:episode>567</podcast:episode>
      <itunes:title>SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bb1e24bb-d217-466e-9571-103ca328af56</guid>
      <link>https://share.transistor.fm/s/92f9c4a9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Bohan Lyu, Siqiao Huang, Zichen Liang</p>

            <p><strong>Title:</strong><br>
            SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.11167v1">http://arxiv.org/abs/2502.11167v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as code understanding and code generation. However, an equally important yet underexplored question is whether LLMs can serve as general-purpose surrogate code executors, to predict the output and behavior of a program without actually running it. To systematically investigate this capability, we introduce SURGE, a comprehensive benchmark covering eight key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. We evaluate multiple open-source and proprietary LLMs on SURGE and conduct a scaling study to analyze the impact of model size and training data scale on surrogate execution accuracy. Additionally, we categorize model prediction errors and explore potential areas for improvement. Our findings indicate that while LLMs can predict code execution results in certain cases, they exhibit limitations in general-purpose surrogate execution. This study provides empirical insights into the feasibility of using LLMs as surrogate code executors. Code and dataset are released at https://github.com/Imbernoulli/SURGE.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Bohan Lyu, Siqiao Huang, Zichen Liang</p>

            <p><strong>Title:</strong><br>
            SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.11167v1">http://arxiv.org/abs/2502.11167v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as code understanding and code generation. However, an equally important yet underexplored question is whether LLMs can serve as general-purpose surrogate code executors, to predict the output and behavior of a program without actually running it. To systematically investigate this capability, we introduce SURGE, a comprehensive benchmark covering eight key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. We evaluate multiple open-source and proprietary LLMs on SURGE and conduct a scaling study to analyze the impact of model size and training data scale on surrogate execution accuracy. Additionally, we categorize model prediction errors and explore potential areas for improvement. Our findings indicate that while LLMs can predict code execution results in certain cases, they exhibit limitations in general-purpose surrogate execution. This study provides empirical insights into the feasibility of using LLMs as surrogate code executors. Code and dataset are released at https://github.com/Imbernoulli/SURGE.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 18 Feb 2025 20:48:28 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/92f9c4a9/15eb0391.mp3" length="22944399" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1430</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Bohan Lyu, Siqiao Huang, Zichen Liang</p>

            <p><strong>Title:</strong><br>
            SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.11167v1">http://arxiv.org/abs/2502.11167v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as code understanding and code generation. However, an equally important yet underexplored question is whether LLMs can serve as general-purpose surrogate code executors, to predict the output and behavior of a program without actually running it. To systematically investigate this capability, we introduce SURGE, a comprehensive benchmark covering eight key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. We evaluate multiple open-source and proprietary LLMs on SURGE and conduct a scaling study to analyze the impact of model size and training data scale on surrogate execution accuracy. Additionally, we categorize model prediction errors and explore potential areas for improvement. Our findings indicate that while LLMs can predict code execution results in certain cases, they exhibit limitations in general-purpose surrogate execution. This study provides empirical insights into the feasibility of using LLMs as surrogate code executors. Code and dataset are released at https://github.com/Imbernoulli/SURGE.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Region-Adaptive Sampling for Diffusion Transformers</title>
      <itunes:episode>566</itunes:episode>
      <podcast:episode>566</podcast:episode>
      <itunes:title>Region-Adaptive Sampling for Diffusion Transformers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b66f0761-26ca-4815-a545-f8eb2ee52400</guid>
      <link>https://share.transistor.fm/s/fc03f8a5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, Yuqing Yang</p>

            <p><strong>Title:</strong><br>
            Region-Adaptive Sampling for Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.10389v1">http://arxiv.org/abs/2502.10389v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models (DMs) have become the leading choice for generative tasks across diverse domains. However, their reliance on multiple sequential forward passes significantly limits real-time performance. Previous acceleration methods have primarily focused on reducing the number of sampling steps or reusing intermediate results, failing to leverage variations across spatial regions within the image due to the constraints of convolutional U-Net structures. By harnessing the flexibility of Diffusion Transformers (DiTs) in handling variable number of tokens, we introduce RAS, a novel, training-free sampling strategy that dynamically assigns different sampling ratios to regions within an image based on the focus of the DiT model. Our key observation is that during each sampling step, the model concentrates on semantically meaningful regions, and these areas of focus exhibit strong continuity across consecutive steps. Leveraging this insight, RAS updates only the regions currently in focus, while other regions are updated using cached noise from the previous step. The model's focus is determined based on the output from the preceding step, capitalizing on the temporal consistency we observed. We evaluate RAS on Stable Diffusion 3 and Lumina-Next-T2I, achieving speedups up to 2.36x and 2.51x, respectively, with minimal degradation in generation quality. Additionally, a user study reveals that RAS delivers comparable qualities under human evaluation while achieving a 1.6x speedup. Our approach makes a significant step towards more efficient diffusion transformers, enhancing their potential for real-time applications.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, Yuqing Yang</p>

            <p><strong>Title:</strong><br>
            Region-Adaptive Sampling for Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.10389v1">http://arxiv.org/abs/2502.10389v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models (DMs) have become the leading choice for generative tasks across diverse domains. However, their reliance on multiple sequential forward passes significantly limits real-time performance. Previous acceleration methods have primarily focused on reducing the number of sampling steps or reusing intermediate results, failing to leverage variations across spatial regions within the image due to the constraints of convolutional U-Net structures. By harnessing the flexibility of Diffusion Transformers (DiTs) in handling variable number of tokens, we introduce RAS, a novel, training-free sampling strategy that dynamically assigns different sampling ratios to regions within an image based on the focus of the DiT model. Our key observation is that during each sampling step, the model concentrates on semantically meaningful regions, and these areas of focus exhibit strong continuity across consecutive steps. Leveraging this insight, RAS updates only the regions currently in focus, while other regions are updated using cached noise from the previous step. The model's focus is determined based on the output from the preceding step, capitalizing on the temporal consistency we observed. We evaluate RAS on Stable Diffusion 3 and Lumina-Next-T2I, achieving speedups up to 2.36x and 2.51x, respectively, with minimal degradation in generation quality. Additionally, a user study reveals that RAS delivers comparable qualities under human evaluation while achieving a 1.6x speedup. Our approach makes a significant step towards more efficient diffusion transformers, enhancing their potential for real-time applications.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 17 Feb 2025 20:44:36 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fc03f8a5/9586e370.mp3" length="21957557" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1369</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, Yuqing Yang</p>

            <p><strong>Title:</strong><br>
            Region-Adaptive Sampling for Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.10389v1">http://arxiv.org/abs/2502.10389v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models (DMs) have become the leading choice for generative tasks across diverse domains. However, their reliance on multiple sequential forward passes significantly limits real-time performance. Previous acceleration methods have primarily focused on reducing the number of sampling steps or reusing intermediate results, failing to leverage variations across spatial regions within the image due to the constraints of convolutional U-Net structures. By harnessing the flexibility of Diffusion Transformers (DiTs) in handling variable number of tokens, we introduce RAS, a novel, training-free sampling strategy that dynamically assigns different sampling ratios to regions within an image based on the focus of the DiT model. Our key observation is that during each sampling step, the model concentrates on semantically meaningful regions, and these areas of focus exhibit strong continuity across consecutive steps. Leveraging this insight, RAS updates only the regions currently in focus, while other regions are updated using cached noise from the previous step. The model's focus is determined based on the output from the preceding step, capitalizing on the temporal consistency we observed. We evaluate RAS on Stable Diffusion 3 and Lumina-Next-T2I, achieving speedups up to 2.36x and 2.51x, respectively, with minimal degradation in generation quality. Additionally, a user study reveals that RAS delivers comparable qualities under human evaluation while achieving a 1.6x speedup. Our approach makes a significant step towards more efficient diffusion transformers, enhancing their potential for real-time applications.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Large Language Diffusion Models</title>
      <itunes:episode>565</itunes:episode>
      <podcast:episode>565</podcast:episode>
      <itunes:title>Large Language Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f6663df3-fd9e-4f8e-8ac7-de727d101d34</guid>
      <link>https://share.transistor.fm/s/9cadc882</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li</p>

            <p><strong>Title:</strong><br>
            Large Language Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09992v1">http://arxiv.org/abs/2502.09992v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings establish diffusion models as a viable and promising alternative to ARMs, challenging the assumption that key LLM capabilities discussed above are inherently tied to ARMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li</p>

            <p><strong>Title:</strong><br>
            Large Language Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09992v1">http://arxiv.org/abs/2502.09992v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings establish diffusion models as a viable and promising alternative to ARMs, challenging the assumption that key LLM capabilities discussed above are inherently tied to ARMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 17 Feb 2025 20:44:14 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9cadc882/3eb70ae3.mp3" length="18153690" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1131</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li</p>

            <p><strong>Title:</strong><br>
            Large Language Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09992v1">http://arxiv.org/abs/2502.09992v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings establish diffusion models as a viable and promising alternative to ARMs, challenging the assumption that key LLM capabilities discussed above are inherently tied to ARMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks</title>
      <itunes:episode>564</itunes:episode>
      <podcast:episode>564</podcast:episode>
      <itunes:title>The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ddbb7c6e-e5e1-462f-aa03-ce74d27e920d</guid>
      <link>https://share.transistor.fm/s/a6677310</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, Joseph E. Gonzalez</p>

            <p><strong>Title:</strong><br>
            The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08235v1">http://arxiv.org/abs/2502.08235v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Reasoning Models (LRMs) represent a breakthrough in AI problem-solving capabilities, but their effectiveness in interactive environments can be limited. This paper introduces and analyzes overthinking in LRMs. A phenomenon where models favor extended internal reasoning chains over environmental interaction. Through experiments on software engineering tasks using SWE Bench Verified, we observe three recurring patterns: Analysis Paralysis, Rogue Actions, and Premature Disengagement. We propose a framework to study these behaviors, which correlates with human expert assessments, and analyze 4018 trajectories. We observe that higher overthinking scores correlate with decreased performance, with reasoning models exhibiting stronger tendencies toward overthinking compared to non-reasoning models. Our analysis reveals that simple efforts to mitigate overthinking in agentic environments, such as selecting the solution with the lower overthinking score, can improve model performance by almost 30% while reducing computational costs by 43%. These results suggest that mitigating overthinking has strong practical implications. We suggest that by leveraging native function-calling capabilities and selective reinforcement learning overthinking tendencies could be mitigated. We also open-source our evaluation framework and dataset to facilitate research in this direction at https://github.com/AlexCuadron/Overthinking.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, Joseph E. Gonzalez</p>

            <p><strong>Title:</strong><br>
            The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08235v1">http://arxiv.org/abs/2502.08235v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Reasoning Models (LRMs) represent a breakthrough in AI problem-solving capabilities, but their effectiveness in interactive environments can be limited. This paper introduces and analyzes overthinking in LRMs. A phenomenon where models favor extended internal reasoning chains over environmental interaction. Through experiments on software engineering tasks using SWE Bench Verified, we observe three recurring patterns: Analysis Paralysis, Rogue Actions, and Premature Disengagement. We propose a framework to study these behaviors, which correlates with human expert assessments, and analyze 4018 trajectories. We observe that higher overthinking scores correlate with decreased performance, with reasoning models exhibiting stronger tendencies toward overthinking compared to non-reasoning models. Our analysis reveals that simple efforts to mitigate overthinking in agentic environments, such as selecting the solution with the lower overthinking score, can improve model performance by almost 30% while reducing computational costs by 43%. These results suggest that mitigating overthinking has strong practical implications. We suggest that by leveraging native function-calling capabilities and selective reinforcement learning overthinking tendencies could be mitigated. We also open-source our evaluation framework and dataset to facilitate research in this direction at https://github.com/AlexCuadron/Overthinking.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 17 Feb 2025 20:43:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a6677310/6e43beae.mp3" length="26441455" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1649</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, Joseph E. Gonzalez</p>

            <p><strong>Title:</strong><br>
            The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08235v1">http://arxiv.org/abs/2502.08235v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Reasoning Models (LRMs) represent a breakthrough in AI problem-solving capabilities, but their effectiveness in interactive environments can be limited. This paper introduces and analyzes overthinking in LRMs. A phenomenon where models favor extended internal reasoning chains over environmental interaction. Through experiments on software engineering tasks using SWE Bench Verified, we observe three recurring patterns: Analysis Paralysis, Rogue Actions, and Premature Disengagement. We propose a framework to study these behaviors, which correlates with human expert assessments, and analyze 4018 trajectories. We observe that higher overthinking scores correlate with decreased performance, with reasoning models exhibiting stronger tendencies toward overthinking compared to non-reasoning models. Our analysis reveals that simple efforts to mitigate overthinking in agentic environments, such as selecting the solution with the lower overthinking score, can improve model performance by almost 30% while reducing computational costs by 43%. These results suggest that mitigating overthinking has strong practical implications. We suggest that by leveraging native function-calling capabilities and selective reinforcement learning overthinking tendencies could be mitigated. We also open-source our evaluation framework and dataset to facilitate research in this direction at https://github.com/AlexCuadron/Overthinking.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model</title>
      <itunes:episode>563</itunes:episode>
      <podcast:episode>563</podcast:episode>
      <itunes:title>Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5045f0ac-0200-4726-8617-70097671b4fd</guid>
      <link>https://share.transistor.fm/s/ab5debd1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang Yu, Dapeng Shi, Dingyuan Hu, Enle Liu, Gang Yu, Ge Yang, Guanzhe Huang, Gulin Yan, Haiyang Feng, Hao Nie, Haonan Jia, Hanpeng Hu, Hanqi Chen, Haolong Yan, Heng Wang, Hongcheng Guo, Huilin Xiong, Huixin Xiong, Jiahao Gong, Jianchang Wu, Jiaoren Wu, Jie Wu, Jie Yang, Jiashuai Liu, Jiashuo Li, Jingyang Zhang, Junjing Guo, Junzhe Lin, Kaixiang Li, Lei Liu, Lei Xia, Liang Zhao, Liguo Tan, Liwen Huang, Liying Shi, Ming Li, Mingliang Li, Muhua Cheng, Na Wang, Qiaohui Chen, Qinglin He, Qiuyan Liang, Quan Sun, Ran Sun, Rui Wang, Shaoliang Pang, Shiliang Yang, Sitong Liu, Siqi Liu, Shuli Gao, Tiancheng Cao, Tianyu Wang, Weipeng Ming, Wenqing He, Xu Zhao, Xuelin Zhang, Xianfang Zeng, Xiaojia Liu, Xuan Yang, Yaqi Dai, Yanbo Yu, Yang Li, Yineng Deng, Yingming Wang, Yilei Wang, Yuanwei Lu, Yu Chen, Yu Luo, Yuchu Luo, Yuhe Yin, Yuheng Feng, Yuxiang Yang, Zecheng Tang, Zekai Zhang, Zidong Yang, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu, Heung-Yeung Shum, Daxin Jiang</p>

            <p><strong>Title:</strong><br>
            Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.10248v2">http://arxiv.org/abs/2502.10248v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang Yu, Dapeng Shi, Dingyuan Hu, Enle Liu, Gang Yu, Ge Yang, Guanzhe Huang, Gulin Yan, Haiyang Feng, Hao Nie, Haonan Jia, Hanpeng Hu, Hanqi Chen, Haolong Yan, Heng Wang, Hongcheng Guo, Huilin Xiong, Huixin Xiong, Jiahao Gong, Jianchang Wu, Jiaoren Wu, Jie Wu, Jie Yang, Jiashuai Liu, Jiashuo Li, Jingyang Zhang, Junjing Guo, Junzhe Lin, Kaixiang Li, Lei Liu, Lei Xia, Liang Zhao, Liguo Tan, Liwen Huang, Liying Shi, Ming Li, Mingliang Li, Muhua Cheng, Na Wang, Qiaohui Chen, Qinglin He, Qiuyan Liang, Quan Sun, Ran Sun, Rui Wang, Shaoliang Pang, Shiliang Yang, Sitong Liu, Siqi Liu, Shuli Gao, Tiancheng Cao, Tianyu Wang, Weipeng Ming, Wenqing He, Xu Zhao, Xuelin Zhang, Xianfang Zeng, Xiaojia Liu, Xuan Yang, Yaqi Dai, Yanbo Yu, Yang Li, Yineng Deng, Yingming Wang, Yilei Wang, Yuanwei Lu, Yu Chen, Yu Luo, Yuchu Luo, Yuhe Yin, Yuheng Feng, Yuxiang Yang, Zecheng Tang, Zekai Zhang, Zidong Yang, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu, Heung-Yeung Shum, Daxin Jiang</p>

            <p><strong>Title:</strong><br>
            Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.10248v2">http://arxiv.org/abs/2502.10248v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 17 Feb 2025 20:43:32 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ab5debd1/28fb46c5.mp3" length="22354662" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1393</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang Yu, Dapeng Shi, Dingyuan Hu, Enle Liu, Gang Yu, Ge Yang, Guanzhe Huang, Gulin Yan, Haiyang Feng, Hao Nie, Haonan Jia, Hanpeng Hu, Hanqi Chen, Haolong Yan, Heng Wang, Hongcheng Guo, Huilin Xiong, Huixin Xiong, Jiahao Gong, Jianchang Wu, Jiaoren Wu, Jie Wu, Jie Yang, Jiashuai Liu, Jiashuo Li, Jingyang Zhang, Junjing Guo, Junzhe Lin, Kaixiang Li, Lei Liu, Lei Xia, Liang Zhao, Liguo Tan, Liwen Huang, Liying Shi, Ming Li, Mingliang Li, Muhua Cheng, Na Wang, Qiaohui Chen, Qinglin He, Qiuyan Liang, Quan Sun, Ran Sun, Rui Wang, Shaoliang Pang, Shiliang Yang, Sitong Liu, Siqi Liu, Shuli Gao, Tiancheng Cao, Tianyu Wang, Weipeng Ming, Wenqing He, Xu Zhao, Xuelin Zhang, Xianfang Zeng, Xiaojia Liu, Xuan Yang, Yaqi Dai, Yanbo Yu, Yang Li, Yineng Deng, Yingming Wang, Yilei Wang, Yuanwei Lu, Yu Chen, Yu Luo, Yuchu Luo, Yuhe Yin, Yuheng Feng, Yuxiang Yang, Zecheng Tang, Zekai Zhang, Zidong Yang, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu, Heung-Yeung Shum, Daxin Jiang</p>

            <p><strong>Title:</strong><br>
            Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.10248v2">http://arxiv.org/abs/2502.10248v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models</title>
      <itunes:episode>562</itunes:episode>
      <podcast:episode>562</podcast:episode>
      <itunes:title>ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bff3f0c2-ae62-44a4-872b-c3a0dc5fd6fe</guid>
      <link>https://share.transistor.fm/s/b819f112</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, Kai Han, Samuel Albanie</p>

            <p><strong>Title:</strong><br>
            ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09696v1">http://arxiv.org/abs/2502.09696v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, Kai Han, Samuel Albanie</p>

            <p><strong>Title:</strong><br>
            ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09696v1">http://arxiv.org/abs/2502.09696v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 17 Feb 2025 20:43:11 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b819f112/e5186f8f.mp3" length="21713918" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1353</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, Kai Han, Samuel Albanie</p>

            <p><strong>Title:</strong><br>
            ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09696v1">http://arxiv.org/abs/2502.09696v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MM-RLHF: The Next Step Forward in Multimodal LLM Alignment</title>
      <itunes:episode>561</itunes:episode>
      <podcast:episode>561</podcast:episode>
      <itunes:title>MM-RLHF: The Next Step Forward in Multimodal LLM Alignment</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4798933a-848b-45d6-8921-c9ae198f6ade</guid>
      <link>https://share.transistor.fm/s/5d7f01c1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, Tieniu Tan</p>

            <p><strong>Title:</strong><br>
            MM-RLHF: The Next Step Forward in Multimodal LLM Alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.10391v1">http://arxiv.org/abs/2502.10391v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhance MLLM capability remains largely unexplored. To this end, we introduce MM-RLHF, a dataset containing $\mathbf{120k}$ fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce a Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across $\mathbf{10}$ distinct dimensions and $\mathbf{27}$ benchmarks, with results demonstrating significant and consistent improvements in model performance. Specifically, fine-tuning LLaVA-ov-7B with MM-RLHF and our alignment algorithm leads to a $\mathbf{19.5}$% increase in conversational abilities and a $\mathbf{60}$% improvement in safety.   We have open-sourced the preference dataset, reward model, training and evaluation code, as well as reward modeling and safety benchmarks. For more details, please visit our project page: https://mm-rlhf.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, Tieniu Tan</p>

            <p><strong>Title:</strong><br>
            MM-RLHF: The Next Step Forward in Multimodal LLM Alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.10391v1">http://arxiv.org/abs/2502.10391v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhance MLLM capability remains largely unexplored. To this end, we introduce MM-RLHF, a dataset containing $\mathbf{120k}$ fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce a Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across $\mathbf{10}$ distinct dimensions and $\mathbf{27}$ benchmarks, with results demonstrating significant and consistent improvements in model performance. Specifically, fine-tuning LLaVA-ov-7B with MM-RLHF and our alignment algorithm leads to a $\mathbf{19.5}$% increase in conversational abilities and a $\mathbf{60}$% improvement in safety.   We have open-sourced the preference dataset, reward model, training and evaluation code, as well as reward modeling and safety benchmarks. For more details, please visit our project page: https://mm-rlhf.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 17 Feb 2025 20:42:50 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5d7f01c1/e4254ff7.mp3" length="22697352" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1415</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, Tieniu Tan</p>

            <p><strong>Title:</strong><br>
            MM-RLHF: The Next Step Forward in Multimodal LLM Alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.10391v1">http://arxiv.org/abs/2502.10391v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhance MLLM capability remains largely unexplored. To this end, we introduce MM-RLHF, a dataset containing $\mathbf{120k}$ fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce a Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across $\mathbf{10}$ distinct dimensions and $\mathbf{27}$ benchmarks, with results demonstrating significant and consistent improvements in model performance. Specifically, fine-tuning LLaVA-ov-7B with MM-RLHF and our alignment algorithm leads to a $\mathbf{19.5}$% increase in conversational abilities and a $\mathbf{60}$% improvement in safety.   We have open-sourced the preference dataset, reward model, training and evaluation code, as well as reward modeling and safety benchmarks. For more details, please visit our project page: https://mm-rlhf.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation</title>
      <itunes:episode>560</itunes:episode>
      <podcast:episode>560</podcast:episode>
      <itunes:title>ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">16d1a4fa-77b2-42af-a75d-21788567efa8</guid>
      <link>https://share.transistor.fm/s/e1a67519</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Rotem Shalev-Arkushin, Rinon Gal, Amit H. Bermano, Ohad Fried</p>

            <p><strong>Title:</strong><br>
            ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09411v1">http://arxiv.org/abs/2502.09411v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models enable high-quality and diverse visual content synthesis. However, they struggle to generate rare or unseen concepts. To address this challenge, we explore the usage of Retrieval-Augmented Generation (RAG) with image generation models. We propose ImageRAG, a method that dynamically retrieves relevant images based on a given text prompt, and uses them as context to guide the generation process. Prior approaches that used retrieved images to improve generation, trained models specifically for retrieval-based generation. In contrast, ImageRAG leverages the capabilities of existing image conditioning models, and does not require RAG-specific training. Our approach is highly adaptable and can be applied across different model types, showing significant improvement in generating rare and fine-grained concepts using different base models.   Our project page is available at: https://rotem-shalev.github.io/ImageRAG</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Rotem Shalev-Arkushin, Rinon Gal, Amit H. Bermano, Ohad Fried</p>

            <p><strong>Title:</strong><br>
            ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09411v1">http://arxiv.org/abs/2502.09411v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models enable high-quality and diverse visual content synthesis. However, they struggle to generate rare or unseen concepts. To address this challenge, we explore the usage of Retrieval-Augmented Generation (RAG) with image generation models. We propose ImageRAG, a method that dynamically retrieves relevant images based on a given text prompt, and uses them as context to guide the generation process. Prior approaches that used retrieved images to improve generation, trained models specifically for retrieval-based generation. In contrast, ImageRAG leverages the capabilities of existing image conditioning models, and does not require RAG-specific training. Our approach is highly adaptable and can be applied across different model types, showing significant improvement in generating rare and fine-grained concepts using different base models.   Our project page is available at: https://rotem-shalev.github.io/ImageRAG</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 17 Feb 2025 20:42:29 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e1a67519/f2a39d90.mp3" length="22490893" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1402</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Rotem Shalev-Arkushin, Rinon Gal, Amit H. Bermano, Ohad Fried</p>

            <p><strong>Title:</strong><br>
            ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09411v1">http://arxiv.org/abs/2502.09411v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models enable high-quality and diverse visual content synthesis. However, they struggle to generate rare or unseen concepts. To address this challenge, we explore the usage of Retrieval-Augmented Generation (RAG) with image generation models. We propose ImageRAG, a method that dynamically retrieves relevant images based on a given text prompt, and uses them as context to guide the generation process. Prior approaches that used retrieved images to improve generation, trained models specifically for retrieval-based generation. In contrast, ImageRAG leverages the capabilities of existing image conditioning models, and does not require RAG-specific training. Our approach is highly adaptable and can be applied across different model types, showing significant improvement in generating rare and fine-grained concepts using different base models.   Our project page is available at: https://rotem-shalev.github.io/ImageRAG</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Diverse Inference and Verification for Advanced Reasoning</title>
      <itunes:episode>559</itunes:episode>
      <podcast:episode>559</podcast:episode>
      <itunes:title>Diverse Inference and Verification for Advanced Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">040f3dc1-501a-423b-bc28-a022017b74c6</guid>
      <link>https://share.transistor.fm/s/424dee8d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Iddo Drori, Gaston Longhitano, Mao Mao, Seunghwan Hyun, Yuke Zhang, Sungjun Park, Zachary Meeks, Xin-Yu Zhang, Ben Segev, Howard Yong, Nakul Verma, Avi Shporer, Alon Amit, Madeleine Udell</p>

            <p><strong>Title:</strong><br>
            Diverse Inference and Verification for Advanced Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09955v1">http://arxiv.org/abs/2502.09955v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning LLMs such as OpenAI o1, o3 and DeepSeek R1 have made significant progress in mathematics and coding, yet find challenging advanced tasks such as International Mathematical Olympiad (IMO) combinatorics problems, Abstraction and Reasoning Corpus (ARC) puzzles, and Humanity's Last Exam (HLE) questions. We use a diverse inference approach that combines multiple models and methods at test time. We find that verifying mathematics and code problems, and rejection sampling on other problems is simple and effective. We automatically verify correctness of solutions to IMO problems by Lean, and ARC puzzles by code, and find that best-of-N effectively answers HLE questions. Our approach increases answer accuracy on IMO combinatorics problems from 33.3% to 77.8%, accuracy on HLE questions from 8% to 37%, and solves 80% of ARC puzzles that 948 humans could not and 26.5% of ARC puzzles that o3 high compute does not. Test-time simulations, reinforcement learning, and meta-learning with inference feedback improve generalization by adapting agent graph representations and varying prompts, code, and datasets. Our approach is reliable, robust, and scalable, and in the spirit of reproducible research, we will make it publicly available upon publication.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Iddo Drori, Gaston Longhitano, Mao Mao, Seunghwan Hyun, Yuke Zhang, Sungjun Park, Zachary Meeks, Xin-Yu Zhang, Ben Segev, Howard Yong, Nakul Verma, Avi Shporer, Alon Amit, Madeleine Udell</p>

            <p><strong>Title:</strong><br>
            Diverse Inference and Verification for Advanced Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09955v1">http://arxiv.org/abs/2502.09955v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning LLMs such as OpenAI o1, o3 and DeepSeek R1 have made significant progress in mathematics and coding, yet find challenging advanced tasks such as International Mathematical Olympiad (IMO) combinatorics problems, Abstraction and Reasoning Corpus (ARC) puzzles, and Humanity's Last Exam (HLE) questions. We use a diverse inference approach that combines multiple models and methods at test time. We find that verifying mathematics and code problems, and rejection sampling on other problems is simple and effective. We automatically verify correctness of solutions to IMO problems by Lean, and ARC puzzles by code, and find that best-of-N effectively answers HLE questions. Our approach increases answer accuracy on IMO combinatorics problems from 33.3% to 77.8%, accuracy on HLE questions from 8% to 37%, and solves 80% of ARC puzzles that 948 humans could not and 26.5% of ARC puzzles that o3 high compute does not. Test-time simulations, reinforcement learning, and meta-learning with inference feedback improve generalization by adapting agent graph representations and varying prompts, code, and datasets. Our approach is reliable, robust, and scalable, and in the spirit of reproducible research, we will make it publicly available upon publication.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 17 Feb 2025 20:42:08 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/424dee8d/e643c75b.mp3" length="22041573" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1374</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Iddo Drori, Gaston Longhitano, Mao Mao, Seunghwan Hyun, Yuke Zhang, Sungjun Park, Zachary Meeks, Xin-Yu Zhang, Ben Segev, Howard Yong, Nakul Verma, Avi Shporer, Alon Amit, Madeleine Udell</p>

            <p><strong>Title:</strong><br>
            Diverse Inference and Verification for Advanced Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09955v1">http://arxiv.org/abs/2502.09955v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning LLMs such as OpenAI o1, o3 and DeepSeek R1 have made significant progress in mathematics and coding, yet find challenging advanced tasks such as International Mathematical Olympiad (IMO) combinatorics problems, Abstraction and Reasoning Corpus (ARC) puzzles, and Humanity's Last Exam (HLE) questions. We use a diverse inference approach that combines multiple models and methods at test time. We find that verifying mathematics and code problems, and rejection sampling on other problems is simple and effective. We automatically verify correctness of solutions to IMO problems by Lean, and ARC puzzles by code, and find that best-of-N effectively answers HLE questions. Our approach increases answer accuracy on IMO combinatorics problems from 33.3% to 77.8%, accuracy on HLE questions from 8% to 37%, and solves 80% of ARC puzzles that 948 humans could not and 26.5% of ARC puzzles that o3 high compute does not. Test-time simulations, reinforcement learning, and meta-learning with inference feedback improve generalization by adapting agent graph representations and varying prompts, code, and datasets. Our approach is reliable, robust, and scalable, and in the spirit of reproducible research, we will make it publicly available upon publication.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Precise Parameter Localization for Textual Generation in Diffusion Models</title>
      <itunes:episode>558</itunes:episode>
      <podcast:episode>558</podcast:episode>
      <itunes:title>Precise Parameter Localization for Textual Generation in Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">60d11abd-2e74-4af8-bef7-0b69d5d3ca63</guid>
      <link>https://share.transistor.fm/s/12c23602</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Łukasz Staniszewski, Bartosz Cywiński, Franziska Boenisch, Kamil Deja, Adam Dziedzic</p>

            <p><strong>Title:</strong><br>
            Precise Parameter Localization for Textual Generation in Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09935v1">http://arxiv.org/abs/2502.09935v1</a></p>

            <p><strong>Abstract:</strong><br>
            Novel diffusion models can synthesize photo-realistic images with integrated high-quality text. Surprisingly, we demonstrate through attention activation patching that only less than 1% of diffusion models' parameters, all contained in attention layers, influence the generation of textual content within the images. Building on this observation, we improve textual generation efficiency and performance by targeting cross and joint attention layers of diffusion models. We introduce several applications that benefit from localizing the layers responsible for textual content generation. We first show that a LoRA-based fine-tuning solely of the localized layers enhances, even more, the general text-generation capabilities of large diffusion models while preserving the quality and diversity of the diffusion models' generations. Then, we demonstrate how we can use the localized layers to edit textual content in generated images. Finally, we extend this idea to the practical use case of preventing the generation of toxic text in a cost-free manner. In contrast to prior work, our localization approach is broadly applicable across various diffusion model architectures, including U-Net (e.g., LDM and SDXL) and transformer-based (e.g., DeepFloyd IF and Stable Diffusion 3), utilizing diverse text encoders (e.g., from CLIP to the large language models like T5). Project page available at https://t2i-text-loc.github.io/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Łukasz Staniszewski, Bartosz Cywiński, Franziska Boenisch, Kamil Deja, Adam Dziedzic</p>

            <p><strong>Title:</strong><br>
            Precise Parameter Localization for Textual Generation in Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09935v1">http://arxiv.org/abs/2502.09935v1</a></p>

            <p><strong>Abstract:</strong><br>
            Novel diffusion models can synthesize photo-realistic images with integrated high-quality text. Surprisingly, we demonstrate through attention activation patching that only less than 1% of diffusion models' parameters, all contained in attention layers, influence the generation of textual content within the images. Building on this observation, we improve textual generation efficiency and performance by targeting cross and joint attention layers of diffusion models. We introduce several applications that benefit from localizing the layers responsible for textual content generation. We first show that a LoRA-based fine-tuning solely of the localized layers enhances, even more, the general text-generation capabilities of large diffusion models while preserving the quality and diversity of the diffusion models' generations. Then, we demonstrate how we can use the localized layers to edit textual content in generated images. Finally, we extend this idea to the practical use case of preventing the generation of toxic text in a cost-free manner. In contrast to prior work, our localization approach is broadly applicable across various diffusion model architectures, including U-Net (e.g., LDM and SDXL) and transformer-based (e.g., DeepFloyd IF and Stable Diffusion 3), utilizing diverse text encoders (e.g., from CLIP to the large language models like T5). Project page available at https://t2i-text-loc.github.io/.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 17 Feb 2025 20:41:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/12c23602/3f2fc78a.mp3" length="21024276" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1310</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Łukasz Staniszewski, Bartosz Cywiński, Franziska Boenisch, Kamil Deja, Adam Dziedzic</p>

            <p><strong>Title:</strong><br>
            Precise Parameter Localization for Textual Generation in Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09935v1">http://arxiv.org/abs/2502.09935v1</a></p>

            <p><strong>Abstract:</strong><br>
            Novel diffusion models can synthesize photo-realistic images with integrated high-quality text. Surprisingly, we demonstrate through attention activation patching that only less than 1% of diffusion models' parameters, all contained in attention layers, influence the generation of textual content within the images. Building on this observation, we improve textual generation efficiency and performance by targeting cross and joint attention layers of diffusion models. We introduce several applications that benefit from localizing the layers responsible for textual content generation. We first show that a LoRA-based fine-tuning solely of the localized layers enhances, even more, the general text-generation capabilities of large diffusion models while preserving the quality and diversity of the diffusion models' generations. Then, we demonstrate how we can use the localized layers to edit textual content in generated images. Finally, we extend this idea to the practical use case of preventing the generation of toxic text in a cost-free manner. In contrast to prior work, our localization approach is broadly applicable across various diffusion model architectures, including U-Net (e.g., LDM and SDXL) and transformer-based (e.g., DeepFloyd IF and Stable Diffusion 3), utilizing diverse text encoders (e.g., from CLIP to the large language models like T5). Project page available at https://t2i-text-loc.github.io/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DarwinLM: Evolutionary Structured Pruning of Large Language Models</title>
      <itunes:episode>557</itunes:episode>
      <podcast:episode>557</podcast:episode>
      <itunes:title>DarwinLM: Evolutionary Structured Pruning of Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c4c70635-2f30-4d92-bde9-a964d3bb83b6</guid>
      <link>https://share.transistor.fm/s/09816d7e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shengkun Tang, Oliver Sieberling, Eldar Kurtic, Zhiqiang Shen, Dan Alistarh</p>

            <p><strong>Title:</strong><br>
            DarwinLM: Evolutionary Structured Pruning of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07780v1">http://arxiv.org/abs/2502.07780v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have achieved significant success across various NLP tasks. However, their massive computational costs limit their widespread use, particularly in real-time applications. Structured pruning offers an effective solution by compressing models and directly providing end-to-end speed improvements, regardless of the hardware environment. Meanwhile, different components of the model exhibit varying sensitivities towards pruning, calling for \emph{non-uniform} model compression. However, a pruning method should not only identify a capable substructure, but also account for post-compression training. To this end, we propose \sysname, a method for \emph{training-aware} structured pruning. \sysname builds upon an evolutionary search process, generating multiple offspring models in each generation through mutation, and selecting the fittest for survival. To assess the effect of post-training, we incorporate a lightweight, multistep training process within the offspring population, progressively increasing the number of tokens and eliminating poorly performing models in each selection stage. We validate our method through extensive experiments on Llama-2-7B, Llama-3.1-8B and Qwen-2.5-14B-Instruct, achieving state-of-the-art performance for structured pruning. For instance, \sysname surpasses ShearedLlama while requiring $5\times$ less training data during post-compression training.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shengkun Tang, Oliver Sieberling, Eldar Kurtic, Zhiqiang Shen, Dan Alistarh</p>

            <p><strong>Title:</strong><br>
            DarwinLM: Evolutionary Structured Pruning of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07780v1">http://arxiv.org/abs/2502.07780v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have achieved significant success across various NLP tasks. However, their massive computational costs limit their widespread use, particularly in real-time applications. Structured pruning offers an effective solution by compressing models and directly providing end-to-end speed improvements, regardless of the hardware environment. Meanwhile, different components of the model exhibit varying sensitivities towards pruning, calling for \emph{non-uniform} model compression. However, a pruning method should not only identify a capable substructure, but also account for post-compression training. To this end, we propose \sysname, a method for \emph{training-aware} structured pruning. \sysname builds upon an evolutionary search process, generating multiple offspring models in each generation through mutation, and selecting the fittest for survival. To assess the effect of post-training, we incorporate a lightweight, multistep training process within the offspring population, progressively increasing the number of tokens and eliminating poorly performing models in each selection stage. We validate our method through extensive experiments on Llama-2-7B, Llama-3.1-8B and Qwen-2.5-14B-Instruct, achieving state-of-the-art performance for structured pruning. For instance, \sysname surpasses ShearedLlama while requiring $5\times$ less training data during post-compression training.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 17 Feb 2025 20:41:25 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/09816d7e/326deb53.mp3" length="16765683" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1044</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shengkun Tang, Oliver Sieberling, Eldar Kurtic, Zhiqiang Shen, Dan Alistarh</p>

            <p><strong>Title:</strong><br>
            DarwinLM: Evolutionary Structured Pruning of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07780v1">http://arxiv.org/abs/2502.07780v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have achieved significant success across various NLP tasks. However, their massive computational costs limit their widespread use, particularly in real-time applications. Structured pruning offers an effective solution by compressing models and directly providing end-to-end speed improvements, regardless of the hardware environment. Meanwhile, different components of the model exhibit varying sensitivities towards pruning, calling for \emph{non-uniform} model compression. However, a pruning method should not only identify a capable substructure, but also account for post-compression training. To this end, we propose \sysname, a method for \emph{training-aware} structured pruning. \sysname builds upon an evolutionary search process, generating multiple offspring models in each generation through mutation, and selecting the fittest for survival. To assess the effect of post-training, we incorporate a lightweight, multistep training process within the offspring population, progressively increasing the number of tokens and eliminating poorly performing models in each selection stage. We validate our method through extensive experiments on Llama-2-7B, Llama-3.1-8B and Qwen-2.5-14B-Instruct, achieving state-of-the-art performance for structured pruning. For instance, \sysname surpasses ShearedLlama while requiring $5\times$ less training data during post-compression training.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU</title>
      <itunes:episode>556</itunes:episode>
      <podcast:episode>556</podcast:episode>
      <itunes:title>InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">eaed5c8b-c371-4c18-94f2-99d7d3e774f0</guid>
      <link>https://share.transistor.fm/s/8ebeea36</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Heejun Lee, Geon Park, Jaduk Suh, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08910v1">http://arxiv.org/abs/2502.08910v1</a></p>

            <p><strong>Abstract:</strong><br>
            In modern large language models (LLMs), handling very long context lengths presents significant challenges as it causes slower inference speeds and increased memory costs. Additionally, most existing pre-trained LLMs fail to generalize beyond their original training sequence lengths. To enable efficient and practical long-context utilization, we introduce InfiniteHiP, a novel, and practical LLM inference framework that accelerates processing by dynamically eliminating irrelevant context tokens through a modular hierarchical token pruning algorithm. Our method also allows generalization to longer sequences by selectively applying various RoPE adjustment methods according to the internal attention patterns within LLMs. Furthermore, we offload the key-value cache to host memory during inference, significantly reducing GPU memory pressure. As a result, InfiniteHiP enables the processing of up to 3 million tokens on a single L40s 48GB GPU -- 3x larger -- without any permanent loss of context information. Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training. We implement our method in the SGLang framework and demonstrate its effectiveness and practicality through extensive evaluations.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Heejun Lee, Geon Park, Jaduk Suh, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08910v1">http://arxiv.org/abs/2502.08910v1</a></p>

            <p><strong>Abstract:</strong><br>
            In modern large language models (LLMs), handling very long context lengths presents significant challenges as it causes slower inference speeds and increased memory costs. Additionally, most existing pre-trained LLMs fail to generalize beyond their original training sequence lengths. To enable efficient and practical long-context utilization, we introduce InfiniteHiP, a novel, and practical LLM inference framework that accelerates processing by dynamically eliminating irrelevant context tokens through a modular hierarchical token pruning algorithm. Our method also allows generalization to longer sequences by selectively applying various RoPE adjustment methods according to the internal attention patterns within LLMs. Furthermore, we offload the key-value cache to host memory during inference, significantly reducing GPU memory pressure. As a result, InfiniteHiP enables the processing of up to 3 million tokens on a single L40s 48GB GPU -- 3x larger -- without any permanent loss of context information. Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training. We implement our method in the SGLang framework and demonstrate its effectiveness and practicality through extensive evaluations.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 14 Feb 2025 20:38:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8ebeea36/bd410a17.mp3" length="20421590" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1273</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 62 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Heejun Lee, Geon Park, Jaduk Suh, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08910v1">http://arxiv.org/abs/2502.08910v1</a></p>

            <p><strong>Abstract:</strong><br>
            In modern large language models (LLMs), handling very long context lengths presents significant challenges as it causes slower inference speeds and increased memory costs. Additionally, most existing pre-trained LLMs fail to generalize beyond their original training sequence lengths. To enable efficient and practical long-context utilization, we introduce InfiniteHiP, a novel, and practical LLM inference framework that accelerates processing by dynamically eliminating irrelevant context tokens through a modular hierarchical token pruning algorithm. Our method also allows generalization to longer sequences by selectively applying various RoPE adjustment methods according to the internal attention patterns within LLMs. Furthermore, we offload the key-value cache to host memory during inference, significantly reducing GPU memory pressure. As a result, InfiniteHiP enables the processing of up to 3 million tokens on a single L40s 48GB GPU -- 3x larger -- without any permanent loss of context information. Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training. We implement our method in the SGLang framework and demonstrate its effectiveness and practicality through extensive evaluations.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding</title>
      <itunes:episode>555</itunes:episode>
      <podcast:episode>555</podcast:episode>
      <itunes:title>The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a2293c74-2086-4047-a1e7-6d054d2ef977</guid>
      <link>https://share.transistor.fm/s/09fb8a99</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CL, cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Mo Yu, Lemao Liu, Junjie Wu, Tsz Ting Chung, Shunchi Zhang, Jiangnan Li, Dit-Yan Yeung, Jie Zhou</p>

            <p><strong>Title:</strong><br>
            The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08946v1">http://arxiv.org/abs/2502.08946v1</a></p>

            <p><strong>Abstract:</strong><br>
            In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language; (3) our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CL, cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Mo Yu, Lemao Liu, Junjie Wu, Tsz Ting Chung, Shunchi Zhang, Jiangnan Li, Dit-Yan Yeung, Jie Zhou</p>

            <p><strong>Title:</strong><br>
            The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08946v1">http://arxiv.org/abs/2502.08946v1</a></p>

            <p><strong>Abstract:</strong><br>
            In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language; (3) our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 14 Feb 2025 20:37:42 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/09fb8a99/eb59eabc.mp3" length="20842906" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1299</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CL, cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Mo Yu, Lemao Liu, Junjie Wu, Tsz Ting Chung, Shunchi Zhang, Jiangnan Li, Dit-Yan Yeung, Jie Zhou</p>

            <p><strong>Title:</strong><br>
            The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08946v1">http://arxiv.org/abs/2502.08946v1</a></p>

            <p><strong>Abstract:</strong><br>
            In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language; (3) our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation</title>
      <itunes:episode>554</itunes:episode>
      <podcast:episode>554</podcast:episode>
      <itunes:title>Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8c404634-5f74-414a-9d55-399667a93a50</guid>
      <link>https://share.transistor.fm/s/8beff4bc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hoigi Seo, Wongi Jeong, Jae-sun Seo, Se Young Chun</p>

            <p><strong>Title:</strong><br>
            Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08690v1">http://arxiv.org/abs/2502.08690v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a single forward pass to produce text embeddings. However, despite their minimal contribution to total inference time and floating-point operations (FLOPs), text encoders demand significantly higher memory usage, up to eight times more than denoising modules. To address this inefficiency, we propose Skip and Re-use layers (Skrr), a simple yet effective pruning strategy specifically designed for text encoders in T2I diffusion models. Skrr exploits the inherent redundancy in transformer blocks by selectively skipping or reusing certain layers in a manner tailored for T2I tasks, thereby reducing memory consumption without compromising performance. Extensive experiments demonstrate that Skrr maintains image quality comparable to the original model even under high sparsity levels, outperforming existing blockwise pruning methods. Furthermore, Skrr achieves state-of-the-art memory efficiency while preserving performance across multiple evaluation metrics, including the FID, CLIP, DreamSim, and GenEval scores.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hoigi Seo, Wongi Jeong, Jae-sun Seo, Se Young Chun</p>

            <p><strong>Title:</strong><br>
            Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08690v1">http://arxiv.org/abs/2502.08690v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a single forward pass to produce text embeddings. However, despite their minimal contribution to total inference time and floating-point operations (FLOPs), text encoders demand significantly higher memory usage, up to eight times more than denoising modules. To address this inefficiency, we propose Skip and Re-use layers (Skrr), a simple yet effective pruning strategy specifically designed for text encoders in T2I diffusion models. Skrr exploits the inherent redundancy in transformer blocks by selectively skipping or reusing certain layers in a manner tailored for T2I tasks, thereby reducing memory consumption without compromising performance. Extensive experiments demonstrate that Skrr maintains image quality comparable to the original model even under high sparsity levels, outperforming existing blockwise pruning methods. Furthermore, Skrr achieves state-of-the-art memory efficiency while preserving performance across multiple evaluation metrics, including the FID, CLIP, DreamSim, and GenEval scores.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 14 Feb 2025 20:37:19 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8beff4bc/b8fafa5f.mp3" length="18779431" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1170</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hoigi Seo, Wongi Jeong, Jae-sun Seo, Se Young Chun</p>

            <p><strong>Title:</strong><br>
            Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08690v1">http://arxiv.org/abs/2502.08690v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a single forward pass to produce text embeddings. However, despite their minimal contribution to total inference time and floating-point operations (FLOPs), text encoders demand significantly higher memory usage, up to eight times more than denoising modules. To address this inefficiency, we propose Skip and Re-use layers (Skrr), a simple yet effective pruning strategy specifically designed for text encoders in T2I diffusion models. Skrr exploits the inherent redundancy in transformer blocks by selectively skipping or reusing certain layers in a manner tailored for T2I tasks, thereby reducing memory consumption without compromising performance. Extensive experiments demonstrate that Skrr maintains image quality comparable to the original model even under high sparsity levels, outperforming existing blockwise pruning methods. Furthermore, Skrr achieves state-of-the-art memory efficiency while preserving performance across multiple evaluation metrics, including the FID, CLIP, DreamSim, and GenEval scores.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models</title>
      <itunes:episode>553</itunes:episode>
      <podcast:episode>553</podcast:episode>
      <itunes:title>SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bf536dde-b22f-4021-85cc-ffcf32c3994b</guid>
      <link>https://share.transistor.fm/s/b7546f39</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih</p>

            <p><strong>Title:</strong><br>
            SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09604v1">http://arxiv.org/abs/2502.09604v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce SelfCite, a novel self-supervised approach that aligns LLMs to generate high-quality, fine-grained, sentence-level citations for the statements in their generated responses. Instead of only relying on costly and labor-intensive annotations, SelfCite leverages a reward signal provided by the LLM itself through context ablation: If a citation is necessary, removing the cited text from the context should prevent the same response; if sufficient, retaining the cited text alone should preserve the same response. This reward can guide the inference-time best-of-N sampling strategy to improve citation quality significantly, as well as be used in preference optimization to directly fine-tune the models for generating better citations. The effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3 points on the LongBench-Cite benchmark across five long-form question answering tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih</p>

            <p><strong>Title:</strong><br>
            SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09604v1">http://arxiv.org/abs/2502.09604v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce SelfCite, a novel self-supervised approach that aligns LLMs to generate high-quality, fine-grained, sentence-level citations for the statements in their generated responses. Instead of only relying on costly and labor-intensive annotations, SelfCite leverages a reward signal provided by the LLM itself through context ablation: If a citation is necessary, removing the cited text from the context should prevent the same response; if sufficient, retaining the cited text alone should preserve the same response. This reward can guide the inference-time best-of-N sampling strategy to improve citation quality significantly, as well as be used in preference optimization to directly fine-tune the models for generating better citations. The effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3 points on the LongBench-Cite benchmark across five long-form question answering tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 14 Feb 2025 20:36:56 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b7546f39/a42165b0.mp3" length="21343608" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1330</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih</p>

            <p><strong>Title:</strong><br>
            SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09604v1">http://arxiv.org/abs/2502.09604v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce SelfCite, a novel self-supervised approach that aligns LLMs to generate high-quality, fine-grained, sentence-level citations for the statements in their generated responses. Instead of only relying on costly and labor-intensive annotations, SelfCite leverages a reward signal provided by the LLM itself through context ablation: If a citation is necessary, removing the cited text from the context should prevent the same response; if sufficient, retaining the cited text alone should preserve the same response. This reward can guide the inference-time best-of-N sampling strategy to improve citation quality significantly, as well as be used in preference optimization to directly fine-tune the models for generating better citations. The effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3 points on the LongBench-Cite benchmark across five long-form question answering tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights</title>
      <itunes:episode>552</itunes:episode>
      <podcast:episode>552</podcast:episode>
      <itunes:title>Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">520a8a95-90fd-477c-b25b-a99cdad68fb5</guid>
      <link>https://share.transistor.fm/s/37083b34</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.LG, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jonathan Kahana, Or Nathan, Eliahu Horwitz, Yedid Hoshen</p>

            <p><strong>Title:</strong><br>
            Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09619v1">http://arxiv.org/abs/2502.09619v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the increasing numbers of publicly available models, there are probably pretrained, online models for most tasks users require. However, current model search methods are rudimentary, essentially a text-based search in the documentation, thus users cannot find the relevant models. This paper presents ProbeLog, a method for retrieving classification models that can recognize a target concept, such as "Dog", without access to model metadata or training data. Differently from previous probing methods, ProbeLog computes a descriptor for each output dimension (logit) of each model, by observing its responses on a fixed set of inputs (probes). Our method supports both logit-based retrieval ("find more logits like this") and zero-shot, text-based retrieval ("find all logits corresponding to dogs"). As probing-based representations require multiple costly feedforward passes through the model, we develop a method, based on collaborative filtering, that reduces the cost of encoding repositories by 3x. We demonstrate that ProbeLog achieves high retrieval accuracy, both in real-world and fine-grained search tasks and is scalable to full-size repositories.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.LG, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jonathan Kahana, Or Nathan, Eliahu Horwitz, Yedid Hoshen</p>

            <p><strong>Title:</strong><br>
            Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09619v1">http://arxiv.org/abs/2502.09619v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the increasing numbers of publicly available models, there are probably pretrained, online models for most tasks users require. However, current model search methods are rudimentary, essentially a text-based search in the documentation, thus users cannot find the relevant models. This paper presents ProbeLog, a method for retrieving classification models that can recognize a target concept, such as "Dog", without access to model metadata or training data. Differently from previous probing methods, ProbeLog computes a descriptor for each output dimension (logit) of each model, by observing its responses on a fixed set of inputs (probes). Our method supports both logit-based retrieval ("find more logits like this") and zero-shot, text-based retrieval ("find all logits corresponding to dogs"). As probing-based representations require multiple costly feedforward passes through the model, we develop a method, based on collaborative filtering, that reduces the cost of encoding repositories by 3x. We demonstrate that ProbeLog achieves high retrieval accuracy, both in real-world and fine-grained search tasks and is scalable to full-size repositories.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 14 Feb 2025 20:36:33 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/37083b34/c638ccc4.mp3" length="19510008" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1216</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.LG, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jonathan Kahana, Or Nathan, Eliahu Horwitz, Yedid Hoshen</p>

            <p><strong>Title:</strong><br>
            Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09619v1">http://arxiv.org/abs/2502.09619v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the increasing numbers of publicly available models, there are probably pretrained, online models for most tasks users require. However, current model search methods are rudimentary, essentially a text-based search in the documentation, thus users cannot find the relevant models. This paper presents ProbeLog, a method for retrieving classification models that can recognize a target concept, such as "Dog", without access to model metadata or training data. Differently from previous probing methods, ProbeLog computes a descriptor for each output dimension (logit) of each model, by observing its responses on a fixed set of inputs (probes). Our method supports both logit-based retrieval ("find more logits like this") and zero-shot, text-based retrieval ("find all logits corresponding to dogs"). As probing-based representations require multiple costly feedforward passes through the model, we develop a method, based on collaborative filtering, that reduces the cost of encoding repositories by 3x. We demonstrate that ProbeLog achieves high retrieval accuracy, both in real-world and fine-grained search tasks and is scalable to full-size repositories.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging</title>
      <itunes:episode>551</itunes:episode>
      <podcast:episode>551</podcast:episode>
      <itunes:title>An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0c0cc575-907c-42ea-998b-14cf15c12449</guid>
      <link>https://share.transistor.fm/s/3a021dc3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Kunat Pipatanakul, Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai</p>

            <p><strong>Title:</strong><br>
            An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09056v1">http://arxiv.org/abs/2502.09056v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper investigates data selection and model merging methodologies aimed at incorporating advanced reasoning capabilities such as those of DeepSeek R1 into language-specific large language models (LLMs), with a particular focus on the Thai LLM. Our goal is to enhance the reasoning capabilities of language-specific LLMs while maintaining their target language abilities. DeepSeek R1 excels in reasoning but primarily benefits high-resource languages such as English and Chinese. However, low-resource languages remain underserved due to the dominance of English-centric training data and model optimizations, which limit performance in these languages. This limitation results in unreliable code-switching and diminished effectiveness on tasks in low-resource languages. Meanwhile, local and regional LLM initiatives have attempted to bridge this gap by developing language-specific LLMs that focus on improving local linguistic fidelity. We demonstrate that, with only publicly available datasets and a computational budget of $120, it is possible to enhance the reasoning capabilities of language-specific LLMs to match the level of DeepSeek R1, without compromising their performance on target language tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Kunat Pipatanakul, Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai</p>

            <p><strong>Title:</strong><br>
            An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09056v1">http://arxiv.org/abs/2502.09056v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper investigates data selection and model merging methodologies aimed at incorporating advanced reasoning capabilities such as those of DeepSeek R1 into language-specific large language models (LLMs), with a particular focus on the Thai LLM. Our goal is to enhance the reasoning capabilities of language-specific LLMs while maintaining their target language abilities. DeepSeek R1 excels in reasoning but primarily benefits high-resource languages such as English and Chinese. However, low-resource languages remain underserved due to the dominance of English-centric training data and model optimizations, which limit performance in these languages. This limitation results in unreliable code-switching and diminished effectiveness on tasks in low-resource languages. Meanwhile, local and regional LLM initiatives have attempted to bridge this gap by developing language-specific LLMs that focus on improving local linguistic fidelity. We demonstrate that, with only publicly available datasets and a computational budget of $120, it is possible to enhance the reasoning capabilities of language-specific LLMs to match the level of DeepSeek R1, without compromising their performance on target language tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 14 Feb 2025 20:36:10 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3a021dc3/29972ad0.mp3" length="24085851" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1502</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Kunat Pipatanakul, Pittawat Taveekitworachai, Potsawee Manakul, Kasima Tharnpipitchai</p>

            <p><strong>Title:</strong><br>
            An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09056v1">http://arxiv.org/abs/2502.09056v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper investigates data selection and model merging methodologies aimed at incorporating advanced reasoning capabilities such as those of DeepSeek R1 into language-specific large language models (LLMs), with a particular focus on the Thai LLM. Our goal is to enhance the reasoning capabilities of language-specific LLMs while maintaining their target language abilities. DeepSeek R1 excels in reasoning but primarily benefits high-resource languages such as English and Chinese. However, low-resource languages remain underserved due to the dominance of English-centric training data and model optimizations, which limit performance in these languages. This limitation results in unreliable code-switching and diminished effectiveness on tasks in low-resource languages. Meanwhile, local and regional LLM initiatives have attempted to bridge this gap by developing language-specific LLMs that focus on improving local linguistic fidelity. We demonstrate that, with only publicly available datasets and a computational budget of $120, it is possible to enhance the reasoning capabilities of language-specific LLMs to match the level of DeepSeek R1, without compromising their performance on target language tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents</title>
      <itunes:episode>550</itunes:episode>
      <podcast:episode>550</podcast:episode>
      <itunes:title>EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1a70dba5-bd21-4fea-839d-ddd5136a8eb3</guid>
      <link>https://share.transistor.fm/s/11f9539a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang</p>

            <p><strong>Title:</strong><br>
            EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09560v1">http://arxiv.org/abs/2502.09560v1</a></p>

            <p><strong>Abstract:</strong><br>
            Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-term planning. Through extensive experiments, we evaluated 13 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides a multifaceted standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance MLLM-based embodied agents. Our code is available at https://embodiedbench.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang</p>

            <p><strong>Title:</strong><br>
            EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09560v1">http://arxiv.org/abs/2502.09560v1</a></p>

            <p><strong>Abstract:</strong><br>
            Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-term planning. Through extensive experiments, we evaluated 13 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides a multifaceted standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance MLLM-based embodied agents. Our code is available at https://embodiedbench.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 14 Feb 2025 20:35:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/11f9539a/88212ba7.mp3" length="20283688" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1264</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang</p>

            <p><strong>Title:</strong><br>
            EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09560v1">http://arxiv.org/abs/2502.09560v1</a></p>

            <p><strong>Abstract:</strong><br>
            Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-term planning. Through extensive experiments, we evaluated 13 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides a multifaceted standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance MLLM-based embodied agents. Our code is available at https://embodiedbench.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Exploring the Potential of Encoder-free Architectures in 3D LMMs</title>
      <itunes:episode>549</itunes:episode>
      <podcast:episode>549</podcast:episode>
      <itunes:title>Exploring the Potential of Encoder-free Architectures in 3D LMMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2148fca2-c873-45f5-9d31-5e09e4bc7b04</guid>
      <link>https://share.transistor.fm/s/8e15fd12</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao</p>

            <p><strong>Title:</strong><br>
            Exploring the Potential of Encoder-free Architectures in 3D LMMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09620v1">http://arxiv.org/abs/2502.09620v1</a></p>

            <p><strong>Abstract:</strong><br>
            Encoder-free architectures have been preliminarily explored in the 2D visual domain, yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to overcome the challenges of encoder-based 3D Large Multimodal Models (LMMs). These challenges include the failure to adapt to varying point cloud resolutions and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM early layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current state-of-the-art model, ShapeLLM-13B, achieving 55.0%, 50.92%, and 42.7% on the classification, captioning, and VQA tasks, respectively. Our results demonstrate that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding. The code is released at https://github.com/Ivan-Tang-3D/ENEL</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao</p>

            <p><strong>Title:</strong><br>
            Exploring the Potential of Encoder-free Architectures in 3D LMMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09620v1">http://arxiv.org/abs/2502.09620v1</a></p>

            <p><strong>Abstract:</strong><br>
            Encoder-free architectures have been preliminarily explored in the 2D visual domain, yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to overcome the challenges of encoder-based 3D Large Multimodal Models (LMMs). These challenges include the failure to adapt to varying point cloud resolutions and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM early layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current state-of-the-art model, ShapeLLM-13B, achieving 55.0%, 50.92%, and 42.7% on the classification, captioning, and VQA tasks, respectively. Our results demonstrate that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding. The code is released at https://github.com/Ivan-Tang-3D/ENEL</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 14 Feb 2025 20:35:23 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8e15fd12/c2a8d3fd.mp3" length="17583627" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1095</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao</p>

            <p><strong>Title:</strong><br>
            Exploring the Potential of Encoder-free Architectures in 3D LMMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09620v1">http://arxiv.org/abs/2502.09620v1</a></p>

            <p><strong>Abstract:</strong><br>
            Encoder-free architectures have been preliminarily explored in the 2D visual domain, yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to overcome the challenges of encoder-based 3D Large Multimodal Models (LMMs). These challenges include the failure to adapt to varying point cloud resolutions and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM early layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current state-of-the-art model, ShapeLLM-13B, achieving 55.0%, 50.92%, and 42.7% on the classification, captioning, and VQA tasks, respectively. Our results demonstrate that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding. The code is released at https://github.com/Ivan-Tang-3D/ENEL</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CoSER: Coordinating LLM-Based Persona Simulation of Established Roles</title>
      <itunes:episode>548</itunes:episode>
      <podcast:episode>548</podcast:episode>
      <itunes:title>CoSER: Coordinating LLM-Based Persona Simulation of Established Roles</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d6153cc9-2e9e-4786-9a06-cdfc51eee078</guid>
      <link>https://share.transistor.fm/s/a2dfd99c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Wei Wang, Yanghua Xiao, Shuchang Zhou</p>

            <p><strong>Title:</strong><br>
            CoSER: Coordinating LLM-Based Persona Simulation of Established Roles</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09082v1">http://arxiv.org/abs/2502.09082v1</a></p>

            <p><strong>Abstract:</strong><br>
            Role-playing language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for RPLAs, due to the lack of authentic character datasets and nuanced evaluation methods using such data. In this paper, we present CoSER, a collection of a high-quality dataset, open models, and an evaluation protocol towards effective RPLAs of established characters. The CoSER dataset covers 17,966 characters from 771 renowned books. It provides authentic dialogues with real-world intricacies, as well as diverse data types such as conversation setups, character experiences and internal thoughts. Drawing from acting methodology, we introduce given-circumstance acting for training and evaluating role-playing LLMs, where LLMs sequentially portray multiple characters in book scenes. Using our dataset, we develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models. Extensive experiments demonstrate the value of the CoSER dataset for RPLA training, evaluation and retrieval. Moreover, CoSER 70B exhibits state-of-the-art performance surpassing or matching GPT-4o on our evaluation and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on the InCharacter and LifeChoice benchmarks respectively.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Wei Wang, Yanghua Xiao, Shuchang Zhou</p>

            <p><strong>Title:</strong><br>
            CoSER: Coordinating LLM-Based Persona Simulation of Established Roles</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09082v1">http://arxiv.org/abs/2502.09082v1</a></p>

            <p><strong>Abstract:</strong><br>
            Role-playing language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for RPLAs, due to the lack of authentic character datasets and nuanced evaluation methods using such data. In this paper, we present CoSER, a collection of a high-quality dataset, open models, and an evaluation protocol towards effective RPLAs of established characters. The CoSER dataset covers 17,966 characters from 771 renowned books. It provides authentic dialogues with real-world intricacies, as well as diverse data types such as conversation setups, character experiences and internal thoughts. Drawing from acting methodology, we introduce given-circumstance acting for training and evaluating role-playing LLMs, where LLMs sequentially portray multiple characters in book scenes. Using our dataset, we develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models. Extensive experiments demonstrate the value of the CoSER dataset for RPLA training, evaluation and retrieval. Moreover, CoSER 70B exhibits state-of-the-art performance surpassing or matching GPT-4o on our evaluation and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on the InCharacter and LifeChoice benchmarks respectively.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 14 Feb 2025 20:35:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a2dfd99c/54a6689b.mp3" length="21951305" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1368</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Wei Wang, Yanghua Xiao, Shuchang Zhou</p>

            <p><strong>Title:</strong><br>
            CoSER: Coordinating LLM-Based Persona Simulation of Established Roles</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.09082v1">http://arxiv.org/abs/2502.09082v1</a></p>

            <p><strong>Abstract:</strong><br>
            Role-playing language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for RPLAs, due to the lack of authentic character datasets and nuanced evaluation methods using such data. In this paper, we present CoSER, a collection of a high-quality dataset, open models, and an evaluation protocol towards effective RPLAs of established characters. The CoSER dataset covers 17,966 characters from 771 renowned books. It provides authentic dialogues with real-world intricacies, as well as diverse data types such as conversation setups, character experiences and internal thoughts. Drawing from acting methodology, we introduce given-circumstance acting for training and evaluating role-playing LLMs, where LLMs sequentially portray multiple characters in book scenes. Using our dataset, we develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models. Extensive experiments demonstrate the value of the CoSER dataset for RPLA training, evaluation and retrieval. Moreover, CoSER 70B exhibits state-of-the-art performance surpassing or matching GPT-4o on our evaluation and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on the InCharacter and LifeChoice benchmarks respectively.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models</title>
      <itunes:episode>547</itunes:episode>
      <podcast:episode>547</podcast:episode>
      <itunes:title>TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7935798a-eb71-4c8c-9676-cde8eef11398</guid>
      <link>https://share.transistor.fm/s/ad9768b8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, Yan-Pei Cao</p>

            <p><strong>Title:</strong><br>
            TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06608v1">http://arxiv.org/abs/2502.06608v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in diffusion techniques have propelled image and video generation to unprece- dented levels of quality, significantly accelerating the deployment and application of generative AI. However, 3D shape generation technology has so far lagged behind, constrained by limitations in 3D data scale, complexity of 3D data process- ing, and insufficient exploration of advanced tech- niques in the 3D domain. Current approaches to 3D shape generation face substantial challenges in terms of output quality, generalization capa- bility, and alignment with input conditions. We present TripoSG, a new streamlined shape diffu- sion paradigm capable of generating high-fidelity 3D meshes with precise correspondence to input images. Specifically, we propose: 1) A large-scale rectified flow transformer for 3D shape generation, achieving state-of-the-art fidelity through training on extensive, high-quality data. 2) A hybrid supervised training strategy combining SDF, normal, and eikonal losses for 3D VAE, achieving high- quality 3D reconstruction performance. 3) A data processing pipeline to generate 2 million high- quality 3D samples, highlighting the crucial rules for data quality and quantity in training 3D gen- erative models. Through comprehensive experi- ments, we have validated the effectiveness of each component in our new framework. The seamless integration of these parts has enabled TripoSG to achieve state-of-the-art performance in 3D shape generation. The resulting 3D shapes exhibit en- hanced detail due to high-resolution capabilities and demonstrate exceptional fidelity to input im- ages. Moreover, TripoSG demonstrates improved versatility in generating 3D models from diverse image styles and contents, showcasing strong gen- eralization capabilities. To foster progress and innovation in the field of 3D generation, we will make our model publicly available.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, Yan-Pei Cao</p>

            <p><strong>Title:</strong><br>
            TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06608v1">http://arxiv.org/abs/2502.06608v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in diffusion techniques have propelled image and video generation to unprece- dented levels of quality, significantly accelerating the deployment and application of generative AI. However, 3D shape generation technology has so far lagged behind, constrained by limitations in 3D data scale, complexity of 3D data process- ing, and insufficient exploration of advanced tech- niques in the 3D domain. Current approaches to 3D shape generation face substantial challenges in terms of output quality, generalization capa- bility, and alignment with input conditions. We present TripoSG, a new streamlined shape diffu- sion paradigm capable of generating high-fidelity 3D meshes with precise correspondence to input images. Specifically, we propose: 1) A large-scale rectified flow transformer for 3D shape generation, achieving state-of-the-art fidelity through training on extensive, high-quality data. 2) A hybrid supervised training strategy combining SDF, normal, and eikonal losses for 3D VAE, achieving high- quality 3D reconstruction performance. 3) A data processing pipeline to generate 2 million high- quality 3D samples, highlighting the crucial rules for data quality and quantity in training 3D gen- erative models. Through comprehensive experi- ments, we have validated the effectiveness of each component in our new framework. The seamless integration of these parts has enabled TripoSG to achieve state-of-the-art performance in 3D shape generation. The resulting 3D shapes exhibit en- hanced detail due to high-resolution capabilities and demonstrate exceptional fidelity to input im- ages. Moreover, TripoSG demonstrates improved versatility in generating 3D models from diverse image styles and contents, showcasing strong gen- eralization capabilities. To foster progress and innovation in the field of 3D generation, we will make our model publicly available.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 14 Feb 2025 20:34:37 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ad9768b8/18da7904.mp3" length="22526011" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1404</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, Yan-Pei Cao</p>

            <p><strong>Title:</strong><br>
            TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06608v1">http://arxiv.org/abs/2502.06608v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in diffusion techniques have propelled image and video generation to unprece- dented levels of quality, significantly accelerating the deployment and application of generative AI. However, 3D shape generation technology has so far lagged behind, constrained by limitations in 3D data scale, complexity of 3D data process- ing, and insufficient exploration of advanced tech- niques in the 3D domain. Current approaches to 3D shape generation face substantial challenges in terms of output quality, generalization capa- bility, and alignment with input conditions. We present TripoSG, a new streamlined shape diffu- sion paradigm capable of generating high-fidelity 3D meshes with precise correspondence to input images. Specifically, we propose: 1) A large-scale rectified flow transformer for 3D shape generation, achieving state-of-the-art fidelity through training on extensive, high-quality data. 2) A hybrid supervised training strategy combining SDF, normal, and eikonal losses for 3D VAE, achieving high- quality 3D reconstruction performance. 3) A data processing pipeline to generate 2 million high- quality 3D samples, highlighting the crucial rules for data quality and quantity in training 3D gen- erative models. Through comprehensive experi- ments, we have validated the effectiveness of each component in our new framework. The seamless integration of these parts has enabled TripoSG to achieve state-of-the-art performance in 3D shape generation. The resulting 3D shapes exhibit en- hanced detail due to high-resolution capabilities and demonstrate exceptional fidelity to input im- ages. Moreover, TripoSG demonstrates improved versatility in generating 3D models from diverse image styles and contents, showcasing strong gen- eralization capabilities. To foster progress and innovation in the field of 3D generation, we will make our model publicly available.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance</title>
      <itunes:episode>546</itunes:episode>
      <podcast:episode>546</podcast:episode>
      <itunes:title>Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c6935207-7a2f-4d14-ae55-317b32e515b0</guid>
      <link>https://share.transistor.fm/s/98df9911</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Lingfei Qian, Weipeng Zhou, Yan Wang, Xueqing Peng, Jimin Huang, Qianqian Xie</p>

            <p><strong>Title:</strong><br>
            Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08127v1">http://arxiv.org/abs/2502.08127v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in large language models (LLMs) have shown strong general reasoning abilities, yet their effectiveness in financial reasoning remains underexplored. In this study, we comprehensively evaluate 16 powerful reasoning and general LLMs on three complex financial tasks involving financial text, tabular data, and equations, assessing numerical reasoning, tabular interpretation, financial terminology comprehension, long-context processing, and equation-based problem solving. Our results show that while better datasets and pretraining improve financial reasoning, general enhancements like CoT fine-tuning do not always yield consistent gains. Moreover, all reasoning strategies face challenges in improving performance on long-context and multi-table tasks. To address these limitations, we develop a financial reasoning-enhanced model based on Llama-3.1-8B-Instruct, by CoT fine-tuning and reinforcement learning with domain-specific reasoning paths. Even with simple fine-tuning with one financial dataset, our model achieves a consistent 10% performance improvement across tasks, surpassing all 8B models and even Llama3-70B-Instruct and Llama3.1-70B-Instruct on average. Our results highlight the need for domain-specific adaptations in financial tasks, emphasizing future directions such as multi-table reasoning, long-context processing, and financial terminology comprehension. All our datasets, models, and codes are publicly available. Furthermore, we introduce a leaderboard for benchmarking future datasets and models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Lingfei Qian, Weipeng Zhou, Yan Wang, Xueqing Peng, Jimin Huang, Qianqian Xie</p>

            <p><strong>Title:</strong><br>
            Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08127v1">http://arxiv.org/abs/2502.08127v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in large language models (LLMs) have shown strong general reasoning abilities, yet their effectiveness in financial reasoning remains underexplored. In this study, we comprehensively evaluate 16 powerful reasoning and general LLMs on three complex financial tasks involving financial text, tabular data, and equations, assessing numerical reasoning, tabular interpretation, financial terminology comprehension, long-context processing, and equation-based problem solving. Our results show that while better datasets and pretraining improve financial reasoning, general enhancements like CoT fine-tuning do not always yield consistent gains. Moreover, all reasoning strategies face challenges in improving performance on long-context and multi-table tasks. To address these limitations, we develop a financial reasoning-enhanced model based on Llama-3.1-8B-Instruct, by CoT fine-tuning and reinforcement learning with domain-specific reasoning paths. Even with simple fine-tuning with one financial dataset, our model achieves a consistent 10% performance improvement across tasks, surpassing all 8B models and even Llama3-70B-Instruct and Llama3.1-70B-Instruct on average. Our results highlight the need for domain-specific adaptations in financial tasks, emphasizing future directions such as multi-table reasoning, long-context processing, and financial terminology comprehension. All our datasets, models, and codes are publicly available. Furthermore, we introduce a leaderboard for benchmarking future datasets and models.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 13 Feb 2025 21:30:40 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/98df9911/8b2c6ff0.mp3" length="21715992" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1354</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Lingfei Qian, Weipeng Zhou, Yan Wang, Xueqing Peng, Jimin Huang, Qianqian Xie</p>

            <p><strong>Title:</strong><br>
            Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08127v1">http://arxiv.org/abs/2502.08127v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in large language models (LLMs) have shown strong general reasoning abilities, yet their effectiveness in financial reasoning remains underexplored. In this study, we comprehensively evaluate 16 powerful reasoning and general LLMs on three complex financial tasks involving financial text, tabular data, and equations, assessing numerical reasoning, tabular interpretation, financial terminology comprehension, long-context processing, and equation-based problem solving. Our results show that while better datasets and pretraining improve financial reasoning, general enhancements like CoT fine-tuning do not always yield consistent gains. Moreover, all reasoning strategies face challenges in improving performance on long-context and multi-table tasks. To address these limitations, we develop a financial reasoning-enhanced model based on Llama-3.1-8B-Instruct, by CoT fine-tuning and reinforcement learning with domain-specific reasoning paths. Even with simple fine-tuning with one financial dataset, our model achieves a consistent 10% performance improvement across tasks, surpassing all 8B models and even Llama3-70B-Instruct and Llama3.1-70B-Instruct on average. Our results highlight the need for domain-specific adaptations in financial tasks, emphasizing future directions such as multi-table reasoning, long-context processing, and financial terminology comprehension. All our datasets, models, and codes are publicly available. Furthermore, we introduce a leaderboard for benchmarking future datasets and models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation</title>
      <itunes:episode>545</itunes:episode>
      <podcast:episode>545</podcast:episode>
      <itunes:title>TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e92beb67-9ea8-4fa6-a5b1-3f11d90f4f9c</guid>
      <link>https://share.transistor.fm/s/fbcf276a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, Lijuan Wang, Min Li</p>

            <p><strong>Title:</strong><br>
            TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07870v1">http://arxiv.org/abs/2502.07870v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-conditioned image generation has gained significant attention in recent years and are processing increasingly longer and comprehensive text prompt. In everyday life, dense and intricate text appears in contexts like advertisements, infographics, and signage, where the integration of both text and visuals is essential for conveying complex information. However, despite these advances, the generation of images containing long-form text remains a persistent challenge, largely due to the limitations of existing datasets, which often focus on shorter and simpler text. To address this gap, we introduce TextAtlas5M, a novel dataset specifically designed to evaluate long-text rendering in text-conditioned image generation. Our dataset consists of 5 million long-text generated and collected images across diverse data types, enabling comprehensive evaluation of large-scale generative models on long-text image generation. We further curate 3000 human-improved test set TextAtlasEval across 3 data domains, establishing one of the most extensive benchmarks for text-conditioned generation. Evaluations suggest that the TextAtlasEval benchmarks present significant challenges even for the most advanced proprietary models (e.g. GPT4o with DallE-3), while their open-source counterparts show an even larger performance gap. These evidences position TextAtlas5M as a valuable dataset for training and evaluating future-generation text-conditioned image generation models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, Lijuan Wang, Min Li</p>

            <p><strong>Title:</strong><br>
            TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07870v1">http://arxiv.org/abs/2502.07870v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-conditioned image generation has gained significant attention in recent years and are processing increasingly longer and comprehensive text prompt. In everyday life, dense and intricate text appears in contexts like advertisements, infographics, and signage, where the integration of both text and visuals is essential for conveying complex information. However, despite these advances, the generation of images containing long-form text remains a persistent challenge, largely due to the limitations of existing datasets, which often focus on shorter and simpler text. To address this gap, we introduce TextAtlas5M, a novel dataset specifically designed to evaluate long-text rendering in text-conditioned image generation. Our dataset consists of 5 million long-text generated and collected images across diverse data types, enabling comprehensive evaluation of large-scale generative models on long-text image generation. We further curate 3000 human-improved test set TextAtlasEval across 3 data domains, establishing one of the most extensive benchmarks for text-conditioned generation. Evaluations suggest that the TextAtlasEval benchmarks present significant challenges even for the most advanced proprietary models (e.g. GPT4o with DallE-3), while their open-source counterparts show an even larger performance gap. These evidences position TextAtlas5M as a valuable dataset for training and evaluating future-generation text-conditioned image generation models.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 13 Feb 2025 21:30:17 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fbcf276a/e45b796b.mp3" length="18184236" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1133</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, Lijuan Wang, Min Li</p>

            <p><strong>Title:</strong><br>
            TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07870v1">http://arxiv.org/abs/2502.07870v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-conditioned image generation has gained significant attention in recent years and are processing increasingly longer and comprehensive text prompt. In everyday life, dense and intricate text appears in contexts like advertisements, infographics, and signage, where the integration of both text and visuals is essential for conveying complex information. However, despite these advances, the generation of images containing long-form text remains a persistent challenge, largely due to the limitations of existing datasets, which often focus on shorter and simpler text. To address this gap, we introduce TextAtlas5M, a novel dataset specifically designed to evaluate long-text rendering in text-conditioned image generation. Our dataset consists of 5 million long-text generated and collected images across diverse data types, enabling comprehensive evaluation of large-scale generative models on long-text image generation. We further curate 3000 human-improved test set TextAtlasEval across 3 data domains, establishing one of the most extensive benchmarks for text-conditioned generation. Evaluations suggest that the TextAtlasEval benchmarks present significant challenges even for the most advanced proprietary models (e.g. GPT4o with DallE-3), while their open-source counterparts show an even larger performance gap. These evidences position TextAtlas5M as a valuable dataset for training and evaluating future-generation text-conditioned image generation models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models</title>
      <itunes:episode>544</itunes:episode>
      <podcast:episode>544</podcast:episode>
      <itunes:title>BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c34703d5-c31e-4f93-ac4f-c2125868251b</guid>
      <link>https://share.transistor.fm/s/c6057562</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, Fei Yuan</p>

            <p><strong>Title:</strong><br>
            BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07346v1">http://arxiv.org/abs/2502.07346v1</a></p>

            <p><strong>Abstract:</strong><br>
            Previous multilingual benchmarks focus primarily on simple understanding tasks, but for large language models(LLMs), we emphasize proficiency in instruction following, reasoning, long context understanding, code generation, and so on. However, measuring these advanced capabilities across languages is underexplored. To address the disparity, we introduce BenchMAX, a multi-way multilingual evaluation benchmark that allows for fair comparisons of these important abilities across languages. To maintain high quality, three distinct native-speaking annotators independently annotate each sample within all tasks after the data was machine-translated from English into 16 other languages. Additionally, we present a novel translation challenge stemming from dataset construction. Extensive experiments on BenchMAX reveal varying effectiveness of core capabilities across languages, highlighting performance gaps that cannot be bridged by simply scaling up model size. BenchMAX serves as a comprehensive multilingual evaluation platform, providing a promising test bed to promote the development of multilingual language models. The dataset and code are publicly accessible.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, Fei Yuan</p>

            <p><strong>Title:</strong><br>
            BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07346v1">http://arxiv.org/abs/2502.07346v1</a></p>

            <p><strong>Abstract:</strong><br>
            Previous multilingual benchmarks focus primarily on simple understanding tasks, but for large language models(LLMs), we emphasize proficiency in instruction following, reasoning, long context understanding, code generation, and so on. However, measuring these advanced capabilities across languages is underexplored. To address the disparity, we introduce BenchMAX, a multi-way multilingual evaluation benchmark that allows for fair comparisons of these important abilities across languages. To maintain high quality, three distinct native-speaking annotators independently annotate each sample within all tasks after the data was machine-translated from English into 16 other languages. Additionally, we present a novel translation challenge stemming from dataset construction. Extensive experiments on BenchMAX reveal varying effectiveness of core capabilities across languages, highlighting performance gaps that cannot be bridged by simply scaling up model size. BenchMAX serves as a comprehensive multilingual evaluation platform, providing a promising test bed to promote the development of multilingual language models. The dataset and code are publicly accessible.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 13 Feb 2025 21:29:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c6057562/5a24ce1c.mp3" length="19147647" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1193</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 35 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, Fei Yuan</p>

            <p><strong>Title:</strong><br>
            BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07346v1">http://arxiv.org/abs/2502.07346v1</a></p>

            <p><strong>Abstract:</strong><br>
            Previous multilingual benchmarks focus primarily on simple understanding tasks, but for large language models(LLMs), we emphasize proficiency in instruction following, reasoning, long context understanding, code generation, and so on. However, measuring these advanced capabilities across languages is underexplored. To address the disparity, we introduce BenchMAX, a multi-way multilingual evaluation benchmark that allows for fair comparisons of these important abilities across languages. To maintain high quality, three distinct native-speaking annotators independently annotate each sample within all tasks after the data was machine-translated from English into 16 other languages. Additionally, we present a novel translation challenge stemming from dataset construction. Extensive experiments on BenchMAX reveal varying effectiveness of core capabilities across languages, highlighting performance gaps that cannot be bridged by simply scaling up model size. BenchMAX serves as a comprehensive multilingual evaluation platform, providing a promising test bed to promote the development of multilingual language models. The dataset and code are publicly accessible.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation</title>
      <itunes:episode>543</itunes:episode>
      <podcast:episode>543</podcast:episode>
      <itunes:title>CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9305e59a-7ad5-477e-9133-545084344cff</guid>
      <link>https://share.transistor.fm/s/523ac7f1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai</p>

            <p><strong>Title:</strong><br>
            CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08639v1">http://arxiv.org/abs/2502.08639v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and intuitive layout control over the rendered frames. To achieve this, CineMaster operates in two stages. In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware conditional signals by positioning object bounding boxes and defining camera movements within the 3D space. In the second stage, these control signals--comprising rendered depth maps, camera trajectories and object class labels--serve as the guidance for a text-to-video diffusion model, ensuring to generate the user-intended video content. Furthermore, to overcome the scarcity of in-the-wild datasets with 3D object motion and camera pose annotations, we carefully establish an automated data annotation pipeline that extracts 3D bounding boxes and camera trajectories from large-scale video data. Extensive qualitative and quantitative experiments demonstrate that CineMaster significantly outperforms existing methods and implements prominent 3D-aware text-to-video generation. Project page: https://cinemaster-dev.github.io/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai</p>

            <p><strong>Title:</strong><br>
            CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08639v1">http://arxiv.org/abs/2502.08639v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and intuitive layout control over the rendered frames. To achieve this, CineMaster operates in two stages. In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware conditional signals by positioning object bounding boxes and defining camera movements within the 3D space. In the second stage, these control signals--comprising rendered depth maps, camera trajectories and object class labels--serve as the guidance for a text-to-video diffusion model, ensuring to generate the user-intended video content. Furthermore, to overcome the scarcity of in-the-wild datasets with 3D object motion and camera pose annotations, we carefully establish an automated data annotation pipeline that extracts 3D bounding boxes and camera trajectories from large-scale video data. Extensive qualitative and quantitative experiments demonstrate that CineMaster significantly outperforms existing methods and implements prominent 3D-aware text-to-video generation. Project page: https://cinemaster-dev.github.io/.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 13 Feb 2025 21:29:20 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/523ac7f1/d46a32e3.mp3" length="22881702" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1426</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai</p>

            <p><strong>Title:</strong><br>
            CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08639v1">http://arxiv.org/abs/2502.08639v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and intuitive layout control over the rendered frames. To achieve this, CineMaster operates in two stages. In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware conditional signals by positioning object bounding boxes and defining camera movements within the 3D space. In the second stage, these control signals--comprising rendered depth maps, camera trajectories and object class labels--serve as the guidance for a text-to-video diffusion model, ensuring to generate the user-intended video content. Furthermore, to overcome the scarcity of in-the-wild datasets with 3D object motion and camera pose annotations, we carefully establish an automated data annotation pipeline that extracts 3D bounding boxes and camera trajectories from large-scale video data. Extensive qualitative and quantitative experiments demonstrate that CineMaster significantly outperforms existing methods and implements prominent 3D-aware text-to-video generation. Project page: https://cinemaster-dev.github.io/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Distillation Scaling Laws</title>
      <itunes:episode>542</itunes:episode>
      <podcast:episode>542</podcast:episode>
      <itunes:title>Distillation Scaling Laws</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e8ac1170-7f38-43d7-bcdb-d8a0ce07dbaa</guid>
      <link>https://share.transistor.fm/s/6a744b25</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG, cs.AI, cs.CL, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb</p>

            <p><strong>Title:</strong><br>
            Distillation Scaling Laws</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08606v1">http://arxiv.org/abs/2502.08606v1</a></p>

            <p><strong>Abstract:</strong><br>
            We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated with using distillation at scale; compute allocation for both the teacher and student models can now be done to maximize student performance. We provide compute optimal distillation recipes for when 1) a teacher exists, or 2) a teacher needs training. If many students are to be distilled, or a teacher already exists, distillation outperforms supervised pretraining until a compute level which grows predictably with student size. If one student is to be distilled and a teacher also needs training, supervised learning should be done instead. Additionally, we provide insights across our large scale study of distillation, which increase our understanding of distillation and inform experimental design.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG, cs.AI, cs.CL, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb</p>

            <p><strong>Title:</strong><br>
            Distillation Scaling Laws</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08606v1">http://arxiv.org/abs/2502.08606v1</a></p>

            <p><strong>Abstract:</strong><br>
            We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated with using distillation at scale; compute allocation for both the teacher and student models can now be done to maximize student performance. We provide compute optimal distillation recipes for when 1) a teacher exists, or 2) a teacher needs training. If many students are to be distilled, or a teacher already exists, distillation outperforms supervised pretraining until a compute level which grows predictably with student size. If one student is to be distilled and a teacher also needs training, supervised learning should be done instead. Additionally, we provide insights across our large scale study of distillation, which increase our understanding of distillation and inform experimental design.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 13 Feb 2025 21:28:57 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6a744b25/295a3b72.mp3" length="20735418" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1292</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.LG, cs.AI, cs.CL, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb</p>

            <p><strong>Title:</strong><br>
            Distillation Scaling Laws</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08606v1">http://arxiv.org/abs/2502.08606v1</a></p>

            <p><strong>Abstract:</strong><br>
            We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated with using distillation at scale; compute allocation for both the teacher and student models can now be done to maximize student performance. We provide compute optimal distillation recipes for when 1) a teacher exists, or 2) a teacher needs training. If many students are to be distilled, or a teacher already exists, distillation outperforms supervised pretraining until a compute level which grows predictably with student size. If one student is to be distilled and a teacher also needs training, supervised learning should be done instead. Additionally, we provide insights across our large scale study of distillation, which increase our understanding of distillation and inform experimental design.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TransMLA: Multi-Head Latent Attention Is All You Need</title>
      <itunes:episode>541</itunes:episode>
      <podcast:episode>541</podcast:episode>
      <itunes:title>TransMLA: Multi-Head Latent Attention Is All You Need</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e5412702-76ff-415a-871f-987126ae7d49</guid>
      <link>https://share.transistor.fm/s/b107e1e4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Fanxu Meng, Zengwei Yao, Muhan Zhang</p>

            <p><strong>Title:</strong><br>
            TransMLA: Multi-Head Latent Attention Is All You Need</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07864v2">http://arxiv.org/abs/2502.07864v2</a></p>

            <p><strong>Abstract:</strong><br>
            Modern large language models (LLMs) often encounter communication bottlenecks on current hardware, rather than purely computational constraints. Multi-head Latent Attention (MLA) tackles this challenge by using low-rank matrices in the key-value (KV) layers, thereby allowing compressed latent KV states to be cached. This approach significantly reduces the KV cache size relative to traditional multi-head attention, leading to faster inference. Moreover, MLA employs an up-projection matrix to increase expressiveness, trading additional computation for reduced communication overhead. Although MLA has demonstrated efficiency and effectiveness in Deepseek V2/V3/R1, many major model providers still rely on Group Query Attention (GQA) and have not announced any plans to adopt MLA. In this paper, we show that GQA can always be represented by MLA while maintaining the same KV cache overhead, but the converse does not hold. To encourage broader use of MLA, we introduce TransMLA, a post-training method that converts widely used GQA-based pre-trained models (e.g., LLaMA, Qwen, Mixtral) into MLA-based models. After conversion, the model can undergo additional training to boost expressiveness without increasing the KV cache size. Furthermore, we plan to develop MLA-specific inference acceleration techniques to preserve low latency in transformed models, thus enabling more efficient distillation of Deepseek R1.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Fanxu Meng, Zengwei Yao, Muhan Zhang</p>

            <p><strong>Title:</strong><br>
            TransMLA: Multi-Head Latent Attention Is All You Need</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07864v2">http://arxiv.org/abs/2502.07864v2</a></p>

            <p><strong>Abstract:</strong><br>
            Modern large language models (LLMs) often encounter communication bottlenecks on current hardware, rather than purely computational constraints. Multi-head Latent Attention (MLA) tackles this challenge by using low-rank matrices in the key-value (KV) layers, thereby allowing compressed latent KV states to be cached. This approach significantly reduces the KV cache size relative to traditional multi-head attention, leading to faster inference. Moreover, MLA employs an up-projection matrix to increase expressiveness, trading additional computation for reduced communication overhead. Although MLA has demonstrated efficiency and effectiveness in Deepseek V2/V3/R1, many major model providers still rely on Group Query Attention (GQA) and have not announced any plans to adopt MLA. In this paper, we show that GQA can always be represented by MLA while maintaining the same KV cache overhead, but the converse does not hold. To encourage broader use of MLA, we introduce TransMLA, a post-training method that converts widely used GQA-based pre-trained models (e.g., LLaMA, Qwen, Mixtral) into MLA-based models. After conversion, the model can undergo additional training to boost expressiveness without increasing the KV cache size. Furthermore, we plan to develop MLA-specific inference acceleration techniques to preserve low latency in transformed models, thus enabling more efficient distillation of Deepseek R1.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 13 Feb 2025 21:28:34 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b107e1e4/9dcc2175.mp3" length="19823877" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1235</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Fanxu Meng, Zengwei Yao, Muhan Zhang</p>

            <p><strong>Title:</strong><br>
            TransMLA: Multi-Head Latent Attention Is All You Need</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07864v2">http://arxiv.org/abs/2502.07864v2</a></p>

            <p><strong>Abstract:</strong><br>
            Modern large language models (LLMs) often encounter communication bottlenecks on current hardware, rather than purely computational constraints. Multi-head Latent Attention (MLA) tackles this challenge by using low-rank matrices in the key-value (KV) layers, thereby allowing compressed latent KV states to be cached. This approach significantly reduces the KV cache size relative to traditional multi-head attention, leading to faster inference. Moreover, MLA employs an up-projection matrix to increase expressiveness, trading additional computation for reduced communication overhead. Although MLA has demonstrated efficiency and effectiveness in Deepseek V2/V3/R1, many major model providers still rely on Group Query Attention (GQA) and have not announced any plans to adopt MLA. In this paper, we show that GQA can always be represented by MLA while maintaining the same KV cache overhead, but the converse does not hold. To encourage broader use of MLA, we introduce TransMLA, a post-training method that converts widely used GQA-based pre-trained models (e.g., LLaMA, Qwen, Mixtral) into MLA-based models. After conversion, the model can undergo additional training to boost expressiveness without increasing the KV cache size. Furthermore, we plan to develop MLA-specific inference acceleration techniques to preserve low latency in transformed models, thus enabling more efficient distillation of Deepseek R1.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation</title>
      <itunes:episode>540</itunes:episode>
      <podcast:episode>540</podcast:episode>
      <itunes:title>WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0ef49d3e-a5ef-4b49-a8e4-854ee94fff6b</guid>
      <link>https://share.transistor.fm/s/65a36252</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.AI, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Henry Hengyuan Zhao, Difei Gao, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08047v1">http://arxiv.org/abs/2502.08047v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state-often lead to planning errors. This issue is widespread in real user scenarios, but existing benchmarks fail to evaluate it. In this paper, we present WorldGUI, a novel GUI benchmark that designs GUI tasks with various initial states to simulate real computer-user interactions. The benchmark spans a wide range of tasks across 10 popular software applications, including PowerPoint, VSCode, and Adobe Acrobat. In addition, to address the challenges of dynamic GUI automation tasks, we propose GUI-Thinker, a holistic framework, leveraging a critique mechanism, that effectively manages the unpredictability and complexity of GUI interactions. Experimental results demonstrate that GUI-Thinker significantly outperforms Claude-3.5 (Computer Use) by 14.9% in success rate on WorldGUI tasks. This improvement underscores the effectiveness of our critical-thinking-based framework in enhancing GUI automation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.AI, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Henry Hengyuan Zhao, Difei Gao, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08047v1">http://arxiv.org/abs/2502.08047v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state-often lead to planning errors. This issue is widespread in real user scenarios, but existing benchmarks fail to evaluate it. In this paper, we present WorldGUI, a novel GUI benchmark that designs GUI tasks with various initial states to simulate real computer-user interactions. The benchmark spans a wide range of tasks across 10 popular software applications, including PowerPoint, VSCode, and Adobe Acrobat. In addition, to address the challenges of dynamic GUI automation tasks, we propose GUI-Thinker, a holistic framework, leveraging a critique mechanism, that effectively manages the unpredictability and complexity of GUI interactions. Experimental results demonstrate that GUI-Thinker significantly outperforms Claude-3.5 (Computer Use) by 14.9% in success rate on WorldGUI tasks. This improvement underscores the effectiveness of our critical-thinking-based framework in enhancing GUI automation.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 13 Feb 2025 21:28:11 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/65a36252/78ecb1b3.mp3" length="19278871" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1201</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.AI, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Henry Hengyuan Zhao, Difei Gao, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.08047v1">http://arxiv.org/abs/2502.08047v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state-often lead to planning errors. This issue is widespread in real user scenarios, but existing benchmarks fail to evaluate it. In this paper, we present WorldGUI, a novel GUI benchmark that designs GUI tasks with various initial states to simulate real computer-user interactions. The benchmark spans a wide range of tasks across 10 popular software applications, including PowerPoint, VSCode, and Adobe Acrobat. In addition, to address the challenges of dynamic GUI automation tasks, we propose GUI-Thinker, a holistic framework, leveraging a critique mechanism, that effectively manages the unpredictability and complexity of GUI interactions. Experimental results demonstrate that GUI-Thinker significantly outperforms Claude-3.5 (Computer Use) by 14.9% in success rate on WorldGUI tasks. This improvement underscores the effectiveness of our critical-thinking-based framework in enhancing GUI automation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid</title>
      <itunes:episode>539</itunes:episode>
      <podcast:episode>539</podcast:episode>
      <itunes:title>LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">527823f9-85d8-4030-aeaf-f9ae45cd651e</guid>
      <link>https://share.transistor.fm/s/99eab7f3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07563v1">http://arxiv.org/abs/2502.07563v1</a></p>

            <p><strong>Abstract:</strong><br>
            Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths. However, existing sequence parallelism (SP) methods are either not optimized for the right-product-first feature of linear attention or use a ring-style communication strategy, which results in lower computation parallelism, limits their scalability for longer sequences in distributed systems. In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models with very-long input sequences. Compared to previous work LASP, LASP-2 rethinks the minimal communication requirement for SP on linear attention layers, reorganizes the whole communication-computation workflow of LASP. In this way, only one single AllGather collective communication is needed on intermediate memory states, whose sizes are independent of the sequence length, leading to significant improvements of both communication and computation parallelism, as well as their overlap. Additionally, we extend LASP-2 to LASP-2H by applying similar communication redesign to standard attention modules, offering an efficient SP solution for hybrid models that blend linear and standard attention layers. Our evaluation on a Linear-Llama3 model, a variant of Llama3 with linear attention replacing standard attention, demonstrates the effectiveness of LASP-2 and LASP-2H. Specifically, LASP-2 achieves training speed improvements of 15.2% over LASP and 36.6% over Ring Attention, with a sequence length of 2048K across 64 GPUs. The Code is released as a part of: https://github.com/OpenSparseLLMs/Linear-MoE.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07563v1">http://arxiv.org/abs/2502.07563v1</a></p>

            <p><strong>Abstract:</strong><br>
            Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths. However, existing sequence parallelism (SP) methods are either not optimized for the right-product-first feature of linear attention or use a ring-style communication strategy, which results in lower computation parallelism, limits their scalability for longer sequences in distributed systems. In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models with very-long input sequences. Compared to previous work LASP, LASP-2 rethinks the minimal communication requirement for SP on linear attention layers, reorganizes the whole communication-computation workflow of LASP. In this way, only one single AllGather collective communication is needed on intermediate memory states, whose sizes are independent of the sequence length, leading to significant improvements of both communication and computation parallelism, as well as their overlap. Additionally, we extend LASP-2 to LASP-2H by applying similar communication redesign to standard attention modules, offering an efficient SP solution for hybrid models that blend linear and standard attention layers. Our evaluation on a Linear-Llama3 model, a variant of Llama3 with linear attention replacing standard attention, demonstrates the effectiveness of LASP-2 and LASP-2H. Specifically, LASP-2 achieves training speed improvements of 15.2% over LASP and 36.6% over Ring Attention, with a sequence length of 2048K across 64 GPUs. The Code is released as a part of: https://github.com/OpenSparseLLMs/Linear-MoE.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 13 Feb 2025 21:27:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/99eab7f3/eac17ec6.mp3" length="22484209" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1402</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07563v1">http://arxiv.org/abs/2502.07563v1</a></p>

            <p><strong>Abstract:</strong><br>
            Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths. However, existing sequence parallelism (SP) methods are either not optimized for the right-product-first feature of linear attention or use a ring-style communication strategy, which results in lower computation parallelism, limits their scalability for longer sequences in distributed systems. In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models with very-long input sequences. Compared to previous work LASP, LASP-2 rethinks the minimal communication requirement for SP on linear attention layers, reorganizes the whole communication-computation workflow of LASP. In this way, only one single AllGather collective communication is needed on intermediate memory states, whose sizes are independent of the sequence length, leading to significant improvements of both communication and computation parallelism, as well as their overlap. Additionally, we extend LASP-2 to LASP-2H by applying similar communication redesign to standard attention modules, offering an efficient SP solution for hybrid models that blend linear and standard attention layers. Our evaluation on a Linear-Llama3 model, a variant of Llama3 with linear attention replacing standard attention, demonstrates the effectiveness of LASP-2 and LASP-2H. Specifically, LASP-2 achieves training speed improvements of 15.2% over LASP and 36.6% over Ring Attention, with a sequence length of 2048K across 64 GPUs. The Code is released as a part of: https://github.com/OpenSparseLLMs/Linear-MoE.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning</title>
      <itunes:episode>538</itunes:episode>
      <podcast:episode>538</podcast:episode>
      <itunes:title>Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9f8ae584-e39d-4e45-943c-e302084a5471</guid>
      <link>https://share.transistor.fm/s/b1e5f580</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jean Vassoyan, Nathanaël Beau, Roman Plaud</p>

            <p><strong>Title:</strong><br>
            Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06533v1">http://arxiv.org/abs/2502.06533v1</a></p>

            <p><strong>Abstract:</strong><br>
            The ability to achieve long-term goals is a key challenge in the current development of large language models (LLMs). To address this, pre-trained LLMs can be fine-tuned with reinforcement learning (RL) to explore solutions that optimize a given goal. However, exploration with LLMs is difficult, as a balance has to be struck between discovering new solutions and staying close enough to the pre-trained model, so as not to degrade basic capabilities. This is typically controlled with a Kullback-Leibler (KL) penalty. In this paper, we investigate the exploration dynamics of a small language model on a simple arithmetic task. We show how varying degrees of pre-training influence exploration and demonstrate the importance of "critical tokens" which have a dramatic impact on the final outcome. Consequently, we introduce a simple modification to the KL penalty that favors exploration on critical tokens, increasing the efficiency of the RL fine-tuning stage.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jean Vassoyan, Nathanaël Beau, Roman Plaud</p>

            <p><strong>Title:</strong><br>
            Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06533v1">http://arxiv.org/abs/2502.06533v1</a></p>

            <p><strong>Abstract:</strong><br>
            The ability to achieve long-term goals is a key challenge in the current development of large language models (LLMs). To address this, pre-trained LLMs can be fine-tuned with reinforcement learning (RL) to explore solutions that optimize a given goal. However, exploration with LLMs is difficult, as a balance has to be struck between discovering new solutions and staying close enough to the pre-trained model, so as not to degrade basic capabilities. This is typically controlled with a Kullback-Leibler (KL) penalty. In this paper, we investigate the exploration dynamics of a small language model on a simple arithmetic task. We show how varying degrees of pre-training influence exploration and demonstrate the importance of "critical tokens" which have a dramatic impact on the final outcome. Consequently, we introduce a simple modification to the KL penalty that favors exploration on critical tokens, increasing the efficiency of the RL fine-tuning stage.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 13 Feb 2025 21:27:25 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b1e5f580/2ee4e3de.mp3" length="20623886" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1285</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jean Vassoyan, Nathanaël Beau, Roman Plaud</p>

            <p><strong>Title:</strong><br>
            Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06533v1">http://arxiv.org/abs/2502.06533v1</a></p>

            <p><strong>Abstract:</strong><br>
            The ability to achieve long-term goals is a key challenge in the current development of large language models (LLMs). To address this, pre-trained LLMs can be fine-tuned with reinforcement learning (RL) to explore solutions that optimize a given goal. However, exploration with LLMs is difficult, as a balance has to be struck between discovering new solutions and staying close enough to the pre-trained model, so as not to degrade basic capabilities. This is typically controlled with a Kullback-Leibler (KL) penalty. In this paper, we investigate the exploration dynamics of a small language model on a simple arithmetic task. We show how varying degrees of pre-training influence exploration and demonstrate the importance of "critical tokens" which have a dramatic impact on the final outcome. Consequently, we introduce a simple modification to the KL penalty that favors exploration on critical tokens, increasing the efficiency of the RL fine-tuning stage.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Expect the Unexpected: FailSafe Long Context QA for Finance</title>
      <itunes:episode>537</itunes:episode>
      <podcast:episode>537</podcast:episode>
      <itunes:title>Expect the Unexpected: FailSafe Long Context QA for Finance</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5cd4962e-812c-4bc3-802f-2d29ba297b8e</guid>
      <link>https://share.transistor.fm/s/1e081e31</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 105 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kiran Kamble, Melisa Russak, Dmytro Mozolevskyi, Muayad Ali, Mateusz Russak, Waseem AlShikh</p>

            <p><strong>Title:</strong><br>
            Expect the Unexpected: FailSafe Long Context QA for Finance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06329v1">http://arxiv.org/abs/2502.06329v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose a new long-context financial benchmark, FailSafeQA, designed to test the robustness and context-awareness of LLMs against six variations in human-interface interactions in LLM-based query-answer systems within finance. We concentrate on two case studies: Query Failure and Context Failure. In the Query Failure scenario, we perturb the original query to vary in domain expertise, completeness, and linguistic accuracy. In the Context Failure case, we simulate the uploads of degraded, irrelevant, and empty documents. We employ the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained rating criteria to define and calculate Robustness, Context Grounding, and Compliance scores for 24 off-the-shelf models. The results suggest that although some models excel at mitigating input perturbations, they must balance robust answering with the ability to refrain from hallucinating. Notably, Palmyra-Fin-128k-Instruct, recognized as the most compliant model, maintained strong baseline performance but encountered challenges in sustaining robust predictions in 17% of test cases. On the other hand, the most robust model, OpenAI o3-mini, fabricated information in 41% of tested cases. The results demonstrate that even high-performing models have significant room for improvement and highlight the role of FailSafeQA as a tool for developing LLMs optimized for dependability in financial applications. The dataset is available at: https://huggingface.co/datasets/Writer/FailSafeQA</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 105 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kiran Kamble, Melisa Russak, Dmytro Mozolevskyi, Muayad Ali, Mateusz Russak, Waseem AlShikh</p>

            <p><strong>Title:</strong><br>
            Expect the Unexpected: FailSafe Long Context QA for Finance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06329v1">http://arxiv.org/abs/2502.06329v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose a new long-context financial benchmark, FailSafeQA, designed to test the robustness and context-awareness of LLMs against six variations in human-interface interactions in LLM-based query-answer systems within finance. We concentrate on two case studies: Query Failure and Context Failure. In the Query Failure scenario, we perturb the original query to vary in domain expertise, completeness, and linguistic accuracy. In the Context Failure case, we simulate the uploads of degraded, irrelevant, and empty documents. We employ the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained rating criteria to define and calculate Robustness, Context Grounding, and Compliance scores for 24 off-the-shelf models. The results suggest that although some models excel at mitigating input perturbations, they must balance robust answering with the ability to refrain from hallucinating. Notably, Palmyra-Fin-128k-Instruct, recognized as the most compliant model, maintained strong baseline performance but encountered challenges in sustaining robust predictions in 17% of test cases. On the other hand, the most robust model, OpenAI o3-mini, fabricated information in 41% of tested cases. The results demonstrate that even high-performing models have significant room for improvement and highlight the role of FailSafeQA as a tool for developing LLMs optimized for dependability in financial applications. The dataset is available at: https://huggingface.co/datasets/Writer/FailSafeQA</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 12 Feb 2025 20:50:18 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1e081e31/191c4946.mp3" length="20376843" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1270</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 105 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kiran Kamble, Melisa Russak, Dmytro Mozolevskyi, Muayad Ali, Mateusz Russak, Waseem AlShikh</p>

            <p><strong>Title:</strong><br>
            Expect the Unexpected: FailSafe Long Context QA for Finance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06329v1">http://arxiv.org/abs/2502.06329v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose a new long-context financial benchmark, FailSafeQA, designed to test the robustness and context-awareness of LLMs against six variations in human-interface interactions in LLM-based query-answer systems within finance. We concentrate on two case studies: Query Failure and Context Failure. In the Query Failure scenario, we perturb the original query to vary in domain expertise, completeness, and linguistic accuracy. In the Context Failure case, we simulate the uploads of degraded, irrelevant, and empty documents. We employ the LLM-as-a-Judge methodology with Qwen2.5-72B-Instruct and use fine-grained rating criteria to define and calculate Robustness, Context Grounding, and Compliance scores for 24 off-the-shelf models. The results suggest that although some models excel at mitigating input perturbations, they must balance robust answering with the ability to refrain from hallucinating. Notably, Palmyra-Fin-128k-Instruct, recognized as the most compliant model, maintained strong baseline performance but encountered challenges in sustaining robust predictions in 17% of test cases. On the other hand, the most robust model, OpenAI o3-mini, fabricated information in 41% of tested cases. The results demonstrate that even high-performing models have significant room for improvement and highlight the role of FailSafeQA as a tool for developing LLMs optimized for dependability in financial applications. The dataset is available at: https://huggingface.co/datasets/Writer/FailSafeQA</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Competitive Programming with Large Reasoning Models</title>
      <itunes:episode>536</itunes:episode>
      <podcast:episode>536</podcast:episode>
      <itunes:title>Competitive Programming with Large Reasoning Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6a8f08b0-a571-4788-80fe-71803a518753</guid>
      <link>https://share.transistor.fm/s/4089779a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            OpenAI, :, Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, Jerry Tworek, Lorenz Kuhn, Lukasz Kaiser, Mark Chen, Max Schwarzer, Mostafa Rohaninejad, Nat McAleese, o3 contributors, Oleg Mürk, Rhythm Garg, Rui Shu, Szymon Sidor, Vineet Kosaraju, Wenda Zhou</p>

            <p><strong>Title:</strong><br>
            Competitive Programming with Large Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06807v1">http://arxiv.org/abs/2502.06807v1</a></p>

            <p><strong>Abstract:</strong><br>
            We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks. Additionally, we compare two general-purpose reasoning models - OpenAI o1 and an early checkpoint of o3 - with a domain-specific system, o1-ioi, which uses hand-engineered inference strategies designed for competing in the 2024 International Olympiad in Informatics (IOI). We competed live at IOI 2024 with o1-ioi and, using hand-crafted test-time strategies, placed in the 49th percentile. Under relaxed competition constraints, o1-ioi achieved a gold medal. However, when evaluating later models such as o3, we find that o3 achieves gold without hand-crafted domain-specific strategies or relaxed constraints. Our findings show that although specialized pipelines such as o1-ioi yield solid improvements, the scaled-up, general-purpose o3 model surpasses those results without relying on hand-crafted inference heuristics. Notably, o3 achieves a gold medal at the 2024 IOI and obtains a Codeforces rating on par with elite human competitors. Overall, these results indicate that scaling general-purpose reinforcement learning, rather than relying on domain-specific techniques, offers a robust path toward state-of-the-art AI in reasoning domains, such as competitive programming.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            OpenAI, :, Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, Jerry Tworek, Lorenz Kuhn, Lukasz Kaiser, Mark Chen, Max Schwarzer, Mostafa Rohaninejad, Nat McAleese, o3 contributors, Oleg Mürk, Rhythm Garg, Rui Shu, Szymon Sidor, Vineet Kosaraju, Wenda Zhou</p>

            <p><strong>Title:</strong><br>
            Competitive Programming with Large Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06807v1">http://arxiv.org/abs/2502.06807v1</a></p>

            <p><strong>Abstract:</strong><br>
            We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks. Additionally, we compare two general-purpose reasoning models - OpenAI o1 and an early checkpoint of o3 - with a domain-specific system, o1-ioi, which uses hand-engineered inference strategies designed for competing in the 2024 International Olympiad in Informatics (IOI). We competed live at IOI 2024 with o1-ioi and, using hand-crafted test-time strategies, placed in the 49th percentile. Under relaxed competition constraints, o1-ioi achieved a gold medal. However, when evaluating later models such as o3, we find that o3 achieves gold without hand-crafted domain-specific strategies or relaxed constraints. Our findings show that although specialized pipelines such as o1-ioi yield solid improvements, the scaled-up, general-purpose o3 model surpasses those results without relying on hand-crafted inference heuristics. Notably, o3 achieves a gold medal at the 2024 IOI and obtains a Codeforces rating on par with elite human competitors. Overall, these results indicate that scaling general-purpose reinforcement learning, rather than relying on domain-specific techniques, offers a robust path toward state-of-the-art AI in reasoning domains, such as competitive programming.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 12 Feb 2025 20:49:55 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4089779a/d4d89eaa.mp3" length="20346742" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1268</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            OpenAI, :, Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, Jerry Tworek, Lorenz Kuhn, Lukasz Kaiser, Mark Chen, Max Schwarzer, Mostafa Rohaninejad, Nat McAleese, o3 contributors, Oleg Mürk, Rhythm Garg, Rui Shu, Szymon Sidor, Vineet Kosaraju, Wenda Zhou</p>

            <p><strong>Title:</strong><br>
            Competitive Programming with Large Reasoning Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06807v1">http://arxiv.org/abs/2502.06807v1</a></p>

            <p><strong>Abstract:</strong><br>
            We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks. Additionally, we compare two general-purpose reasoning models - OpenAI o1 and an early checkpoint of o3 - with a domain-specific system, o1-ioi, which uses hand-engineered inference strategies designed for competing in the 2024 International Olympiad in Informatics (IOI). We competed live at IOI 2024 with o1-ioi and, using hand-crafted test-time strategies, placed in the 49th percentile. Under relaxed competition constraints, o1-ioi achieved a gold medal. However, when evaluating later models such as o3, we find that o3 achieves gold without hand-crafted domain-specific strategies or relaxed constraints. Our findings show that although specialized pipelines such as o1-ioi yield solid improvements, the scaled-up, general-purpose o3 model surpasses those results without relying on hand-crafted inference heuristics. Notably, o3 achieves a gold medal at the 2024 IOI and obtains a Codeforces rating on par with elite human competitors. Overall, these results indicate that scaling general-purpose reinforcement learning, rather than relying on domain-specific techniques, offers a robust path toward state-of-the-art AI in reasoning domains, such as competitive programming.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Enhancing Financial Time-Series Forecasting with Retrieval-Augmented Large Language Models</title>
      <itunes:episode>535</itunes:episode>
      <podcast:episode>535</podcast:episode>
      <itunes:title>Enhancing Financial Time-Series Forecasting with Retrieval-Augmented Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b4172d26-d460-4917-9c26-ce2a48fbe048</guid>
      <link>https://share.transistor.fm/s/66608a80</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mengxi Xiao, Zihao Jiang, Lingfei Qian, Zhengyu Chen, Yueru He, Yijing Xu, Yuecheng Jiang, Dong Li, Ruey-Ling Weng, Min Peng, Jimin Huang, Sophia Ananiadou, Qianqian Xie</p>

            <p><strong>Title:</strong><br>
            Enhancing Financial Time-Series Forecasting with Retrieval-Augmented Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05878v2">http://arxiv.org/abs/2502.05878v2</a></p>

            <p><strong>Abstract:</strong><br>
            Stock movement prediction, a critical task in financial time-series forecasting, relies on identifying and retrieving key influencing factors from vast and complex datasets. However, traditional text-trained or numeric similarity-based retrieval methods often struggle to handle the intricacies of financial data. To address this, we propose the first retrieval-augmented generation (RAG) framework specifically designed for financial time-series forecasting. Our framework incorporates three key innovations: a fine-tuned 1B large language model (StockLLM) as its backbone, a novel candidate selection method enhanced by LLM feedback, and a training objective that maximizes the similarity between queries and historically significant sequences. These advancements enable our retriever, FinSeer, to uncover meaningful patterns while effectively minimizing noise in complex financial datasets. To support robust evaluation, we also construct new datasets that integrate financial indicators and historical stock prices. Experimental results demonstrate that our RAG framework outperforms both the baseline StockLLM and random retrieval methods, showcasing its effectiveness. FinSeer, as the retriever, achieves an 8% higher accuracy on the BIGDATA22 benchmark and retrieves more impactful sequences compared to existing retrieval methods. This work highlights the importance of tailored retrieval models in financial forecasting and provides a novel, scalable framework for future research in the field.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mengxi Xiao, Zihao Jiang, Lingfei Qian, Zhengyu Chen, Yueru He, Yijing Xu, Yuecheng Jiang, Dong Li, Ruey-Ling Weng, Min Peng, Jimin Huang, Sophia Ananiadou, Qianqian Xie</p>

            <p><strong>Title:</strong><br>
            Enhancing Financial Time-Series Forecasting with Retrieval-Augmented Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05878v2">http://arxiv.org/abs/2502.05878v2</a></p>

            <p><strong>Abstract:</strong><br>
            Stock movement prediction, a critical task in financial time-series forecasting, relies on identifying and retrieving key influencing factors from vast and complex datasets. However, traditional text-trained or numeric similarity-based retrieval methods often struggle to handle the intricacies of financial data. To address this, we propose the first retrieval-augmented generation (RAG) framework specifically designed for financial time-series forecasting. Our framework incorporates three key innovations: a fine-tuned 1B large language model (StockLLM) as its backbone, a novel candidate selection method enhanced by LLM feedback, and a training objective that maximizes the similarity between queries and historically significant sequences. These advancements enable our retriever, FinSeer, to uncover meaningful patterns while effectively minimizing noise in complex financial datasets. To support robust evaluation, we also construct new datasets that integrate financial indicators and historical stock prices. Experimental results demonstrate that our RAG framework outperforms both the baseline StockLLM and random retrieval methods, showcasing its effectiveness. FinSeer, as the retriever, achieves an 8% higher accuracy on the BIGDATA22 benchmark and retrieves more impactful sequences compared to existing retrieval methods. This work highlights the importance of tailored retrieval models in financial forecasting and provides a novel, scalable framework for future research in the field.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 12 Feb 2025 20:49:31 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/66608a80/55370398.mp3" length="21435565" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1336</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mengxi Xiao, Zihao Jiang, Lingfei Qian, Zhengyu Chen, Yueru He, Yijing Xu, Yuecheng Jiang, Dong Li, Ruey-Ling Weng, Min Peng, Jimin Huang, Sophia Ananiadou, Qianqian Xie</p>

            <p><strong>Title:</strong><br>
            Enhancing Financial Time-Series Forecasting with Retrieval-Augmented Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05878v2">http://arxiv.org/abs/2502.05878v2</a></p>

            <p><strong>Abstract:</strong><br>
            Stock movement prediction, a critical task in financial time-series forecasting, relies on identifying and retrieving key influencing factors from vast and complex datasets. However, traditional text-trained or numeric similarity-based retrieval methods often struggle to handle the intricacies of financial data. To address this, we propose the first retrieval-augmented generation (RAG) framework specifically designed for financial time-series forecasting. Our framework incorporates three key innovations: a fine-tuned 1B large language model (StockLLM) as its backbone, a novel candidate selection method enhanced by LLM feedback, and a training objective that maximizes the similarity between queries and historically significant sequences. These advancements enable our retriever, FinSeer, to uncover meaningful patterns while effectively minimizing noise in complex financial datasets. To support robust evaluation, we also construct new datasets that integrate financial indicators and historical stock prices. Experimental results demonstrate that our RAG framework outperforms both the baseline StockLLM and random retrieval methods, showcasing its effectiveness. FinSeer, as the retriever, achieves an 8% higher accuracy on the BIGDATA22 benchmark and retrieves more impactful sequences compared to existing retrieval methods. This work highlights the importance of tailored retrieval models in financial forecasting and provides a novel, scalable framework for future research in the field.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction</title>
      <itunes:episode>534</itunes:episode>
      <podcast:episode>534</podcast:episode>
      <itunes:title>CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">17f30241-a93e-4a52-a2c2-9b5e7ffb2f70</guid>
      <link>https://share.transistor.fm/s/75e14417</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, Junxian He</p>

            <p><strong>Title:</strong><br>
            CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07316v2">http://arxiv.org/abs/2502.07316v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning is a fundamental capability of Large Language Models. While prior research predominantly focuses on enhancing narrow skills like math or code generation, improving performance on many other reasoning tasks remains challenging due to sparse and fragmented training data. To address this issue, we propose CodeI/O, a novel approach that systematically condenses diverse reasoning patterns inherently embedded in contextually-grounded codes, through transforming the original code into a code input-output prediction format. By training models to predict inputs/outputs given code and test cases entirely in natural language as Chain-of-Thought (CoT) rationales, we expose them to universal reasoning primitives -- like logic flow planning, state-space searching, decision tree traversal, and modular decomposition -- while decoupling structured reasoning from code-specific syntax and preserving procedural rigor. Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math &amp; numerical, and commonsense reasoning tasks. By matching the existing ground-truth outputs or re-executing the code with predicted inputs, we can verify each prediction and further enhance the CoTs through multi-turn revision, resulting in CodeI/O++ and achieving higher performance. Our data and models are available at https://github.com/hkust-nlp/CodeIO.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, Junxian He</p>

            <p><strong>Title:</strong><br>
            CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07316v2">http://arxiv.org/abs/2502.07316v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning is a fundamental capability of Large Language Models. While prior research predominantly focuses on enhancing narrow skills like math or code generation, improving performance on many other reasoning tasks remains challenging due to sparse and fragmented training data. To address this issue, we propose CodeI/O, a novel approach that systematically condenses diverse reasoning patterns inherently embedded in contextually-grounded codes, through transforming the original code into a code input-output prediction format. By training models to predict inputs/outputs given code and test cases entirely in natural language as Chain-of-Thought (CoT) rationales, we expose them to universal reasoning primitives -- like logic flow planning, state-space searching, decision tree traversal, and modular decomposition -- while decoupling structured reasoning from code-specific syntax and preserving procedural rigor. Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math &amp; numerical, and commonsense reasoning tasks. By matching the existing ground-truth outputs or re-executing the code with predicted inputs, we can verify each prediction and further enhance the CoTs through multi-turn revision, resulting in CodeI/O++ and achieving higher performance. Our data and models are available at https://github.com/hkust-nlp/CodeIO.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 12 Feb 2025 20:49:08 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/75e14417/91c9ec37.mp3" length="20885511" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1302</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, Junxian He</p>

            <p><strong>Title:</strong><br>
            CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07316v2">http://arxiv.org/abs/2502.07316v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning is a fundamental capability of Large Language Models. While prior research predominantly focuses on enhancing narrow skills like math or code generation, improving performance on many other reasoning tasks remains challenging due to sparse and fragmented training data. To address this issue, we propose CodeI/O, a novel approach that systematically condenses diverse reasoning patterns inherently embedded in contextually-grounded codes, through transforming the original code into a code input-output prediction format. By training models to predict inputs/outputs given code and test cases entirely in natural language as Chain-of-Thought (CoT) rationales, we expose them to universal reasoning primitives -- like logic flow planning, state-space searching, decision tree traversal, and modular decomposition -- while decoupling structured reasoning from code-specific syntax and preserving procedural rigor. Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math &amp; numerical, and commonsense reasoning tasks. By matching the existing ground-truth outputs or re-executing the code with predicted inputs, we can verify each prediction and further enhance the CoTs through multi-turn revision, resulting in CodeI/O++ and achieving higher performance. Our data and models are available at https://github.com/hkust-nlp/CodeIO.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Magic 1-For-1: Generating One Minute Video Clips within One Minute</title>
      <itunes:episode>533</itunes:episode>
      <podcast:episode>533</podcast:episode>
      <itunes:title>Magic 1-For-1: Generating One Minute Video Clips within One Minute</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6d39205c-9815-406c-81ae-309355edb6f3</guid>
      <link>https://share.transistor.fm/s/a17a01d8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hongwei Yi, Shitong Shao, Tian Ye, Jiantong Zhao, Qingyu Yin, Michael Lingelbach, Li Yuan, Yonghong Tian, Enze Xie, Daquan Zhou</p>

            <p><strong>Title:</strong><br>
            Magic 1-For-1: Generating One Minute Video Clips within One Minute</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07701v1">http://arxiv.org/abs/2502.07701v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter sparsification. With those techniques, we are able to generate 5-second video clips within 3 seconds. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics, spending less than 1 second for generating 1 second video clips on average. We conduct a series of preliminary explorations to find out the optimal tradeoff between computational cost and video quality during diffusion step distillation and hope this could be a good foundation model for open-source explorations. The code and the model weights are available at https://github.com/DA-Group-PKU/Magic-1-For-1.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hongwei Yi, Shitong Shao, Tian Ye, Jiantong Zhao, Qingyu Yin, Michael Lingelbach, Li Yuan, Yonghong Tian, Enze Xie, Daquan Zhou</p>

            <p><strong>Title:</strong><br>
            Magic 1-For-1: Generating One Minute Video Clips within One Minute</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07701v1">http://arxiv.org/abs/2502.07701v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter sparsification. With those techniques, we are able to generate 5-second video clips within 3 seconds. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics, spending less than 1 second for generating 1 second video clips on average. We conduct a series of preliminary explorations to find out the optimal tradeoff between computational cost and video quality during diffusion step distillation and hope this could be a good foundation model for open-source explorations. The code and the model weights are available at https://github.com/DA-Group-PKU/Magic-1-For-1.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 12 Feb 2025 20:48:43 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a17a01d8/0bd3d3b4.mp3" length="20218444" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1260</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hongwei Yi, Shitong Shao, Tian Ye, Jiantong Zhao, Qingyu Yin, Michael Lingelbach, Li Yuan, Yonghong Tian, Enze Xie, Daquan Zhou</p>

            <p><strong>Title:</strong><br>
            Magic 1-For-1: Generating One Minute Video Clips within One Minute</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07701v1">http://arxiv.org/abs/2502.07701v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter sparsification. With those techniques, we are able to generate 5-second video clips within 3 seconds. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics, spending less than 1 second for generating 1 second video clips on average. We conduct a series of preliminary explorations to find out the optimal tradeoff between computational cost and video quality during diffusion step distillation and hope this could be a good foundation model for open-source explorations. The code and the model weights are available at https://github.com/DA-Group-PKU/Magic-1-For-1.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!</title>
      <itunes:episode>532</itunes:episode>
      <podcast:episode>532</podcast:episode>
      <itunes:title>LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6b215198-530c-4655-b360-fc466712e16c</guid>
      <link>https://share.transistor.fm/s/36ff3f44</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica</p>

            <p><strong>Title:</strong><br>
            LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07374v1">http://arxiv.org/abs/2502.07374v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training techniques and data requirements to elicit Long CoT remain poorly understood. In this work, we find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA). With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks, including 56.7% (+40.0%) on AIME 2024 and 57.0% (+8.1%) on LiveCodeBench, competitive to the proprietary o1-preview model's score of 44.6% and 59.1%. More importantly, we find that the structure of Long CoT is critical to the learning process, whereas the content of individual reasoning steps has minimal impact. Perturbations affecting content, such as training on incorrect samples or removing reasoning keywords, have little impact on performance. In contrast, structural modifications that disrupt logical consistency in the Long CoT, such as shuffling or deleting reasoning steps, significantly degrade accuracy. For example, a model trained on Long CoT samples with incorrect answers still achieves only 3.2% lower accuracy compared to training with fully correct samples. These insights deepen our understanding of how to elicit reasoning capabilities in LLMs and highlight key considerations for efficiently training the next generation of reasoning models. This is the academic paper of our previous released Sky-T1-32B-Preview model. Codes are available at https://github.com/NovaSky-AI/SkyThought.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica</p>

            <p><strong>Title:</strong><br>
            LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07374v1">http://arxiv.org/abs/2502.07374v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training techniques and data requirements to elicit Long CoT remain poorly understood. In this work, we find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA). With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks, including 56.7% (+40.0%) on AIME 2024 and 57.0% (+8.1%) on LiveCodeBench, competitive to the proprietary o1-preview model's score of 44.6% and 59.1%. More importantly, we find that the structure of Long CoT is critical to the learning process, whereas the content of individual reasoning steps has minimal impact. Perturbations affecting content, such as training on incorrect samples or removing reasoning keywords, have little impact on performance. In contrast, structural modifications that disrupt logical consistency in the Long CoT, such as shuffling or deleting reasoning steps, significantly degrade accuracy. For example, a model trained on Long CoT samples with incorrect answers still achieves only 3.2% lower accuracy compared to training with fully correct samples. These insights deepen our understanding of how to elicit reasoning capabilities in LLMs and highlight key considerations for efficiently training the next generation of reasoning models. This is the academic paper of our previous released Sky-T1-32B-Preview model. Codes are available at https://github.com/NovaSky-AI/SkyThought.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 12 Feb 2025 20:48:20 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/36ff3f44/1fb3c179.mp3" length="24812677" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1547</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica</p>

            <p><strong>Title:</strong><br>
            LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07374v1">http://arxiv.org/abs/2502.07374v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training techniques and data requirements to elicit Long CoT remain poorly understood. In this work, we find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA). With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks, including 56.7% (+40.0%) on AIME 2024 and 57.0% (+8.1%) on LiveCodeBench, competitive to the proprietary o1-preview model's score of 44.6% and 59.1%. More importantly, we find that the structure of Long CoT is critical to the learning process, whereas the content of individual reasoning steps has minimal impact. Perturbations affecting content, such as training on incorrect samples or removing reasoning keywords, have little impact on performance. In contrast, structural modifications that disrupt logical consistency in the Long CoT, such as shuffling or deleting reasoning steps, significantly degrade accuracy. For example, a model trained on Long CoT samples with incorrect answers still achieves only 3.2% lower accuracy compared to training with fully correct samples. These insights deepen our understanding of how to elicit reasoning capabilities in LLMs and highlight key considerations for efficiently training the next generation of reasoning models. This is the academic paper of our previous released Sky-T1-32B-Preview model. Codes are available at https://github.com/NovaSky-AI/SkyThought.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Teaching Language Models to Critique via Reinforcement Learning</title>
      <itunes:episode>531</itunes:episode>
      <podcast:episode>531</podcast:episode>
      <itunes:title>Teaching Language Models to Critique via Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">aed9c95c-184e-4b91-a03b-40102e35b4ea</guid>
      <link>https://share.transistor.fm/s/0b91f8cd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhihui Xie, Jie chen, Liyu Chen, Weichao Mao, Jingjing Xu, Lingpeng Kong</p>

            <p><strong>Title:</strong><br>
            Teaching Language Models to Critique via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.03492v1">http://arxiv.org/abs/2502.03492v1</a></p>

            <p><strong>Abstract:</strong><br>
            Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose $\texttt{CTRL}$, a framework for $\texttt{C}$ritic $\texttt{T}$raining via $\texttt{R}$einforcement $\texttt{L}$earning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with $\texttt{CTRL}$ significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1% relative improvements across challenging code generation benchmarks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhihui Xie, Jie chen, Liyu Chen, Weichao Mao, Jingjing Xu, Lingpeng Kong</p>

            <p><strong>Title:</strong><br>
            Teaching Language Models to Critique via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.03492v1">http://arxiv.org/abs/2502.03492v1</a></p>

            <p><strong>Abstract:</strong><br>
            Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose $\texttt{CTRL}$, a framework for $\texttt{C}$ritic $\texttt{T}$raining via $\texttt{R}$einforcement $\texttt{L}$earning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with $\texttt{CTRL}$ significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1% relative improvements across challenging code generation benchmarks.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 12 Feb 2025 20:47:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0b91f8cd/8fe8bf5e.mp3" length="21277131" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1326</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhihui Xie, Jie chen, Liyu Chen, Weichao Mao, Jingjing Xu, Lingpeng Kong</p>

            <p><strong>Title:</strong><br>
            Teaching Language Models to Critique via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.03492v1">http://arxiv.org/abs/2502.03492v1</a></p>

            <p><strong>Abstract:</strong><br>
            Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose $\texttt{CTRL}$, a framework for $\texttt{C}$ritic $\texttt{T}$raining via $\texttt{R}$einforcement $\texttt{L}$earning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with $\texttt{CTRL}$ significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1% relative improvements across challenging code generation benchmarks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Scaling Pre-training to One Hundred Billion Data for Vision Language Models</title>
      <itunes:episode>530</itunes:episode>
      <podcast:episode>530</podcast:episode>
      <itunes:title>Scaling Pre-training to One Hundred Billion Data for Vision Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fae1f93a-3043-452e-9782-dff8af287662</guid>
      <link>https://share.transistor.fm/s/4c060a69</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiao Wang, Ibrahim Alabdulmohsin, Daniel Salz, Zhe Li, Keran Rong, Xiaohua Zhai</p>

            <p><strong>Title:</strong><br>
            Scaling Pre-training to One Hundred Billion Data for Vision Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07617v1">http://arxiv.org/abs/2502.07617v1</a></p>

            <p><strong>Abstract:</strong><br>
            We provide an empirical investigation of the potential of pre-training vision-language models on an unprecedented scale: 100 billion examples. We find that model performance tends to saturate at this scale on many common Western-centric classification and retrieval benchmarks, such as COCO Captions. Nevertheless, tasks of cultural diversity achieve more substantial gains from the 100-billion scale web data, thanks to its coverage of long-tail concepts. Furthermore, we analyze the model's multilinguality and show gains in low-resource languages as well. In addition, we observe that reducing the size of the pretraining dataset via quality filters like using CLIP, typically used to enhance performance, may inadvertently reduce the cultural diversity represented even in large-scale datasets. Our results highlight that while traditional benchmarks may not benefit significantly from scaling noisy, raw web data to 100 billion examples, this data scale is vital for building truly inclusive multimodal systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiao Wang, Ibrahim Alabdulmohsin, Daniel Salz, Zhe Li, Keran Rong, Xiaohua Zhai</p>

            <p><strong>Title:</strong><br>
            Scaling Pre-training to One Hundred Billion Data for Vision Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07617v1">http://arxiv.org/abs/2502.07617v1</a></p>

            <p><strong>Abstract:</strong><br>
            We provide an empirical investigation of the potential of pre-training vision-language models on an unprecedented scale: 100 billion examples. We find that model performance tends to saturate at this scale on many common Western-centric classification and retrieval benchmarks, such as COCO Captions. Nevertheless, tasks of cultural diversity achieve more substantial gains from the 100-billion scale web data, thanks to its coverage of long-tail concepts. Furthermore, we analyze the model's multilinguality and show gains in low-resource languages as well. In addition, we observe that reducing the size of the pretraining dataset via quality filters like using CLIP, typically used to enhance performance, may inadvertently reduce the cultural diversity represented even in large-scale datasets. Our results highlight that while traditional benchmarks may not benefit significantly from scaling noisy, raw web data to 100 billion examples, this data scale is vital for building truly inclusive multimodal systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 12 Feb 2025 20:47:23 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4c060a69/aa1d3089.mp3" length="22036993" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1374</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiao Wang, Ibrahim Alabdulmohsin, Daniel Salz, Zhe Li, Keran Rong, Xiaohua Zhai</p>

            <p><strong>Title:</strong><br>
            Scaling Pre-training to One Hundred Billion Data for Vision Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07617v1">http://arxiv.org/abs/2502.07617v1</a></p>

            <p><strong>Abstract:</strong><br>
            We provide an empirical investigation of the potential of pre-training vision-language models on an unprecedented scale: 100 billion examples. We find that model performance tends to saturate at this scale on many common Western-centric classification and retrieval benchmarks, such as COCO Captions. Nevertheless, tasks of cultural diversity achieve more substantial gains from the 100-billion scale web data, thanks to its coverage of long-tail concepts. Furthermore, we analyze the model's multilinguality and show gains in low-resource languages as well. In addition, we observe that reducing the size of the pretraining dataset via quality filters like using CLIP, typically used to enhance performance, may inadvertently reduce the cultural diversity represented even in large-scale datasets. Our results highlight that while traditional benchmarks may not benefit significantly from scaling noisy, raw web data to 100 billion examples, this data scale is vital for building truly inclusive multimodal systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Enhance-A-Video: Better Generated Video for Free</title>
      <itunes:episode>529</itunes:episode>
      <podcast:episode>529</podcast:episode>
      <itunes:title>Enhance-A-Video: Better Generated Video for Free</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b14ef6e2-0d7c-4200-a0e9-26abe62ad31a</guid>
      <link>https://share.transistor.fm/s/116fdc0d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, Yang You</p>

            <p><strong>Title:</strong><br>
            Enhance-A-Video: Better Generated Video for Free</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07508v1">http://arxiv.org/abs/2502.07508v1</a></p>

            <p><strong>Abstract:</strong><br>
            DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to its simple design, our approach can be easily applied to most DiT-based video generation frameworks without any retraining or fine-tuning. Across various DiT-based video generation models, our approach demonstrates promising improvements in both temporal consistency and visual quality. We hope this research can inspire future explorations in video generation enhancement.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, Yang You</p>

            <p><strong>Title:</strong><br>
            Enhance-A-Video: Better Generated Video for Free</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07508v1">http://arxiv.org/abs/2502.07508v1</a></p>

            <p><strong>Abstract:</strong><br>
            DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to its simple design, our approach can be easily applied to most DiT-based video generation frameworks without any retraining or fine-tuning. Across various DiT-based video generation models, our approach demonstrates promising improvements in both temporal consistency and visual quality. We hope this research can inspire future explorations in video generation enhancement.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 12 Feb 2025 20:47:00 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/116fdc0d/225403da.mp3" length="19746968" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1231</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, Yang You</p>

            <p><strong>Title:</strong><br>
            Enhance-A-Video: Better Generated Video for Free</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.07508v1">http://arxiv.org/abs/2502.07508v1</a></p>

            <p><strong>Abstract:</strong><br>
            DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to its simple design, our approach can be easily applied to most DiT-based video generation frameworks without any retraining or fine-tuning. Across various DiT-based video generation models, our approach demonstrates promising improvements in both temporal consistency and visual quality. We hope this research can inspire future explorations in video generation enhancement.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling</title>
      <itunes:episode>528</itunes:episode>
      <podcast:episode>528</podcast:episode>
      <itunes:title>Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0641af71-272a-4bf9-92ec-a8017ca47fe3</guid>
      <link>https://share.transistor.fm/s/26c88f49</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06703v1">http://arxiv.org/abs/2502.06703v1</a></p>

            <p><strong>Abstract:</strong><br>
            Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: (1) What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? (2) To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and challenging AIME24 tasks, we have the following observations: (1) The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. (2) With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher inference efficiency. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06703v1">http://arxiv.org/abs/2502.06703v1</a></p>

            <p><strong>Abstract:</strong><br>
            Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: (1) What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? (2) To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and challenging AIME24 tasks, we have the following observations: (1) The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. (2) With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher inference efficiency. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 11 Feb 2025 20:43:28 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/26c88f49/8e970291.mp3" length="21708475" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1353</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06703v1">http://arxiv.org/abs/2502.06703v1</a></p>

            <p><strong>Abstract:</strong><br>
            Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: (1) What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? (2) To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and challenging AIME24 tasks, we have the following observations: (1) The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. (2) With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher inference efficiency. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators</title>
      <itunes:episode>527</itunes:episode>
      <podcast:episode>527</podcast:episode>
      <itunes:title>SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b424d860-d18f-4837-a543-a2c667121def</guid>
      <link>https://share.transistor.fm/s/51cfaf53</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Daniil Moskovskiy, Nikita Sushko, Sergey Pletenev, Elena Tutubalina, Alexander Panchenko</p>

            <p><strong>Title:</strong><br>
            SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06394v1">http://arxiv.org/abs/2502.06394v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work, we introduce a pipeline for the generation of multilingual parallel detoxification data. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on the produced synthetic datasets have superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our dataset and code to help further research in multilingual text detoxification.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Daniil Moskovskiy, Nikita Sushko, Sergey Pletenev, Elena Tutubalina, Alexander Panchenko</p>

            <p><strong>Title:</strong><br>
            SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06394v1">http://arxiv.org/abs/2502.06394v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work, we introduce a pipeline for the generation of multilingual parallel detoxification data. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on the produced synthetic datasets have superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our dataset and code to help further research in multilingual text detoxification.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 11 Feb 2025 20:43:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/51cfaf53/5c237d37.mp3" length="21184776" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1320</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 71 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Daniil Moskovskiy, Nikita Sushko, Sergey Pletenev, Elena Tutubalina, Alexander Panchenko</p>

            <p><strong>Title:</strong><br>
            SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06394v1">http://arxiv.org/abs/2502.06394v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work, we introduce a pipeline for the generation of multilingual parallel detoxification data. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on the produced synthetic datasets have superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our dataset and code to help further research in multilingual text detoxification.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning</title>
      <itunes:episode>526</itunes:episode>
      <podcast:episode>526</podcast:episode>
      <itunes:title>Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a487e9b7-1a3e-43cb-ae8a-6d774034cc19</guid>
      <link>https://share.transistor.fm/s/2935c6e6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chengqi Lyu, Songyang Gao, Yuzhe Gu, Wenwei Zhang, Jianfei Gao, Kuikun Liu, Ziyi Wang, Shuaibin Li, Qian Zhao, Haian Huang, Weihan Cao, Jiangning Liu, Hongwei Liu, Junnan Liu, Songyang Zhang, Dahua Lin, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06781v1">http://arxiv.org/abs/2502.06781v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning abilities, especially those for solving complex math problems, are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. However, the complete technical details remain unrevealed, and the techniques that are believed certainly to be adopted are only reinforcement learning (RL) and the long chain of thoughts. This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through \textbf{O}utcome \textbf{RE}w\textbf{A}rd-based reinforcement \textbf{L}earning for mathematical reasoning tasks, where only binary outcome rewards are easily accessible. We theoretically prove that behavior cloning on positive trajectories from best-of-N (BoN) sampling is sufficient to learn the KL-regularized optimal policy in binary feedback environments. This formulation further implies that the rewards of negative samples should be reshaped to ensure the gradient consistency between positive and negative samples. To alleviate the long-existing difficulties brought by sparse rewards in RL, which are even exacerbated by the partial correctness of the long chain of thought for reasoning tasks, we further apply a token-level reward model to sample important tokens in reasoning trajectories for learning. With OREAL, for the first time, a 7B model can obtain 94.0 pass@1 accuracy on MATH-500 through RL, being on par with 32B models. OREAL-32B also surpasses previous 32B models trained by distillation with 95.0 pass@1 accuracy on MATH-500. Our investigation also indicates the importance of initial policy models and training queries for RL. Code, models, and data will be released to benefit future research\footnote{https://github.com/InternLM/OREAL}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chengqi Lyu, Songyang Gao, Yuzhe Gu, Wenwei Zhang, Jianfei Gao, Kuikun Liu, Ziyi Wang, Shuaibin Li, Qian Zhao, Haian Huang, Weihan Cao, Jiangning Liu, Hongwei Liu, Junnan Liu, Songyang Zhang, Dahua Lin, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06781v1">http://arxiv.org/abs/2502.06781v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning abilities, especially those for solving complex math problems, are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. However, the complete technical details remain unrevealed, and the techniques that are believed certainly to be adopted are only reinforcement learning (RL) and the long chain of thoughts. This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through \textbf{O}utcome \textbf{RE}w\textbf{A}rd-based reinforcement \textbf{L}earning for mathematical reasoning tasks, where only binary outcome rewards are easily accessible. We theoretically prove that behavior cloning on positive trajectories from best-of-N (BoN) sampling is sufficient to learn the KL-regularized optimal policy in binary feedback environments. This formulation further implies that the rewards of negative samples should be reshaped to ensure the gradient consistency between positive and negative samples. To alleviate the long-existing difficulties brought by sparse rewards in RL, which are even exacerbated by the partial correctness of the long chain of thought for reasoning tasks, we further apply a token-level reward model to sample important tokens in reasoning trajectories for learning. With OREAL, for the first time, a 7B model can obtain 94.0 pass@1 accuracy on MATH-500 through RL, being on par with 32B models. OREAL-32B also surpasses previous 32B models trained by distillation with 95.0 pass@1 accuracy on MATH-500. Our investigation also indicates the importance of initial policy models and training queries for RL. Code, models, and data will be released to benefit future research\footnote{https://github.com/InternLM/OREAL}.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 11 Feb 2025 20:42:42 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2935c6e6/27828473.mp3" length="22567381" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1407</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chengqi Lyu, Songyang Gao, Yuzhe Gu, Wenwei Zhang, Jianfei Gao, Kuikun Liu, Ziyi Wang, Shuaibin Li, Qian Zhao, Haian Huang, Weihan Cao, Jiangning Liu, Hongwei Liu, Junnan Liu, Songyang Zhang, Dahua Lin, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06781v1">http://arxiv.org/abs/2502.06781v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning abilities, especially those for solving complex math problems, are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. However, the complete technical details remain unrevealed, and the techniques that are believed certainly to be adopted are only reinforcement learning (RL) and the long chain of thoughts. This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through \textbf{O}utcome \textbf{RE}w\textbf{A}rd-based reinforcement \textbf{L}earning for mathematical reasoning tasks, where only binary outcome rewards are easily accessible. We theoretically prove that behavior cloning on positive trajectories from best-of-N (BoN) sampling is sufficient to learn the KL-regularized optimal policy in binary feedback environments. This formulation further implies that the rewards of negative samples should be reshaped to ensure the gradient consistency between positive and negative samples. To alleviate the long-existing difficulties brought by sparse rewards in RL, which are even exacerbated by the partial correctness of the long chain of thought for reasoning tasks, we further apply a token-level reward model to sample important tokens in reasoning trajectories for learning. With OREAL, for the first time, a 7B model can obtain 94.0 pass@1 accuracy on MATH-500 through RL, being on par with 32B models. OREAL-32B also surpasses previous 32B models trained by distillation with 95.0 pass@1 accuracy on MATH-500. Our investigation also indicates the importance of initial policy models and training queries for RL. Code, models, and data will be released to benefit future research\footnote{https://github.com/InternLM/OREAL}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning</title>
      <itunes:episode>525</itunes:episode>
      <podcast:episode>525</podcast:episode>
      <itunes:title>Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fa4b99e0-dcb0-40f3-a83b-7f985afed231</guid>
      <link>https://share.transistor.fm/s/e3eba382</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI, cs.CL, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Bidipta Sarkar, Warren Xia, C. Karen Liu, Dorsa Sadigh</p>

            <p><strong>Title:</strong><br>
            Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06060v1">http://arxiv.org/abs/2502.06060v1</a></p>

            <p><strong>Abstract:</strong><br>
            Communicating in natural language is a powerful tool in multi-agent settings, as it enables independent agents to share information in partially observable settings and allows zero-shot coordination with humans. However, most prior works are limited as they either rely on training with large amounts of human demonstrations or lack the ability to generate natural and useful communication strategies. In this work, we train language models to have productive discussions about their environment in natural language without any human demonstrations. We decompose the communication problem into listening and speaking. Our key idea is to leverage the agent's goal to predict useful information about the world as a dense reward signal that guides communication. Specifically, we improve a model's listening skills by training them to predict information about the environment based on discussions, and we simultaneously improve a model's speaking skills with multi-agent reinforcement learning by rewarding messages based on their influence on other agents. To investigate the role and necessity of communication in complex social settings, we study an embodied social deduction game based on Among Us, where the key question to answer is the identity of an adversarial imposter. We analyze emergent behaviors due to our technique, such as accusing suspects and providing evidence, and find that it enables strong discussions, doubling the win rates compared to standard RL. We release our code and models at https://socialdeductionllm.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI, cs.CL, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Bidipta Sarkar, Warren Xia, C. Karen Liu, Dorsa Sadigh</p>

            <p><strong>Title:</strong><br>
            Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06060v1">http://arxiv.org/abs/2502.06060v1</a></p>

            <p><strong>Abstract:</strong><br>
            Communicating in natural language is a powerful tool in multi-agent settings, as it enables independent agents to share information in partially observable settings and allows zero-shot coordination with humans. However, most prior works are limited as they either rely on training with large amounts of human demonstrations or lack the ability to generate natural and useful communication strategies. In this work, we train language models to have productive discussions about their environment in natural language without any human demonstrations. We decompose the communication problem into listening and speaking. Our key idea is to leverage the agent's goal to predict useful information about the world as a dense reward signal that guides communication. Specifically, we improve a model's listening skills by training them to predict information about the environment based on discussions, and we simultaneously improve a model's speaking skills with multi-agent reinforcement learning by rewarding messages based on their influence on other agents. To investigate the role and necessity of communication in complex social settings, we study an embodied social deduction game based on Among Us, where the key question to answer is the identity of an adversarial imposter. We analyze emergent behaviors due to our technique, such as accusing suspects and providing evidence, and find that it enables strong discussions, doubling the win rates compared to standard RL. We release our code and models at https://socialdeductionllm.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 11 Feb 2025 20:42:19 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e3eba382/5fcf86f2.mp3" length="21418841" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1335</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.AI, cs.CL, cs.LG, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Bidipta Sarkar, Warren Xia, C. Karen Liu, Dorsa Sadigh</p>

            <p><strong>Title:</strong><br>
            Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06060v1">http://arxiv.org/abs/2502.06060v1</a></p>

            <p><strong>Abstract:</strong><br>
            Communicating in natural language is a powerful tool in multi-agent settings, as it enables independent agents to share information in partially observable settings and allows zero-shot coordination with humans. However, most prior works are limited as they either rely on training with large amounts of human demonstrations or lack the ability to generate natural and useful communication strategies. In this work, we train language models to have productive discussions about their environment in natural language without any human demonstrations. We decompose the communication problem into listening and speaking. Our key idea is to leverage the agent's goal to predict useful information about the world as a dense reward signal that guides communication. Specifically, we improve a model's listening skills by training them to predict information about the environment based on discussions, and we simultaneously improve a model's speaking skills with multi-agent reinforcement learning by rewarding messages based on their influence on other agents. To investigate the role and necessity of communication in complex social settings, we study an embodied social deduction game based on Among Us, where the key question to answer is the identity of an adversarial imposter. We analyze emergent behaviors due to our technique, such as accusing suspects and providing evidence, and find that it enables strong discussions, doubling the win rates compared to standard RL. We release our code and models at https://socialdeductionllm.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging</title>
      <itunes:episode>524</itunes:episode>
      <podcast:episode>524</podcast:episode>
      <itunes:title>CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b1e3d403-3b59-45b5-bcd2-0c752f29ebdb</guid>
      <link>https://share.transistor.fm/s/398f1217</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Md. Ashraful Islam, Mohammed Eunus Ali, Md Rizwan Parvez</p>

            <p><strong>Title:</strong><br>
            CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05664v1">http://arxiv.org/abs/2502.05664v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have made significant strides in code generation and problem solving. Current approaches employ external tool-based iterative debuggers that use compiler or other tool-based runtime feedback to refine coarse programs generated by various methods. However, the effectiveness of these approaches heavily relies on the quality of the initial code generation, which remains an open challenge. In this paper, we introduce CodeSim, a novel multi-agent code generation framework that comprehensively addresses the stages of program synthesis-planning, coding, and debugging-through a human-like perception approach. As human verifies their understanding of any algorithms through visual simulation, CodeSim uniquely features a method of plan verification and internal debugging through the step-by-step simulation of input/output. Extensive experiments across seven challenging competitive problem-solving and program synthesis benchmarks demonstrate CodeSim's remarkable code generation capabilities. Our framework achieves new state-of-the-art (pass@1) results-(HumanEval 95.1%, MBPP 90.7%, APPS 22%, and CodeContests 29.1%). Furthermore, our method shows potential for even greater enhancement when cascaded with external debuggers. To facilitate further research and development in this area, we have open-sourced our framework in this link (https://kagnlp.github.io/codesim.github.io/).</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Md. Ashraful Islam, Mohammed Eunus Ali, Md Rizwan Parvez</p>

            <p><strong>Title:</strong><br>
            CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05664v1">http://arxiv.org/abs/2502.05664v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have made significant strides in code generation and problem solving. Current approaches employ external tool-based iterative debuggers that use compiler or other tool-based runtime feedback to refine coarse programs generated by various methods. However, the effectiveness of these approaches heavily relies on the quality of the initial code generation, which remains an open challenge. In this paper, we introduce CodeSim, a novel multi-agent code generation framework that comprehensively addresses the stages of program synthesis-planning, coding, and debugging-through a human-like perception approach. As human verifies their understanding of any algorithms through visual simulation, CodeSim uniquely features a method of plan verification and internal debugging through the step-by-step simulation of input/output. Extensive experiments across seven challenging competitive problem-solving and program synthesis benchmarks demonstrate CodeSim's remarkable code generation capabilities. Our framework achieves new state-of-the-art (pass@1) results-(HumanEval 95.1%, MBPP 90.7%, APPS 22%, and CodeContests 29.1%). Furthermore, our method shows potential for even greater enhancement when cascaded with external debuggers. To facilitate further research and development in this area, we have open-sourced our framework in this link (https://kagnlp.github.io/codesim.github.io/).</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 11 Feb 2025 20:41:55 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/398f1217/791a931a.mp3" length="19897072" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1240</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Md. Ashraful Islam, Mohammed Eunus Ali, Md Rizwan Parvez</p>

            <p><strong>Title:</strong><br>
            CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05664v1">http://arxiv.org/abs/2502.05664v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have made significant strides in code generation and problem solving. Current approaches employ external tool-based iterative debuggers that use compiler or other tool-based runtime feedback to refine coarse programs generated by various methods. However, the effectiveness of these approaches heavily relies on the quality of the initial code generation, which remains an open challenge. In this paper, we introduce CodeSim, a novel multi-agent code generation framework that comprehensively addresses the stages of program synthesis-planning, coding, and debugging-through a human-like perception approach. As human verifies their understanding of any algorithms through visual simulation, CodeSim uniquely features a method of plan verification and internal debugging through the step-by-step simulation of input/output. Extensive experiments across seven challenging competitive problem-solving and program synthesis benchmarks demonstrate CodeSim's remarkable code generation capabilities. Our framework achieves new state-of-the-art (pass@1) results-(HumanEval 95.1%, MBPP 90.7%, APPS 22%, and CodeContests 29.1%). Furthermore, our method shows potential for even greater enhancement when cascaded with external debuggers. To facilitate further research and development in this area, we have open-sourced our framework in this link (https://kagnlp.github.io/codesim.github.io/).</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LM2: Large Memory Models</title>
      <itunes:episode>523</itunes:episode>
      <podcast:episode>523</podcast:episode>
      <itunes:title>LM2: Large Memory Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4dbad579-0f2e-40b6-ac6b-5b562ea37d2e</guid>
      <link>https://share.transistor.fm/s/f81c2dc3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jikun Kang, Wenqi Wu, Filippos Christianos, Alex J. Chan, Fraser Greenlee, George Thomas, Marvin Purtorab, Andy Toulis</p>

            <p><strong>Title:</strong><br>
            LM2: Large Memory Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06049v1">http://arxiv.org/abs/2502.06049v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces the Large Memory Model (LM2), a decoder-only Transformer architecture enhanced with an auxiliary memory module that aims to address the limitations of standard Transformers in multi-step reasoning, relational argumentation, and synthesizing information distributed over long contexts. The proposed LM2 incorporates a memory module that acts as a contextual representation repository, interacting with input tokens via cross attention and updating through gating mechanisms. To preserve the Transformers general-purpose capabilities, LM2 maintains the original information flow while integrating a complementary memory pathway. Experimental results on the BABILong benchmark demonstrate that the LM2model outperforms both the memory-augmented RMT model by 37.1% and the baseline Llama-3.2 model by 86.3% on average across tasks. LM2 exhibits exceptional capabilities in multi-hop inference, numerical reasoning, and large-context question-answering. On the MMLU dataset, it achieves a 5.0% improvement over a pre-trained vanilla model, demonstrating that its memory module does not degrade performance on general tasks. Further, in our analysis, we explore the memory interpretability, effectiveness of memory modules, and test-time behavior. Our findings emphasize the importance of explicit memory in enhancing Transformer architectures.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jikun Kang, Wenqi Wu, Filippos Christianos, Alex J. Chan, Fraser Greenlee, George Thomas, Marvin Purtorab, Andy Toulis</p>

            <p><strong>Title:</strong><br>
            LM2: Large Memory Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06049v1">http://arxiv.org/abs/2502.06049v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces the Large Memory Model (LM2), a decoder-only Transformer architecture enhanced with an auxiliary memory module that aims to address the limitations of standard Transformers in multi-step reasoning, relational argumentation, and synthesizing information distributed over long contexts. The proposed LM2 incorporates a memory module that acts as a contextual representation repository, interacting with input tokens via cross attention and updating through gating mechanisms. To preserve the Transformers general-purpose capabilities, LM2 maintains the original information flow while integrating a complementary memory pathway. Experimental results on the BABILong benchmark demonstrate that the LM2model outperforms both the memory-augmented RMT model by 37.1% and the baseline Llama-3.2 model by 86.3% on average across tasks. LM2 exhibits exceptional capabilities in multi-hop inference, numerical reasoning, and large-context question-answering. On the MMLU dataset, it achieves a 5.0% improvement over a pre-trained vanilla model, demonstrating that its memory module does not degrade performance on general tasks. Further, in our analysis, we explore the memory interpretability, effectiveness of memory modules, and test-time behavior. Our findings emphasize the importance of explicit memory in enhancing Transformer architectures.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 11 Feb 2025 20:41:32 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f81c2dc3/f0a1f91a.mp3" length="24561833" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1531</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jikun Kang, Wenqi Wu, Filippos Christianos, Alex J. Chan, Fraser Greenlee, George Thomas, Marvin Purtorab, Andy Toulis</p>

            <p><strong>Title:</strong><br>
            LM2: Large Memory Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06049v1">http://arxiv.org/abs/2502.06049v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces the Large Memory Model (LM2), a decoder-only Transformer architecture enhanced with an auxiliary memory module that aims to address the limitations of standard Transformers in multi-step reasoning, relational argumentation, and synthesizing information distributed over long contexts. The proposed LM2 incorporates a memory module that acts as a contextual representation repository, interacting with input tokens via cross attention and updating through gating mechanisms. To preserve the Transformers general-purpose capabilities, LM2 maintains the original information flow while integrating a complementary memory pathway. Experimental results on the BABILong benchmark demonstrate that the LM2model outperforms both the memory-augmented RMT model by 37.1% and the baseline Llama-3.2 model by 86.3% on average across tasks. LM2 exhibits exceptional capabilities in multi-hop inference, numerical reasoning, and large-context question-answering. On the MMLU dataset, it achieves a 5.0% improvement over a pre-trained vanilla model, demonstrating that its memory module does not degrade performance on general tasks. Further, in our analysis, we explore the memory interpretability, effectiveness of memory modules, and test-time behavior. Our findings emphasize the importance of explicit memory in enhancing Transformer architectures.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Matryoshka Quantization</title>
      <itunes:episode>522</itunes:episode>
      <podcast:episode>522</podcast:episode>
      <itunes:title>Matryoshka Quantization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5b171d03-9cb0-4028-9218-ba8cbc77460f</guid>
      <link>https://share.transistor.fm/s/acbd8421</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Pranav Nair, Puranjay Datta, Jeff Dean, Prateek Jain, Aditya Kusupati</p>

            <p><strong>Title:</strong><br>
            Matryoshka Quantization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06786v1">http://arxiv.org/abs/2502.06786v1</a></p>

            <p><strong>Abstract:</strong><br>
            Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models -- especially to low precisions like int4 or int2 -- requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. This paper proposes Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that addresses the challenge of needing multiple quantized models. It allows training and maintaining just one model, which can then be served at different precision levels. Furthermore, due to the co-training and co-distillation regularization provided by MatQuant, the int2 precision models extracted by MatQuant can be up to $10\%$ more accurate than standard int2 quantization (using techniques like QAT or OmniQuant). This represents significant progress in model quantization, demonstrated by the fact that, with the same recipe, an int2 FFN-quantized Gemma-2 9B model is more accurate than an int8 FFN-quantized Gemma-2 2B model.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Pranav Nair, Puranjay Datta, Jeff Dean, Prateek Jain, Aditya Kusupati</p>

            <p><strong>Title:</strong><br>
            Matryoshka Quantization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06786v1">http://arxiv.org/abs/2502.06786v1</a></p>

            <p><strong>Abstract:</strong><br>
            Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models -- especially to low precisions like int4 or int2 -- requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. This paper proposes Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that addresses the challenge of needing multiple quantized models. It allows training and maintaining just one model, which can then be served at different precision levels. Furthermore, due to the co-training and co-distillation regularization provided by MatQuant, the int2 precision models extracted by MatQuant can be up to $10\%$ more accurate than standard int2 quantization (using techniques like QAT or OmniQuant). This represents significant progress in model quantization, demonstrated by the fact that, with the same recipe, an int2 FFN-quantized Gemma-2 9B model is more accurate than an int8 FFN-quantized Gemma-2 2B model.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 11 Feb 2025 20:41:09 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/acbd8421/42e5d559.mp3" length="21791181" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1358</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Pranav Nair, Puranjay Datta, Jeff Dean, Prateek Jain, Aditya Kusupati</p>

            <p><strong>Title:</strong><br>
            Matryoshka Quantization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06786v1">http://arxiv.org/abs/2502.06786v1</a></p>

            <p><strong>Abstract:</strong><br>
            Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models -- especially to low precisions like int4 or int2 -- requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. This paper proposes Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that addresses the challenge of needing multiple quantized models. It allows training and maintaining just one model, which can then be served at different precision levels. Furthermore, due to the co-training and co-distillation regularization provided by MatQuant, the int2 precision models extracted by MatQuant can be up to $10\%$ more accurate than standard int2 quantization (using techniques like QAT or OmniQuant). This represents significant progress in model quantization, demonstrated by the fact that, with the same recipe, an int2 FFN-quantized Gemma-2 9B model is more accurate than an int8 FFN-quantized Gemma-2 2B model.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation</title>
      <itunes:episode>521</itunes:episode>
      <podcast:episode>521</podcast:episode>
      <itunes:title>Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">03a67b74-4821-4d5c-8f25-966e4ffb0ef7</guid>
      <link>https://share.transistor.fm/s/95c76005</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, Zhijie Deng</p>

            <p><strong>Title:</strong><br>
            Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05415v1">http://arxiv.org/abs/2502.05415v1</a></p>

            <p><strong>Abstract:</strong><br>
            There has been increasing research interest in building unified multimodal understanding and generation models, among which Show-o stands as a notable representative, demonstrating great promise for both text-to-image and image-to-text generation. The inference of Show-o involves progressively denoising image tokens and autoregressively decoding text tokens, and hence, unfortunately, suffers from inefficiency issues from both sides. This paper introduces Show-o Turbo to bridge the gap. We first identify a unified denoising perspective for the generation of images and text in Show-o based on the parallel decoding of text tokens. We then propose to extend consistency distillation (CD), a qualified approach for shortening the denoising process of diffusion models, to the multimodal denoising trajectories of Show-o. We introduce a trajectory segmentation strategy and a curriculum learning procedure to improve the training convergence. Empirically, in text-to-image generation, Show-o Turbo displays a GenEval score of 0.625 at 4 sampling steps without using classifier-free guidance (CFG), outperforming that of the original Show-o with 8 steps and CFG; in image-to-text generation, Show-o Turbo exhibits a 1.5x speedup without significantly sacrificing performance. The code is available at https://github.com/zhijie-group/Show-o-Turbo.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, Zhijie Deng</p>

            <p><strong>Title:</strong><br>
            Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05415v1">http://arxiv.org/abs/2502.05415v1</a></p>

            <p><strong>Abstract:</strong><br>
            There has been increasing research interest in building unified multimodal understanding and generation models, among which Show-o stands as a notable representative, demonstrating great promise for both text-to-image and image-to-text generation. The inference of Show-o involves progressively denoising image tokens and autoregressively decoding text tokens, and hence, unfortunately, suffers from inefficiency issues from both sides. This paper introduces Show-o Turbo to bridge the gap. We first identify a unified denoising perspective for the generation of images and text in Show-o based on the parallel decoding of text tokens. We then propose to extend consistency distillation (CD), a qualified approach for shortening the denoising process of diffusion models, to the multimodal denoising trajectories of Show-o. We introduce a trajectory segmentation strategy and a curriculum learning procedure to improve the training convergence. Empirically, in text-to-image generation, Show-o Turbo displays a GenEval score of 0.625 at 4 sampling steps without using classifier-free guidance (CFG), outperforming that of the original Show-o with 8 steps and CFG; in image-to-text generation, Show-o Turbo exhibits a 1.5x speedup without significantly sacrificing performance. The code is available at https://github.com/zhijie-group/Show-o-Turbo.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 11 Feb 2025 20:40:45 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/95c76005/9d036500.mp3" length="18220614" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1135</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, Zhijie Deng</p>

            <p><strong>Title:</strong><br>
            Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05415v1">http://arxiv.org/abs/2502.05415v1</a></p>

            <p><strong>Abstract:</strong><br>
            There has been increasing research interest in building unified multimodal understanding and generation models, among which Show-o stands as a notable representative, demonstrating great promise for both text-to-image and image-to-text generation. The inference of Show-o involves progressively denoising image tokens and autoregressively decoding text tokens, and hence, unfortunately, suffers from inefficiency issues from both sides. This paper introduces Show-o Turbo to bridge the gap. We first identify a unified denoising perspective for the generation of images and text in Show-o based on the parallel decoding of text tokens. We then propose to extend consistency distillation (CD), a qualified approach for shortening the denoising process of diffusion models, to the multimodal denoising trajectories of Show-o. We introduce a trajectory segmentation strategy and a curriculum learning procedure to improve the training convergence. Empirically, in text-to-image generation, Show-o Turbo displays a GenEval score of 0.625 at 4 sampling steps without using classifier-free guidance (CFG), outperforming that of the original Show-o with 8 steps and CFG; in image-to-text generation, Show-o Turbo exhibits a 1.5x speedup without significantly sacrificing performance. The code is available at https://github.com/zhijie-group/Show-o-Turbo.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding</title>
      <itunes:episode>520</itunes:episode>
      <podcast:episode>520</podcast:episode>
      <itunes:title>Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1a76633c-4884-4f23-bcd6-a1391f216f8a</guid>
      <link>https://share.transistor.fm/s/4c1a7078</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sukmin Cho, Sangjin Choi, Taeho Hwang, Jeongyeon Seo, Soyeong Jeong, Huije Lee, Hoyun Song, Jong C. Park, Youngjin Kwon</p>

            <p><strong>Title:</strong><br>
            Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05609v1">http://arxiv.org/abs/2502.05609v1</a></p>

            <p><strong>Abstract:</strong><br>
            Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for improving inference speed by drafting and verifying tokens, thereby generating multiple tokens in a single forward pass. However, current drafting strategies usually require significant fine-tuning or have inconsistent performance across tasks. To address these challenges, we propose Hierarchy Drafting (HD), a novel lossless drafting approach that organizes various token sources into multiple databases in a hierarchical framework based on temporal locality. In the drafting step, HD sequentially accesses multiple databases to obtain draft tokens from the highest to the lowest locality, ensuring consistent acceleration across diverse tasks and minimizing drafting latency. Our experiments on Spec-Bench using LLMs with 7B and 13B parameters demonstrate that HD outperforms existing database drafting methods, achieving robust inference speedups across model sizes, tasks, and temperatures.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sukmin Cho, Sangjin Choi, Taeho Hwang, Jeongyeon Seo, Soyeong Jeong, Huije Lee, Hoyun Song, Jong C. Park, Youngjin Kwon</p>

            <p><strong>Title:</strong><br>
            Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05609v1">http://arxiv.org/abs/2502.05609v1</a></p>

            <p><strong>Abstract:</strong><br>
            Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for improving inference speed by drafting and verifying tokens, thereby generating multiple tokens in a single forward pass. However, current drafting strategies usually require significant fine-tuning or have inconsistent performance across tasks. To address these challenges, we propose Hierarchy Drafting (HD), a novel lossless drafting approach that organizes various token sources into multiple databases in a hierarchical framework based on temporal locality. In the drafting step, HD sequentially accesses multiple databases to obtain draft tokens from the highest to the lowest locality, ensuring consistent acceleration across diverse tasks and minimizing drafting latency. Our experiments on Spec-Bench using LLMs with 7B and 13B parameters demonstrate that HD outperforms existing database drafting methods, achieving robust inference speedups across model sizes, tasks, and temperatures.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 11 Feb 2025 20:40:22 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4c1a7078/01e9ba91.mp3" length="20298750" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1265</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Sukmin Cho, Sangjin Choi, Taeho Hwang, Jeongyeon Seo, Soyeong Jeong, Huije Lee, Hoyun Song, Jong C. Park, Youngjin Kwon</p>

            <p><strong>Title:</strong><br>
            Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05609v1">http://arxiv.org/abs/2502.05609v1</a></p>

            <p><strong>Abstract:</strong><br>
            Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for improving inference speed by drafting and verifying tokens, thereby generating multiple tokens in a single forward pass. However, current drafting strategies usually require significant fine-tuning or have inconsistent performance across tasks. To address these challenges, we propose Hierarchy Drafting (HD), a novel lossless drafting approach that organizes various token sources into multiple databases in a hierarchical framework based on temporal locality. In the drafting step, HD sequentially accesses multiple databases to obtain draft tokens from the highest to the lowest locality, ensuring consistent acceleration across diverse tasks and minimizing drafting latency. Our experiments on Spec-Bench using LLMs with 7B and 13B parameters demonstrate that HD outperforms existing database drafting methods, achieving robust inference speedups across model sizes, tasks, and temperatures.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates</title>
      <itunes:episode>519</itunes:episode>
      <podcast:episode>519</podcast:episode>
      <itunes:title>ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">784be6e9-f14b-4938-a943-6656ee3caa0e</guid>
      <link>https://share.transistor.fm/s/c59b4dbf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ling Yang, Zhaochen Yu, Bin Cui, Mengdi Wang</p>

            <p><strong>Title:</strong><br>
            ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06772v1">http://arxiv.org/abs/2502.06772v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present that hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space and outperform the mathematical reasoning capabilities of powerful LLMs like OpenAI o1-preview and DeepSeek V3. We train our ReasonFlux-32B model with only 8 GPUs and introduces three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of long CoTs, optimizing a base LLM to plan out an optimal template trajectory for gradually handling complex problems; (iii) a brand new inference scaling system that enables hierarchical LLM reasoning by adaptively scaling thought templates at inference time. With a template trajectory containing sequential thought templates, our ReasonFlux-32B significantly advances math reasoning capabilities to state-of-the-art levels. Notably, on the MATH benchmark, it achieves an accuracy of 91.2% and surpasses o1-preview by 6.7%. On the USA Math Olympiad (AIME) benchmark, ReasonFlux-32B solves an average of 56.7% of problems, surpassing o1-preview and DeepSeek-V3 by 27% and 45%, respectively. Code: https://github.com/Gen-Verse/ReasonFlux</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ling Yang, Zhaochen Yu, Bin Cui, Mengdi Wang</p>

            <p><strong>Title:</strong><br>
            ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06772v1">http://arxiv.org/abs/2502.06772v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present that hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space and outperform the mathematical reasoning capabilities of powerful LLMs like OpenAI o1-preview and DeepSeek V3. We train our ReasonFlux-32B model with only 8 GPUs and introduces three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of long CoTs, optimizing a base LLM to plan out an optimal template trajectory for gradually handling complex problems; (iii) a brand new inference scaling system that enables hierarchical LLM reasoning by adaptively scaling thought templates at inference time. With a template trajectory containing sequential thought templates, our ReasonFlux-32B significantly advances math reasoning capabilities to state-of-the-art levels. Notably, on the MATH benchmark, it achieves an accuracy of 91.2% and surpasses o1-preview by 6.7%. On the USA Math Olympiad (AIME) benchmark, ReasonFlux-32B solves an average of 56.7% of problems, surpassing o1-preview and DeepSeek-V3 by 27% and 45%, respectively. Code: https://github.com/Gen-Verse/ReasonFlux</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 11 Feb 2025 20:39:59 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c59b4dbf/536e8474.mp3" length="20619686" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1285</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ling Yang, Zhaochen Yu, Bin Cui, Mengdi Wang</p>

            <p><strong>Title:</strong><br>
            ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.06772v1">http://arxiv.org/abs/2502.06772v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present that hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space and outperform the mathematical reasoning capabilities of powerful LLMs like OpenAI o1-preview and DeepSeek V3. We train our ReasonFlux-32B model with only 8 GPUs and introduces three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of long CoTs, optimizing a base LLM to plan out an optimal template trajectory for gradually handling complex problems; (iii) a brand new inference scaling system that enables hierarchical LLM reasoning by adaptively scaling thought templates at inference time. With a template trajectory containing sequential thought templates, our ReasonFlux-32B significantly advances math reasoning capabilities to state-of-the-art levels. Notably, on the MATH benchmark, it achieves an accuracy of 91.2% and surpasses o1-preview by 6.7%. On the USA Math Olympiad (AIME) benchmark, ReasonFlux-32B solves an average of 56.7% of problems, surpassing o1-preview and DeepSeek-V3 by 27% and 45%, respectively. Code: https://github.com/Gen-Verse/ReasonFlux</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VideoRoPE: What Makes for Good Video Rotary Position Embedding?</title>
      <itunes:episode>518</itunes:episode>
      <podcast:episode>518</podcast:episode>
      <itunes:title>VideoRoPE: What Makes for Good Video Rotary Position Embedding?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a9237165-69f4-4b7f-b079-b220413ba49f</guid>
      <link>https://share.transistor.fm/s/e4b7b1da</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin</p>

            <p><strong>Title:</strong><br>
            VideoRoPE: What Makes for Good Video Rotary Position Embedding?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05173v1">http://arxiv.org/abs/2502.05173v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce \textbf{VideoRoPE}, with a \textit{3D structure} designed to preserve spatio-temporal relationships. VideoRoPE features \textit{low-frequency temporal allocation} to mitigate periodic oscillations, a \textit{diagonal layout} to maintain spatial symmetry, and \textit{adjustable temporal spacing} to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code will be available at \href{https://github.com/Wiselnn570/VideoRoPE}{https://github.com/Wiselnn570/VideoRoPE}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin</p>

            <p><strong>Title:</strong><br>
            VideoRoPE: What Makes for Good Video Rotary Position Embedding?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05173v1">http://arxiv.org/abs/2502.05173v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce \textbf{VideoRoPE}, with a \textit{3D structure} designed to preserve spatio-temporal relationships. VideoRoPE features \textit{low-frequency temporal allocation} to mitigate periodic oscillations, a \textit{diagonal layout} to maintain spatial symmetry, and \textit{adjustable temporal spacing} to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code will be available at \href{https://github.com/Wiselnn570/VideoRoPE}{https://github.com/Wiselnn570/VideoRoPE}.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 10 Feb 2025 21:16:58 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e4b7b1da/681fbcd1.mp3" length="19667988" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1226</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin</p>

            <p><strong>Title:</strong><br>
            VideoRoPE: What Makes for Good Video Rotary Position Embedding?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05173v1">http://arxiv.org/abs/2502.05173v1</a></p>

            <p><strong>Abstract:</strong><br>
            While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce \textbf{VideoRoPE}, with a \textit{3D structure} designed to preserve spatio-temporal relationships. VideoRoPE features \textit{low-frequency temporal allocation} to mitigate periodic oscillations, a \textit{diagonal layout} to maintain spatial symmetry, and \textit{adjustable temporal spacing} to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code will be available at \href{https://github.com/Wiselnn570/VideoRoPE}{https://github.com/Wiselnn570/VideoRoPE}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Fast Video Generation with Sliding Tile Attention</title>
      <itunes:episode>517</itunes:episode>
      <podcast:episode>517</podcast:episode>
      <itunes:title>Fast Video Generation with Sliding Tile Attention</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">65d3c8a0-14b0-4143-b4fb-a958854d5c7b</guid>
      <link>https://share.transistor.fm/s/dd71febd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhenghong Liu, Hao Zhang</p>

            <p><strong>Title:</strong><br>
            Fast Video Generation with Sliding Tile Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04507v1">http://arxiv.org/abs/2502.04507v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s without quality degradation, requiring no training. Enabling finetuning further lowers latency to 268s with only a 0.09% drop on VBench.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhenghong Liu, Hao Zhang</p>

            <p><strong>Title:</strong><br>
            Fast Video Generation with Sliding Tile Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04507v1">http://arxiv.org/abs/2502.04507v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s without quality degradation, requiring no training. Enabling finetuning further lowers latency to 268s with only a 0.09% drop on VBench.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 10 Feb 2025 21:16:37 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/dd71febd/059e7ec4.mp3" length="20516849" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1279</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhenghong Liu, Hao Zhang</p>

            <p><strong>Title:</strong><br>
            Fast Video Generation with Sliding Tile Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04507v1">http://arxiv.org/abs/2502.04507v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s without quality degradation, requiring no training. Enabling finetuning further lowers latency to 268s with only a 0.09% drop on VBench.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Goku: Flow Based Video Generative Foundation Models</title>
      <itunes:episode>516</itunes:episode>
      <podcast:episode>516</podcast:episode>
      <itunes:title>Goku: Flow Based Video Generative Foundation Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">54039eb3-1f98-4154-b994-215b8a1eafff</guid>
      <link>https://share.transistor.fm/s/6b67988c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu</p>

            <p><strong>Title:</strong><br>
            Goku: Flow Based Video Generative Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04896v2">http://arxiv.org/abs/2502.04896v2</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu</p>

            <p><strong>Title:</strong><br>
            Goku: Flow Based Video Generative Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04896v2">http://arxiv.org/abs/2502.04896v2</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 10 Feb 2025 21:16:16 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6b67988c/0d445011.mp3" length="21476486" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1339</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu</p>

            <p><strong>Title:</strong><br>
            Goku: Flow Based Video Generative Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04896v2">http://arxiv.org/abs/2502.04896v2</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>QuEST: Stable Training of LLMs with 1-Bit Weights and Activations</title>
      <itunes:episode>515</itunes:episode>
      <podcast:episode>515</podcast:episode>
      <itunes:title>QuEST: Stable Training of LLMs with 1-Bit Weights and Activations</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0fdfea10-5bc9-4b65-bc91-2b79bf128ed3</guid>
      <link>https://share.transistor.fm/s/2099c38e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Andrei Panferov, Jiale Chen, Soroush Tabesh, Roberto L. Castro, Mahdi Nikdan, Dan Alistarh</p>

            <p><strong>Title:</strong><br>
            QuEST: Stable Training of LLMs with 1-Bit Weights and Activations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05003v1">http://arxiv.org/abs/2502.05003v1</a></p>

            <p><strong>Abstract:</strong><br>
            One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study (arXiv:2411.04330v2) put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations.   We advance this state-of-the-art via a new method called QuEST, which is Pareto-competitive with FP16, i.e., it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Andrei Panferov, Jiale Chen, Soroush Tabesh, Roberto L. Castro, Mahdi Nikdan, Dan Alistarh</p>

            <p><strong>Title:</strong><br>
            QuEST: Stable Training of LLMs with 1-Bit Weights and Activations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05003v1">http://arxiv.org/abs/2502.05003v1</a></p>

            <p><strong>Abstract:</strong><br>
            One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study (arXiv:2411.04330v2) put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations.   We advance this state-of-the-art via a new method called QuEST, which is Pareto-competitive with FP16, i.e., it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 10 Feb 2025 21:15:55 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2099c38e/e760bd25.mp3" length="21941688" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1368</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Andrei Panferov, Jiale Chen, Soroush Tabesh, Roberto L. Castro, Mahdi Nikdan, Dan Alistarh</p>

            <p><strong>Title:</strong><br>
            QuEST: Stable Training of LLMs with 1-Bit Weights and Activations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05003v1">http://arxiv.org/abs/2502.05003v1</a></p>

            <p><strong>Abstract:</strong><br>
            One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study (arXiv:2411.04330v2) put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations.   We advance this state-of-the-art via a new method called QuEST, which is Pareto-competitive with FP16, i.e., it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach</title>
      <itunes:episode>514</itunes:episode>
      <podcast:episode>514</podcast:episode>
      <itunes:title>Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8416c24c-0e07-4ead-b2e5-dc447a1aa8d7</guid>
      <link>https://share.transistor.fm/s/6fa4df32</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein</p>

            <p><strong>Title:</strong><br>
            Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05171v1">http://arxiv.org/abs/2502.05171v1</a></p>

            <p><strong>Abstract:</strong><br>
            We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein</p>

            <p><strong>Title:</strong><br>
            Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05171v1">http://arxiv.org/abs/2502.05171v1</a></p>

            <p><strong>Abstract:</strong><br>
            We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 10 Feb 2025 21:15:34 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6fa4df32/f23a4ecb.mp3" length="22263948" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1388</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein</p>

            <p><strong>Title:</strong><br>
            Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05171v1">http://arxiv.org/abs/2502.05171v1</a></p>

            <p><strong>Abstract:</strong><br>
            We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting</title>
      <itunes:episode>513</itunes:episode>
      <podcast:episode>513</podcast:episode>
      <itunes:title>AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">941de49a-cf69-4534-aa77-16cd7197718b</guid>
      <link>https://share.transistor.fm/s/8cbd1c35</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chung-Ho Wu, Yang-Jung Chen, Ying-Huan Chen, Jie-Ying Lee, Bo-Hsu Ke, Chun-Wei Tuan Mu, Yi-Chuan Huang, Chin-Yang Lin, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05176v1">http://arxiv.org/abs/2502.05176v1</a></p>

            <p><strong>Abstract:</strong><br>
            Three-dimensional scene inpainting is crucial for applications from virtual reality to architectural visualization, yet existing methods struggle with view consistency and geometric accuracy in 360{\deg} unbounded scenes. We present AuraFusion360, a novel reference-based method that enables high-quality object removal and hole filling in 3D scenes represented by Gaussian Splatting. Our approach introduces (1) depth-aware unseen mask generation for accurate occlusion identification, (2) Adaptive Guided Depth Diffusion, a zero-shot method for accurate initial point placement without requiring additional training, and (3) SDEdit-based detail enhancement for multi-view coherence. We also introduce 360-USID, the first comprehensive dataset for 360{\deg} unbounded scene inpainting with ground truth. Extensive experiments demonstrate that AuraFusion360 significantly outperforms existing methods, achieving superior perceptual quality while maintaining geometric accuracy across dramatic viewpoint changes. See our project page for video results and the dataset at https://kkennethwu.github.io/aurafusion360/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chung-Ho Wu, Yang-Jung Chen, Ying-Huan Chen, Jie-Ying Lee, Bo-Hsu Ke, Chun-Wei Tuan Mu, Yi-Chuan Huang, Chin-Yang Lin, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05176v1">http://arxiv.org/abs/2502.05176v1</a></p>

            <p><strong>Abstract:</strong><br>
            Three-dimensional scene inpainting is crucial for applications from virtual reality to architectural visualization, yet existing methods struggle with view consistency and geometric accuracy in 360{\deg} unbounded scenes. We present AuraFusion360, a novel reference-based method that enables high-quality object removal and hole filling in 3D scenes represented by Gaussian Splatting. Our approach introduces (1) depth-aware unseen mask generation for accurate occlusion identification, (2) Adaptive Guided Depth Diffusion, a zero-shot method for accurate initial point placement without requiring additional training, and (3) SDEdit-based detail enhancement for multi-view coherence. We also introduce 360-USID, the first comprehensive dataset for 360{\deg} unbounded scene inpainting with ground truth. Extensive experiments demonstrate that AuraFusion360 significantly outperforms existing methods, achieving superior perceptual quality while maintaining geometric accuracy across dramatic viewpoint changes. See our project page for video results and the dataset at https://kkennethwu.github.io/aurafusion360/.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 10 Feb 2025 21:15:12 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8cbd1c35/5e79a496.mp3" length="17252639" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1075</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chung-Ho Wu, Yang-Jung Chen, Ying-Huan Chen, Jie-Ying Lee, Bo-Hsu Ke, Chun-Wei Tuan Mu, Yi-Chuan Huang, Chin-Yang Lin, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu</p>

            <p><strong>Title:</strong><br>
            AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05176v1">http://arxiv.org/abs/2502.05176v1</a></p>

            <p><strong>Abstract:</strong><br>
            Three-dimensional scene inpainting is crucial for applications from virtual reality to architectural visualization, yet existing methods struggle with view consistency and geometric accuracy in 360{\deg} unbounded scenes. We present AuraFusion360, a novel reference-based method that enables high-quality object removal and hole filling in 3D scenes represented by Gaussian Splatting. Our approach introduces (1) depth-aware unseen mask generation for accurate occlusion identification, (2) Adaptive Guided Depth Diffusion, a zero-shot method for accurate initial point placement without requiring additional training, and (3) SDEdit-based detail enhancement for multi-view coherence. We also introduce 360-USID, the first comprehensive dataset for 360{\deg} unbounded scene inpainting with ground truth. Extensive experiments demonstrate that AuraFusion360 significantly outperforms existing methods, achieving superior perceptual quality while maintaining geometric accuracy across dramatic viewpoint changes. See our project page for video results and the dataset at https://kkennethwu.github.io/aurafusion360/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails</title>
      <itunes:episode>512</itunes:episode>
      <podcast:episode>512</podcast:episode>
      <itunes:title>DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4fa027d6-8ef1-470c-9395-7aea16a425e0</guid>
      <link>https://share.transistor.fm/s/3c9c7309</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yihe Deng, Yu Yang, Junkai Zhang, Wei Wang, Bo Li</p>

            <p><strong>Title:</strong><br>
            DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05163v1">http://arxiv.org/abs/2502.05163v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancement of large language models (LLMs) has increased the need for guardrail models to ensure responsible use, particularly in detecting unsafe and illegal content. While substantial safety data exist in English, multilingual guardrail modeling remains underexplored due to the scarcity of open-source safety data in other languages. To address this gap, we propose a novel two-player Reinforcement Learning (RL) framework, where a generator and a guardrail model co-evolve adversarially to produce high-quality synthetic data for multilingual guardrail training. We theoretically formalize this interaction as a two-player game, proving convergence to a Nash equilibrium. Empirical evaluations show that our model \ours outperforms state-of-the-art models, achieving nearly 10% improvement over LlamaGuard3 (8B) on English benchmarks while being 4.5x faster at inference with a significantly smaller model (0.5B). We achieve substantial advancements in multilingual safety tasks, particularly in addressing the imbalance for lower-resource languages in a collected real dataset. Ablation studies emphasize the critical role of synthetic data generation in bridging the imbalance in open-source data between English and other languages. These findings establish a scalable and efficient approach to synthetic data generation, paving the way for improved multilingual guardrail models to enhance LLM safety. Code, model, and data will be open-sourced at https://github.com/yihedeng9/DuoGuard.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yihe Deng, Yu Yang, Junkai Zhang, Wei Wang, Bo Li</p>

            <p><strong>Title:</strong><br>
            DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05163v1">http://arxiv.org/abs/2502.05163v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancement of large language models (LLMs) has increased the need for guardrail models to ensure responsible use, particularly in detecting unsafe and illegal content. While substantial safety data exist in English, multilingual guardrail modeling remains underexplored due to the scarcity of open-source safety data in other languages. To address this gap, we propose a novel two-player Reinforcement Learning (RL) framework, where a generator and a guardrail model co-evolve adversarially to produce high-quality synthetic data for multilingual guardrail training. We theoretically formalize this interaction as a two-player game, proving convergence to a Nash equilibrium. Empirical evaluations show that our model \ours outperforms state-of-the-art models, achieving nearly 10% improvement over LlamaGuard3 (8B) on English benchmarks while being 4.5x faster at inference with a significantly smaller model (0.5B). We achieve substantial advancements in multilingual safety tasks, particularly in addressing the imbalance for lower-resource languages in a collected real dataset. Ablation studies emphasize the critical role of synthetic data generation in bridging the imbalance in open-source data between English and other languages. These findings establish a scalable and efficient approach to synthetic data generation, paving the way for improved multilingual guardrail models to enhance LLM safety. Code, model, and data will be open-sourced at https://github.com/yihedeng9/DuoGuard.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 10 Feb 2025 21:14:51 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3c9c7309/f498dfc9.mp3" length="18499803" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1153</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yihe Deng, Yu Yang, Junkai Zhang, Wei Wang, Bo Li</p>

            <p><strong>Title:</strong><br>
            DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05163v1">http://arxiv.org/abs/2502.05163v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancement of large language models (LLMs) has increased the need for guardrail models to ensure responsible use, particularly in detecting unsafe and illegal content. While substantial safety data exist in English, multilingual guardrail modeling remains underexplored due to the scarcity of open-source safety data in other languages. To address this gap, we propose a novel two-player Reinforcement Learning (RL) framework, where a generator and a guardrail model co-evolve adversarially to produce high-quality synthetic data for multilingual guardrail training. We theoretically formalize this interaction as a two-player game, proving convergence to a Nash equilibrium. Empirical evaluations show that our model \ours outperforms state-of-the-art models, achieving nearly 10% improvement over LlamaGuard3 (8B) on English benchmarks while being 4.5x faster at inference with a significantly smaller model (0.5B). We achieve substantial advancements in multilingual safety tasks, particularly in addressing the imbalance for lower-resource languages in a collected real dataset. Ablation studies emphasize the critical role of synthetic data generation in bridging the imbalance in open-source data between English and other languages. These findings establish a scalable and efficient approach to synthetic data generation, paving the way for improved multilingual guardrail models to enhance LLM safety. Code, model, and data will be open-sourced at https://github.com/yihedeng9/DuoGuard.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Agency Is Frame-Dependent</title>
      <itunes:episode>511</itunes:episode>
      <podcast:episode>511</podcast:episode>
      <itunes:title>Agency Is Frame-Dependent</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">03fd19b1-e754-4901-8dc6-eae9d07abd61</guid>
      <link>https://share.transistor.fm/s/1214d179</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            David Abel, André Barreto, Michael Bowling, Will Dabney, Shi Dong, Steven Hansen, Anna Harutyunyan, Khimya Khetarpal, Clare Lyle, Razvan Pascanu, Georgios Piliouras, Doina Precup, Jonathan Richens, Mark Rowland, Tom Schaul, Satinder Singh</p>

            <p><strong>Title:</strong><br>
            Agency Is Frame-Dependent</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04403v1">http://arxiv.org/abs/2502.04403v1</a></p>

            <p><strong>Abstract:</strong><br>
            Agency is a system's capacity to steer outcomes toward a goal, and is a central topic of study across biology, philosophy, cognitive science, and artificial intelligence. Determining if a system exhibits agency is a notoriously difficult question: Dennett (1989), for instance, highlights the puzzle of determining which principles can decide whether a rock, a thermostat, or a robot each possess agency. We here address this puzzle from the viewpoint of reinforcement learning by arguing that agency is fundamentally frame-dependent: Any measurement of a system's agency must be made relative to a reference frame. We support this claim by presenting a philosophical argument that each of the essential properties of agency proposed by Barandiaran et al. (2009) and Moreno (2018) are themselves frame-dependent. We conclude that any basic science of agency requires frame-dependence, and discuss the implications of this claim for reinforcement learning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            David Abel, André Barreto, Michael Bowling, Will Dabney, Shi Dong, Steven Hansen, Anna Harutyunyan, Khimya Khetarpal, Clare Lyle, Razvan Pascanu, Georgios Piliouras, Doina Precup, Jonathan Richens, Mark Rowland, Tom Schaul, Satinder Singh</p>

            <p><strong>Title:</strong><br>
            Agency Is Frame-Dependent</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04403v1">http://arxiv.org/abs/2502.04403v1</a></p>

            <p><strong>Abstract:</strong><br>
            Agency is a system's capacity to steer outcomes toward a goal, and is a central topic of study across biology, philosophy, cognitive science, and artificial intelligence. Determining if a system exhibits agency is a notoriously difficult question: Dennett (1989), for instance, highlights the puzzle of determining which principles can decide whether a rock, a thermostat, or a robot each possess agency. We here address this puzzle from the viewpoint of reinforcement learning by arguing that agency is fundamentally frame-dependent: Any measurement of a system's agency must be made relative to a reference frame. We support this claim by presenting a philosophical argument that each of the essential properties of agency proposed by Barandiaran et al. (2009) and Moreno (2018) are themselves frame-dependent. We conclude that any basic science of agency requires frame-dependence, and discuss the implications of this claim for reinforcement learning.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 10 Feb 2025 21:14:30 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1214d179/5c692abf.mp3" length="22365041" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1394</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            David Abel, André Barreto, Michael Bowling, Will Dabney, Shi Dong, Steven Hansen, Anna Harutyunyan, Khimya Khetarpal, Clare Lyle, Razvan Pascanu, Georgios Piliouras, Doina Precup, Jonathan Richens, Mark Rowland, Tom Schaul, Satinder Singh</p>

            <p><strong>Title:</strong><br>
            Agency Is Frame-Dependent</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04403v1">http://arxiv.org/abs/2502.04403v1</a></p>

            <p><strong>Abstract:</strong><br>
            Agency is a system's capacity to steer outcomes toward a goal, and is a central topic of study across biology, philosophy, cognitive science, and artificial intelligence. Determining if a system exhibits agency is a notoriously difficult question: Dennett (1989), for instance, highlights the puzzle of determining which principles can decide whether a rock, a thermostat, or a robot each possess agency. We here address this puzzle from the viewpoint of reinforcement learning by arguing that agency is fundamentally frame-dependent: Any measurement of a system's agency must be made relative to a reference frame. We support this claim by presenting a philosophical argument that each of the essential properties of agency proposed by Barandiaran et al. (2009) and Moreno (2018) are themselves frame-dependent. We conclude that any basic science of agency requires frame-dependence, and discuss the implications of this claim for reinforcement learning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation</title>
      <itunes:episode>510</itunes:episode>
      <podcast:episode>510</podcast:episode>
      <itunes:title>FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0136b8b8-e157-4cf2-b20f-bb83cc099eb9</guid>
      <link>https://share.transistor.fm/s/bf18445a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shilong Zhang, Wenbo Li, Shoufa Chen, Chongjian Ge, Peize Sun, Yida Zhang, Yi Jiang, Zehuan Yuan, Binyue Peng, Ping Luo</p>

            <p><strong>Title:</strong><br>
            FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05179v1">http://arxiv.org/abs/2502.05179v1</a></p>

            <p><strong>Abstract:</strong><br>
            DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high resolution outputs, further amplifying computational demands especially for single stage DiT models. To address these challenges, we propose a novel two stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage establishes flow matching between low and high resolutions, effectively generating fine details with minimal NFEs. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output before committing to full resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability .</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shilong Zhang, Wenbo Li, Shoufa Chen, Chongjian Ge, Peize Sun, Yida Zhang, Yi Jiang, Zehuan Yuan, Binyue Peng, Ping Luo</p>

            <p><strong>Title:</strong><br>
            FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05179v1">http://arxiv.org/abs/2502.05179v1</a></p>

            <p><strong>Abstract:</strong><br>
            DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high resolution outputs, further amplifying computational demands especially for single stage DiT models. To address these challenges, we propose a novel two stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage establishes flow matching between low and high resolutions, effectively generating fine details with minimal NFEs. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output before committing to full resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability .</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 10 Feb 2025 21:14:09 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bf18445a/15442267.mp3" length="22347546" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1393</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shilong Zhang, Wenbo Li, Shoufa Chen, Chongjian Ge, Peize Sun, Yida Zhang, Yi Jiang, Zehuan Yuan, Binyue Peng, Ping Luo</p>

            <p><strong>Title:</strong><br>
            FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.05179v1">http://arxiv.org/abs/2502.05179v1</a></p>

            <p><strong>Abstract:</strong><br>
            DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high resolution outputs, further amplifying computational demands especially for single stage DiT models. To address these challenges, we propose a novel two stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage establishes flow matching between low and high resolutions, effectively generating fine details with minimal NFEs. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output before committing to full resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability .</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Generating Symbolic World Models via Test-time Scaling of Large Language Models</title>
      <itunes:episode>509</itunes:episode>
      <podcast:episode>509</podcast:episode>
      <itunes:title>Generating Symbolic World Models via Test-time Scaling of Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">18774710-01de-4a64-adbc-424e190615f2</guid>
      <link>https://share.transistor.fm/s/9375b6b6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhouliang Yu, Yuhuan Yuan, Tim Z. Xiao, Fuxiang Frank Xia, Jie Fu, Ge Zhang, Ge Lin, Weiyang Liu</p>

            <p><strong>Title:</strong><br>
            Generating Symbolic World Models via Test-time Scaling of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04728v1">http://arxiv.org/abs/2502.04728v1</a></p>

            <p><strong>Abstract:</strong><br>
            Solving complex planning problems requires Large Language Models (LLMs) to explicitly model the state transition to avoid rule violations, comply with constraints, and ensure optimality-a task hindered by the inherent ambiguity of natural language. To overcome such ambiguity, Planning Domain Definition Language (PDDL) is leveraged as a planning abstraction that enables precise and formal state descriptions. With PDDL, we can generate a symbolic world model where classic searching algorithms, such as A*, can be seamlessly applied to find optimal plans. However, directly generating PDDL domains with current LLMs remains an open challenge due to the lack of PDDL training data. To address this challenge, we propose to scale up the test-time computation of LLMs to enhance their PDDL reasoning capabilities, thereby enabling the generation of high-quality PDDL domains. Specifically, we introduce a simple yet effective algorithm, which first employs a Best-of-N sampling approach to improve the quality of the initial solution and then refines the solution in a fine-grained manner with verbalized machine learning. Our method outperforms o1-mini by a considerable margin in the generation of PDDL domain, achieving over 50% success rate on two tasks (i.e., generating PDDL domains from natural language description or PDDL problems). This is done without requiring additional training. By taking advantage of PDDL as state abstraction, our method is able to outperform current state-of-the-art methods on almost all competition-level planning tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhouliang Yu, Yuhuan Yuan, Tim Z. Xiao, Fuxiang Frank Xia, Jie Fu, Ge Zhang, Ge Lin, Weiyang Liu</p>

            <p><strong>Title:</strong><br>
            Generating Symbolic World Models via Test-time Scaling of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04728v1">http://arxiv.org/abs/2502.04728v1</a></p>

            <p><strong>Abstract:</strong><br>
            Solving complex planning problems requires Large Language Models (LLMs) to explicitly model the state transition to avoid rule violations, comply with constraints, and ensure optimality-a task hindered by the inherent ambiguity of natural language. To overcome such ambiguity, Planning Domain Definition Language (PDDL) is leveraged as a planning abstraction that enables precise and formal state descriptions. With PDDL, we can generate a symbolic world model where classic searching algorithms, such as A*, can be seamlessly applied to find optimal plans. However, directly generating PDDL domains with current LLMs remains an open challenge due to the lack of PDDL training data. To address this challenge, we propose to scale up the test-time computation of LLMs to enhance their PDDL reasoning capabilities, thereby enabling the generation of high-quality PDDL domains. Specifically, we introduce a simple yet effective algorithm, which first employs a Best-of-N sampling approach to improve the quality of the initial solution and then refines the solution in a fine-grained manner with verbalized machine learning. Our method outperforms o1-mini by a considerable margin in the generation of PDDL domain, achieving over 50% success rate on two tasks (i.e., generating PDDL domains from natural language description or PDDL problems). This is done without requiring additional training. By taking advantage of PDDL as state abstraction, our method is able to outperform current state-of-the-art methods on almost all competition-level planning tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 10 Feb 2025 21:13:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9375b6b6/4fcff365.mp3" length="20444573" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1274</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhouliang Yu, Yuhuan Yuan, Tim Z. Xiao, Fuxiang Frank Xia, Jie Fu, Ge Zhang, Ge Lin, Weiyang Liu</p>

            <p><strong>Title:</strong><br>
            Generating Symbolic World Models via Test-time Scaling of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04728v1">http://arxiv.org/abs/2502.04728v1</a></p>

            <p><strong>Abstract:</strong><br>
            Solving complex planning problems requires Large Language Models (LLMs) to explicitly model the state transition to avoid rule violations, comply with constraints, and ensure optimality-a task hindered by the inherent ambiguity of natural language. To overcome such ambiguity, Planning Domain Definition Language (PDDL) is leveraged as a planning abstraction that enables precise and formal state descriptions. With PDDL, we can generate a symbolic world model where classic searching algorithms, such as A*, can be seamlessly applied to find optimal plans. However, directly generating PDDL domains with current LLMs remains an open challenge due to the lack of PDDL training data. To address this challenge, we propose to scale up the test-time computation of LLMs to enhance their PDDL reasoning capabilities, thereby enabling the generation of high-quality PDDL domains. Specifically, we introduce a simple yet effective algorithm, which first employs a Best-of-N sampling approach to improve the quality of the initial solution and then refines the solution in a fine-grained manner with verbalized machine learning. Our method outperforms o1-mini by a considerable margin in the generation of PDDL domain, achieving over 50% success rate on two tasks (i.e., generating PDDL domains from natural language description or PDDL problems). This is done without requiring additional training. By taking advantage of PDDL as state abstraction, our method is able to outperform current state-of-the-art methods on almost all competition-level planning tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Analyze Feature Flow to Enhance Interpretation and Steering in Language Models</title>
      <itunes:episode>508</itunes:episode>
      <podcast:episode>508</podcast:episode>
      <itunes:title>Analyze Feature Flow to Enhance Interpretation and Steering in Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b1281cba-f96f-43d6-baba-137acfdb5bed</guid>
      <link>https://share.transistor.fm/s/90a85d2c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov</p>

            <p><strong>Title:</strong><br>
            Analyze Feature Flow to Enhance Interpretation and Steering in Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.03032v2">http://arxiv.org/abs/2502.03032v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov</p>

            <p><strong>Title:</strong><br>
            Analyze Feature Flow to Enhance Interpretation and Steering in Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.03032v2">http://arxiv.org/abs/2502.03032v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 07 Feb 2025 20:53:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/90a85d2c/e4884126.mp3" length="22647634" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1412</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov</p>

            <p><strong>Title:</strong><br>
            Analyze Feature Flow to Enhance Interpretation and Steering in Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.03032v2">http://arxiv.org/abs/2502.03032v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UltraIF: Advancing Instruction Following from the Wild</title>
      <itunes:episode>507</itunes:episode>
      <podcast:episode>507</podcast:episode>
      <itunes:title>UltraIF: Advancing Instruction Following from the Wild</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">427f34ed-9b78-4aad-a881-4d450631ae85</guid>
      <link>https://share.transistor.fm/s/cc9fc7ba</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Kaikai An, Li Sheng, Ganqu Cui, Shuzheng Si, Ning Ding, Yu Cheng, Baobao Chang</p>

            <p><strong>Title:</strong><br>
            UltraIF: Advancing Instruction Following from the Wild</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04153v1">http://arxiv.org/abs/2502.04153v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-following made modern large language models (LLMs) helpful assistants. However, the key to taming LLMs on complex instructions remains mysterious, for that there are huge gaps between models trained by open-source community and those trained by leading companies. To bridge the gap, we propose a simple and scalable approach UltraIF for building LLMs that can follow complex instructions with open-source data. UltraIF first decomposes real-world user prompts into simpler queries, constraints, and corresponding evaluation questions for the constraints. Then, we train an UltraComposer to compose constraint-associated prompts with evaluation questions. This prompt composer allows us to synthesize complicated instructions as well as filter responses with evaluation questions. In our experiment, for the first time, we successfully align LLaMA-3.1-8B-Base to catch up with its instruct version on 5 instruction-following benchmarks without any benchmark information, using only 8B model as response generator and evaluator. The aligned model also achieved competitive scores on other benchmarks. Moreover, we also show that UltraIF could further improve LLaMA-3.1-8B-Instruct through self-alignment, motivating broader use cases for the method. Our code will be available at https://github.com/kkk-an/UltraIF.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Kaikai An, Li Sheng, Ganqu Cui, Shuzheng Si, Ning Ding, Yu Cheng, Baobao Chang</p>

            <p><strong>Title:</strong><br>
            UltraIF: Advancing Instruction Following from the Wild</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04153v1">http://arxiv.org/abs/2502.04153v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-following made modern large language models (LLMs) helpful assistants. However, the key to taming LLMs on complex instructions remains mysterious, for that there are huge gaps between models trained by open-source community and those trained by leading companies. To bridge the gap, we propose a simple and scalable approach UltraIF for building LLMs that can follow complex instructions with open-source data. UltraIF first decomposes real-world user prompts into simpler queries, constraints, and corresponding evaluation questions for the constraints. Then, we train an UltraComposer to compose constraint-associated prompts with evaluation questions. This prompt composer allows us to synthesize complicated instructions as well as filter responses with evaluation questions. In our experiment, for the first time, we successfully align LLaMA-3.1-8B-Base to catch up with its instruct version on 5 instruction-following benchmarks without any benchmark information, using only 8B model as response generator and evaluator. The aligned model also achieved competitive scores on other benchmarks. Moreover, we also show that UltraIF could further improve LLaMA-3.1-8B-Instruct through self-alignment, motivating broader use cases for the method. Our code will be available at https://github.com/kkk-an/UltraIF.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 07 Feb 2025 20:52:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cc9fc7ba/a50ca708.mp3" length="19112930" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1191</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Kaikai An, Li Sheng, Ganqu Cui, Shuzheng Si, Ning Ding, Yu Cheng, Baobao Chang</p>

            <p><strong>Title:</strong><br>
            UltraIF: Advancing Instruction Following from the Wild</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04153v1">http://arxiv.org/abs/2502.04153v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-following made modern large language models (LLMs) helpful assistants. However, the key to taming LLMs on complex instructions remains mysterious, for that there are huge gaps between models trained by open-source community and those trained by leading companies. To bridge the gap, we propose a simple and scalable approach UltraIF for building LLMs that can follow complex instructions with open-source data. UltraIF first decomposes real-world user prompts into simpler queries, constraints, and corresponding evaluation questions for the constraints. Then, we train an UltraComposer to compose constraint-associated prompts with evaluation questions. This prompt composer allows us to synthesize complicated instructions as well as filter responses with evaluation questions. In our experiment, for the first time, we successfully align LLaMA-3.1-8B-Base to catch up with its instruct version on 5 instruction-following benchmarks without any benchmark information, using only 8B model as response generator and evaluator. The aligned model also achieved competitive scores on other benchmarks. Moreover, we also show that UltraIF could further improve LLaMA-3.1-8B-Instruct through self-alignment, motivating broader use cases for the method. Our code will be available at https://github.com/kkk-an/UltraIF.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Great Models Think Alike and this Undermines AI Oversight</title>
      <itunes:episode>506</itunes:episode>
      <podcast:episode>506</podcast:episode>
      <itunes:title>Great Models Think Alike and this Undermines AI Oversight</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a05b235c-db33-41d5-946b-83a573cf87b9</guid>
      <link>https://share.transistor.fm/s/df0a5a44</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping</p>

            <p><strong>Title:</strong><br>
            Great Models Think Alike and this Undermines AI Oversight</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04313v1">http://arxiv.org/abs/2502.04313v1</a></p>

            <p><strong>Abstract:</strong><br>
            As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as "AI Oversight". We study how model similarity affects both aspects of AI oversight by proposing a probabilistic metric for LM similarity based on overlap in model mistakes. Using this metric, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from "weak-to-strong generalization". As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend -- model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping</p>

            <p><strong>Title:</strong><br>
            Great Models Think Alike and this Undermines AI Oversight</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04313v1">http://arxiv.org/abs/2502.04313v1</a></p>

            <p><strong>Abstract:</strong><br>
            As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as "AI Oversight". We study how model similarity affects both aspects of AI oversight by proposing a probabilistic metric for LM similarity based on overlap in model mistakes. Using this metric, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from "weak-to-strong generalization". As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend -- model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 07 Feb 2025 20:52:04 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/df0a5a44/dd8b3687.mp3" length="29248025" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1824</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping</p>

            <p><strong>Title:</strong><br>
            Great Models Think Alike and this Undermines AI Oversight</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04313v1">http://arxiv.org/abs/2502.04313v1</a></p>

            <p><strong>Abstract:</strong><br>
            As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as "AI Oversight". We study how model similarity affects both aspects of AI oversight by proposing a probabilistic metric for LM similarity based on overlap in model mistakes. Using this metric, we first show that LLM-as-a-judge scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from "weak-to-strong generalization". As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend -- model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2</title>
      <itunes:episode>505</itunes:episode>
      <podcast:episode>505</podcast:episode>
      <itunes:title>Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2e62018e-3387-4fbb-a36d-70a6ccdc244c</guid>
      <link>https://share.transistor.fm/s/150ade85</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yuri Chervonyi, Trieu H. Trinh, Miroslav Olšák, Xiaomeng Yang, Hoang Nguyen, Marcelo Menegali, Junehyuk Jung, Vikas Verma, Quoc V. Le, Thang Luong</p>

            <p><strong>Title:</strong><br>
            Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.03544v1">http://arxiv.org/abs/2502.03544v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present AlphaGeometry2, a significantly improved version of AlphaGeometry introduced in Trinh et al. (2024), which has now surpassed an average gold medalist in solving Olympiad geometry problems. To achieve this, we first extend the original AlphaGeometry language to tackle harder problems involving movements of objects, and problems containing linear equations of angles, ratios, and distances. This, together with other additions, has markedly improved the coverage rate of the AlphaGeometry language on International Math Olympiads (IMO) 2000-2024 geometry problems from 66% to 88%. The search process of AlphaGeometry2 has also been greatly improved through the use of Gemini architecture for better language modeling, and a novel knowledge-sharing mechanism that combines multiple search trees. Together with further enhancements to the symbolic engine and synthetic data generation, we have significantly boosted the overall solving rate of AlphaGeometry2 to 84% for $\textit{all}$ geometry problems over the last 25 years, compared to 54% previously. AlphaGeometry2 was also part of the system that achieved silver-medal standard at IMO 2024 https://dpmd.ai/imo-silver. Last but not least, we report progress towards using AlphaGeometry2 as a part of a fully automated system that reliably solves geometry problems directly from natural language input.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yuri Chervonyi, Trieu H. Trinh, Miroslav Olšák, Xiaomeng Yang, Hoang Nguyen, Marcelo Menegali, Junehyuk Jung, Vikas Verma, Quoc V. Le, Thang Luong</p>

            <p><strong>Title:</strong><br>
            Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.03544v1">http://arxiv.org/abs/2502.03544v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present AlphaGeometry2, a significantly improved version of AlphaGeometry introduced in Trinh et al. (2024), which has now surpassed an average gold medalist in solving Olympiad geometry problems. To achieve this, we first extend the original AlphaGeometry language to tackle harder problems involving movements of objects, and problems containing linear equations of angles, ratios, and distances. This, together with other additions, has markedly improved the coverage rate of the AlphaGeometry language on International Math Olympiads (IMO) 2000-2024 geometry problems from 66% to 88%. The search process of AlphaGeometry2 has also been greatly improved through the use of Gemini architecture for better language modeling, and a novel knowledge-sharing mechanism that combines multiple search trees. Together with further enhancements to the symbolic engine and synthetic data generation, we have significantly boosted the overall solving rate of AlphaGeometry2 to 84% for $\textit{all}$ geometry problems over the last 25 years, compared to 54% previously. AlphaGeometry2 was also part of the system that achieved silver-medal standard at IMO 2024 https://dpmd.ai/imo-silver. Last but not least, we report progress towards using AlphaGeometry2 as a part of a fully automated system that reliably solves geometry problems directly from natural language input.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 07 Feb 2025 20:51:40 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/150ade85/15a06be9.mp3" length="21703879" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1353</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yuri Chervonyi, Trieu H. Trinh, Miroslav Olšák, Xiaomeng Yang, Hoang Nguyen, Marcelo Menegali, Junehyuk Jung, Vikas Verma, Quoc V. Le, Thang Luong</p>

            <p><strong>Title:</strong><br>
            Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.03544v1">http://arxiv.org/abs/2502.03544v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present AlphaGeometry2, a significantly improved version of AlphaGeometry introduced in Trinh et al. (2024), which has now surpassed an average gold medalist in solving Olympiad geometry problems. To achieve this, we first extend the original AlphaGeometry language to tackle harder problems involving movements of objects, and problems containing linear equations of angles, ratios, and distances. This, together with other additions, has markedly improved the coverage rate of the AlphaGeometry language on International Math Olympiads (IMO) 2000-2024 geometry problems from 66% to 88%. The search process of AlphaGeometry2 has also been greatly improved through the use of Gemini architecture for better language modeling, and a novel knowledge-sharing mechanism that combines multiple search trees. Together with further enhancements to the symbolic engine and synthetic data generation, we have significantly boosted the overall solving rate of AlphaGeometry2 to 84% for $\textit{all}$ geometry problems over the last 25 years, compared to 54% previously. AlphaGeometry2 was also part of the system that achieved silver-medal standard at IMO 2024 https://dpmd.ai/imo-silver. Last but not least, we report progress towards using AlphaGeometry2 as a part of a fully automated system that reliably solves geometry problems directly from natural language input.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment</title>
      <itunes:episode>504</itunes:episode>
      <podcast:episode>504</podcast:episode>
      <itunes:title>Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4af97a39-5cf5-40cc-9663-1c60a5181ed4</guid>
      <link>https://share.transistor.fm/s/200b6213</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV, cs.CL, cs.MM, cs.SD, eess.AS, eess.IV</p>

            <p><strong>Authors:</strong><br>
            Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao</p>

            <p><strong>Title:</strong><br>
            Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04328v1">http://arxiv.org/abs/2502.04328v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts. The core design of Ola lies in its progressive modality alignment strategy that extends the supporting modality of the language model progressively. Our training pipeline begins with the most distinct modalities: image and text, then gradually expands the skill sets of the model using speech data that connects language and audio knowledge, and video data that connects all modalities. The progressive learning pipeline also enables us to maintain a relatively small size of the cross-modal alignment data, making developing omni-modal from existing vision-language models easy and less costly. Moreover, to unlock an advanced interactive experience like GPT-4o, we further design a sentence-wise decoding solution for streaming speech generation. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are open-sourced at https://github.com/Ola-Omni/Ola.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV, cs.CL, cs.MM, cs.SD, eess.AS, eess.IV</p>

            <p><strong>Authors:</strong><br>
            Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao</p>

            <p><strong>Title:</strong><br>
            Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04328v1">http://arxiv.org/abs/2502.04328v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts. The core design of Ola lies in its progressive modality alignment strategy that extends the supporting modality of the language model progressively. Our training pipeline begins with the most distinct modalities: image and text, then gradually expands the skill sets of the model using speech data that connects language and audio knowledge, and video data that connects all modalities. The progressive learning pipeline also enables us to maintain a relatively small size of the cross-modal alignment data, making developing omni-modal from existing vision-language models easy and less costly. Moreover, to unlock an advanced interactive experience like GPT-4o, we further design a sentence-wise decoding solution for streaming speech generation. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are open-sourced at https://github.com/Ola-Omni/Ola.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 07 Feb 2025 20:51:17 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/200b6213/c7424ef7.mp3" length="19947213" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1243</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV, cs.CL, cs.MM, cs.SD, eess.AS, eess.IV</p>

            <p><strong>Authors:</strong><br>
            Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao</p>

            <p><strong>Title:</strong><br>
            Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04328v1">http://arxiv.org/abs/2502.04328v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts. The core design of Ola lies in its progressive modality alignment strategy that extends the supporting modality of the language model progressively. Our training pipeline begins with the most distinct modalities: image and text, then gradually expands the skill sets of the model using speech data that connects language and audio knowledge, and video data that connects all modalities. The progressive learning pipeline also enables us to maintain a relatively small size of the cross-modal alignment data, making developing omni-modal from existing vision-language models easy and less costly. Moreover, to unlock an advanced interactive experience like GPT-4o, we further design a sentence-wise decoding solution for streaming speech generation. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are open-sourced at https://github.com/Ola-Omni/Ola.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm</title>
      <itunes:episode>503</itunes:episode>
      <podcast:episode>503</podcast:episode>
      <itunes:title>MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1bd33923-96c4-4e55-8781-b3d624033bd4</guid>
      <link>https://share.transistor.fm/s/2fbfd9f2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziyan Guo, Zeyu Hu, Na Zhao, De Wen Soh</p>

            <p><strong>Title:</strong><br>
            MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02358v3">http://arxiv.org/abs/2502.02358v3</a></p>

            <p><strong>Abstract:</strong><br>
            Human motion generation and editing are key components of computer graphics and vision. However, current approaches in this field tend to offer isolated solutions tailored to specific tasks, which can be inefficient and impractical for real-world applications. While some efforts have aimed to unify motion-related tasks, these methods simply use different modalities as conditions to guide motion generation. Consequently, they lack editing capabilities, fine-grained control, and fail to facilitate knowledge sharing across tasks. To address these limitations and provide a versatile, unified framework capable of handling both human motion generation and editing, we introduce a novel paradigm: Motion-Condition-Motion, which enables the unified formulation of diverse tasks with three concepts: source motion, condition, and target motion. Based on this paradigm, we propose a unified framework, MotionLab, which incorporates rectified flows to learn the mapping from source motion to target motion, guided by the specified conditions. In MotionLab, we introduce the 1) MotionFlow Transformer to enhance conditional generation and editing without task-specific modules; 2) Aligned Rotational Position Encoding} to guarantee the time synchronization between source motion and target motion; 3) Task Specified Instruction Modulation; and 4) Motion Curriculum Learning for effective multi-task learning and knowledge sharing across tasks. Notably, our MotionLab demonstrates promising generalization capabilities and inference efficiency across multiple benchmarks for human motion. Our code and additional video results are available at: https://diouo.github.io/motionlab.github.io/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziyan Guo, Zeyu Hu, Na Zhao, De Wen Soh</p>

            <p><strong>Title:</strong><br>
            MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02358v3">http://arxiv.org/abs/2502.02358v3</a></p>

            <p><strong>Abstract:</strong><br>
            Human motion generation and editing are key components of computer graphics and vision. However, current approaches in this field tend to offer isolated solutions tailored to specific tasks, which can be inefficient and impractical for real-world applications. While some efforts have aimed to unify motion-related tasks, these methods simply use different modalities as conditions to guide motion generation. Consequently, they lack editing capabilities, fine-grained control, and fail to facilitate knowledge sharing across tasks. To address these limitations and provide a versatile, unified framework capable of handling both human motion generation and editing, we introduce a novel paradigm: Motion-Condition-Motion, which enables the unified formulation of diverse tasks with three concepts: source motion, condition, and target motion. Based on this paradigm, we propose a unified framework, MotionLab, which incorporates rectified flows to learn the mapping from source motion to target motion, guided by the specified conditions. In MotionLab, we introduce the 1) MotionFlow Transformer to enhance conditional generation and editing without task-specific modules; 2) Aligned Rotational Position Encoding} to guarantee the time synchronization between source motion and target motion; 3) Task Specified Instruction Modulation; and 4) Motion Curriculum Learning for effective multi-task learning and knowledge sharing across tasks. Notably, our MotionLab demonstrates promising generalization capabilities and inference efficiency across multiple benchmarks for human motion. Our code and additional video results are available at: https://diouo.github.io/motionlab.github.io/.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 07 Feb 2025 20:50:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2fbfd9f2/1d973bde.mp3" length="19966443" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1244</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziyan Guo, Zeyu Hu, Na Zhao, De Wen Soh</p>

            <p><strong>Title:</strong><br>
            MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02358v3">http://arxiv.org/abs/2502.02358v3</a></p>

            <p><strong>Abstract:</strong><br>
            Human motion generation and editing are key components of computer graphics and vision. However, current approaches in this field tend to offer isolated solutions tailored to specific tasks, which can be inefficient and impractical for real-world applications. While some efforts have aimed to unify motion-related tasks, these methods simply use different modalities as conditions to guide motion generation. Consequently, they lack editing capabilities, fine-grained control, and fail to facilitate knowledge sharing across tasks. To address these limitations and provide a versatile, unified framework capable of handling both human motion generation and editing, we introduce a novel paradigm: Motion-Condition-Motion, which enables the unified formulation of diverse tasks with three concepts: source motion, condition, and target motion. Based on this paradigm, we propose a unified framework, MotionLab, which incorporates rectified flows to learn the mapping from source motion to target motion, guided by the specified conditions. In MotionLab, we introduce the 1) MotionFlow Transformer to enhance conditional generation and editing without task-specific modules; 2) Aligned Rotational Position Encoding} to guarantee the time synchronization between source motion and target motion; 3) Task Specified Instruction Modulation; and 4) Motion Curriculum Learning for effective multi-task learning and knowledge sharing across tasks. Notably, our MotionLab demonstrates promising generalization capabilities and inference efficiency across multiple benchmarks for human motion. Our code and additional video results are available at: https://diouo.github.io/motionlab.github.io/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion</title>
      <itunes:episode>502</itunes:episode>
      <podcast:episode>502</podcast:episode>
      <itunes:title>MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">76df5e3a-bbba-476d-b7e0-53419402bd5a</guid>
      <link>https://share.transistor.fm/s/9a09e0bc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xintong Hao, Ke Shen, Chenggang Li</p>

            <p><strong>Title:</strong><br>
            MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04235v1">http://arxiv.org/abs/2502.04235v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the remarkable capabilities of large language models across various tasks, their continued scaling faces a critical challenge: the scarcity of high-quality pretraining data. While model architectures continue to evolve, the natural language data struggles to scale up. To tackle this bottleneck, we propose \textbf{MA}ssive \textbf{G}enre-\textbf{A}udience~(MAGA) reformulation method, which systematic synthesizes diverse, contextually-rich pretraining data from existing corpus. This work makes three main contributions: (1) We propose MAGA reformulation method, a lightweight and scalable approach for pretraining corpus expansion, and build a 770B tokens MAGACorpus. (2) We evaluate MAGACorpus with different data budget scaling strategies, demonstrating consistent improvements across various model sizes (134M-13B), establishing the necessity for next-generation large-scale synthetic pretraining language models. (3) Through comprehensive analysis, we investigate prompt engineering's impact on synthetic training collapse and reveal limitations in conventional collapse detection metrics using validation losses. Our work shows that MAGA can substantially expand training datasets while maintaining quality, offering a reliably pathway for scaling models beyond data limitations.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xintong Hao, Ke Shen, Chenggang Li</p>

            <p><strong>Title:</strong><br>
            MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04235v1">http://arxiv.org/abs/2502.04235v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the remarkable capabilities of large language models across various tasks, their continued scaling faces a critical challenge: the scarcity of high-quality pretraining data. While model architectures continue to evolve, the natural language data struggles to scale up. To tackle this bottleneck, we propose \textbf{MA}ssive \textbf{G}enre-\textbf{A}udience~(MAGA) reformulation method, which systematic synthesizes diverse, contextually-rich pretraining data from existing corpus. This work makes three main contributions: (1) We propose MAGA reformulation method, a lightweight and scalable approach for pretraining corpus expansion, and build a 770B tokens MAGACorpus. (2) We evaluate MAGACorpus with different data budget scaling strategies, demonstrating consistent improvements across various model sizes (134M-13B), establishing the necessity for next-generation large-scale synthetic pretraining language models. (3) Through comprehensive analysis, we investigate prompt engineering's impact on synthetic training collapse and reveal limitations in conventional collapse detection metrics using validation losses. Our work shows that MAGA can substantially expand training datasets while maintaining quality, offering a reliably pathway for scaling models beyond data limitations.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 07 Feb 2025 20:50:31 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9a09e0bc/234e5787.mp3" length="22175754" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1382</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xintong Hao, Ke Shen, Chenggang Li</p>

            <p><strong>Title:</strong><br>
            MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04235v1">http://arxiv.org/abs/2502.04235v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the remarkable capabilities of large language models across various tasks, their continued scaling faces a critical challenge: the scarcity of high-quality pretraining data. While model architectures continue to evolve, the natural language data struggles to scale up. To tackle this bottleneck, we propose \textbf{MA}ssive \textbf{G}enre-\textbf{A}udience~(MAGA) reformulation method, which systematic synthesizes diverse, contextually-rich pretraining data from existing corpus. This work makes three main contributions: (1) We propose MAGA reformulation method, a lightweight and scalable approach for pretraining corpus expansion, and build a 770B tokens MAGACorpus. (2) We evaluate MAGACorpus with different data budget scaling strategies, demonstrating consistent improvements across various model sizes (134M-13B), establishing the necessity for next-generation large-scale synthetic pretraining language models. (3) Through comprehensive analysis, we investigate prompt engineering's impact on synthetic training collapse and reveal limitations in conventional collapse detection metrics using validation losses. Our work shows that MAGA can substantially expand training datasets while maintaining quality, offering a reliably pathway for scaling models beyond data limitations.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization</title>
      <itunes:episode>501</itunes:episode>
      <podcast:episode>501</podcast:episode>
      <itunes:title>ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d1fe53ff-27fb-46ba-ae8a-6017a764d510</guid>
      <link>https://share.transistor.fm/s/bd7ccac7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, Bryon Aragam</p>

            <p><strong>Title:</strong><br>
            ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04306v1">http://arxiv.org/abs/2502.04306v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent research has leveraged large language model multi-agent systems for complex problem-solving while trying to reduce the manual effort required to build them, driving the development of automated agent workflow optimization methods. However, existing methods remain inflexible due to representational limitations, a lack of adaptability, and poor scalability when relying on discrete optimization techniques. We address these challenges with ScoreFlow, a simple yet high-performance framework that leverages efficient gradient-based optimization in a continuous space. ScoreFlow incorporates Score-DPO, a novel variant of the direct preference optimization method that accounts for quantitative feedback. Across six benchmarks spanning question answering, coding, and mathematical reasoning, ScoreFlow achieves an 8.2% improvement over existing baselines. Moreover, it empowers smaller models to outperform larger ones with lower inference costs. Project: https://github.com/Gen-Verse/ScoreFlow</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, Bryon Aragam</p>

            <p><strong>Title:</strong><br>
            ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04306v1">http://arxiv.org/abs/2502.04306v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent research has leveraged large language model multi-agent systems for complex problem-solving while trying to reduce the manual effort required to build them, driving the development of automated agent workflow optimization methods. However, existing methods remain inflexible due to representational limitations, a lack of adaptability, and poor scalability when relying on discrete optimization techniques. We address these challenges with ScoreFlow, a simple yet high-performance framework that leverages efficient gradient-based optimization in a continuous space. ScoreFlow incorporates Score-DPO, a novel variant of the direct preference optimization method that accounts for quantitative feedback. Across six benchmarks spanning question answering, coding, and mathematical reasoning, ScoreFlow achieves an 8.2% improvement over existing baselines. Moreover, it empowers smaller models to outperform larger ones with lower inference costs. Project: https://github.com/Gen-Verse/ScoreFlow</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 07 Feb 2025 20:50:07 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bd7ccac7/14c3f6d2.mp3" length="19850236" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1237</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, Bryon Aragam</p>

            <p><strong>Title:</strong><br>
            ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04306v1">http://arxiv.org/abs/2502.04306v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent research has leveraged large language model multi-agent systems for complex problem-solving while trying to reduce the manual effort required to build them, driving the development of automated agent workflow optimization methods. However, existing methods remain inflexible due to representational limitations, a lack of adaptability, and poor scalability when relying on discrete optimization techniques. We address these challenges with ScoreFlow, a simple yet high-performance framework that leverages efficient gradient-based optimization in a continuous space. ScoreFlow incorporates Score-DPO, a novel variant of the direct preference optimization method that accounts for quantitative feedback. Across six benchmarks spanning question answering, coding, and mathematical reasoning, ScoreFlow achieves an 8.2% improvement over existing baselines. Moreover, it empowers smaller models to outperform larger ones with lower inference costs. Project: https://github.com/Gen-Verse/ScoreFlow</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis</title>
      <itunes:episode>500</itunes:episode>
      <podcast:episode>500</podcast:episode>
      <itunes:title>Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cf293c4e-8a8a-4f86-83c6-d7029b51d302</guid>
      <link>https://share.transistor.fm/s/3d906866</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | eess.AS, cs.AI, cs.CL, cs.MM, cs.SD</p>

            <p><strong>Authors:</strong><br>
            Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi DAI, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, Wei Xue</p>

            <p><strong>Title:</strong><br>
            Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04128v1">http://arxiv.org/abs/2502.04128v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | eess.AS, cs.AI, cs.CL, cs.MM, cs.SD</p>

            <p><strong>Authors:</strong><br>
            Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi DAI, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, Wei Xue</p>

            <p><strong>Title:</strong><br>
            Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04128v1">http://arxiv.org/abs/2502.04128v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 07 Feb 2025 20:49:44 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3d906866/fdbb1f27.mp3" length="21667527" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1351</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | eess.AS, cs.AI, cs.CL, cs.MM, cs.SD</p>

            <p><strong>Authors:</strong><br>
            Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi DAI, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, Wei Xue</p>

            <p><strong>Title:</strong><br>
            Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.04128v1">http://arxiv.org/abs/2502.04128v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model</title>
      <itunes:episode>499</itunes:episode>
      <podcast:episode>499</podcast:episode>
      <itunes:title>SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">62f5c784-f4e3-491f-9e1a-863204a7b5a6</guid>
      <link>https://share.transistor.fm/s/d37e78bf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 90 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, Thomas Wolf</p>

            <p><strong>Title:</strong><br>
            SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02737v1">http://arxiv.org/abs/2502.02737v1</a></p>

            <p><strong>Abstract:</strong><br>
            While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 90 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, Thomas Wolf</p>

            <p><strong>Title:</strong><br>
            SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02737v1">http://arxiv.org/abs/2502.02737v1</a></p>

            <p><strong>Abstract:</strong><br>
            While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 06 Feb 2025 20:50:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d37e78bf/57c853c9.mp3" length="20929404" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1304</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 90 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, Thomas Wolf</p>

            <p><strong>Title:</strong><br>
            SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02737v1">http://arxiv.org/abs/2502.02737v1</a></p>

            <p><strong>Abstract:</strong><br>
            While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets</title>
      <itunes:episode>498</itunes:episode>
      <podcast:episode>498</podcast:episode>
      <itunes:title>TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">03b1dd30-8a7c-4df9-b3b5-0e6d00966927</guid>
      <link>https://share.transistor.fm/s/408aef71</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CE, cs.CY</p>

            <p><strong>Authors:</strong><br>
            Yuzhe Yang, Yifei Zhang, Minghao Wu, Kaidi Zhang, Yunmiao Zhang, Honghai Yu, Yan Hu, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01506v2">http://arxiv.org/abs/2502.01506v2</a></p>

            <p><strong>Abstract:</strong><br>
            The study of social emergence has long been a central focus in social science. Traditional modeling approaches, such as rule-based Agent-Based Models (ABMs), struggle to capture the diversity and complexity of human behavior, particularly the irrational factors emphasized in behavioral economics. Recently, large language model (LLM) agents have gained traction as simulation tools for modeling human behavior in social science and role-playing applications. Studies suggest that LLMs can account for cognitive biases, emotional fluctuations, and other non-rational influences, enabling more realistic simulations of socio-economic dynamics. In this work, we introduce TwinMarket, a novel multi-agent framework that leverages LLMs to simulate socio-economic systems. Specifically, we examine how individual behaviors, through interactions and feedback mechanisms, give rise to collective dynamics and emergent phenomena. Through experiments in a simulated stock market environment, we demonstrate how individual actions can trigger group behaviors, leading to emergent outcomes such as financial bubbles and recessions. Our approach provides valuable insights into the complex interplay between individual decision-making and collective socio-economic patterns.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CE, cs.CY</p>

            <p><strong>Authors:</strong><br>
            Yuzhe Yang, Yifei Zhang, Minghao Wu, Kaidi Zhang, Yunmiao Zhang, Honghai Yu, Yan Hu, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01506v2">http://arxiv.org/abs/2502.01506v2</a></p>

            <p><strong>Abstract:</strong><br>
            The study of social emergence has long been a central focus in social science. Traditional modeling approaches, such as rule-based Agent-Based Models (ABMs), struggle to capture the diversity and complexity of human behavior, particularly the irrational factors emphasized in behavioral economics. Recently, large language model (LLM) agents have gained traction as simulation tools for modeling human behavior in social science and role-playing applications. Studies suggest that LLMs can account for cognitive biases, emotional fluctuations, and other non-rational influences, enabling more realistic simulations of socio-economic dynamics. In this work, we introduce TwinMarket, a novel multi-agent framework that leverages LLMs to simulate socio-economic systems. Specifically, we examine how individual behaviors, through interactions and feedback mechanisms, give rise to collective dynamics and emergent phenomena. Through experiments in a simulated stock market environment, we demonstrate how individual actions can trigger group behaviors, leading to emergent outcomes such as financial bubbles and recessions. Our approach provides valuable insights into the complex interplay between individual decision-making and collective socio-economic patterns.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 06 Feb 2025 20:49:41 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/408aef71/319ae88f.mp3" length="22390171" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1396</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CE, cs.CY</p>

            <p><strong>Authors:</strong><br>
            Yuzhe Yang, Yifei Zhang, Minghao Wu, Kaidi Zhang, Yunmiao Zhang, Honghai Yu, Yan Hu, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            TwinMarket: A Scalable Behavioral and Social Simulation for Financial Markets</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01506v2">http://arxiv.org/abs/2502.01506v2</a></p>

            <p><strong>Abstract:</strong><br>
            The study of social emergence has long been a central focus in social science. Traditional modeling approaches, such as rule-based Agent-Based Models (ABMs), struggle to capture the diversity and complexity of human behavior, particularly the irrational factors emphasized in behavioral economics. Recently, large language model (LLM) agents have gained traction as simulation tools for modeling human behavior in social science and role-playing applications. Studies suggest that LLMs can account for cognitive biases, emotional fluctuations, and other non-rational influences, enabling more realistic simulations of socio-economic dynamics. In this work, we introduce TwinMarket, a novel multi-agent framework that leverages LLMs to simulate socio-economic systems. Specifically, we examine how individual behaviors, through interactions and feedback mechanisms, give rise to collective dynamics and emergent phenomena. Through experiments in a simulated stock market environment, we demonstrate how individual actions can trigger group behaviors, leading to emergent outcomes such as financial bubbles and recessions. Our approach provides valuable insights into the complex interplay between individual decision-making and collective socio-economic patterns.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Demystifying Long Chain-of-Thought Reasoning in LLMs</title>
      <itunes:episode>497</itunes:episode>
      <podcast:episode>497</podcast:episode>
      <itunes:title>Demystifying Long Chain-of-Thought Reasoning in LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">343edcf6-f875-4680-a088-840934af97c3</guid>
      <link>https://share.transistor.fm/s/3ecf74f2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, Xiang Yue</p>

            <p><strong>Title:</strong><br>
            Demystifying Long Chain-of-Thought Reasoning in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.03373v1">http://arxiv.org/abs/2502.03373v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, Xiang Yue</p>

            <p><strong>Title:</strong><br>
            Demystifying Long Chain-of-Thought Reasoning in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.03373v1">http://arxiv.org/abs/2502.03373v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 06 Feb 2025 20:49:20 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3ecf74f2/6609a075.mp3" length="20164095" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1257</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, Xiang Yue</p>

            <p><strong>Title:</strong><br>
            Demystifying Long Chain-of-Thought Reasoning in LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.03373v1">http://arxiv.org/abs/2502.03373v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LIMO: Less is More for Reasoning</title>
      <itunes:episode>496</itunes:episode>
      <podcast:episode>496</podcast:episode>
      <itunes:title>LIMO: Less is More for Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">39c08ec7-010b-441e-a265-d0a09f1b1f1d</guid>
      <link>https://share.transistor.fm/s/a72316ac</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            LIMO: Less is More for Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.03387v1">http://arxiv.org/abs/2502.03387v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a fundamental discovery that challenges our understanding of how complex reasoning emerges in large language models. While conventional wisdom suggests that sophisticated reasoning tasks demand extensive training data (&gt;100,000 examples), we demonstrate that complex mathematical reasoning abilities can be effectively elicited with surprisingly few examples. Through comprehensive experiments, our proposed model LIMO demonstrates unprecedented performance in mathematical reasoning. With merely 817 curated training samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, improving from previous SFT-based models' 6.5% and 59.2% respectively, while only using 1% of the training data required by previous approaches. LIMO demonstrates exceptional out-of-distribution generalization, achieving 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data, challenging the notion that SFT leads to memorization rather than generalization. Based on these results, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning capabilities can emerge through minimal but precisely orchestrated demonstrations of cognitive processes. This hypothesis posits that the elicitation threshold for complex reasoning is determined by two key factors: (1) the completeness of the model's encoded knowledge foundation during pre-training, and (2) the effectiveness of post-training examples as "cognitive templates" that show the model how to utilize its knowledge base to solve complex reasoning tasks. To facilitate reproducibility and future research in data-efficient reasoning, we release LIMO as a comprehensive open-source suite at https://github.com/GAIR-NLP/LIMO.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            LIMO: Less is More for Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.03387v1">http://arxiv.org/abs/2502.03387v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a fundamental discovery that challenges our understanding of how complex reasoning emerges in large language models. While conventional wisdom suggests that sophisticated reasoning tasks demand extensive training data (&gt;100,000 examples), we demonstrate that complex mathematical reasoning abilities can be effectively elicited with surprisingly few examples. Through comprehensive experiments, our proposed model LIMO demonstrates unprecedented performance in mathematical reasoning. With merely 817 curated training samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, improving from previous SFT-based models' 6.5% and 59.2% respectively, while only using 1% of the training data required by previous approaches. LIMO demonstrates exceptional out-of-distribution generalization, achieving 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data, challenging the notion that SFT leads to memorization rather than generalization. Based on these results, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning capabilities can emerge through minimal but precisely orchestrated demonstrations of cognitive processes. This hypothesis posits that the elicitation threshold for complex reasoning is determined by two key factors: (1) the completeness of the model's encoded knowledge foundation during pre-training, and (2) the effectiveness of post-training examples as "cognitive templates" that show the model how to utilize its knowledge base to solve complex reasoning tasks. To facilitate reproducibility and future research in data-efficient reasoning, we release LIMO as a comprehensive open-source suite at https://github.com/GAIR-NLP/LIMO.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 06 Feb 2025 20:48:59 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a72316ac/7b9d54f7.mp3" length="22729508" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1417</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            LIMO: Less is More for Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.03387v1">http://arxiv.org/abs/2502.03387v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a fundamental discovery that challenges our understanding of how complex reasoning emerges in large language models. While conventional wisdom suggests that sophisticated reasoning tasks demand extensive training data (&gt;100,000 examples), we demonstrate that complex mathematical reasoning abilities can be effectively elicited with surprisingly few examples. Through comprehensive experiments, our proposed model LIMO demonstrates unprecedented performance in mathematical reasoning. With merely 817 curated training samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, improving from previous SFT-based models' 6.5% and 59.2% respectively, while only using 1% of the training data required by previous approaches. LIMO demonstrates exceptional out-of-distribution generalization, achieving 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data, challenging the notion that SFT leads to memorization rather than generalization. Based on these results, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning capabilities can emerge through minimal but precisely orchestrated demonstrations of cognitive processes. This hypothesis posits that the elicitation threshold for complex reasoning is determined by two key factors: (1) the completeness of the model's encoded knowledge foundation during pre-training, and (2) the effectiveness of post-training examples as "cognitive templates" that show the model how to utilize its knowledge base to solve complex reasoning tasks. To facilitate reproducibility and future research in data-efficient reasoning, we release LIMO as a comprehensive open-source suite at https://github.com/GAIR-NLP/LIMO.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking</title>
      <itunes:episode>495</itunes:episode>
      <podcast:episode>495</podcast:episode>
      <itunes:title>Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2336a0ee-bfce-4732-8641-d6b2c190fecc</guid>
      <link>https://share.transistor.fm/s/fb095a0b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jinyang Wu, Mingkuan Feng, Shuai Zhang, Ruihan Jin, Feihu Che, Zengqi Wen, Jianhua Tao</p>

            <p><strong>Title:</strong><br>
            Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02339v1">http://arxiv.org/abs/2502.02339v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) exhibit impressive capabilities but still face challenges in complex visual reasoning. While recent efforts attempt to enhance MLLMs' reasoning by incorporating OpenAI o1-like structured thinking through explicit search structures or teacher-guided distillation, they often struggle to balance performance and efficiency. A critical limitation is their heavy reliance on extensive data and search spaces, resulting in low-efficiency implicit insight extraction and data utilization. To address this, we propose AStar, an Automated Structured thinking paradigm for multimodal reasoning via Monte Carlo Tree Search (MCTS). AStar automatically derives high-level cognitive reasoning patterns from limited data using MCTS-powered hierarchical structures. Building on these explicit patterns, we design a unified reasoning framework that seamlessly integrates models' internal reasoning capabilities and external reasoning guidelines, enabling efficient inference with minimal tree iterations. This novel paradigm strikes a compelling balance between performance and efficiency. Extensive experiments demonstrate AStar's effectiveness, achieving superior accuracy (54.0$\%$) on the MathVerse benchmark with a 7B backbone, surpassing GPT-4o (50.2$\%$) while maintaining substantial data and computational efficiency.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jinyang Wu, Mingkuan Feng, Shuai Zhang, Ruihan Jin, Feihu Che, Zengqi Wen, Jianhua Tao</p>

            <p><strong>Title:</strong><br>
            Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02339v1">http://arxiv.org/abs/2502.02339v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) exhibit impressive capabilities but still face challenges in complex visual reasoning. While recent efforts attempt to enhance MLLMs' reasoning by incorporating OpenAI o1-like structured thinking through explicit search structures or teacher-guided distillation, they often struggle to balance performance and efficiency. A critical limitation is their heavy reliance on extensive data and search spaces, resulting in low-efficiency implicit insight extraction and data utilization. To address this, we propose AStar, an Automated Structured thinking paradigm for multimodal reasoning via Monte Carlo Tree Search (MCTS). AStar automatically derives high-level cognitive reasoning patterns from limited data using MCTS-powered hierarchical structures. Building on these explicit patterns, we design a unified reasoning framework that seamlessly integrates models' internal reasoning capabilities and external reasoning guidelines, enabling efficient inference with minimal tree iterations. This novel paradigm strikes a compelling balance between performance and efficiency. Extensive experiments demonstrate AStar's effectiveness, achieving superior accuracy (54.0$\%$) on the MathVerse benchmark with a 7B backbone, surpassing GPT-4o (50.2$\%$) while maintaining substantial data and computational efficiency.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 06 Feb 2025 20:48:38 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fb095a0b/fa508516.mp3" length="20927723" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1304</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jinyang Wu, Mingkuan Feng, Shuai Zhang, Ruihan Jin, Feihu Che, Zengqi Wen, Jianhua Tao</p>

            <p><strong>Title:</strong><br>
            Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02339v1">http://arxiv.org/abs/2502.02339v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) exhibit impressive capabilities but still face challenges in complex visual reasoning. While recent efforts attempt to enhance MLLMs' reasoning by incorporating OpenAI o1-like structured thinking through explicit search structures or teacher-guided distillation, they often struggle to balance performance and efficiency. A critical limitation is their heavy reliance on extensive data and search spaces, resulting in low-efficiency implicit insight extraction and data utilization. To address this, we propose AStar, an Automated Structured thinking paradigm for multimodal reasoning via Monte Carlo Tree Search (MCTS). AStar automatically derives high-level cognitive reasoning patterns from limited data using MCTS-powered hierarchical structures. Building on these explicit patterns, we design a unified reasoning framework that seamlessly integrates models' internal reasoning capabilities and external reasoning guidelines, enabling efficient inference with minimal tree iterations. This novel paradigm strikes a compelling balance between performance and efficiency. Extensive experiments demonstrate AStar's effectiveness, achieving superior accuracy (54.0$\%$) on the MathVerse benchmark with a 7B backbone, surpassing GPT-4o (50.2$\%$) while maintaining substantial data and computational efficiency.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer</title>
      <itunes:episode>494</itunes:episode>
      <podcast:episode>494</podcast:episode>
      <itunes:title>LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">314302ba-799c-4755-b3b4-cfe998e5e7cb</guid>
      <link>https://share.transistor.fm/s/60c16526</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yiren Song, Danze Chen, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01105v1">http://arxiv.org/abs/2502.01105v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generating cognitive-aligned layered SVGs remains challenging due to existing methods' tendencies toward either oversimplified single-layer outputs or optimization-induced shape redundancies. We propose LayerTracer, a diffusion transformer based framework that bridges this gap by learning designers' layered SVG creation processes from a novel dataset of sequential design operations. Our approach operates in two phases: First, a text-conditioned DiT generates multi-phase rasterized construction blueprints that simulate human design workflows. Second, layer-wise vectorization with path deduplication produces clean, editable SVGs. For image vectorization, we introduce a conditional diffusion mechanism that encodes reference images into latent tokens, guiding hierarchical reconstruction while preserving structural integrity. Extensive experiments demonstrate LayerTracer's superior performance against optimization-based and neural baselines in both generation quality and editability, effectively aligning AI-generated vectors with professional design cognition.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yiren Song, Danze Chen, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01105v1">http://arxiv.org/abs/2502.01105v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generating cognitive-aligned layered SVGs remains challenging due to existing methods' tendencies toward either oversimplified single-layer outputs or optimization-induced shape redundancies. We propose LayerTracer, a diffusion transformer based framework that bridges this gap by learning designers' layered SVG creation processes from a novel dataset of sequential design operations. Our approach operates in two phases: First, a text-conditioned DiT generates multi-phase rasterized construction blueprints that simulate human design workflows. Second, layer-wise vectorization with path deduplication produces clean, editable SVGs. For image vectorization, we introduce a conditional diffusion mechanism that encodes reference images into latent tokens, guiding hierarchical reconstruction while preserving structural integrity. Extensive experiments demonstrate LayerTracer's superior performance against optimization-based and neural baselines in both generation quality and editability, effectively aligning AI-generated vectors with professional design cognition.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 06 Feb 2025 20:48:17 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/60c16526/754d3958.mp3" length="24586547" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1533</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yiren Song, Danze Chen, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01105v1">http://arxiv.org/abs/2502.01105v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generating cognitive-aligned layered SVGs remains challenging due to existing methods' tendencies toward either oversimplified single-layer outputs or optimization-induced shape redundancies. We propose LayerTracer, a diffusion transformer based framework that bridges this gap by learning designers' layered SVG creation processes from a novel dataset of sequential design operations. Our approach operates in two phases: First, a text-conditioned DiT generates multi-phase rasterized construction blueprints that simulate human design workflows. Second, layer-wise vectorization with path deduplication produces clean, editable SVGs. For image vectorization, we introduce a conditional diffusion mechanism that encodes reference images into latent tokens, guiding hierarchical reconstruction while preserving structural integrity. Extensive experiments demonstrate LayerTracer's superior performance against optimization-based and neural baselines in both generation quality and editability, effectively aligning AI-generated vectors with professional design cognition.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>On Teacher Hacking in Language Model Distillation</title>
      <itunes:episode>493</itunes:episode>
      <podcast:episode>493</podcast:episode>
      <itunes:title>On Teacher Hacking in Language Model Distillation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3e727d37-bb1a-42a3-9f06-85c8a6b22fb5</guid>
      <link>https://share.transistor.fm/s/2bd07da2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.LG, cs.AI, cs.CL, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Daniil Tiapkin, Daniele Calandriello, Johan Ferret, Sarah Perrin, Nino Vieillard, Alexandre Ramé, Mathieu Blondel</p>

            <p><strong>Title:</strong><br>
            On Teacher Hacking in Language Model Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02671v1">http://arxiv.org/abs/2502.02671v1</a></p>

            <p><strong>Abstract:</strong><br>
            Post-training of language models (LMs) increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF), where the LM is aligned by optimizing a reward model. In the second RLHF stage, a well-known challenge is reward hacking, where the LM over-optimizes the reward model. Such phenomenon is in line with Goodhart's law and can lead to degraded performance on the true objective. In this paper, we investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. To study this, we propose a controlled experimental setup involving: (i) an oracle LM representing the ground-truth distribution, (ii) a teacher LM distilled from the oracle, and (iii) a student LM distilled from the teacher. Our experiments reveal the following insights. When using a fixed offline dataset for distillation, teacher hacking occurs; moreover, we can detect it by observing when the optimization process deviates from polynomial convergence laws. In contrast, employing online data generation techniques effectively mitigates teacher hacking. More precisely, we identify data diversity as the key factor in preventing hacking. Overall, our findings provide a deeper understanding of the benefits and limitations of distillation for building robust and efficient LMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.LG, cs.AI, cs.CL, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Daniil Tiapkin, Daniele Calandriello, Johan Ferret, Sarah Perrin, Nino Vieillard, Alexandre Ramé, Mathieu Blondel</p>

            <p><strong>Title:</strong><br>
            On Teacher Hacking in Language Model Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02671v1">http://arxiv.org/abs/2502.02671v1</a></p>

            <p><strong>Abstract:</strong><br>
            Post-training of language models (LMs) increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF), where the LM is aligned by optimizing a reward model. In the second RLHF stage, a well-known challenge is reward hacking, where the LM over-optimizes the reward model. Such phenomenon is in line with Goodhart's law and can lead to degraded performance on the true objective. In this paper, we investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. To study this, we propose a controlled experimental setup involving: (i) an oracle LM representing the ground-truth distribution, (ii) a teacher LM distilled from the oracle, and (iii) a student LM distilled from the teacher. Our experiments reveal the following insights. When using a fixed offline dataset for distillation, teacher hacking occurs; moreover, we can detect it by observing when the optimization process deviates from polynomial convergence laws. In contrast, employing online data generation techniques effectively mitigates teacher hacking. More precisely, we identify data diversity as the key factor in preventing hacking. Overall, our findings provide a deeper understanding of the benefits and limitations of distillation for building robust and efficient LMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 06 Feb 2025 20:47:56 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2bd07da2/2fed870c.mp3" length="20230965" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1261</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.LG, cs.AI, cs.CL, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Daniil Tiapkin, Daniele Calandriello, Johan Ferret, Sarah Perrin, Nino Vieillard, Alexandre Ramé, Mathieu Blondel</p>

            <p><strong>Title:</strong><br>
            On Teacher Hacking in Language Model Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02671v1">http://arxiv.org/abs/2502.02671v1</a></p>

            <p><strong>Abstract:</strong><br>
            Post-training of language models (LMs) increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF), where the LM is aligned by optimizing a reward model. In the second RLHF stage, a well-known challenge is reward hacking, where the LM over-optimizes the reward model. Such phenomenon is in line with Goodhart's law and can lead to degraded performance on the true objective. In this paper, we investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. To study this, we propose a controlled experimental setup involving: (i) an oracle LM representing the ground-truth distribution, (ii) a teacher LM distilled from the oracle, and (iii) a student LM distilled from the teacher. Our experiments reveal the following insights. When using a fixed offline dataset for distillation, teacher hacking occurs; moreover, we can detect it by observing when the optimization process deviates from polynomial convergence laws. In contrast, employing online data generation techniques effectively mitigates teacher hacking. More precisely, we identify data diversity as the key factor in preventing hacking. Overall, our findings provide a deeper understanding of the benefits and limitations of distillation for building robust and efficient LMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods</title>
      <itunes:episode>492</itunes:episode>
      <podcast:episode>492</podcast:episode>
      <itunes:title>A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c834f781-2229-46bb-8459-91ea0b55d671</guid>
      <link>https://share.transistor.fm/s/ac96801f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, Akash Srivastava</p>

            <p><strong>Title:</strong><br>
            A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01618v2">http://arxiv.org/abs/2502.01618v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating scaling the computation spent at inference time. Existing inference-time scaling methods, usually with reward models, cast the task as a search problem, which tends to be vulnerable to reward hacking as a consequence of approximation errors in reward models. In this paper, we instead cast inference-time scaling as a probabilistic inference task and leverage sampling-based techniques to explore the typical set of the state distribution of a state-space model with an approximate likelihood, rather than optimize for its mode directly. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods to this task. Our empirical evaluation demonstrates that our methods have a 4-16x better scaling rate over our deterministic search counterparts on various challenging mathematical reasoning tasks. Using our approach, we show that Qwen2.5-Math-1.5B-Instruct can surpass GPT-4o accuracy in only 4 rollouts, while Qwen2.5-Math-7B-Instruct scales to o1 level accuracy in only 32 rollouts. Our work not only presents an effective method to inference-time scaling, but also connects the rich literature in probabilistic inference with inference-time scaling of LLMs to develop more robust algorithms in future work. Code and further information is available at https://probabilistic-inference-scaling.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, Akash Srivastava</p>

            <p><strong>Title:</strong><br>
            A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01618v2">http://arxiv.org/abs/2502.01618v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating scaling the computation spent at inference time. Existing inference-time scaling methods, usually with reward models, cast the task as a search problem, which tends to be vulnerable to reward hacking as a consequence of approximation errors in reward models. In this paper, we instead cast inference-time scaling as a probabilistic inference task and leverage sampling-based techniques to explore the typical set of the state distribution of a state-space model with an approximate likelihood, rather than optimize for its mode directly. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods to this task. Our empirical evaluation demonstrates that our methods have a 4-16x better scaling rate over our deterministic search counterparts on various challenging mathematical reasoning tasks. Using our approach, we show that Qwen2.5-Math-1.5B-Instruct can surpass GPT-4o accuracy in only 4 rollouts, while Qwen2.5-Math-7B-Instruct scales to o1 level accuracy in only 32 rollouts. Our work not only presents an effective method to inference-time scaling, but also connects the rich literature in probabilistic inference with inference-time scaling of LLMs to develop more robust algorithms in future work. Code and further information is available at https://probabilistic-inference-scaling.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 06 Feb 2025 20:47:35 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ac96801f/b5f9570b.mp3" length="23173040" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1445</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, Akash Srivastava</p>

            <p><strong>Title:</strong><br>
            A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01618v2">http://arxiv.org/abs/2502.01618v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating scaling the computation spent at inference time. Existing inference-time scaling methods, usually with reward models, cast the task as a search problem, which tends to be vulnerable to reward hacking as a consequence of approximation errors in reward models. In this paper, we instead cast inference-time scaling as a probabilistic inference task and leverage sampling-based techniques to explore the typical set of the state distribution of a state-space model with an approximate likelihood, rather than optimize for its mode directly. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods to this task. Our empirical evaluation demonstrates that our methods have a 4-16x better scaling rate over our deterministic search counterparts on various challenging mathematical reasoning tasks. Using our approach, we show that Qwen2.5-Math-1.5B-Instruct can surpass GPT-4o accuracy in only 4 rollouts, while Qwen2.5-Math-7B-Instruct scales to o1 level accuracy in only 32 rollouts. Our work not only presents an effective method to inference-time scaling, but also connects the rich literature in probabilistic inference with inference-time scaling of LLMs to develop more robust algorithms in future work. Code and further information is available at https://probabilistic-inference-scaling.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Jailbreaking with Universal Multi-Prompts</title>
      <itunes:episode>491</itunes:episode>
      <podcast:episode>491</podcast:episode>
      <itunes:title>Jailbreaking with Universal Multi-Prompts</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6b86a368-b4eb-4be5-b3d6-dcab445b2a06</guid>
      <link>https://share.transistor.fm/s/a92d1415</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CL, cs.AI, cs.CR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yu-Ling Hsu, Hsuan Su, Shang-Tse Chen</p>

            <p><strong>Title:</strong><br>
            Jailbreaking with Universal Multi-Prompts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01154v1">http://arxiv.org/abs/2502.01154v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have seen rapid development in recent years, revolutionizing various applications and significantly enhancing convenience and productivity. However, alongside their impressive capabilities, ethical concerns and new types of attacks, such as jailbreaking, have emerged. While most prompting techniques focus on optimizing adversarial inputs for individual cases, resulting in higher computational costs when dealing with large datasets. Less research has addressed the more general setting of training a universal attacker that can transfer to unseen tasks. In this paper, we introduce JUMP, a prompt-based method designed to jailbreak LLMs using universal multi-prompts. We also adapt our approach for defense, which we term DUMP. Experimental results demonstrate that our method for optimizing universal multi-prompts outperforms existing techniques.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CL, cs.AI, cs.CR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yu-Ling Hsu, Hsuan Su, Shang-Tse Chen</p>

            <p><strong>Title:</strong><br>
            Jailbreaking with Universal Multi-Prompts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01154v1">http://arxiv.org/abs/2502.01154v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have seen rapid development in recent years, revolutionizing various applications and significantly enhancing convenience and productivity. However, alongside their impressive capabilities, ethical concerns and new types of attacks, such as jailbreaking, have emerged. While most prompting techniques focus on optimizing adversarial inputs for individual cases, resulting in higher computational costs when dealing with large datasets. Less research has addressed the more general setting of training a universal attacker that can transfer to unseen tasks. In this paper, we introduce JUMP, a prompt-based method designed to jailbreak LLMs using universal multi-prompts. We also adapt our approach for defense, which we term DUMP. Experimental results demonstrate that our method for optimizing universal multi-prompts outperforms existing techniques.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 06 Feb 2025 20:47:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a92d1415/6171c304.mp3" length="19385426" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1208</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CL, cs.AI, cs.CR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yu-Ling Hsu, Hsuan Su, Shang-Tse Chen</p>

            <p><strong>Title:</strong><br>
            Jailbreaking with Universal Multi-Prompts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01154v1">http://arxiv.org/abs/2502.01154v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have seen rapid development in recent years, revolutionizing various applications and significantly enhancing convenience and productivity. However, alongside their impressive capabilities, ethical concerns and new types of attacks, such as jailbreaking, have emerged. While most prompting techniques focus on optimizing adversarial inputs for individual cases, resulting in higher computational costs when dealing with large datasets. Less research has addressed the more general setting of training a universal attacker that can transfer to unseen tasks. In this paper, we introduce JUMP, a prompt-based method designed to jailbreak LLMs using universal multi-prompts. We also adapt our approach for defense, which we term DUMP. Experimental results demonstrate that our method for optimizing universal multi-prompts outperforms existing techniques.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models</title>
      <itunes:episode>490</itunes:episode>
      <podcast:episode>490</podcast:episode>
      <itunes:title>VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">be663dc8-f22b-4ad9-85d9-6a7de2f3f5fa</guid>
      <link>https://share.transistor.fm/s/40e20532</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, Shelly Sheynin</p>

            <p><strong>Title:</strong><br>
            VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02492v1">http://arxiv.org/abs/2502.02492v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation. Project website: https://hila-chefer.github.io/videojam-paper.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, Shelly Sheynin</p>

            <p><strong>Title:</strong><br>
            VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02492v1">http://arxiv.org/abs/2502.02492v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation. Project website: https://hila-chefer.github.io/videojam-paper.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 05 Feb 2025 20:29:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/40e20532/a2451ffb.mp3" length="18723434" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1167</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, Shelly Sheynin</p>

            <p><strong>Title:</strong><br>
            VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02492v1">http://arxiv.org/abs/2502.02492v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation. Project website: https://hila-chefer.github.io/videojam-paper.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Inverse Bridge Matching Distillation</title>
      <itunes:episode>489</itunes:episode>
      <podcast:episode>489</podcast:episode>
      <itunes:title>Inverse Bridge Matching Distillation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">53511749-b8dc-4a04-8e7a-2feec3d5e534</guid>
      <link>https://share.transistor.fm/s/afdf987d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Nikita Gushchin, David Li, Daniil Selikhanovych, Evgeny Burnaev, Dmitry Baranchuk, Alexander Korotin</p>

            <p><strong>Title:</strong><br>
            Inverse Bridge Matching Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01362v1">http://arxiv.org/abs/2502.01362v1</a></p>

            <p><strong>Abstract:</strong><br>
            Learning diffusion bridge models is easy; making them fast and practical is an art. Diffusion bridge models (DBMs) are a promising extension of diffusion models for applications in image-to-image translation. However, like many modern diffusion and flow models, DBMs suffer from the problem of slow inference. To address it, we propose a novel distillation technique based on the inverse bridge matching formulation and derive the tractable objective to solve it in practice. Unlike previously developed DBM distillation techniques, the proposed method can distill both conditional and unconditional types of DBMs, distill models in a one-step generator, and use only the corrupted images for training. We evaluate our approach for both conditional and unconditional types of bridge matching on a wide set of setups, including super-resolution, JPEG restoration, sketch-to-image, and other tasks, and show that our distillation technique allows us to accelerate the inference of DBMs from 4x to 100x and even provide better generation quality than used teacher model depending on particular setup.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Nikita Gushchin, David Li, Daniil Selikhanovych, Evgeny Burnaev, Dmitry Baranchuk, Alexander Korotin</p>

            <p><strong>Title:</strong><br>
            Inverse Bridge Matching Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01362v1">http://arxiv.org/abs/2502.01362v1</a></p>

            <p><strong>Abstract:</strong><br>
            Learning diffusion bridge models is easy; making them fast and practical is an art. Diffusion bridge models (DBMs) are a promising extension of diffusion models for applications in image-to-image translation. However, like many modern diffusion and flow models, DBMs suffer from the problem of slow inference. To address it, we propose a novel distillation technique based on the inverse bridge matching formulation and derive the tractable objective to solve it in practice. Unlike previously developed DBM distillation techniques, the proposed method can distill both conditional and unconditional types of DBMs, distill models in a one-step generator, and use only the corrupted images for training. We evaluate our approach for both conditional and unconditional types of bridge matching on a wide set of setups, including super-resolution, JPEG restoration, sketch-to-image, and other tasks, and show that our distillation technique allows us to accelerate the inference of DBMs from 4x to 100x and even provide better generation quality than used teacher model depending on particular setup.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 05 Feb 2025 20:29:32 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/afdf987d/183d5faa.mp3" length="19069862" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1188</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.LG, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Nikita Gushchin, David Li, Daniil Selikhanovych, Evgeny Burnaev, Dmitry Baranchuk, Alexander Korotin</p>

            <p><strong>Title:</strong><br>
            Inverse Bridge Matching Distillation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01362v1">http://arxiv.org/abs/2502.01362v1</a></p>

            <p><strong>Abstract:</strong><br>
            Learning diffusion bridge models is easy; making them fast and practical is an art. Diffusion bridge models (DBMs) are a promising extension of diffusion models for applications in image-to-image translation. However, like many modern diffusion and flow models, DBMs suffer from the problem of slow inference. To address it, we propose a novel distillation technique based on the inverse bridge matching formulation and derive the tractable objective to solve it in practice. Unlike previously developed DBM distillation techniques, the proposed method can distill both conditional and unconditional types of DBMs, distill models in a one-step generator, and use only the corrupted images for training. We evaluate our approach for both conditional and unconditional types of bridge matching on a wide set of setups, including super-resolution, JPEG restoration, sketch-to-image, and other tasks, and show that our distillation technique allows us to accelerate the inference of DBMs from 4x to 100x and even provide better generation quality than used teacher model depending on particular setup.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ACECODER: Acing Coder RL via Automated Test-Case Synthesis</title>
      <itunes:episode>488</itunes:episode>
      <podcast:episode>488</podcast:episode>
      <itunes:title>ACECODER: Acing Coder RL via Automated Test-Case Synthesis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a6272314-9848-4474-834c-5a109b76aa85</guid>
      <link>https://share.transistor.fm/s/5743d5d8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.SE, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            ACECODER: Acing Coder RL via Automated Test-Case Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01718v1">http://arxiv.org/abs/2502.01718v1</a></p>

            <p><strong>Abstract:</strong><br>
            Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipeline that generates extensive (question, test-cases) pairs from existing code data. Using these test cases, we construct preference pairs based on pass rates over sampled programs to train reward models with Bradley-Terry loss. It shows an average of 10-point improvement for Llama-3.1-8B-Ins and 5-point improvement for Qwen2.5-Coder-7B-Ins through best-of-32 sampling, making the 7B model on par with 236B DeepSeek-V2.5. Furthermore, we conduct reinforcement learning with both reward models and test-case pass rewards, leading to consistent improvements across HumanEval, MBPP, BigCodeBench, and LiveCodeBench (V4). Notably, we follow the R1-style training to start from Qwen2.5-Coder-base directly and show that our RL training can improve model on HumanEval-plus by over 25\% and MBPP-plus by 6\% for merely 80 optimization steps. We believe our results highlight the huge potential of reinforcement learning in coder models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.SE, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            ACECODER: Acing Coder RL via Automated Test-Case Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01718v1">http://arxiv.org/abs/2502.01718v1</a></p>

            <p><strong>Abstract:</strong><br>
            Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipeline that generates extensive (question, test-cases) pairs from existing code data. Using these test cases, we construct preference pairs based on pass rates over sampled programs to train reward models with Bradley-Terry loss. It shows an average of 10-point improvement for Llama-3.1-8B-Ins and 5-point improvement for Qwen2.5-Coder-7B-Ins through best-of-32 sampling, making the 7B model on par with 236B DeepSeek-V2.5. Furthermore, we conduct reinforcement learning with both reward models and test-case pass rewards, leading to consistent improvements across HumanEval, MBPP, BigCodeBench, and LiveCodeBench (V4). Notably, we follow the R1-style training to start from Qwen2.5-Coder-base directly and show that our RL training can improve model on HumanEval-plus by over 25\% and MBPP-plus by 6\% for merely 80 optimization steps. We believe our results highlight the huge potential of reinforcement learning in coder models.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 05 Feb 2025 20:29:11 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5743d5d8/96994b8e.mp3" length="19357440" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1206</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.SE, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            ACECODER: Acing Coder RL via Automated Test-Case Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01718v1">http://arxiv.org/abs/2502.01718v1</a></p>

            <p><strong>Abstract:</strong><br>
            Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipeline that generates extensive (question, test-cases) pairs from existing code data. Using these test cases, we construct preference pairs based on pass rates over sampled programs to train reward models with Bradley-Terry loss. It shows an average of 10-point improvement for Llama-3.1-8B-Ins and 5-point improvement for Qwen2.5-Coder-7B-Ins through best-of-32 sampling, making the 7B model on par with 236B DeepSeek-V2.5. Furthermore, we conduct reinforcement learning with both reward models and test-case pass rewards, leading to consistent improvements across HumanEval, MBPP, BigCodeBench, and LiveCodeBench (V4). Notably, we follow the R1-style training to start from Qwen2.5-Coder-base directly and show that our RL training can improve model on HumanEval-plus by over 25\% and MBPP-plus by 6\% for merely 80 optimization steps. We believe our results highlight the huge potential of reinforcement learning in coder models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search</title>
      <itunes:episode>487</itunes:episode>
      <podcast:episode>487</podcast:episode>
      <itunes:title>QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c32a64c3-8db2-418e-a336-446b6b1e05d7</guid>
      <link>https://share.transistor.fm/s/3bf4b700</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zongyu Lin, Yao Tang, Xingcheng Yao, Da Yin, Ziniu Hu, Yizhou Sun, Kai-Wei Chang</p>

            <p><strong>Title:</strong><br>
            QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02584v1">http://arxiv.org/abs/2502.02584v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language agents have become a promising solution to complex interactive tasks. One of the key ingredients to the success of language agents is the reward model on the trajectory of the agentic workflow, which provides valuable guidance during training or inference. However, due to the lack of annotations of intermediate interactions, most existing works use an outcome reward model to optimize policies across entire trajectories. This may lead to sub-optimal policies and hinder the overall performance. To address this, we propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values in a stepwise manner for open language agents. By introducing a reasoning tree and performing process reward modeling, QLASS provides effective intermediate guidance for each step. With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value, resulting in significant performance improvement during model inference on complex interactive agent tasks. Notably, even with almost half the annotated data, QLASS retains strong performance, demonstrating its efficiency in handling limited supervision. We also empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis. We will release our code and data.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zongyu Lin, Yao Tang, Xingcheng Yao, Da Yin, Ziniu Hu, Yizhou Sun, Kai-Wei Chang</p>

            <p><strong>Title:</strong><br>
            QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02584v1">http://arxiv.org/abs/2502.02584v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language agents have become a promising solution to complex interactive tasks. One of the key ingredients to the success of language agents is the reward model on the trajectory of the agentic workflow, which provides valuable guidance during training or inference. However, due to the lack of annotations of intermediate interactions, most existing works use an outcome reward model to optimize policies across entire trajectories. This may lead to sub-optimal policies and hinder the overall performance. To address this, we propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values in a stepwise manner for open language agents. By introducing a reasoning tree and performing process reward modeling, QLASS provides effective intermediate guidance for each step. With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value, resulting in significant performance improvement during model inference on complex interactive agent tasks. Notably, even with almost half the annotated data, QLASS retains strong performance, demonstrating its efficiency in handling limited supervision. We also empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis. We will release our code and data.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 05 Feb 2025 20:28:50 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3bf4b700/90eefdb6.mp3" length="17538492" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1092</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zongyu Lin, Yao Tang, Xingcheng Yao, Da Yin, Ziniu Hu, Yizhou Sun, Kai-Wei Chang</p>

            <p><strong>Title:</strong><br>
            QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02584v1">http://arxiv.org/abs/2502.02584v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language agents have become a promising solution to complex interactive tasks. One of the key ingredients to the success of language agents is the reward model on the trajectory of the agentic workflow, which provides valuable guidance during training or inference. However, due to the lack of annotations of intermediate interactions, most existing works use an outcome reward model to optimize policies across entire trajectories. This may lead to sub-optimal policies and hinder the overall performance. To address this, we propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values in a stepwise manner for open language agents. By introducing a reasoning tree and performing process reward modeling, QLASS provides effective intermediate guidance for each step. With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value, resulting in significant performance improvement during model inference on complex interactive agent tasks. Notably, even with almost half the annotated data, QLASS retains strong performance, demonstrating its efficiency in handling limited supervision. We also empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis. We will release our code and data.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search</title>
      <itunes:episode>486</itunes:episode>
      <podcast:episode>486</podcast:episode>
      <itunes:title>Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">81021b1e-005c-4580-b3c6-347ec8ffa67d</guid>
      <link>https://share.transistor.fm/s/95402700</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Maohao Shen, Guangtao Zeng, Zhenting Qi, Zhang-Wei Hong, Zhenfang Chen, Wei Lu, Gregory Wornell, Subhro Das, David Cox, Chuang Gan</p>

            <p><strong>Title:</strong><br>
            Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02508v1">http://arxiv.org/abs/2502.02508v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM? This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e., an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models will be fully open-sourced.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Maohao Shen, Guangtao Zeng, Zhenting Qi, Zhang-Wei Hong, Zhenfang Chen, Wei Lu, Gregory Wornell, Subhro Das, David Cox, Chuang Gan</p>

            <p><strong>Title:</strong><br>
            Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02508v1">http://arxiv.org/abs/2502.02508v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM? This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e., an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models will be fully open-sourced.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 05 Feb 2025 20:28:29 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/95402700/ba23fde7.mp3" length="23230299" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1448</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Maohao Shen, Guangtao Zeng, Zhenting Qi, Zhang-Wei Hong, Zhenfang Chen, Wei Lu, Gregory Wornell, Subhro Das, David Cox, Chuang Gan</p>

            <p><strong>Title:</strong><br>
            Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02508v1">http://arxiv.org/abs/2502.02508v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM? This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e., an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models will be fully open-sourced.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?</title>
      <itunes:episode>485</itunes:episode>
      <podcast:episode>485</podcast:episode>
      <itunes:title>Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f2ddb678-df30-4ea0-abe8-743b6838001a</guid>
      <link>https://share.transistor.fm/s/d3106a7a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wenzhe Li, Yong Lin, Mengzhou Xia, Chi Jin</p>

            <p><strong>Title:</strong><br>
            Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.00674v1">http://arxiv.org/abs/2502.00674v1</a></p>

            <p><strong>Abstract:</strong><br>
            Ensembling outputs from diverse sources is a straightforward yet effective approach to boost performance. Mixture-of-Agents (MoA) is one such popular ensemble method that aggregates outputs from multiple different Large Language Models (LLMs). This paper raises the question in the context of language models: is mixing different LLMs truly beneficial? We propose Self-MoA -- an ensemble method that aggregates outputs from only the single top-performing LLM. Our extensive experiments reveal that, surprisingly, Self-MoA outperforms standard MoA that mixes different LLMs in a large number of scenarios: Self-MoA achieves $6.6\%$ improvement over MoA on the AlpacaEval 2.0 benchmark, and an average of $3.8\%$ improvement across various benchmarks, including MMLU, CRUX, and MATH. Applying Self-MoA to one of the top-ranking models in AlpacaEval 2.0 directly achieves the new state-of-the-art performance on the leaderboard. To understand the effectiveness of Self-MoA, we systematically investigate the trade-off between diversity and quality of outputs under various MoA settings. We confirm that the MoA performance is rather sensitive to the quality, and mixing different LLMs often lowers the average quality of the models. To complement the study, we identify the scenarios where mixing different LLMs could be helpful. This paper further introduces a sequential version of Self-MoA, that is capable of aggregating a large number of LLM outputs on-the-fly over multiple rounds, and is as effective as aggregating all outputs at once.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wenzhe Li, Yong Lin, Mengzhou Xia, Chi Jin</p>

            <p><strong>Title:</strong><br>
            Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.00674v1">http://arxiv.org/abs/2502.00674v1</a></p>

            <p><strong>Abstract:</strong><br>
            Ensembling outputs from diverse sources is a straightforward yet effective approach to boost performance. Mixture-of-Agents (MoA) is one such popular ensemble method that aggregates outputs from multiple different Large Language Models (LLMs). This paper raises the question in the context of language models: is mixing different LLMs truly beneficial? We propose Self-MoA -- an ensemble method that aggregates outputs from only the single top-performing LLM. Our extensive experiments reveal that, surprisingly, Self-MoA outperforms standard MoA that mixes different LLMs in a large number of scenarios: Self-MoA achieves $6.6\%$ improvement over MoA on the AlpacaEval 2.0 benchmark, and an average of $3.8\%$ improvement across various benchmarks, including MMLU, CRUX, and MATH. Applying Self-MoA to one of the top-ranking models in AlpacaEval 2.0 directly achieves the new state-of-the-art performance on the leaderboard. To understand the effectiveness of Self-MoA, we systematically investigate the trade-off between diversity and quality of outputs under various MoA settings. We confirm that the MoA performance is rather sensitive to the quality, and mixing different LLMs often lowers the average quality of the models. To complement the study, we identify the scenarios where mixing different LLMs could be helpful. This paper further introduces a sequential version of Self-MoA, that is capable of aggregating a large number of LLM outputs on-the-fly over multiple rounds, and is as effective as aggregating all outputs at once.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 05 Feb 2025 20:28:08 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d3106a7a/8fca4286.mp3" length="22684420" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1414</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wenzhe Li, Yong Lin, Mengzhou Xia, Chi Jin</p>

            <p><strong>Title:</strong><br>
            Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.00674v1">http://arxiv.org/abs/2502.00674v1</a></p>

            <p><strong>Abstract:</strong><br>
            Ensembling outputs from diverse sources is a straightforward yet effective approach to boost performance. Mixture-of-Agents (MoA) is one such popular ensemble method that aggregates outputs from multiple different Large Language Models (LLMs). This paper raises the question in the context of language models: is mixing different LLMs truly beneficial? We propose Self-MoA -- an ensemble method that aggregates outputs from only the single top-performing LLM. Our extensive experiments reveal that, surprisingly, Self-MoA outperforms standard MoA that mixes different LLMs in a large number of scenarios: Self-MoA achieves $6.6\%$ improvement over MoA on the AlpacaEval 2.0 benchmark, and an average of $3.8\%$ improvement across various benchmarks, including MMLU, CRUX, and MATH. Applying Self-MoA to one of the top-ranking models in AlpacaEval 2.0 directly achieves the new state-of-the-art performance on the leaderboard. To understand the effectiveness of Self-MoA, we systematically investigate the trade-off between diversity and quality of outputs under various MoA settings. We confirm that the MoA performance is rather sensitive to the quality, and mixing different LLMs often lowers the average quality of the models. To complement the study, we identify the scenarios where mixing different LLMs could be helpful. This paper further introduces a sequential version of Self-MoA, that is capable of aggregating a large number of LLM outputs on-the-fly over multiple rounds, and is as effective as aggregating all outputs at once.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation</title>
      <itunes:episode>484</itunes:episode>
      <podcast:episode>484</podcast:episode>
      <itunes:title>COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">043a8f90-a5c0-4752-887a-22c4d7904589</guid>
      <link>https://share.transistor.fm/s/cee29134</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xueqing Deng, Qihang Yu, Ali Athar, Chenglin Yang, Linjie Yang, Xiaojie Jin, Xiaohui Shen, Liang-Chieh Chen</p>

            <p><strong>Title:</strong><br>
            COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02589v1">http://arxiv.org/abs/2502.02589v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in panoptic segmentation masks, ensuring consistency and improving the detail of generated captions. Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of vision-language models (VLMs) for image understanding and generative models for text-to-image tasks. Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xueqing Deng, Qihang Yu, Ali Athar, Chenglin Yang, Linjie Yang, Xiaojie Jin, Xiaohui Shen, Liang-Chieh Chen</p>

            <p><strong>Title:</strong><br>
            COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02589v1">http://arxiv.org/abs/2502.02589v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in panoptic segmentation masks, ensuring consistency and improving the detail of generated captions. Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of vision-language models (VLMs) for image understanding and generative models for text-to-image tasks. Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 05 Feb 2025 20:27:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cee29134/bcfeb483.mp3" length="24049920" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1499</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xueqing Deng, Qihang Yu, Ali Athar, Chenglin Yang, Linjie Yang, Xiaojie Jin, Xiaohui Shen, Liang-Chieh Chen</p>

            <p><strong>Title:</strong><br>
            COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.02589v1">http://arxiv.org/abs/2502.02589v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in panoptic segmentation masks, ensuring consistency and improving the detail of generated captions. Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of vision-language models (VLMs) for image understanding and generative models for text-to-image tasks. Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Differences Between Direct Alignment Algorithms are a Blur</title>
      <itunes:episode>483</itunes:episode>
      <podcast:episode>483</podcast:episode>
      <itunes:title>The Differences Between Direct Alignment Algorithms are a Blur</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">82d13685-4b29-4cbc-bb6d-68b6d194dd90</guid>
      <link>https://share.transistor.fm/s/4a68a4a2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 84 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daniil Gavrilov</p>

            <p><strong>Title:</strong><br>
            The Differences Between Direct Alignment Algorithms are a Blur</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01237v1">http://arxiv.org/abs/2502.01237v1</a></p>

            <p><strong>Abstract:</strong><br>
            Direct Alignment Algorithms (DAAs) simplify language model alignment by replacing reinforcement learning (RL) and reward modeling (RM) in Reinforcement Learning from Human Feedback (RLHF) with direct policy optimization. DAAs can be classified by their ranking losses (pairwise vs. pointwise), by the rewards used in those losses (e.g., likelihood ratios of policy and reference policy, or odds ratios), or by whether a Supervised Fine-Tuning (SFT) phase is required (two-stage vs. one-stage). We first show that one-stage methods underperform two-stage methods. To address this, we incorporate an explicit SFT phase and introduce the $\beta$ parameter, controlling the strength of preference optimization, into single-stage ORPO and ASFT. These modifications improve their performance in Alpaca Eval 2 by +$3.46$ (ORPO) and +$8.27$ (ASFT), matching two-stage methods like DPO. Further analysis reveals that the key factor is whether the approach uses pairwise or pointwise objectives, rather than the specific implicit reward or loss function. These results highlight the importance of careful evaluation to avoid premature claims of performance gains or overall superiority in alignment algorithms.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 84 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daniil Gavrilov</p>

            <p><strong>Title:</strong><br>
            The Differences Between Direct Alignment Algorithms are a Blur</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01237v1">http://arxiv.org/abs/2502.01237v1</a></p>

            <p><strong>Abstract:</strong><br>
            Direct Alignment Algorithms (DAAs) simplify language model alignment by replacing reinforcement learning (RL) and reward modeling (RM) in Reinforcement Learning from Human Feedback (RLHF) with direct policy optimization. DAAs can be classified by their ranking losses (pairwise vs. pointwise), by the rewards used in those losses (e.g., likelihood ratios of policy and reference policy, or odds ratios), or by whether a Supervised Fine-Tuning (SFT) phase is required (two-stage vs. one-stage). We first show that one-stage methods underperform two-stage methods. To address this, we incorporate an explicit SFT phase and introduce the $\beta$ parameter, controlling the strength of preference optimization, into single-stage ORPO and ASFT. These modifications improve their performance in Alpaca Eval 2 by +$3.46$ (ORPO) and +$8.27$ (ASFT), matching two-stage methods like DPO. Further analysis reveals that the key factor is whether the approach uses pairwise or pointwise objectives, rather than the specific implicit reward or loss function. These results highlight the importance of careful evaluation to avoid premature claims of performance gains or overall superiority in alignment algorithms.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Feb 2025 21:00:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4a68a4a2/a1f66689.mp3" length="19427243" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1211</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 84 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daniil Gavrilov</p>

            <p><strong>Title:</strong><br>
            The Differences Between Direct Alignment Algorithms are a Blur</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01237v1">http://arxiv.org/abs/2502.01237v1</a></p>

            <p><strong>Abstract:</strong><br>
            Direct Alignment Algorithms (DAAs) simplify language model alignment by replacing reinforcement learning (RL) and reward modeling (RM) in Reinforcement Learning from Human Feedback (RLHF) with direct policy optimization. DAAs can be classified by their ranking losses (pairwise vs. pointwise), by the rewards used in those losses (e.g., likelihood ratios of policy and reference policy, or odds ratios), or by whether a Supervised Fine-Tuning (SFT) phase is required (two-stage vs. one-stage). We first show that one-stage methods underperform two-stage methods. To address this, we incorporate an explicit SFT phase and introduce the $\beta$ parameter, controlling the strength of preference optimization, into single-stage ORPO and ASFT. These modifications improve their performance in Alpaca Eval 2 by +$3.46$ (ORPO) and +$8.27$ (ASFT), matching two-stage methods like DPO. Further analysis reveals that the key factor is whether the approach uses pairwise or pointwise objectives, rather than the specific implicit reward or loss function. These results highlight the importance of careful evaluation to avoid premature claims of performance gains or overall superiority in alignment algorithms.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models</title>
      <itunes:episode>482</itunes:episode>
      <podcast:episode>482</podcast:episode>
      <itunes:title>OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8277ed79-496d-46bf-abdc-3851ebd7ff01</guid>
      <link>https://share.transistor.fm/s/81cb4657</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang</p>

            <p><strong>Title:</strong><br>
            OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01061v1">http://arxiv.org/abs/2502.01061v1</a></p>

            <p><strong>Abstract:</strong><br>
            End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals). Video samples are provided on the ttfamily project page (https://omnihuman-lab.github.io)</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang</p>

            <p><strong>Title:</strong><br>
            OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01061v1">http://arxiv.org/abs/2502.01061v1</a></p>

            <p><strong>Abstract:</strong><br>
            End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals). Video samples are provided on the ttfamily project page (https://omnihuman-lab.github.io)</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Feb 2025 21:00:26 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/81cb4657/a75c9f5d.mp3" length="25395306" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1584</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang</p>

            <p><strong>Title:</strong><br>
            OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01061v1">http://arxiv.org/abs/2502.01061v1</a></p>

            <p><strong>Abstract:</strong><br>
            End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals). Video samples are provided on the ttfamily project page (https://omnihuman-lab.github.io)</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Process Reinforcement through Implicit Rewards</title>
      <itunes:episode>481</itunes:episode>
      <podcast:episode>481</podcast:episode>
      <itunes:title>Process Reinforcement through Implicit Rewards</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">76ec6df7-c08c-4a4f-b3ee-27bea796923b</guid>
      <link>https://share.transistor.fm/s/7f1cf277</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding</p>

            <p><strong>Title:</strong><br>
            Process Reinforcement through Implicit Rewards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01456v1">http://arxiv.org/abs/2502.01456v1</a></p>

            <p><strong>Abstract:</strong><br>
            Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding</p>

            <p><strong>Title:</strong><br>
            Process Reinforcement through Implicit Rewards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01456v1">http://arxiv.org/abs/2502.01456v1</a></p>

            <p><strong>Abstract:</strong><br>
            Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Feb 2025 21:00:04 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7f1cf277/dffcb04c.mp3" length="21034280" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1311</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding</p>

            <p><strong>Title:</strong><br>
            Process Reinforcement through Implicit Rewards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01456v1">http://arxiv.org/abs/2502.01456v1</a></p>

            <p><strong>Abstract:</strong><br>
            Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding</title>
      <itunes:episode>480</itunes:episode>
      <podcast:episode>480</podcast:episode>
      <itunes:title>AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7887aed1-1e0f-4924-a882-465603dcc8ea</guid>
      <link>https://share.transistor.fm/s/d6d48e9b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ahmed Masry, Juan A. Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André Noël, Sathwik Tejaswi Madhusudhan, Marco Pedersoli, Bang Liu, Nicolas Chapados, Yoshua Bengio, Enamul Hoque, Christopher Pal, Issam H. Laradji, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar</p>

            <p><strong>Title:</strong><br>
            AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01341v1">http://arxiv.org/abs/2502.01341v1</a></p>

            <p><strong>Abstract:</strong><br>
            Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), often produce out-of-distribution or noisy inputs, leading to misalignment between the modalities. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where scanned document images must be accurately mapped to their textual content. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods. We provide further analysis demonstrating improved vision-text feature alignment and robustness to noise.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ahmed Masry, Juan A. Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André Noël, Sathwik Tejaswi Madhusudhan, Marco Pedersoli, Bang Liu, Nicolas Chapados, Yoshua Bengio, Enamul Hoque, Christopher Pal, Issam H. Laradji, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar</p>

            <p><strong>Title:</strong><br>
            AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01341v1">http://arxiv.org/abs/2502.01341v1</a></p>

            <p><strong>Abstract:</strong><br>
            Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), often produce out-of-distribution or noisy inputs, leading to misalignment between the modalities. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where scanned document images must be accurately mapped to their textual content. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods. We provide further analysis demonstrating improved vision-text feature alignment and robustness to noise.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Feb 2025 20:59:42 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d6d48e9b/0898db31.mp3" length="22574495" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1407</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ahmed Masry, Juan A. Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André Noël, Sathwik Tejaswi Madhusudhan, Marco Pedersoli, Bang Liu, Nicolas Chapados, Yoshua Bengio, Enamul Hoque, Christopher Pal, Issam H. Laradji, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar</p>

            <p><strong>Title:</strong><br>
            AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01341v1">http://arxiv.org/abs/2502.01341v1</a></p>

            <p><strong>Abstract:</strong><br>
            Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), often produce out-of-distribution or noisy inputs, leading to misalignment between the modalities. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where scanned document images must be accurately mapped to their textual content. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods. We provide further analysis demonstrating improved vision-text feature alignment and robustness to noise.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model</title>
      <itunes:episode>479</itunes:episode>
      <podcast:episode>479</podcast:episode>
      <itunes:title>SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9f9682a7-d73b-4b17-9192-a41fda0d2578</guid>
      <link>https://share.transistor.fm/s/884d27d1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CR, cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Xun Liang, Simin Niu, Zhiyu Li, Sensen Zhang, Hanyu Wang, Feiyu Xiong, Jason Zhaoxin Fan, Bo Tang, Shichao Song, Mengwei Wang, Jiawei Yang</p>

            <p><strong>Title:</strong><br>
            SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18636v1">http://arxiv.org/abs/2501.18636v1</a></p>

            <p><strong>Abstract:</strong><br>
            The indexing-retrieval-generation paradigm of retrieval-augmented generation (RAG) has been highly successful in solving knowledge-intensive tasks by integrating external knowledge into large language models (LLMs). However, the incorporation of external and unverified knowledge increases the vulnerability of LLMs because attackers can perform attack tasks by manipulating knowledge. In this paper, we introduce a benchmark named SafeRAG designed to evaluate the RAG security. First, we classify attack tasks into silver noise, inter-context conflict, soft ad, and white Denial-of-Service. Next, we construct RAG security evaluation dataset (i.e., SafeRAG dataset) primarily manually for each task. We then utilize the SafeRAG dataset to simulate various attack scenarios that RAG may encounter. Experiments conducted on 14 representative RAG components demonstrate that RAG exhibits significant vulnerability to all attack tasks and even the most apparent attack task can easily bypass existing retrievers, filters, or advanced LLMs, resulting in the degradation of RAG service quality. Code is available at: https://github.com/IAAR-Shanghai/SafeRAG.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CR, cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Xun Liang, Simin Niu, Zhiyu Li, Sensen Zhang, Hanyu Wang, Feiyu Xiong, Jason Zhaoxin Fan, Bo Tang, Shichao Song, Mengwei Wang, Jiawei Yang</p>

            <p><strong>Title:</strong><br>
            SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18636v1">http://arxiv.org/abs/2501.18636v1</a></p>

            <p><strong>Abstract:</strong><br>
            The indexing-retrieval-generation paradigm of retrieval-augmented generation (RAG) has been highly successful in solving knowledge-intensive tasks by integrating external knowledge into large language models (LLMs). However, the incorporation of external and unverified knowledge increases the vulnerability of LLMs because attackers can perform attack tasks by manipulating knowledge. In this paper, we introduce a benchmark named SafeRAG designed to evaluate the RAG security. First, we classify attack tasks into silver noise, inter-context conflict, soft ad, and white Denial-of-Service. Next, we construct RAG security evaluation dataset (i.e., SafeRAG dataset) primarily manually for each task. We then utilize the SafeRAG dataset to simulate various attack scenarios that RAG may encounter. Experiments conducted on 14 representative RAG components demonstrate that RAG exhibits significant vulnerability to all attack tasks and even the most apparent attack task can easily bypass existing retrievers, filters, or advanced LLMs, resulting in the degradation of RAG service quality. Code is available at: https://github.com/IAAR-Shanghai/SafeRAG.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Feb 2025 20:59:20 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/884d27d1/de1547e6.mp3" length="22675230" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1414</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CR, cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Xun Liang, Simin Niu, Zhiyu Li, Sensen Zhang, Hanyu Wang, Feiyu Xiong, Jason Zhaoxin Fan, Bo Tang, Shichao Song, Mengwei Wang, Jiawei Yang</p>

            <p><strong>Title:</strong><br>
            SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18636v1">http://arxiv.org/abs/2501.18636v1</a></p>

            <p><strong>Abstract:</strong><br>
            The indexing-retrieval-generation paradigm of retrieval-augmented generation (RAG) has been highly successful in solving knowledge-intensive tasks by integrating external knowledge into large language models (LLMs). However, the incorporation of external and unverified knowledge increases the vulnerability of LLMs because attackers can perform attack tasks by manipulating knowledge. In this paper, we introduce a benchmark named SafeRAG designed to evaluate the RAG security. First, we classify attack tasks into silver noise, inter-context conflict, soft ad, and white Denial-of-Service. Next, we construct RAG security evaluation dataset (i.e., SafeRAG dataset) primarily manually for each task. We then utilize the SafeRAG dataset to simulate various attack scenarios that RAG may encounter. Experiments conducted on 14 representative RAG components demonstrate that RAG exhibits significant vulnerability to all attack tasks and even the most apparent attack task can easily bypass existing retrievers, filters, or advanced LLMs, resulting in the degradation of RAG service quality. Code is available at: https://github.com/IAAR-Shanghai/SafeRAG.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Preference Leakage: A Contamination Problem in LLM-as-a-judge</title>
      <itunes:episode>478</itunes:episode>
      <podcast:episode>478</podcast:episode>
      <itunes:title>Preference Leakage: A Contamination Problem in LLM-as-a-judge</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8236afd4-3487-42f4-bfd9-bcfd8e816f6c</guid>
      <link>https://share.transistor.fm/s/ecf0ab84</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, Huan Liu</p>

            <p><strong>Title:</strong><br>
            Preference Leakage: A Contamination Problem in LLM-as-a-judge</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01534v1">http://arxiv.org/abs/2502.01534v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between data generator LLM and judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive issue that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge. We release all codes and data at: https://github.com/David-Li0406/Preference-Leakage.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, Huan Liu</p>

            <p><strong>Title:</strong><br>
            Preference Leakage: A Contamination Problem in LLM-as-a-judge</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01534v1">http://arxiv.org/abs/2502.01534v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between data generator LLM and judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive issue that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge. We release all codes and data at: https://github.com/David-Li0406/Preference-Leakage.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Feb 2025 20:58:58 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ecf0ab84/54ca950a.mp3" length="21117887" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1316</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, Huan Liu</p>

            <p><strong>Title:</strong><br>
            Preference Leakage: A Contamination Problem in LLM-as-a-judge</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01534v1">http://arxiv.org/abs/2502.01534v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between data generator LLM and judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive issue that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge. We release all codes and data at: https://github.com/David-Li0406/Preference-Leakage.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SliderSpace: Decomposing the Visual Capabilities of Diffusion Models</title>
      <itunes:episode>477</itunes:episode>
      <podcast:episode>477</podcast:episode>
      <itunes:title>SliderSpace: Decomposing the Visual Capabilities of Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">19f31cda-f15b-4bc2-90b1-02e07e76ed24</guid>
      <link>https://share.transistor.fm/s/c42d1235</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CV, cs.GR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Rohit Gandikota, Zongze Wu, Richard Zhang, David Bau, Eli Shechtman, Nick Kolkin</p>

            <p><strong>Title:</strong><br>
            SliderSpace: Decomposing the Visual Capabilities of Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01639v1">http://arxiv.org/abs/2502.01639v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present SliderSpace, a framework for automatically decomposing the visual capabilities of diffusion models into controllable and human-understandable directions. Unlike existing control methods that require a user to specify attributes for each edit direction individually, SliderSpace discovers multiple interpretable and diverse directions simultaneously from a single text prompt. Each direction is trained as a low-rank adaptor, enabling compositional control and the discovery of surprising possibilities in the model's latent space. Through extensive experiments on state-of-the-art diffusion models, we demonstrate SliderSpace's effectiveness across three applications: concept decomposition, artistic style exploration, and diversity enhancement. Our quantitative evaluation shows that SliderSpace-discovered directions decompose the visual structure of model's knowledge effectively, offering insights into the latent capabilities encoded within diffusion models. User studies further validate that our method produces more diverse and useful variations compared to baselines. Our code, data and trained weights are available at https://sliderspace.baulab.info</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CV, cs.GR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Rohit Gandikota, Zongze Wu, Richard Zhang, David Bau, Eli Shechtman, Nick Kolkin</p>

            <p><strong>Title:</strong><br>
            SliderSpace: Decomposing the Visual Capabilities of Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01639v1">http://arxiv.org/abs/2502.01639v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present SliderSpace, a framework for automatically decomposing the visual capabilities of diffusion models into controllable and human-understandable directions. Unlike existing control methods that require a user to specify attributes for each edit direction individually, SliderSpace discovers multiple interpretable and diverse directions simultaneously from a single text prompt. Each direction is trained as a low-rank adaptor, enabling compositional control and the discovery of surprising possibilities in the model's latent space. Through extensive experiments on state-of-the-art diffusion models, we demonstrate SliderSpace's effectiveness across three applications: concept decomposition, artistic style exploration, and diversity enhancement. Our quantitative evaluation shows that SliderSpace-discovered directions decompose the visual structure of model's knowledge effectively, offering insights into the latent capabilities encoded within diffusion models. User studies further validate that our method produces more diverse and useful variations compared to baselines. Our code, data and trained weights are available at https://sliderspace.baulab.info</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Feb 2025 20:58:36 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c42d1235/92306049.mp3" length="24086240" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1502</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CV, cs.GR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Rohit Gandikota, Zongze Wu, Richard Zhang, David Bau, Eli Shechtman, Nick Kolkin</p>

            <p><strong>Title:</strong><br>
            SliderSpace: Decomposing the Visual Capabilities of Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.01639v1">http://arxiv.org/abs/2502.01639v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present SliderSpace, a framework for automatically decomposing the visual capabilities of diffusion models into controllable and human-understandable directions. Unlike existing control methods that require a user to specify attributes for each edit direction individually, SliderSpace discovers multiple interpretable and diverse directions simultaneously from a single text prompt. Each direction is trained as a low-rank adaptor, enabling compositional control and the discovery of surprising possibilities in the model's latent space. Through extensive experiments on state-of-the-art diffusion models, we demonstrate SliderSpace's effectiveness across three applications: concept decomposition, artistic style exploration, and diversity enhancement. Our quantitative evaluation shows that SliderSpace-discovered directions decompose the visual structure of model's knowledge effectively, offering insights into the latent capabilities encoded within diffusion models. User studies further validate that our method produces more diverse and useful variations compared to baselines. Our code, data and trained weights are available at https://sliderspace.baulab.info</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models</title>
      <itunes:episode>476</itunes:episode>
      <podcast:episode>476</podcast:episode>
      <itunes:title>MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e2dfedd7-08e4-4812-a236-722f67f79acc</guid>
      <link>https://share.transistor.fm/s/f4dfa9ed</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Huanqia Cai, Yijun Yang, Winston Hu</p>

            <p><strong>Title:</strong><br>
            MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.00698v1">http://arxiv.org/abs/2502.00698v1</a></p>

            <p><strong>Abstract:</strong><br>
            IQ testing has served as a foundational methodology for evaluating human cognitive capabilities, deliberately decoupling assessment from linguistic background, language proficiency, or domain-specific knowledge to isolate core competencies in abstraction and reasoning. Yet, artificial intelligence research currently lacks systematic benchmarks to quantify these critical cognitive dimensions in multimodal systems. To address this critical gap, we propose MM-IQ, a comprehensive evaluation framework comprising 2,710 meticulously curated test items spanning 8 distinct reasoning paradigms.   Through systematic evaluation of leading open-source and proprietary multimodal models, our benchmark reveals striking limitations: even state-of-the-art architectures achieve only marginally superior performance to random chance (27.49% vs. 25% baseline accuracy). This substantial performance chasm highlights the inadequacy of current multimodal systems in approximating fundamental human reasoning capacities, underscoring the need for paradigm-shifting advancements to bridge this cognitive divide.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Huanqia Cai, Yijun Yang, Winston Hu</p>

            <p><strong>Title:</strong><br>
            MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.00698v1">http://arxiv.org/abs/2502.00698v1</a></p>

            <p><strong>Abstract:</strong><br>
            IQ testing has served as a foundational methodology for evaluating human cognitive capabilities, deliberately decoupling assessment from linguistic background, language proficiency, or domain-specific knowledge to isolate core competencies in abstraction and reasoning. Yet, artificial intelligence research currently lacks systematic benchmarks to quantify these critical cognitive dimensions in multimodal systems. To address this critical gap, we propose MM-IQ, a comprehensive evaluation framework comprising 2,710 meticulously curated test items spanning 8 distinct reasoning paradigms.   Through systematic evaluation of leading open-source and proprietary multimodal models, our benchmark reveals striking limitations: even state-of-the-art architectures achieve only marginally superior performance to random chance (27.49% vs. 25% baseline accuracy). This substantial performance chasm highlights the inadequacy of current multimodal systems in approximating fundamental human reasoning capacities, underscoring the need for paradigm-shifting advancements to bridge this cognitive divide.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Feb 2025 20:58:14 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f4dfa9ed/63f8178e.mp3" length="23715519" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1479</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Huanqia Cai, Yijun Yang, Winston Hu</p>

            <p><strong>Title:</strong><br>
            MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.00698v1">http://arxiv.org/abs/2502.00698v1</a></p>

            <p><strong>Abstract:</strong><br>
            IQ testing has served as a foundational methodology for evaluating human cognitive capabilities, deliberately decoupling assessment from linguistic background, language proficiency, or domain-specific knowledge to isolate core competencies in abstraction and reasoning. Yet, artificial intelligence research currently lacks systematic benchmarks to quantify these critical cognitive dimensions in multimodal systems. To address this critical gap, we propose MM-IQ, a comprehensive evaluation framework comprising 2,710 meticulously curated test items spanning 8 distinct reasoning paradigms.   Through systematic evaluation of leading open-source and proprietary multimodal models, our benchmark reveals striking limitations: even state-of-the-art architectures achieve only marginally superior performance to random chance (27.49% vs. 25% baseline accuracy). This substantial performance chasm highlights the inadequacy of current multimodal systems in approximating fundamental human reasoning capacities, underscoring the need for paradigm-shifting advancements to bridge this cognitive divide.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AIN: The Arabic INclusive Large Multimodal Model</title>
      <itunes:episode>475</itunes:episode>
      <podcast:episode>475</podcast:episode>
      <itunes:title>AIN: The Arabic INclusive Large Multimodal Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">065eb7c1-0d12-44c2-ae62-153cf473fc4c</guid>
      <link>https://share.transistor.fm/s/385f1dec</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV, cs.AI, cs.CL, cs.HC, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ahmed Heakl, Sara Ghaboura, Omkar Thawkar, Fahad Shahbaz Khan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan</p>

            <p><strong>Title:</strong><br>
            AIN: The Arabic INclusive Large Multimodal Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.00094v1">http://arxiv.org/abs/2502.00094v1</a></p>

            <p><strong>Abstract:</strong><br>
            Amid the swift progress of large language models (LLMs) and their evolution into large multimodal models (LMMs), significant strides have been made in high-resource languages such as English and Chinese. While Arabic LLMs have seen notable progress, Arabic LMMs remain largely unexplored, often narrowly focusing on a few specific aspects of the language and visual understanding. To bridge this gap, we introduce AIN-the Arabic Inclusive Multimodal Model-designed to excel across diverse domains. AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic, leveraging carefully constructed 3.6 million high-quality Arabic-English multimodal data samples. AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities. On the recent CAMEL-Bench benchmark comprising 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding, our AIN demonstrates strong performance with the 7B model outperforming GPT-4o by an absolute gain of 3.4% averaged over eight domains and 38 sub-domains. AIN's superior capabilities position it as a significant step toward empowering Arabic speakers with advanced multimodal generative AI tools across diverse applications.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV, cs.AI, cs.CL, cs.HC, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ahmed Heakl, Sara Ghaboura, Omkar Thawkar, Fahad Shahbaz Khan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan</p>

            <p><strong>Title:</strong><br>
            AIN: The Arabic INclusive Large Multimodal Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.00094v1">http://arxiv.org/abs/2502.00094v1</a></p>

            <p><strong>Abstract:</strong><br>
            Amid the swift progress of large language models (LLMs) and their evolution into large multimodal models (LMMs), significant strides have been made in high-resource languages such as English and Chinese. While Arabic LLMs have seen notable progress, Arabic LMMs remain largely unexplored, often narrowly focusing on a few specific aspects of the language and visual understanding. To bridge this gap, we introduce AIN-the Arabic Inclusive Multimodal Model-designed to excel across diverse domains. AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic, leveraging carefully constructed 3.6 million high-quality Arabic-English multimodal data samples. AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities. On the recent CAMEL-Bench benchmark comprising 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding, our AIN demonstrates strong performance with the 7B model outperforming GPT-4o by an absolute gain of 3.4% averaged over eight domains and 38 sub-domains. AIN's superior capabilities position it as a significant step toward empowering Arabic speakers with advanced multimodal generative AI tools across diverse applications.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 04 Feb 2025 20:57:41 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/385f1dec/9b2adf99.mp3" length="19769119" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1232</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV, cs.AI, cs.CL, cs.HC, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ahmed Heakl, Sara Ghaboura, Omkar Thawkar, Fahad Shahbaz Khan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan</p>

            <p><strong>Title:</strong><br>
            AIN: The Arabic INclusive Large Multimodal Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2502.00094v1">http://arxiv.org/abs/2502.00094v1</a></p>

            <p><strong>Abstract:</strong><br>
            Amid the swift progress of large language models (LLMs) and their evolution into large multimodal models (LMMs), significant strides have been made in high-resource languages such as English and Chinese. While Arabic LLMs have seen notable progress, Arabic LMMs remain largely unexplored, often narrowly focusing on a few specific aspects of the language and visual understanding. To bridge this gap, we introduce AIN-the Arabic Inclusive Multimodal Model-designed to excel across diverse domains. AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic, leveraging carefully constructed 3.6 million high-quality Arabic-English multimodal data samples. AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities. On the recent CAMEL-Bench benchmark comprising 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding, our AIN demonstrates strong performance with the 7B model outperforming GPT-4o by an absolute gain of 3.4% averaged over eight domains and 38 sub-domains. AIN's superior capabilities position it as a significant step toward empowering Arabic speakers with advanced multimodal generative AI tools across diverse applications.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>s1: Simple test-time scaling</title>
      <itunes:episode>474</itunes:episode>
      <podcast:episode>474</podcast:episode>
      <itunes:title>s1: Simple test-time scaling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">42cc8472-4766-4407-943e-29ae72398112</guid>
      <link>https://share.transistor.fm/s/e6b160dd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto</p>

            <p><strong>Title:</strong><br>
            s1: Simple test-time scaling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.19393v1">http://arxiv.org/abs/2501.19393v1</a></p>

            <p><strong>Abstract:</strong><br>
            Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto</p>

            <p><strong>Title:</strong><br>
            s1: Simple test-time scaling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.19393v1">http://arxiv.org/abs/2501.19393v1</a></p>

            <p><strong>Abstract:</strong><br>
            Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 03 Feb 2025 20:59:19 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e6b160dd/f320bf0b.mp3" length="21750226" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1356</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto</p>

            <p><strong>Title:</strong><br>
            s1: Simple test-time scaling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.19393v1">http://arxiv.org/abs/2501.19393v1</a></p>

            <p><strong>Abstract:</strong><br>
            Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Reward-Guided Speculative Decoding for Efficient LLM Reasoning</title>
      <itunes:episode>473</itunes:episode>
      <podcast:episode>473</podcast:episode>
      <itunes:title>Reward-Guided Speculative Decoding for Efficient LLM Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">57c10251-f07c-4275-946e-0e00d00db76e</guid>
      <link>https://share.transistor.fm/s/d8026b06</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, Caiming Xiong</p>

            <p><strong>Title:</strong><br>
            Reward-Guided Speculative Decoding for Efficient LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.19324v1">http://arxiv.org/abs/2501.19324v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness. RSD employs a process reward model to evaluate intermediate decoding steps and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. We theoretically demonstrate that a threshold-based mixture strategy achieves an optimal balance between resource utilization and performance. Extensive evaluations on challenging reasoning benchmarks, including Olympiad-level tasks, show that RSD delivers significant efficiency gains against decoding with the target model only (up to 4.4x fewer FLOPs), while achieving significant better accuracy than parallel decoding method on average (up to +3.5). These results highlight RSD as a robust and cost-effective approach for deploying LLMs in resource-intensive scenarios.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, Caiming Xiong</p>

            <p><strong>Title:</strong><br>
            Reward-Guided Speculative Decoding for Efficient LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.19324v1">http://arxiv.org/abs/2501.19324v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness. RSD employs a process reward model to evaluate intermediate decoding steps and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. We theoretically demonstrate that a threshold-based mixture strategy achieves an optimal balance between resource utilization and performance. Extensive evaluations on challenging reasoning benchmarks, including Olympiad-level tasks, show that RSD delivers significant efficiency gains against decoding with the target model only (up to 4.4x fewer FLOPs), while achieving significant better accuracy than parallel decoding method on average (up to +3.5). These results highlight RSD as a robust and cost-effective approach for deploying LLMs in resource-intensive scenarios.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 03 Feb 2025 20:58:56 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d8026b06/3d6702ab.mp3" length="21005875" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1309</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, Caiming Xiong</p>

            <p><strong>Title:</strong><br>
            Reward-Guided Speculative Decoding for Efficient LLM Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.19324v1">http://arxiv.org/abs/2501.19324v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness. RSD employs a process reward model to evaluate intermediate decoding steps and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. We theoretically demonstrate that a threshold-based mixture strategy achieves an optimal balance between resource utilization and performance. Extensive evaluations on challenging reasoning benchmarks, including Olympiad-level tasks, show that RSD delivers significant efficiency gains against decoding with the target model only (up to 4.4x fewer FLOPs), while achieving significant better accuracy than parallel decoding method on average (up to +3.5). These results highlight RSD as a robust and cost-effective approach for deploying LLMs in resource-intensive scenarios.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models</title>
      <itunes:episode>472</itunes:episode>
      <podcast:episode>472</podcast:episode>
      <itunes:title>Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">69ecb1a2-a23e-4332-a170-15011b7e7554</guid>
      <link>https://share.transistor.fm/s/b824b53d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qika Lin, Tianzhe Zhao, Kai He, Zhen Peng, Fangzhi Xu, Ling Huang, Jingying Ma, Mengling Feng</p>

            <p><strong>Title:</strong><br>
            Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18119v1">http://arxiv.org/abs/2501.18119v1</a></p>

            <p><strong>Abstract:</strong><br>
            Due to the presence of the natural gap between Knowledge Graph (KG) structures and the natural language, the effective integration of holistic structural information of KGs with Large Language Models (LLMs) has emerged as a significant question. To this end, we propose a two-stage framework to learn and apply quantized codes for each entity, aiming for the seamless integration of KGs with LLMs. Firstly, a self-supervised quantized representation (SSQR) method is proposed to compress both KG structural and semantic knowledge into discrete codes (\ie, tokens) that align the format of language sentences. We further design KG instruction-following data by viewing these learned codes as features to directly input to LLMs, thereby achieving seamless integration. The experiment results demonstrate that SSQR outperforms existing unsupervised quantized methods, producing more distinguishable codes. Further, the fine-tuned LLaMA2 and LLaMA3.1 also have superior performance on KG link prediction and triple classification tasks, utilizing only 16 tokens per entity instead of thousands in conventional prompting methods.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qika Lin, Tianzhe Zhao, Kai He, Zhen Peng, Fangzhi Xu, Ling Huang, Jingying Ma, Mengling Feng</p>

            <p><strong>Title:</strong><br>
            Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18119v1">http://arxiv.org/abs/2501.18119v1</a></p>

            <p><strong>Abstract:</strong><br>
            Due to the presence of the natural gap between Knowledge Graph (KG) structures and the natural language, the effective integration of holistic structural information of KGs with Large Language Models (LLMs) has emerged as a significant question. To this end, we propose a two-stage framework to learn and apply quantized codes for each entity, aiming for the seamless integration of KGs with LLMs. Firstly, a self-supervised quantized representation (SSQR) method is proposed to compress both KG structural and semantic knowledge into discrete codes (\ie, tokens) that align the format of language sentences. We further design KG instruction-following data by viewing these learned codes as features to directly input to LLMs, thereby achieving seamless integration. The experiment results demonstrate that SSQR outperforms existing unsupervised quantized methods, producing more distinguishable codes. Further, the fine-tuned LLaMA2 and LLaMA3.1 also have superior performance on KG link prediction and triple classification tasks, utilizing only 16 tokens per entity instead of thousands in conventional prompting methods.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 03 Feb 2025 20:58:33 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b824b53d/5056e771.mp3" length="20650240" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1287</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Qika Lin, Tianzhe Zhao, Kai He, Zhen Peng, Fangzhi Xu, Ling Huang, Jingying Ma, Mengling Feng</p>

            <p><strong>Title:</strong><br>
            Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18119v1">http://arxiv.org/abs/2501.18119v1</a></p>

            <p><strong>Abstract:</strong><br>
            Due to the presence of the natural gap between Knowledge Graph (KG) structures and the natural language, the effective integration of holistic structural information of KGs with Large Language Models (LLMs) has emerged as a significant question. To this end, we propose a two-stage framework to learn and apply quantized codes for each entity, aiming for the seamless integration of KGs with LLMs. Firstly, a self-supervised quantized representation (SSQR) method is proposed to compress both KG structural and semantic knowledge into discrete codes (\ie, tokens) that align the format of language sentences. We further design KG instruction-following data by viewing these learned codes as features to directly input to LLMs, thereby achieving seamless integration. The experiment results demonstrate that SSQR outperforms existing unsupervised quantized methods, producing more distinguishable codes. Further, the fine-tuned LLaMA2 and LLaMA3.1 also have superior performance on KG link prediction and triple classification tasks, utilizing only 16 tokens per entity instead of thousands in conventional prompting methods.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PixelWorld: Towards Perceiving Everything as Pixels</title>
      <itunes:episode>471</itunes:episode>
      <podcast:episode>471</podcast:episode>
      <itunes:title>PixelWorld: Towards Perceiving Everything as Pixels</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">96c6ec06-1720-4573-9006-013f2f6920a6</guid>
      <link>https://share.transistor.fm/s/05eeca71</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiheng Lyu, Xueguang Ma, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            PixelWorld: Towards Perceiving Everything as Pixels</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.19339v1">http://arxiv.org/abs/2501.19339v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing foundation models typically process visual input as pixels and textual input as tokens, a paradigm that contrasts with human perception, where both modalities are processed in a unified manner. With the rise of embodied and agentic AI, where inputs primarily come from camera pixels, the need for a unified perception framework becomes increasingly evident. In this paper, we propose to unify all modalities (text, tables, code, diagrams, images, etc) as pixel inputs, i.e. "Perceive Everything as Pixels" (PEAP). We introduce PixelWorld, a novel evaluation suite that unifies all the mentioned modalities into pixel space to gauge the existing models' performance. Our findings show that (1) PEAP outperforms baseline with token-based input in multimodal datasets, benefiting from unified input for better disambiguation, (2) significant declines in reasoning and coding capabilities across all models when processing pixel-based input, underscoring the need to enhance foundation models' perceptual abilities, (3) larger models can maintain strong performance on non-reasoning tasks under PEAP, while smaller models like Phi-3.5-V suffer significant performance degradation, (4) the attention pattern of PEAP is highly aligned with text token input, (5) PEAP can be accelerated significantly by exploiting the spatial sparsity. We conclude that the existing frontier models are competent in pixel perception, however, there is still headroom for improvement. Our code, dataset will be released upon acceptance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiheng Lyu, Xueguang Ma, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            PixelWorld: Towards Perceiving Everything as Pixels</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.19339v1">http://arxiv.org/abs/2501.19339v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing foundation models typically process visual input as pixels and textual input as tokens, a paradigm that contrasts with human perception, where both modalities are processed in a unified manner. With the rise of embodied and agentic AI, where inputs primarily come from camera pixels, the need for a unified perception framework becomes increasingly evident. In this paper, we propose to unify all modalities (text, tables, code, diagrams, images, etc) as pixel inputs, i.e. "Perceive Everything as Pixels" (PEAP). We introduce PixelWorld, a novel evaluation suite that unifies all the mentioned modalities into pixel space to gauge the existing models' performance. Our findings show that (1) PEAP outperforms baseline with token-based input in multimodal datasets, benefiting from unified input for better disambiguation, (2) significant declines in reasoning and coding capabilities across all models when processing pixel-based input, underscoring the need to enhance foundation models' perceptual abilities, (3) larger models can maintain strong performance on non-reasoning tasks under PEAP, while smaller models like Phi-3.5-V suffer significant performance degradation, (4) the attention pattern of PEAP is highly aligned with text token input, (5) PEAP can be accelerated significantly by exploiting the spatial sparsity. We conclude that the existing frontier models are competent in pixel perception, however, there is still headroom for improvement. Our code, dataset will be released upon acceptance.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 03 Feb 2025 20:58:10 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/05eeca71/87350e86.mp3" length="19363702" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1207</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhiheng Lyu, Xueguang Ma, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            PixelWorld: Towards Perceiving Everything as Pixels</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.19339v1">http://arxiv.org/abs/2501.19339v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing foundation models typically process visual input as pixels and textual input as tokens, a paradigm that contrasts with human perception, where both modalities are processed in a unified manner. With the rise of embodied and agentic AI, where inputs primarily come from camera pixels, the need for a unified perception framework becomes increasingly evident. In this paper, we propose to unify all modalities (text, tables, code, diagrams, images, etc) as pixel inputs, i.e. "Perceive Everything as Pixels" (PEAP). We introduce PixelWorld, a novel evaluation suite that unifies all the mentioned modalities into pixel space to gauge the existing models' performance. Our findings show that (1) PEAP outperforms baseline with token-based input in multimodal datasets, benefiting from unified input for better disambiguation, (2) significant declines in reasoning and coding capabilities across all models when processing pixel-based input, underscoring the need to enhance foundation models' perceptual abilities, (3) larger models can maintain strong performance on non-reasoning tasks under PEAP, while smaller models like Phi-3.5-V suffer significant performance degradation, (4) the attention pattern of PEAP is highly aligned with text token input, (5) PEAP can be accelerated significantly by exploiting the spatial sparsity. We conclude that the existing frontier models are competent in pixel perception, however, there is still headroom for improvement. Our code, dataset will be released upon acceptance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning</title>
      <itunes:episode>470</itunes:episode>
      <podcast:episode>470</podcast:episode>
      <itunes:title>DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a763b470-c8b2-42e1-af1f-7993a3527d78</guid>
      <link>https://share.transistor.fm/s/fd283543</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.RO, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Gaoyue Zhou, Hengkai Pan, Yann LeCun, Lerrel Pinto</p>

            <p><strong>Title:</strong><br>
            DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04983v2">http://arxiv.org/abs/2411.04983v2</a></p>

            <p><strong>Abstract:</strong><br>
            The ability to predict future outcomes given control actions is fundamental for physical reasoning. However, such predictive models, often called world models, remains challenging to learn and are typically developed for task-specific solutions with online policy learning. To unlock world models' true potential, we argue that they should 1) be trainable on offline, pre-collected trajectories, 2) support test-time behavior optimization, and 3) facilitate task-agnostic reasoning. To this end, we present DINO World Model (DINO-WM), a new method to model visual dynamics without reconstructing the visual world. DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features. This allows DINO-WM to achieve observational goals through action sequence optimization, facilitating task-agnostic planning by treating goal features as prediction targets. We demonstrate that DINO-WM achieves zero-shot behavioral solutions at test time on six environments without expert demonstrations, reward modeling, or pre-learned inverse models, outperforming prior state-of-the-art work across diverse task families such as arbitrarily configured mazes, push manipulation with varied object shapes, and multi-particle scenarios.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.RO, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Gaoyue Zhou, Hengkai Pan, Yann LeCun, Lerrel Pinto</p>

            <p><strong>Title:</strong><br>
            DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04983v2">http://arxiv.org/abs/2411.04983v2</a></p>

            <p><strong>Abstract:</strong><br>
            The ability to predict future outcomes given control actions is fundamental for physical reasoning. However, such predictive models, often called world models, remains challenging to learn and are typically developed for task-specific solutions with online policy learning. To unlock world models' true potential, we argue that they should 1) be trainable on offline, pre-collected trajectories, 2) support test-time behavior optimization, and 3) facilitate task-agnostic reasoning. To this end, we present DINO World Model (DINO-WM), a new method to model visual dynamics without reconstructing the visual world. DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features. This allows DINO-WM to achieve observational goals through action sequence optimization, facilitating task-agnostic planning by treating goal features as prediction targets. We demonstrate that DINO-WM achieves zero-shot behavioral solutions at test time on six environments without expert demonstrations, reward modeling, or pre-learned inverse models, outperforming prior state-of-the-art work across diverse task families such as arbitrarily configured mazes, push manipulation with varied object shapes, and multi-particle scenarios.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 03 Feb 2025 20:57:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fd283543/a1a2f176.mp3" length="19564349" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1219</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.RO, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Gaoyue Zhou, Hengkai Pan, Yann LeCun, Lerrel Pinto</p>

            <p><strong>Title:</strong><br>
            DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04983v2">http://arxiv.org/abs/2411.04983v2</a></p>

            <p><strong>Abstract:</strong><br>
            The ability to predict future outcomes given control actions is fundamental for physical reasoning. However, such predictive models, often called world models, remains challenging to learn and are typically developed for task-specific solutions with online policy learning. To unlock world models' true potential, we argue that they should 1) be trainable on offline, pre-collected trajectories, 2) support test-time behavior optimization, and 3) facilitate task-agnostic reasoning. To this end, we present DINO World Model (DINO-WM), a new method to model visual dynamics without reconstructing the visual world. DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features. This allows DINO-WM to achieve observational goals through action sequence optimization, facilitating task-agnostic planning by treating goal features as prediction targets. We demonstrate that DINO-WM achieves zero-shot behavioral solutions at test time on six environments without expert demonstrations, reward modeling, or pre-learned inverse models, outperforming prior state-of-the-art work across diverse task families such as arbitrarily configured mazes, push manipulation with varied object shapes, and multi-particle scenarios.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming</title>
      <itunes:episode>469</itunes:episode>
      <podcast:episode>469</podcast:episode>
      <itunes:title>Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5c06c741-dbd2-4922-9742-f2ea4210d278</guid>
      <link>https://share.transistor.fm/s/142ad36c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL, cs.AI, cs.CR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson, Logan Graham, Logan Howard, Nimit Kalra, Taesung Lee, Kevin Lin, Peter Lofgren, Francesco Mosconi, Clare O'Hara, Catherine Olsson, Linda Petrini, Samir Rajani, Nikhil Saxena, Alex Silverstein, Tanya Singh, Theodore Sumers, Leonard Tang, Kevin K. Troy, Constantin Weisser, Ruiqi Zhong, Giulio Zhou, Jan Leike, Jared Kaplan, Ethan Perez</p>

            <p><strong>Title:</strong><br>
            Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18837v1">http://arxiv.org/abs/2501.18837v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL, cs.AI, cs.CR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson, Logan Graham, Logan Howard, Nimit Kalra, Taesung Lee, Kevin Lin, Peter Lofgren, Francesco Mosconi, Clare O'Hara, Catherine Olsson, Linda Petrini, Samir Rajani, Nikhil Saxena, Alex Silverstein, Tanya Singh, Theodore Sumers, Leonard Tang, Kevin K. Troy, Constantin Weisser, Ruiqi Zhong, Giulio Zhou, Jan Leike, Jared Kaplan, Ethan Perez</p>

            <p><strong>Title:</strong><br>
            Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18837v1">http://arxiv.org/abs/2501.18837v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 03 Feb 2025 20:57:13 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/142ad36c/7dded904.mp3" length="20138654" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1255</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL, cs.AI, cs.CR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson, Logan Graham, Logan Howard, Nimit Kalra, Taesung Lee, Kevin Lin, Peter Lofgren, Francesco Mosconi, Clare O'Hara, Catherine Olsson, Linda Petrini, Samir Rajani, Nikhil Saxena, Alex Silverstein, Tanya Singh, Theodore Sumers, Leonard Tang, Kevin K. Troy, Constantin Weisser, Ruiqi Zhong, Giulio Zhou, Jan Leike, Jared Kaplan, Ethan Perez</p>

            <p><strong>Title:</strong><br>
            Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18837v1">http://arxiv.org/abs/2501.18837v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Scalable-Softmax Is Superior for Attention</title>
      <itunes:episode>468</itunes:episode>
      <podcast:episode>468</podcast:episode>
      <itunes:title>Scalable-Softmax Is Superior for Attention</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f9fb0e6f-27db-4157-a7be-476948354bfc</guid>
      <link>https://share.transistor.fm/s/ce98e11e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ken M. Nakanishi</p>

            <p><strong>Title:</strong><br>
            Scalable-Softmax Is Superior for Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.19399v1">http://arxiv.org/abs/2501.19399v1</a></p>

            <p><strong>Abstract:</strong><br>
            The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to flatten as the context size grows. This reduces the model's ability to prioritize key information effectively and potentially limits its length generalization. To address this problem, we propose Scalable-Softmax (SSMax), which replaces Softmax in scenarios where the input vector size varies. SSMax can be seamlessly integrated into existing Transformer-based architectures. Experimental results in language modeling show that models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts and key information retrieval. Furthermore, an analysis of attention scores reveals that SSMax enables the model to focus attention on key information even in long contexts. Additionally, although models that use SSMax from the beginning of pretraining achieve better length generalization, those that have already started pretraining can still gain some of this ability by replacing Softmax in the attention layers with SSMax, either during or after pretraining.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ken M. Nakanishi</p>

            <p><strong>Title:</strong><br>
            Scalable-Softmax Is Superior for Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.19399v1">http://arxiv.org/abs/2501.19399v1</a></p>

            <p><strong>Abstract:</strong><br>
            The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to flatten as the context size grows. This reduces the model's ability to prioritize key information effectively and potentially limits its length generalization. To address this problem, we propose Scalable-Softmax (SSMax), which replaces Softmax in scenarios where the input vector size varies. SSMax can be seamlessly integrated into existing Transformer-based architectures. Experimental results in language modeling show that models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts and key information retrieval. Furthermore, an analysis of attention scores reveals that SSMax enables the model to focus attention on key information even in long contexts. Additionally, although models that use SSMax from the beginning of pretraining achieve better length generalization, those that have already started pretraining can still gain some of this ability by replacing Softmax in the attention layers with SSMax, either during or after pretraining.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 03 Feb 2025 20:56:50 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ce98e11e/ca221a3a.mp3" length="22639239" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1411</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ken M. Nakanishi</p>

            <p><strong>Title:</strong><br>
            Scalable-Softmax Is Superior for Attention</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.19399v1">http://arxiv.org/abs/2501.19399v1</a></p>

            <p><strong>Abstract:</strong><br>
            The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to flatten as the context size grows. This reduces the model's ability to prioritize key information effectively and potentially limits its length generalization. To address this problem, we propose Scalable-Softmax (SSMax), which replaces Softmax in scenarios where the input vector size varies. SSMax can be seamlessly integrated into existing Transformer-based architectures. Experimental results in language modeling show that models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts and key information retrieval. Furthermore, an analysis of attention scores reveals that SSMax enables the model to focus attention on key information even in long contexts. Additionally, although models that use SSMax from the beginning of pretraining achieve better length generalization, those that have already started pretraining can still gain some of this ability by replacing Softmax in the attention layers with SSMax, either during or after pretraining.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training</title>
      <itunes:episode>467</itunes:episode>
      <podcast:episode>467</podcast:episode>
      <itunes:title>The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">79032eb3-ade4-4f5d-8b19-428ed77837be</guid>
      <link>https://share.transistor.fm/s/8e703679</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.LG, math.OC, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Fabian Schaipp, Alexander Hägele, Adrien Taylor, Umut Simsekli, Francis Bach</p>

            <p><strong>Title:</strong><br>
            The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18965v1">http://arxiv.org/abs/2501.18965v1</a></p>

            <p><strong>Abstract:</strong><br>
            We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedule with linear cooldown; in particular, the practical benefit of cooldown is reflected in the bound due to the absence of logarithmic terms. Further, we show that this surprisingly close match between optimization theory and practice can be exploited for learning-rate tuning: we achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.LG, math.OC, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Fabian Schaipp, Alexander Hägele, Adrien Taylor, Umut Simsekli, Francis Bach</p>

            <p><strong>Title:</strong><br>
            The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18965v1">http://arxiv.org/abs/2501.18965v1</a></p>

            <p><strong>Abstract:</strong><br>
            We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedule with linear cooldown; in particular, the practical benefit of cooldown is reflected in the bound due to the absence of logarithmic terms. Further, we show that this surprisingly close match between optimization theory and practice can be exploited for learning-rate tuning: we achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 03 Feb 2025 20:56:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8e703679/0c11625e.mp3" length="21031421" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1311</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.LG, math.OC, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Fabian Schaipp, Alexander Hägele, Adrien Taylor, Umut Simsekli, Francis Bach</p>

            <p><strong>Title:</strong><br>
            The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18965v1">http://arxiv.org/abs/2501.18965v1</a></p>

            <p><strong>Abstract:</strong><br>
            We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedule with linear cooldown; in particular, the practical benefit of cooldown is reflected in the bound due to the absence of logarithmic terms. Further, we show that this surprisingly close match between optimization theory and practice can be exploited for learning-rate tuning: we achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders</title>
      <itunes:episode>466</itunes:episode>
      <podcast:episode>466</podcast:episode>
      <itunes:title>SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">079bb102-78b1-48a4-b240-eaa838c17744</guid>
      <link>https://share.transistor.fm/s/b3aaf5d4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Bartosz Cywiński, Kamil Deja</p>

            <p><strong>Title:</strong><br>
            SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18052v2">http://arxiv.org/abs/2501.18052v2</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Evaluation with the competitive UnlearnCanvas benchmark on object and style unlearning highlights SAeUron's state-of-the-art performance. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content, even under adversarial attack. Code and checkpoints are available at: https://github.com/cywinski/SAeUron.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Bartosz Cywiński, Kamil Deja</p>

            <p><strong>Title:</strong><br>
            SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18052v2">http://arxiv.org/abs/2501.18052v2</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Evaluation with the competitive UnlearnCanvas benchmark on object and style unlearning highlights SAeUron's state-of-the-art performance. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content, even under adversarial attack. Code and checkpoints are available at: https://github.com/cywinski/SAeUron.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 03 Feb 2025 20:56:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b3aaf5d4/fad219f8.mp3" length="19400100" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1209</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Bartosz Cywiński, Kamil Deja</p>

            <p><strong>Title:</strong><br>
            SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18052v2">http://arxiv.org/abs/2501.18052v2</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Evaluation with the competitive UnlearnCanvas benchmark on object and style unlearning highlights SAeUron's state-of-the-art performance. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content, even under adversarial attack. Code and checkpoints are available at: https://github.com/cywinski/SAeUron.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GuardReasoner: Towards Reasoning-based LLM Safeguards</title>
      <itunes:episode>465</itunes:episode>
      <podcast:episode>465</podcast:episode>
      <itunes:title>GuardReasoner: Towards Reasoning-based LLM Safeguards</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5a8906ea-1b76-4d22-9f6e-f265c3915841</guid>
      <link>https://share.transistor.fm/s/a9ecf40a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CR, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi</p>

            <p><strong>Title:</strong><br>
            GuardReasoner: Towards Reasoning-based LLM Safeguards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18492v1">http://arxiv.org/abs/2501.18492v1</a></p>

            <p><strong>Abstract:</strong><br>
            As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner : https://github.com/yueliu1999/GuardReasoner/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CR, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi</p>

            <p><strong>Title:</strong><br>
            GuardReasoner: Towards Reasoning-based LLM Safeguards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18492v1">http://arxiv.org/abs/2501.18492v1</a></p>

            <p><strong>Abstract:</strong><br>
            As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner : https://github.com/yueliu1999/GuardReasoner/.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 31 Jan 2025 20:39:44 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a9ecf40a/d795a913.mp3" length="20215923" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1260</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CR, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi</p>

            <p><strong>Title:</strong><br>
            GuardReasoner: Towards Reasoning-based LLM Safeguards</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18492v1">http://arxiv.org/abs/2501.18492v1</a></p>

            <p><strong>Abstract:</strong><br>
            As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner : https://github.com/yueliu1999/GuardReasoner/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs</title>
      <itunes:episode>464</itunes:episode>
      <podcast:episode>464</podcast:episode>
      <itunes:title>Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4dd39bf9-b52f-4428-9869-db84c9ff94a8</guid>
      <link>https://share.transistor.fm/s/8ae628d0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu</p>

            <p><strong>Title:</strong><br>
            Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18585v1">http://arxiv.org/abs/2501.18585v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting human-like deep thinking. However, we identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts without sufficiently exploring promising paths to reach a correct solution. This behavior leads to inadequate depth of reasoning and decreased performance, particularly on challenging mathematical problems. To systematically analyze this issue, we conduct experiments on three challenging test sets and two representative open-source o1-like models, revealing that frequent thought switching correlates with incorrect responses. We introduce a novel metric to quantify underthinking by measuring token efficiency in incorrect answers. To address underthinking, we propose a decoding strategy with thought switching penalty TIP that discourages premature transitions between thoughts, encouraging deeper exploration of each reasoning path. Experimental results demonstrate that our approach improves accuracy across challenging datasets without requiring model fine-tuning. Our findings contribute to understanding reasoning inefficiencies in o1-like LLMs and offer a practical solution to enhance their problem-solving capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu</p>

            <p><strong>Title:</strong><br>
            Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18585v1">http://arxiv.org/abs/2501.18585v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting human-like deep thinking. However, we identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts without sufficiently exploring promising paths to reach a correct solution. This behavior leads to inadequate depth of reasoning and decreased performance, particularly on challenging mathematical problems. To systematically analyze this issue, we conduct experiments on three challenging test sets and two representative open-source o1-like models, revealing that frequent thought switching correlates with incorrect responses. We introduce a novel metric to quantify underthinking by measuring token efficiency in incorrect answers. To address underthinking, we propose a decoding strategy with thought switching penalty TIP that discourages premature transitions between thoughts, encouraging deeper exploration of each reasoning path. Experimental results demonstrate that our approach improves accuracy across challenging datasets without requiring model fine-tuning. Our findings contribute to understanding reasoning inefficiencies in o1-like LLMs and offer a practical solution to enhance their problem-solving capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 31 Jan 2025 20:39:20 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8ae628d0/5133662c.mp3" length="22149000" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1381</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu</p>

            <p><strong>Title:</strong><br>
            Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18585v1">http://arxiv.org/abs/2501.18585v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting human-like deep thinking. However, we identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts without sufficiently exploring promising paths to reach a correct solution. This behavior leads to inadequate depth of reasoning and decreased performance, particularly on challenging mathematical problems. To systematically analyze this issue, we conduct experiments on three challenging test sets and two representative open-source o1-like models, revealing that frequent thought switching correlates with incorrect responses. We introduce a novel metric to quantify underthinking by measuring token efficiency in incorrect answers. To address underthinking, we propose a decoding strategy with thought switching penalty TIP that discourages premature transitions between thoughts, encouraging deeper exploration of each reasoning path. Experimental results demonstrate that our approach improves accuracy across challenging datasets without requiring model fine-tuning. Our findings contribute to understanding reasoning inefficiencies in o1-like LLMs and offer a practical solution to enhance their problem-solving capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch</title>
      <itunes:episode>463</itunes:episode>
      <podcast:episode>463</podcast:episode>
      <itunes:title>Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">05eea27d-c71f-4d63-86d6-2a7fb1eb87b9</guid>
      <link>https://share.transistor.fm/s/a47a2274</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Arthur Douillard, Yanislav Donchev, Keith Rush, Satyen Kale, Zachary Charles, Zachary Garrett, Gabriel Teston, Dave Lacey, Ross McIlroy, Jiajun Shen, Alexandre Ramé, Arthur Szlam, Marc'Aurelio Ranzato, Paul Barham</p>

            <p><strong>Title:</strong><br>
            Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18512v1">http://arxiv.org/abs/2501.18512v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links to support the required high volume of exchanged bits. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint: accelerators can be grouped into ``workers'', where synchronizations between workers only occur infrequently. This in turn means that workers can afford being connected by lower bandwidth communication links without affecting learning quality. However, in these methods, communication across workers still requires the same peak bandwidth as before, as the synchronizations require all parameters to be exchanged across all workers. In this paper, we improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth. Second, we allow workers to continue training while synchronizing, which decreases wall clock time. Third, we quantize the data exchanged by workers, which further reduces bandwidth across workers. By properly combining these modifications, we show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before, but reducing required bandwidth by two orders of magnitude.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Arthur Douillard, Yanislav Donchev, Keith Rush, Satyen Kale, Zachary Charles, Zachary Garrett, Gabriel Teston, Dave Lacey, Ross McIlroy, Jiajun Shen, Alexandre Ramé, Arthur Szlam, Marc'Aurelio Ranzato, Paul Barham</p>

            <p><strong>Title:</strong><br>
            Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18512v1">http://arxiv.org/abs/2501.18512v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links to support the required high volume of exchanged bits. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint: accelerators can be grouped into ``workers'', where synchronizations between workers only occur infrequently. This in turn means that workers can afford being connected by lower bandwidth communication links without affecting learning quality. However, in these methods, communication across workers still requires the same peak bandwidth as before, as the synchronizations require all parameters to be exchanged across all workers. In this paper, we improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth. Second, we allow workers to continue training while synchronizing, which decreases wall clock time. Third, we quantize the data exchanged by workers, which further reduces bandwidth across workers. By properly combining these modifications, we show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before, but reducing required bandwidth by two orders of magnitude.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 31 Jan 2025 20:38:57 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a47a2274/cd04e1bb.mp3" length="22716601" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1416</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Arthur Douillard, Yanislav Donchev, Keith Rush, Satyen Kale, Zachary Charles, Zachary Garrett, Gabriel Teston, Dave Lacey, Ross McIlroy, Jiajun Shen, Alexandre Ramé, Arthur Szlam, Marc'Aurelio Ranzato, Paul Barham</p>

            <p><strong>Title:</strong><br>
            Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18512v1">http://arxiv.org/abs/2501.18512v1</a></p>

            <p><strong>Abstract:</strong><br>
            Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links to support the required high volume of exchanged bits. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint: accelerators can be grouped into ``workers'', where synchronizations between workers only occur infrequently. This in turn means that workers can afford being connected by lower bandwidth communication links without affecting learning quality. However, in these methods, communication across workers still requires the same peak bandwidth as before, as the synchronizations require all parameters to be exchanged across all workers. In this paper, we improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth. Second, we allow workers to continue training while synchronizing, which decreases wall clock time. Third, we quantize the data exchanged by workers, which further reduces bandwidth across workers. By properly combining these modifications, we show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before, but reducing required bandwidth by two orders of magnitude.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding</title>
      <itunes:episode>462</itunes:episode>
      <podcast:episode>462</podcast:episode>
      <itunes:title>MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a643998e-5305-4c61-bcee-fb07de55b46c</guid>
      <link>https://share.transistor.fm/s/5ed936b5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.AI, cs.CL, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18362v1">http://arxiv.org/abs/2501.18362v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 16 leading models on MedXpertQA. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.AI, cs.CL, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18362v1">http://arxiv.org/abs/2501.18362v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 16 leading models on MedXpertQA. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 31 Jan 2025 20:38:34 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5ed936b5/2da1aaba.mp3" length="18711708" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1166</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.AI, cs.CL, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18362v1">http://arxiv.org/abs/2501.18362v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 16 leading models on MedXpertQA. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Large Language Models Think Too Fast To Explore Effectively</title>
      <itunes:episode>461</itunes:episode>
      <podcast:episode>461</podcast:episode>
      <itunes:title>Large Language Models Think Too Fast To Explore Effectively</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">24e542f9-a5c8-4666-8238-7d703c0c80a5</guid>
      <link>https://share.transistor.fm/s/2646943f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.AI, q-bio.NC</p>

            <p><strong>Authors:</strong><br>
            Lan Pan, Hanbo Xie, Robert C. Wilson</p>

            <p><strong>Title:</strong><br>
            Large Language Models Think Too Fast To Explore Effectively</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18009v1">http://arxiv.org/abs/2501.18009v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models have emerged many intellectual capacities. While numerous benchmarks assess their intelligence, limited attention has been given to their ability to explore, an essential capacity for discovering new information and adapting to novel environments in both natural and artificial systems. The extent to which LLMs can effectively explore, particularly in open-ended tasks, remains unclear. This study investigates whether LLMs can surpass humans in exploration during an open-ended task, using Little Alchemy 2 as a paradigm, where agents combine elements to discover new ones. Results show most LLMs underperform compared to humans, except for the o1 model, with those traditional LLMs relying primarily on uncertainty driven strategies, unlike humans who balance uncertainty and empowerment. Representational analysis of the models with Sparse Autoencoders revealed that uncertainty and choices are represented at earlier transformer blocks, while empowerment values are processed later, causing LLMs to think too fast and make premature decisions, hindering effective exploration. These findings shed light on the limitations of LLM exploration and suggest directions for improving their adaptability.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.AI, q-bio.NC</p>

            <p><strong>Authors:</strong><br>
            Lan Pan, Hanbo Xie, Robert C. Wilson</p>

            <p><strong>Title:</strong><br>
            Large Language Models Think Too Fast To Explore Effectively</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18009v1">http://arxiv.org/abs/2501.18009v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models have emerged many intellectual capacities. While numerous benchmarks assess their intelligence, limited attention has been given to their ability to explore, an essential capacity for discovering new information and adapting to novel environments in both natural and artificial systems. The extent to which LLMs can effectively explore, particularly in open-ended tasks, remains unclear. This study investigates whether LLMs can surpass humans in exploration during an open-ended task, using Little Alchemy 2 as a paradigm, where agents combine elements to discover new ones. Results show most LLMs underperform compared to humans, except for the o1 model, with those traditional LLMs relying primarily on uncertainty driven strategies, unlike humans who balance uncertainty and empowerment. Representational analysis of the models with Sparse Autoencoders revealed that uncertainty and choices are represented at earlier transformer blocks, while empowerment values are processed later, causing LLMs to think too fast and make premature decisions, hindering effective exploration. These findings shed light on the limitations of LLM exploration and suggest directions for improving their adaptability.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 31 Jan 2025 20:38:11 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2646943f/39c09106.mp3" length="24888295" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1552</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.AI, q-bio.NC</p>

            <p><strong>Authors:</strong><br>
            Lan Pan, Hanbo Xie, Robert C. Wilson</p>

            <p><strong>Title:</strong><br>
            Large Language Models Think Too Fast To Explore Effectively</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18009v1">http://arxiv.org/abs/2501.18009v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models have emerged many intellectual capacities. While numerous benchmarks assess their intelligence, limited attention has been given to their ability to explore, an essential capacity for discovering new information and adapting to novel environments in both natural and artificial systems. The extent to which LLMs can effectively explore, particularly in open-ended tasks, remains unclear. This study investigates whether LLMs can surpass humans in exploration during an open-ended task, using Little Alchemy 2 as a paradigm, where agents combine elements to discover new ones. Results show most LLMs underperform compared to humans, except for the o1 model, with those traditional LLMs relying primarily on uncertainty driven strategies, unlike humans who balance uncertainty and empowerment. Representational analysis of the models with Sparse Autoencoders revealed that uncertainty and choices are represented at earlier transformer blocks, while empowerment values are processed later, causing LLMs to think too fast and make premature decisions, hindering effective exploration. These findings shed light on the limitations of LLM exploration and suggest directions for improving their adaptability.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training</title>
      <itunes:episode>460</itunes:episode>
      <podcast:episode>460</podcast:episode>
      <itunes:title>WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8d67e123-aabc-4942-9ade-7fef7b59468b</guid>
      <link>https://share.transistor.fm/s/d1a4a164</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Benjamin Feuer, Chinmay Hegde</p>

            <p><strong>Title:</strong><br>
            WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18511v1">http://arxiv.org/abs/2501.18511v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at https://github.com/penfever/wildchat-50m.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Benjamin Feuer, Chinmay Hegde</p>

            <p><strong>Title:</strong><br>
            WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18511v1">http://arxiv.org/abs/2501.18511v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at https://github.com/penfever/wildchat-50m.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 31 Jan 2025 20:37:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d1a4a164/f1aba8e3.mp3" length="19499980" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1215</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Benjamin Feuer, Chinmay Hegde</p>

            <p><strong>Title:</strong><br>
            WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18511v1">http://arxiv.org/abs/2501.18511v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at https://github.com/penfever/wildchat-50m.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding</title>
      <itunes:episode>459</itunes:episode>
      <podcast:episode>459</podcast:episode>
      <itunes:title>PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d1a8f612-fbe8-455e-8846-3a07accb0bd3</guid>
      <link>https://share.transistor.fm/s/b2aa5728</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.AI, cs.CL, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, Yue Wang</p>

            <p><strong>Title:</strong><br>
            PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16411v2">http://arxiv.org/abs/2501.16411v2</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding the physical world is a fundamental challenge in embodied AI, critical for enabling agents to perform complex tasks and operate safely in real-world environments. While Vision-Language Models (VLMs) have shown great promise in reasoning and task planning for embodied agents, their ability to comprehend physical phenomena remains extremely limited. To close this gap, we introduce PhysBench, a comprehensive benchmark designed to evaluate VLMs' physical world understanding capability across a diverse set of tasks. PhysBench contains 10,002 entries of interleaved video-image-text data, categorized into four major domains: physical object properties, physical object relationships, physical scene understanding, and physics-based dynamics, further divided into 19 subclasses and 8 distinct capability dimensions. Our extensive experiments, conducted on 75 representative VLMs, reveal that while these models excel in common-sense reasoning, they struggle with understanding the physical world -- likely due to the absence of physical knowledge in their training data and the lack of embedded physical priors. To tackle the shortfall, we introduce PhysAgent, a novel framework that combines the generalization strengths of VLMs with the specialized expertise of vision models, significantly enhancing VLMs' physical understanding across a variety of tasks, including an 18.4\% improvement on GPT-4o. Furthermore, our results demonstrate that enhancing VLMs' physical world understanding capabilities can help embodied agents such as MOKA. We believe that PhysBench and PhysAgent offer valuable insights and contribute to bridging the gap between VLMs and physical world understanding.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.AI, cs.CL, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, Yue Wang</p>

            <p><strong>Title:</strong><br>
            PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16411v2">http://arxiv.org/abs/2501.16411v2</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding the physical world is a fundamental challenge in embodied AI, critical for enabling agents to perform complex tasks and operate safely in real-world environments. While Vision-Language Models (VLMs) have shown great promise in reasoning and task planning for embodied agents, their ability to comprehend physical phenomena remains extremely limited. To close this gap, we introduce PhysBench, a comprehensive benchmark designed to evaluate VLMs' physical world understanding capability across a diverse set of tasks. PhysBench contains 10,002 entries of interleaved video-image-text data, categorized into four major domains: physical object properties, physical object relationships, physical scene understanding, and physics-based dynamics, further divided into 19 subclasses and 8 distinct capability dimensions. Our extensive experiments, conducted on 75 representative VLMs, reveal that while these models excel in common-sense reasoning, they struggle with understanding the physical world -- likely due to the absence of physical knowledge in their training data and the lack of embedded physical priors. To tackle the shortfall, we introduce PhysAgent, a novel framework that combines the generalization strengths of VLMs with the specialized expertise of vision models, significantly enhancing VLMs' physical understanding across a variety of tasks, including an 18.4\% improvement on GPT-4o. Furthermore, our results demonstrate that enhancing VLMs' physical world understanding capabilities can help embodied agents such as MOKA. We believe that PhysBench and PhysAgent offer valuable insights and contribute to bridging the gap between VLMs and physical world understanding.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 31 Jan 2025 20:37:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b2aa5728/5ed7f3a6.mp3" length="23617315" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1472</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.AI, cs.CL, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, Yue Wang</p>

            <p><strong>Title:</strong><br>
            PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16411v2">http://arxiv.org/abs/2501.16411v2</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding the physical world is a fundamental challenge in embodied AI, critical for enabling agents to perform complex tasks and operate safely in real-world environments. While Vision-Language Models (VLMs) have shown great promise in reasoning and task planning for embodied agents, their ability to comprehend physical phenomena remains extremely limited. To close this gap, we introduce PhysBench, a comprehensive benchmark designed to evaluate VLMs' physical world understanding capability across a diverse set of tasks. PhysBench contains 10,002 entries of interleaved video-image-text data, categorized into four major domains: physical object properties, physical object relationships, physical scene understanding, and physics-based dynamics, further divided into 19 subclasses and 8 distinct capability dimensions. Our extensive experiments, conducted on 75 representative VLMs, reveal that while these models excel in common-sense reasoning, they struggle with understanding the physical world -- likely due to the absence of physical knowledge in their training data and the lack of embedded physical priors. To tackle the shortfall, we introduce PhysAgent, a novel framework that combines the generalization strengths of VLMs with the specialized expertise of vision models, significantly enhancing VLMs' physical understanding across a variety of tasks, including an 18.4\% improvement on GPT-4o. Furthermore, our results demonstrate that enhancing VLMs' physical world understanding capabilities can help embodied agents such as MOKA. We believe that PhysBench and PhysAgent offer valuable insights and contribute to bridging the gap between VLMs and physical world understanding.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>o3-mini vs DeepSeek-R1: Which One is Safer?</title>
      <itunes:episode>458</itunes:episode>
      <podcast:episode>458</podcast:episode>
      <itunes:title>o3-mini vs DeepSeek-R1: Which One is Safer?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">086eac39-3072-4f9b-aacd-ad33d9561036</guid>
      <link>https://share.transistor.fm/s/fce7ee6c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, Sergio Segura</p>

            <p><strong>Title:</strong><br>
            o3-mini vs DeepSeek-R1: Which One is Safer?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18438v1">http://arxiv.org/abs/2501.18438v1</a></p>

            <p><strong>Abstract:</strong><br>
            The irruption of DeepSeek-R1 constitutes a turning point for the AI industry in general and the LLMs in particular. Its capabilities have demonstrated outstanding performance in several tasks, including creative thinking, code generation, maths and automated program repair, at apparently lower execution cost. However, LLMs must adhere to an important qualitative property, i.e., their alignment with safety and human values. A clear competitor of DeepSeek-R1 is its American counterpart, OpenAI's o3-mini model, which is expected to set high standards in terms of performance, safety and cost. In this paper we conduct a systematic assessment of the safety level of both, DeepSeek-R1 (70b version) and OpenAI's o3-mini (beta version). To this end, we make use of our recently released automated safety testing tool, named ASTRAL. By leveraging this tool, we automatically and systematically generate and execute a total of 1260 unsafe test inputs on both models. After conducting a semi-automated assessment of the outcomes provided by both LLMs, the results indicate that DeepSeek-R1 is highly unsafe as compared to OpenAI's o3-mini. Based on our evaluation, DeepSeek-R1 answered unsafely to 11.98% of the executed prompts whereas o3-mini only to 1.19%.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, Sergio Segura</p>

            <p><strong>Title:</strong><br>
            o3-mini vs DeepSeek-R1: Which One is Safer?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18438v1">http://arxiv.org/abs/2501.18438v1</a></p>

            <p><strong>Abstract:</strong><br>
            The irruption of DeepSeek-R1 constitutes a turning point for the AI industry in general and the LLMs in particular. Its capabilities have demonstrated outstanding performance in several tasks, including creative thinking, code generation, maths and automated program repair, at apparently lower execution cost. However, LLMs must adhere to an important qualitative property, i.e., their alignment with safety and human values. A clear competitor of DeepSeek-R1 is its American counterpart, OpenAI's o3-mini model, which is expected to set high standards in terms of performance, safety and cost. In this paper we conduct a systematic assessment of the safety level of both, DeepSeek-R1 (70b version) and OpenAI's o3-mini (beta version). To this end, we make use of our recently released automated safety testing tool, named ASTRAL. By leveraging this tool, we automatically and systematically generate and execute a total of 1260 unsafe test inputs on both models. After conducting a semi-automated assessment of the outcomes provided by both LLMs, the results indicate that DeepSeek-R1 is highly unsafe as compared to OpenAI's o3-mini. Based on our evaluation, DeepSeek-R1 answered unsafely to 11.98% of the executed prompts whereas o3-mini only to 1.19%.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 31 Jan 2025 20:37:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fce7ee6c/b6ef1f01.mp3" length="19282610" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1201</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, Sergio Segura</p>

            <p><strong>Title:</strong><br>
            o3-mini vs DeepSeek-R1: Which One is Safer?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.18438v1">http://arxiv.org/abs/2501.18438v1</a></p>

            <p><strong>Abstract:</strong><br>
            The irruption of DeepSeek-R1 constitutes a turning point for the AI industry in general and the LLMs in particular. Its capabilities have demonstrated outstanding performance in several tasks, including creative thinking, code generation, maths and automated program repair, at apparently lower execution cost. However, LLMs must adhere to an important qualitative property, i.e., their alignment with safety and human values. A clear competitor of DeepSeek-R1 is its American counterpart, OpenAI's o3-mini model, which is expected to set high standards in terms of performance, safety and cost. In this paper we conduct a systematic assessment of the safety level of both, DeepSeek-R1 (70b version) and OpenAI's o3-mini (beta version). To this end, we make use of our recently released automated safety testing tool, named ASTRAL. By leveraging this tool, we automatically and systematically generate and execute a total of 1260 unsafe test inputs on both models. After conducting a semi-automated assessment of the outcomes provided by both LLMs, the results indicate that DeepSeek-R1 is highly unsafe as compared to OpenAI's o3-mini. Based on our evaluation, DeepSeek-R1 answered unsafely to 11.98% of the executed prompts whereas o3-mini only to 1.19%.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation</title>
      <itunes:episode>457</itunes:episode>
      <podcast:episode>457</podcast:episode>
      <itunes:title>CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ba8c6b28-44a9-4bd3-8f5a-d899f227f29b</guid>
      <link>https://share.transistor.fm/s/0a03882f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 1 | cs.AI, cs.CL, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P. Bigham, Graham Neubig</p>

            <p><strong>Title:</strong><br>
            CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16609v1">http://arxiv.org/abs/2501.16609v1</a></p>

            <p><strong>Abstract:</strong><br>
            While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fall short on complex tasks in real-world contexts and modeling user preference. This presents an opportunity for humans to collaborate with the agent and leverage the agent's capabilities effectively. We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency. CowPilot reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions. During execution, users can interleave their actions with the agent by overriding suggestions or resuming agent control when needed. We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps. Even with human interventions during task execution, the agent successfully drives up to half of task success on its own. CowPilot can serve as a useful tool for data collection and agent evaluation across websites, which we believe will enable research in how users and agents can work together. Video demonstrations are available at https://oaishi.github.io/cowpilot.html</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 1 | cs.AI, cs.CL, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P. Bigham, Graham Neubig</p>

            <p><strong>Title:</strong><br>
            CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16609v1">http://arxiv.org/abs/2501.16609v1</a></p>

            <p><strong>Abstract:</strong><br>
            While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fall short on complex tasks in real-world contexts and modeling user preference. This presents an opportunity for humans to collaborate with the agent and leverage the agent's capabilities effectively. We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency. CowPilot reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions. During execution, users can interleave their actions with the agent by overriding suggestions or resuming agent control when needed. We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps. Even with human interventions during task execution, the agent successfully drives up to half of task success on its own. CowPilot can serve as a useful tool for data collection and agent evaluation across websites, which we believe will enable research in how users and agents can work together. Video demonstrations are available at https://oaishi.github.io/cowpilot.html</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 31 Jan 2025 20:36:38 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0a03882f/b66f897f.mp3" length="20444992" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1274</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 1 | cs.AI, cs.CL, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P. Bigham, Graham Neubig</p>

            <p><strong>Title:</strong><br>
            CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16609v1">http://arxiv.org/abs/2501.16609v1</a></p>

            <p><strong>Abstract:</strong><br>
            While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fall short on complex tasks in real-world contexts and modeling user preference. This presents an opportunity for humans to collaborate with the agent and leverage the agent's capabilities effectively. We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency. CowPilot reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions. During execution, users can interleave their actions with the agent by overriding suggestions or resuming agent control when needed. We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps. Even with human interventions during task execution, the agent successfully drives up to half of task success on its own. CowPilot can serve as a useful tool for data collection and agent evaluation across websites, which we believe will enable research in how users and agents can work together. Video demonstrations are available at https://oaishi.github.io/cowpilot.html</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate</title>
      <itunes:episode>456</itunes:episode>
      <podcast:episode>456</podcast:episode>
      <itunes:title>Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7c47e06b-ea22-4be4-ba0e-e3899d41b120</guid>
      <link>https://share.transistor.fm/s/86562d4a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yubo Wang, Xiang Yue, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17703v2">http://arxiv.org/abs/2501.17703v2</a></p>

            <p><strong>Abstract:</strong><br>
            Supervised Fine-Tuning (SFT) is commonly used to train language models to imitate annotated responses for given instructions. In this paper, we challenge this paradigm and propose Critique Fine-Tuning (CFT), a strategy where models learn to critique noisy responses rather than simply imitate correct ones. Inspired by human learning processes that emphasize critical thinking, CFT encourages deeper analysis and nuanced understanding-traits often overlooked by standard SFT. To validate the effectiveness of CFT, we construct a 50K-sample dataset from WebInstruct, using GPT-4o as the teacher to generate critiques in the form of ([query; noisy response], critique). CFT on this dataset yields a consistent 4-10% improvement over SFT on six math benchmarks with different base models like Qwen2.5, Qwen2.5-Math and DeepSeek-Math. We further expand to MetaMath and NuminaMath datasets and observe similar gains over SFT. Notably, our model Qwen2.5-Math-CFT only requires 1 hour training on 8xH100 over the 50K examples. It can match or outperform strong competitors like Qwen2.5-Math-Instruct on most benchmarks, which use over 2M samples. Moreover, it can match the performance of SimpleRL, which is a deepseek-r1 replication trained with 140x more compute. Ablation studies show that CFT is robust to the source of noisy response and teacher critique model. Through these findings, we argue that CFT offers a more effective alternative to advance the reasoning of language models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yubo Wang, Xiang Yue, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17703v2">http://arxiv.org/abs/2501.17703v2</a></p>

            <p><strong>Abstract:</strong><br>
            Supervised Fine-Tuning (SFT) is commonly used to train language models to imitate annotated responses for given instructions. In this paper, we challenge this paradigm and propose Critique Fine-Tuning (CFT), a strategy where models learn to critique noisy responses rather than simply imitate correct ones. Inspired by human learning processes that emphasize critical thinking, CFT encourages deeper analysis and nuanced understanding-traits often overlooked by standard SFT. To validate the effectiveness of CFT, we construct a 50K-sample dataset from WebInstruct, using GPT-4o as the teacher to generate critiques in the form of ([query; noisy response], critique). CFT on this dataset yields a consistent 4-10% improvement over SFT on six math benchmarks with different base models like Qwen2.5, Qwen2.5-Math and DeepSeek-Math. We further expand to MetaMath and NuminaMath datasets and observe similar gains over SFT. Notably, our model Qwen2.5-Math-CFT only requires 1 hour training on 8xH100 over the 50K examples. It can match or outperform strong competitors like Qwen2.5-Math-Instruct on most benchmarks, which use over 2M samples. Moreover, it can match the performance of SimpleRL, which is a deepseek-r1 replication trained with 140x more compute. Ablation studies show that CFT is robust to the source of noisy response and teacher critique model. Through these findings, we argue that CFT offers a more effective alternative to advance the reasoning of language models.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 30 Jan 2025 20:12:30 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/86562d4a/4e8af537.mp3" length="21664601" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1350</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yubo Wang, Xiang Yue, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17703v2">http://arxiv.org/abs/2501.17703v2</a></p>

            <p><strong>Abstract:</strong><br>
            Supervised Fine-Tuning (SFT) is commonly used to train language models to imitate annotated responses for given instructions. In this paper, we challenge this paradigm and propose Critique Fine-Tuning (CFT), a strategy where models learn to critique noisy responses rather than simply imitate correct ones. Inspired by human learning processes that emphasize critical thinking, CFT encourages deeper analysis and nuanced understanding-traits often overlooked by standard SFT. To validate the effectiveness of CFT, we construct a 50K-sample dataset from WebInstruct, using GPT-4o as the teacher to generate critiques in the form of ([query; noisy response], critique). CFT on this dataset yields a consistent 4-10% improvement over SFT on six math benchmarks with different base models like Qwen2.5, Qwen2.5-Math and DeepSeek-Math. We further expand to MetaMath and NuminaMath datasets and observe similar gains over SFT. Notably, our model Qwen2.5-Math-CFT only requires 1 hour training on 8xH100 over the 50K examples. It can match or outperform strong competitors like Qwen2.5-Math-Instruct on most benchmarks, which use over 2M samples. Moreover, it can match the performance of SimpleRL, which is a deepseek-r1 replication trained with 140x more compute. Ablation studies show that CFT is robust to the source of noisy response and teacher critique model. Through these findings, we argue that CFT offers a more effective alternative to advance the reasoning of language models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Atla Selene Mini: A General Purpose Evaluation Model</title>
      <itunes:episode>455</itunes:episode>
      <podcast:episode>455</podcast:episode>
      <itunes:title>Atla Selene Mini: A General Purpose Evaluation Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d50e0ac0-8c3b-44d4-8bb6-2aa89dae671b</guid>
      <link>https://share.transistor.fm/s/398c76b6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Andrei Alexandru, Antonia Calvi, Henry Broomfield, Jackson Golden, Kyle Dai, Mathias Leys, Maurice Burger, Max Bartolo, Roman Engeler, Sashank Pisupati, Toby Drane, Young Sun Park</p>

            <p><strong>Title:</strong><br>
            Atla Selene Mini: A General Purpose Evaluation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17195v1">http://arxiv.org/abs/2501.17195v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Atla Selene Mini, a state-of-the-art small language model-as-a-judge (SLMJ). Selene Mini is a general-purpose evaluator that outperforms the best SLMJs and GPT-4o-mini on overall performance across 11 out-of-distribution benchmarks, spanning absolute scoring, classification, and pairwise preference tasks. It is the highest-scoring 8B generative model on RewardBench, surpassing strong baselines like GPT-4o and specialized judges. To achieve this, we develop a principled data curation strategy that augments public datasets with synthetically generated critiques and ensures high quality through filtering and dataset ablations. We train our model on a combined direct preference optimization (DPO) and supervised fine-tuning (SFT) loss, and produce a highly promptable evaluator that excels in real-world scenarios. Selene Mini shows dramatically improved zero-shot agreement with human expert evaluations on financial and medical industry datasets. It is also robust to variations in prompt format. Preliminary results indicate that Selene Mini is the top-ranking evaluator in a live, community-driven Judge Arena. We release the model weights on HuggingFace (https://hf.co/AtlaAI/Selene-1-Mini-Llama-3.1-8B) and Ollama to encourage widespread community adoption.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Andrei Alexandru, Antonia Calvi, Henry Broomfield, Jackson Golden, Kyle Dai, Mathias Leys, Maurice Burger, Max Bartolo, Roman Engeler, Sashank Pisupati, Toby Drane, Young Sun Park</p>

            <p><strong>Title:</strong><br>
            Atla Selene Mini: A General Purpose Evaluation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17195v1">http://arxiv.org/abs/2501.17195v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Atla Selene Mini, a state-of-the-art small language model-as-a-judge (SLMJ). Selene Mini is a general-purpose evaluator that outperforms the best SLMJs and GPT-4o-mini on overall performance across 11 out-of-distribution benchmarks, spanning absolute scoring, classification, and pairwise preference tasks. It is the highest-scoring 8B generative model on RewardBench, surpassing strong baselines like GPT-4o and specialized judges. To achieve this, we develop a principled data curation strategy that augments public datasets with synthetically generated critiques and ensures high quality through filtering and dataset ablations. We train our model on a combined direct preference optimization (DPO) and supervised fine-tuning (SFT) loss, and produce a highly promptable evaluator that excels in real-world scenarios. Selene Mini shows dramatically improved zero-shot agreement with human expert evaluations on financial and medical industry datasets. It is also robust to variations in prompt format. Preliminary results indicate that Selene Mini is the top-ranking evaluator in a live, community-driven Judge Arena. We release the model weights on HuggingFace (https://hf.co/AtlaAI/Selene-1-Mini-Llama-3.1-8B) and Ollama to encourage widespread community adoption.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 30 Jan 2025 20:12:09 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/398c76b6/62d13d42.mp3" length="24512960" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1528</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Andrei Alexandru, Antonia Calvi, Henry Broomfield, Jackson Golden, Kyle Dai, Mathias Leys, Maurice Burger, Max Bartolo, Roman Engeler, Sashank Pisupati, Toby Drane, Young Sun Park</p>

            <p><strong>Title:</strong><br>
            Atla Selene Mini: A General Purpose Evaluation Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17195v1">http://arxiv.org/abs/2501.17195v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Atla Selene Mini, a state-of-the-art small language model-as-a-judge (SLMJ). Selene Mini is a general-purpose evaluator that outperforms the best SLMJs and GPT-4o-mini on overall performance across 11 out-of-distribution benchmarks, spanning absolute scoring, classification, and pairwise preference tasks. It is the highest-scoring 8B generative model on RewardBench, surpassing strong baselines like GPT-4o and specialized judges. To achieve this, we develop a principled data curation strategy that augments public datasets with synthetically generated critiques and ensures high quality through filtering and dataset ablations. We train our model on a combined direct preference optimization (DPO) and supervised fine-tuning (SFT) loss, and produce a highly promptable evaluator that excels in real-world scenarios. Selene Mini shows dramatically improved zero-shot agreement with human expert evaluations on financial and medical industry datasets. It is also robust to variations in prompt format. Preliminary results indicate that Selene Mini is the top-ranking evaluator in a live, community-driven Judge Arena. We release the model weights on HuggingFace (https://hf.co/AtlaAI/Selene-1-Mini-Llama-3.1-8B) and Ollama to encourage widespread community adoption.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Exploring the sustainable scaling of AI dilemma: A projective study of corporations' AI environmental impacts</title>
      <itunes:episode>454</itunes:episode>
      <podcast:episode>454</podcast:episode>
      <itunes:title>Exploring the sustainable scaling of AI dilemma: A projective study of corporations' AI environmental impacts</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">396194cc-1cb9-43c0-9978-54d634cab5d9</guid>
      <link>https://share.transistor.fm/s/bed42b1e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.AI, cs.CY, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Clément Desroches, Martin Chauvin, Louis Ladan, Caroline Vateau, Simon Gosset, Philippe Cordier</p>

            <p><strong>Title:</strong><br>
            Exploring the sustainable scaling of AI dilemma: A projective study of corporations' AI environmental impacts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.14334v2">http://arxiv.org/abs/2501.14334v2</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid growth of artificial intelligence (AI), particularly Large Language Models (LLMs), has raised concerns regarding its global environmental impact that extends beyond greenhouse gas emissions to include consideration of hardware fabrication and end-of-life processes. The opacity from major providers hinders companies' abilities to evaluate their AI-related environmental impacts and achieve net-zero targets.   In this paper, we propose a methodology to estimate the environmental impact of a company's AI portfolio, providing actionable insights without necessitating extensive AI and Life-Cycle Assessment (LCA) expertise. Results confirm that large generative AI models consume up to 4600x more energy than traditional models. Our modelling approach, which accounts for increased AI usage, hardware computing efficiency, and changes in electricity mix in line with IPCC scenarios, forecasts AI electricity use up to 2030. Under a high adoption scenario, driven by widespread Generative AI and agents adoption associated to increasingly complex models and frameworks, AI electricity use is projected to rise by a factor of 24.4.   Mitigating the environmental impact of Generative AI by 2030 requires coordinated efforts across the AI value chain. Isolated measures in hardware efficiency, model efficiency, or grid improvements alone are insufficient. We advocate for standardized environmental assessment frameworks, greater transparency from the all actors of the value chain and the introduction of a "Return on Environment" metric to align AI development with net-zero goals.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.AI, cs.CY, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Clément Desroches, Martin Chauvin, Louis Ladan, Caroline Vateau, Simon Gosset, Philippe Cordier</p>

            <p><strong>Title:</strong><br>
            Exploring the sustainable scaling of AI dilemma: A projective study of corporations' AI environmental impacts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.14334v2">http://arxiv.org/abs/2501.14334v2</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid growth of artificial intelligence (AI), particularly Large Language Models (LLMs), has raised concerns regarding its global environmental impact that extends beyond greenhouse gas emissions to include consideration of hardware fabrication and end-of-life processes. The opacity from major providers hinders companies' abilities to evaluate their AI-related environmental impacts and achieve net-zero targets.   In this paper, we propose a methodology to estimate the environmental impact of a company's AI portfolio, providing actionable insights without necessitating extensive AI and Life-Cycle Assessment (LCA) expertise. Results confirm that large generative AI models consume up to 4600x more energy than traditional models. Our modelling approach, which accounts for increased AI usage, hardware computing efficiency, and changes in electricity mix in line with IPCC scenarios, forecasts AI electricity use up to 2030. Under a high adoption scenario, driven by widespread Generative AI and agents adoption associated to increasingly complex models and frameworks, AI electricity use is projected to rise by a factor of 24.4.   Mitigating the environmental impact of Generative AI by 2030 requires coordinated efforts across the AI value chain. Isolated measures in hardware efficiency, model efficiency, or grid improvements alone are insufficient. We advocate for standardized environmental assessment frameworks, greater transparency from the all actors of the value chain and the introduction of a "Return on Environment" metric to align AI development with net-zero goals.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 30 Jan 2025 20:11:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bed42b1e/4911d2a7.mp3" length="27561194" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1719</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.AI, cs.CY, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Clément Desroches, Martin Chauvin, Louis Ladan, Caroline Vateau, Simon Gosset, Philippe Cordier</p>

            <p><strong>Title:</strong><br>
            Exploring the sustainable scaling of AI dilemma: A projective study of corporations' AI environmental impacts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.14334v2">http://arxiv.org/abs/2501.14334v2</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid growth of artificial intelligence (AI), particularly Large Language Models (LLMs), has raised concerns regarding its global environmental impact that extends beyond greenhouse gas emissions to include consideration of hardware fabrication and end-of-life processes. The opacity from major providers hinders companies' abilities to evaluate their AI-related environmental impacts and achieve net-zero targets.   In this paper, we propose a methodology to estimate the environmental impact of a company's AI portfolio, providing actionable insights without necessitating extensive AI and Life-Cycle Assessment (LCA) expertise. Results confirm that large generative AI models consume up to 4600x more energy than traditional models. Our modelling approach, which accounts for increased AI usage, hardware computing efficiency, and changes in electricity mix in line with IPCC scenarios, forecasts AI electricity use up to 2030. Under a high adoption scenario, driven by widespread Generative AI and agents adoption associated to increasingly complex models and frameworks, AI electricity use is projected to rise by a factor of 24.4.   Mitigating the environmental impact of Generative AI by 2030 requires coordinated efforts across the AI value chain. Isolated measures in hardware efficiency, model efficiency, or grid improvements alone are insufficient. We advocate for standardized environmental assessment frameworks, greater transparency from the all actors of the value chain and the introduction of a "Return on Environment" metric to align AI development with net-zero goals.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation</title>
      <itunes:episode>453</itunes:episode>
      <podcast:episode>453</podcast:episode>
      <itunes:title>Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a70cbb7f-0f52-4a1e-8e4d-6666e4171c45</guid>
      <link>https://share.transistor.fm/s/cb243df5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, Sergio Segura</p>

            <p><strong>Title:</strong><br>
            Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17749v1">http://arxiv.org/abs/2501.17749v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have become an integral part of our daily lives. However, they impose certain risks, including those that can harm individuals' privacy, perpetuate biases and spread misinformation. These risks highlight the need for robust safety mechanisms, ethical guidelines, and thorough testing to ensure their responsible deployment. Safety of LLMs is a key property that needs to be thoroughly tested prior the model to be deployed and accessible to the general users. This paper reports the external safety testing experience conducted by researchers from Mondragon University and University of Seville on OpenAI's new o3-mini LLM as part of OpenAI's early access for safety testing program. In particular, we apply our tool, ASTRAL, to automatically and systematically generate up to date unsafe test inputs (i.e., prompts) that helps us test and assess different safety categories of LLMs. We automatically generate and execute a total of 10,080 unsafe test input on a early o3-mini beta version. After manually verifying the test cases classified as unsafe by ASTRAL, we identify a total of 87 actual instances of unsafe LLM behavior. We highlight key insights and findings uncovered during the pre-deployment external testing phase of OpenAI's latest LLM.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, Sergio Segura</p>

            <p><strong>Title:</strong><br>
            Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17749v1">http://arxiv.org/abs/2501.17749v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have become an integral part of our daily lives. However, they impose certain risks, including those that can harm individuals' privacy, perpetuate biases and spread misinformation. These risks highlight the need for robust safety mechanisms, ethical guidelines, and thorough testing to ensure their responsible deployment. Safety of LLMs is a key property that needs to be thoroughly tested prior the model to be deployed and accessible to the general users. This paper reports the external safety testing experience conducted by researchers from Mondragon University and University of Seville on OpenAI's new o3-mini LLM as part of OpenAI's early access for safety testing program. In particular, we apply our tool, ASTRAL, to automatically and systematically generate up to date unsafe test inputs (i.e., prompts) that helps us test and assess different safety categories of LLMs. We automatically generate and execute a total of 10,080 unsafe test input on a early o3-mini beta version. After manually verifying the test cases classified as unsafe by ASTRAL, we identify a total of 87 actual instances of unsafe LLM behavior. We highlight key insights and findings uncovered during the pre-deployment external testing phase of OpenAI's latest LLM.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 30 Jan 2025 20:11:26 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cb243df5/2cc8db22.mp3" length="21232023" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1323</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.SE, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, Sergio Segura</p>

            <p><strong>Title:</strong><br>
            Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17749v1">http://arxiv.org/abs/2501.17749v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have become an integral part of our daily lives. However, they impose certain risks, including those that can harm individuals' privacy, perpetuate biases and spread misinformation. These risks highlight the need for robust safety mechanisms, ethical guidelines, and thorough testing to ensure their responsible deployment. Safety of LLMs is a key property that needs to be thoroughly tested prior the model to be deployed and accessible to the general users. This paper reports the external safety testing experience conducted by researchers from Mondragon University and University of Seville on OpenAI's new o3-mini LLM as part of OpenAI's early access for safety testing program. In particular, we apply our tool, ASTRAL, to automatically and systematically generate up to date unsafe test inputs (i.e., prompts) that helps us test and assess different safety categories of LLMs. We automatically generate and execute a total of 10,080 unsafe test input on a early o3-mini beta version. After manually verifying the test cases classified as unsafe by ASTRAL, we identify a total of 87 actual instances of unsafe LLM behavior. We highlight key insights and findings uncovered during the pre-deployment external testing phase of OpenAI's latest LLM.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks</title>
      <itunes:episode>452</itunes:episode>
      <podcast:episode>452</podcast:episode>
      <itunes:title>Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7df7d74d-9e21-417d-9f52-5aece0efbad4</guid>
      <link>https://share.transistor.fm/s/434eef68</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hailong Guo, Bohan Zeng, Yiren Song, Wentao Zhang, Chuang Zhang, Jiaming Liu</p>

            <p><strong>Title:</strong><br>
            Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15891v1">http://arxiv.org/abs/2501.15891v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image-based virtual try-on (VTON) aims to generate a virtual try-on result by transferring an input garment onto a target person's image. However, the scarcity of paired garment-model data makes it challenging for existing methods to achieve high generalization and quality in VTON. Also, it limits the ability to generate mask-free try-ons. To tackle the data scarcity problem, approaches such as Stable Garment and MMTryon use a synthetic data strategy, effectively increasing the amount of paired data on the model side. However, existing methods are typically limited to performing specific try-on tasks and lack user-friendliness. To enhance the generalization and controllability of VTON generation, we propose Any2AnyTryon, which can generate try-on results based on different textual instructions and model garment images to meet various needs, eliminating the reliance on masks, poses, or other conditions. Specifically, we first construct the virtual try-on dataset LAION-Garment, the largest known open-source garment try-on dataset. Then, we introduce adaptive position embedding, which enables the model to generate satisfactory outfitted model images or garment images based on input images of different sizes and categories, significantly enhancing the generalization and controllability of VTON generation. In our experiments, we demonstrate the effectiveness of our Any2AnyTryon and compare it with existing methods. The results show that Any2AnyTryon enables flexible, controllable, and high-quality image-based virtual try-on generation.https://logn-2024.github.io/Any2anyTryonProjectPage/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hailong Guo, Bohan Zeng, Yiren Song, Wentao Zhang, Chuang Zhang, Jiaming Liu</p>

            <p><strong>Title:</strong><br>
            Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15891v1">http://arxiv.org/abs/2501.15891v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image-based virtual try-on (VTON) aims to generate a virtual try-on result by transferring an input garment onto a target person's image. However, the scarcity of paired garment-model data makes it challenging for existing methods to achieve high generalization and quality in VTON. Also, it limits the ability to generate mask-free try-ons. To tackle the data scarcity problem, approaches such as Stable Garment and MMTryon use a synthetic data strategy, effectively increasing the amount of paired data on the model side. However, existing methods are typically limited to performing specific try-on tasks and lack user-friendliness. To enhance the generalization and controllability of VTON generation, we propose Any2AnyTryon, which can generate try-on results based on different textual instructions and model garment images to meet various needs, eliminating the reliance on masks, poses, or other conditions. Specifically, we first construct the virtual try-on dataset LAION-Garment, the largest known open-source garment try-on dataset. Then, we introduce adaptive position embedding, which enables the model to generate satisfactory outfitted model images or garment images based on input images of different sizes and categories, significantly enhancing the generalization and controllability of VTON generation. In our experiments, we demonstrate the effectiveness of our Any2AnyTryon and compare it with existing methods. The results show that Any2AnyTryon enables flexible, controllable, and high-quality image-based virtual try-on generation.https://logn-2024.github.io/Any2anyTryonProjectPage/</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 30 Jan 2025 20:11:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/434eef68/123ebd22.mp3" length="21410905" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1334</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hailong Guo, Bohan Zeng, Yiren Song, Wentao Zhang, Chuang Zhang, Jiaming Liu</p>

            <p><strong>Title:</strong><br>
            Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15891v1">http://arxiv.org/abs/2501.15891v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image-based virtual try-on (VTON) aims to generate a virtual try-on result by transferring an input garment onto a target person's image. However, the scarcity of paired garment-model data makes it challenging for existing methods to achieve high generalization and quality in VTON. Also, it limits the ability to generate mask-free try-ons. To tackle the data scarcity problem, approaches such as Stable Garment and MMTryon use a synthetic data strategy, effectively increasing the amount of paired data on the model side. However, existing methods are typically limited to performing specific try-on tasks and lack user-friendliness. To enhance the generalization and controllability of VTON generation, we propose Any2AnyTryon, which can generate try-on results based on different textual instructions and model garment images to meet various needs, eliminating the reliance on masks, poses, or other conditions. Specifically, we first construct the virtual try-on dataset LAION-Garment, the largest known open-source garment try-on dataset. Then, we introduce adaptive position embedding, which enables the model to generate satisfactory outfitted model images or garment images based on input images of different sizes and categories, significantly enhancing the generalization and controllability of VTON generation. In our experiments, we demonstrate the effectiveness of our Any2AnyTryon and compare it with existing methods. The results show that Any2AnyTryon enables flexible, controllable, and high-quality image-based virtual try-on generation.https://logn-2024.github.io/Any2anyTryonProjectPage/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation</title>
      <itunes:episode>451</itunes:episode>
      <podcast:episode>451</podcast:episode>
      <itunes:title>Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">369a03cd-a44f-4af6-86af-27637c2c4b81</guid>
      <link>https://share.transistor.fm/s/81085b31</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CR, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu</p>

            <p><strong>Title:</strong><br>
            Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17433v1">http://arxiv.org/abs/2501.17433v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100\% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: \textbf{it is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attack}, as it cannot solve the inherent safety issue of the pre-trained LLMs. Our code is available at https://github.com/git-disl/Virus</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CR, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu</p>

            <p><strong>Title:</strong><br>
            Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17433v1">http://arxiv.org/abs/2501.17433v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100\% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: \textbf{it is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attack}, as it cannot solve the inherent safety issue of the pre-trained LLMs. Our code is available at https://github.com/git-disl/Virus</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 30 Jan 2025 20:10:44 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/81085b31/1d173a4b.mp3" length="20922311" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1304</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CR, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu</p>

            <p><strong>Title:</strong><br>
            Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17433v1">http://arxiv.org/abs/2501.17433v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100\% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: \textbf{it is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attack}, as it cannot solve the inherent safety issue of the pre-trained LLMs. Our code is available at https://github.com/git-disl/Virus</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text</title>
      <itunes:episode>450</itunes:episode>
      <podcast:episode>450</podcast:episode>
      <itunes:title>People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a5248b8c-0d33-467b-a1af-be37279b68b7</guid>
      <link>https://share.transistor.fm/s/5dd11909</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jenna Russell, Marzena Karpinska, Mohit Iyyer</p>

            <p><strong>Title:</strong><br>
            People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15654v1">http://arxiv.org/abs/2501.15654v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we study how well humans can detect text generated by commercial LLMs (GPT-4o, Claude, o1). We hire annotators to read 300 non-fiction English articles, label them as either human-written or AI-generated, and provide paragraph-length explanations for their decisions. Our experiments show that annotators who frequently use LLMs for writing tasks excel at detecting AI-generated text, even without any specialized training or feedback. In fact, the majority vote among five such "expert" annotators misclassifies only 1 of 300 articles, significantly outperforming most commercial and open-source detectors we evaluated even in the presence of evasion tactics like paraphrasing and humanization. Qualitative analysis of the experts' free-form explanations shows that while they rely heavily on specific lexical clues ('AI vocabulary'), they also pick up on more complex phenomena within the text (e.g., formality, originality, clarity) that are challenging to assess for automatic detectors. We release our annotated dataset and code to spur future research into both human and automated detection of AI-generated text.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jenna Russell, Marzena Karpinska, Mohit Iyyer</p>

            <p><strong>Title:</strong><br>
            People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15654v1">http://arxiv.org/abs/2501.15654v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we study how well humans can detect text generated by commercial LLMs (GPT-4o, Claude, o1). We hire annotators to read 300 non-fiction English articles, label them as either human-written or AI-generated, and provide paragraph-length explanations for their decisions. Our experiments show that annotators who frequently use LLMs for writing tasks excel at detecting AI-generated text, even without any specialized training or feedback. In fact, the majority vote among five such "expert" annotators misclassifies only 1 of 300 articles, significantly outperforming most commercial and open-source detectors we evaluated even in the presence of evasion tactics like paraphrasing and humanization. Qualitative analysis of the experts' free-form explanations shows that while they rely heavily on specific lexical clues ('AI vocabulary'), they also pick up on more complex phenomena within the text (e.g., formality, originality, clarity) that are challenging to assess for automatic detectors. We release our annotated dataset and code to spur future research into both human and automated detection of AI-generated text.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 30 Jan 2025 20:10:23 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5dd11909/81deef47.mp3" length="18963770" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1182</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jenna Russell, Marzena Karpinska, Mohit Iyyer</p>

            <p><strong>Title:</strong><br>
            People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15654v1">http://arxiv.org/abs/2501.15654v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we study how well humans can detect text generated by commercial LLMs (GPT-4o, Claude, o1). We hire annotators to read 300 non-fiction English articles, label them as either human-written or AI-generated, and provide paragraph-length explanations for their decisions. Our experiments show that annotators who frequently use LLMs for writing tasks excel at detecting AI-generated text, even without any specialized training or feedback. In fact, the majority vote among five such "expert" annotators misclassifies only 1 of 300 articles, significantly outperforming most commercial and open-source detectors we evaluated even in the presence of evasion tactics like paraphrasing and humanization. Qualitative analysis of the experts' free-form explanations shows that while they rely heavily on specific lexical clues ('AI vocabulary'), they also pick up on more complex phenomena within the text (e.g., formality, originality, clarity) that are challenging to assess for automatic detectors. We release our annotated dataset and code to spur future research into both human and automated detection of AI-generated text.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training</title>
      <itunes:episode>449</itunes:episode>
      <podcast:episode>449</podcast:episode>
      <itunes:title>SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3867218b-3a2f-45bc-b28a-f38a12ee82f9</guid>
      <link>https://share.transistor.fm/s/dfaf1ffc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, Yi Ma</p>

            <p><strong>Title:</strong><br>
            SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17161v1">http://arxiv.org/abs/2501.17161v1</a></p>

            <p><strong>Abstract:</strong><br>
            Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, Yi Ma</p>

            <p><strong>Title:</strong><br>
            SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17161v1">http://arxiv.org/abs/2501.17161v1</a></p>

            <p><strong>Abstract:</strong><br>
            Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 29 Jan 2025 20:30:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/dfaf1ffc/4840feb8.mp3" length="22412329" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1397</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, Yi Ma</p>

            <p><strong>Title:</strong><br>
            SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17161v1">http://arxiv.org/abs/2501.17161v1</a></p>

            <p><strong>Abstract:</strong><br>
            Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Optimizing Large Language Model Training Using FP4 Quantization</title>
      <itunes:episode>448</itunes:episode>
      <podcast:episode>448</podcast:episode>
      <itunes:title>Optimizing Large Language Model Training Using FP4 Quantization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d0442731-46d3-45d1-9637-ecaa18e1ad74</guid>
      <link>https://share.transistor.fm/s/6cee5873</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, Peng Cheng</p>

            <p><strong>Title:</strong><br>
            Optimizing Large Language Model Training Using FP4 Quantization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17116v1">http://arxiv.org/abs/2501.17116v1</a></p>

            <p><strong>Abstract:</strong><br>
            The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, Peng Cheng</p>

            <p><strong>Title:</strong><br>
            Optimizing Large Language Model Training Using FP4 Quantization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17116v1">http://arxiv.org/abs/2501.17116v1</a></p>

            <p><strong>Abstract:</strong><br>
            The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 29 Jan 2025 20:29:40 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6cee5873/16cdc374.mp3" length="21320599" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1329</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, Peng Cheng</p>

            <p><strong>Title:</strong><br>
            Optimizing Large Language Model Training Using FP4 Quantization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17116v1">http://arxiv.org/abs/2501.17116v1</a></p>

            <p><strong>Abstract:</strong><br>
            The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation</title>
      <itunes:episode>447</itunes:episode>
      <podcast:episode>447</podcast:episode>
      <itunes:title>DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ea175110-72d8-466b-ab0e-645eae47f771</guid>
      <link>https://share.transistor.fm/s/587bfbc9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenguo Lin, Panwang Pan, Bangbang Yang, Zeming Li, Yadong Mu</p>

            <p><strong>Title:</strong><br>
            DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16764v1">http://arxiv.org/abs/2501.16764v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in 3D content generation from text or a single image struggle with limited high-quality 3D datasets and inconsistency from 2D multi-view generation. We introduce DiffSplat, a novel 3D generative framework that natively generates 3D Gaussian splats by taming large-scale text-to-image diffusion models. It differs from previous 3D generative models by effectively utilizing web-scale 2D priors while maintaining 3D consistency in a unified model. To bootstrap the training, a lightweight reconstruction model is proposed to instantly produce multi-view Gaussian splat grids for scalable dataset curation. In conjunction with the regular diffusion loss on these grids, a 3D rendering loss is introduced to facilitate 3D coherence across arbitrary views. The compatibility with image diffusion models enables seamless adaptions of numerous techniques for image generation to the 3D realm. Extensive experiments reveal the superiority of DiffSplat in text- and image-conditioned generation tasks and downstream applications. Thorough ablation studies validate the efficacy of each critical design choice and provide insights into the underlying mechanism.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenguo Lin, Panwang Pan, Bangbang Yang, Zeming Li, Yadong Mu</p>

            <p><strong>Title:</strong><br>
            DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16764v1">http://arxiv.org/abs/2501.16764v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in 3D content generation from text or a single image struggle with limited high-quality 3D datasets and inconsistency from 2D multi-view generation. We introduce DiffSplat, a novel 3D generative framework that natively generates 3D Gaussian splats by taming large-scale text-to-image diffusion models. It differs from previous 3D generative models by effectively utilizing web-scale 2D priors while maintaining 3D consistency in a unified model. To bootstrap the training, a lightweight reconstruction model is proposed to instantly produce multi-view Gaussian splat grids for scalable dataset curation. In conjunction with the regular diffusion loss on these grids, a 3D rendering loss is introduced to facilitate 3D coherence across arbitrary views. The compatibility with image diffusion models enables seamless adaptions of numerous techniques for image generation to the 3D realm. Extensive experiments reveal the superiority of DiffSplat in text- and image-conditioned generation tasks and downstream applications. Thorough ablation studies validate the efficacy of each critical design choice and provide insights into the underlying mechanism.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 29 Jan 2025 20:29:19 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/587bfbc9/963937a4.mp3" length="22144418" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1380</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chenguo Lin, Panwang Pan, Bangbang Yang, Zeming Li, Yadong Mu</p>

            <p><strong>Title:</strong><br>
            DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16764v1">http://arxiv.org/abs/2501.16764v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in 3D content generation from text or a single image struggle with limited high-quality 3D datasets and inconsistency from 2D multi-view generation. We introduce DiffSplat, a novel 3D generative framework that natively generates 3D Gaussian splats by taming large-scale text-to-image diffusion models. It differs from previous 3D generative models by effectively utilizing web-scale 2D priors while maintaining 3D consistency in a unified model. To bootstrap the training, a lightweight reconstruction model is proposed to instantly produce multi-view Gaussian splat grids for scalable dataset curation. In conjunction with the regular diffusion loss on these grids, a 3D rendering loss is introduced to facilitate 3D coherence across arbitrary views. The compatibility with image diffusion models enables seamless adaptions of numerous techniques for image generation to the 3D realm. Extensive experiments reveal the superiority of DiffSplat in text- and image-conditioned generation tasks and downstream applications. Thorough ablation studies validate the efficacy of each critical design choice and provide insights into the underlying mechanism.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling</title>
      <itunes:episode>446</itunes:episode>
      <podcast:episode>446</podcast:episode>
      <itunes:title>Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">116a2bd7-f1c6-4d3c-bdde-15112b3bb5dd</guid>
      <link>https://share.transistor.fm/s/132a64b3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hongzhi Huang, Defa Zhu, Banggu Wu, Yutao Zeng, Ya Wang, Qiyang Min, Xun Zhou</p>

            <p><strong>Title:</strong><br>
            Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16975v1">http://arxiv.org/abs/2501.16975v1</a></p>

            <p><strong>Abstract:</strong><br>
            Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hongzhi Huang, Defa Zhu, Banggu Wu, Yutao Zeng, Ya Wang, Qiyang Min, Xun Zhou</p>

            <p><strong>Title:</strong><br>
            Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16975v1">http://arxiv.org/abs/2501.16975v1</a></p>

            <p><strong>Abstract:</strong><br>
            Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 29 Jan 2025 20:28:57 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/132a64b3/5bcd0633.mp3" length="22499664" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1403</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hongzhi Huang, Defa Zhu, Banggu Wu, Yutao Zeng, Ya Wang, Qiyang Min, Xun Zhou</p>

            <p><strong>Title:</strong><br>
            Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16975v1">http://arxiv.org/abs/2501.16975v1</a></p>

            <p><strong>Abstract:</strong><br>
            Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Open Problems in Mechanistic Interpretability</title>
      <itunes:episode>445</itunes:episode>
      <podcast:episode>445</podcast:episode>
      <itunes:title>Open Problems in Mechanistic Interpretability</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a8897500-85c8-4764-970c-68a0bec6b4ef</guid>
      <link>https://share.transistor.fm/s/d0bc102e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath</p>

            <p><strong>Title:</strong><br>
            Open Problems in Mechanistic Interpretability</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16496v1">http://arxiv.org/abs/2501.16496v1</a></p>

            <p><strong>Abstract:</strong><br>
            Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath</p>

            <p><strong>Title:</strong><br>
            Open Problems in Mechanistic Interpretability</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16496v1">http://arxiv.org/abs/2501.16496v1</a></p>

            <p><strong>Abstract:</strong><br>
            Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 29 Jan 2025 20:28:36 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d0bc102e/d4c183ad.mp3" length="24819735" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1548</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath</p>

            <p><strong>Title:</strong><br>
            Open Problems in Mechanistic Interpretability</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16496v1">http://arxiv.org/abs/2501.16496v1</a></p>

            <p><strong>Abstract:</strong><br>
            Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Low-Rank Adapters Meet Neural Architecture Search for LLM Compression</title>
      <itunes:episode>444</itunes:episode>
      <podcast:episode>444</podcast:episode>
      <itunes:title>Low-Rank Adapters Meet Neural Architecture Search for LLM Compression</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a6e0ce42-b940-4f8d-8dd9-278d59cff5f5</guid>
      <link>https://share.transistor.fm/s/b5aaa035</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain</p>

            <p><strong>Title:</strong><br>
            Low-Rank Adapters Meet Neural Architecture Search for LLM Compression</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16372v1">http://arxiv.org/abs/2501.16372v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid expansion of Large Language Models (LLMs) has posed significant challenges regarding the computational resources required for fine-tuning and deployment. Recent advancements in low-rank adapters have demonstrated their efficacy in parameter-efficient fine-tuning (PEFT) of these models. This retrospective paper comprehensively discusses innovative approaches that synergize low-rank representations with Neural Architecture Search (NAS) techniques, particularly weight-sharing super-networks. Robust solutions for compressing and fine-tuning large pre-trained models are developed by integrating these methodologies. Our analysis highlights the potential of these combined strategies to democratize the use of LLMs, making them more accessible for deployment in resource-constrained environments. The resulting models exhibit reduced memory footprints and faster inference times, paving the way for more practical and scalable applications of LLMs. Models and code are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain</p>

            <p><strong>Title:</strong><br>
            Low-Rank Adapters Meet Neural Architecture Search for LLM Compression</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16372v1">http://arxiv.org/abs/2501.16372v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid expansion of Large Language Models (LLMs) has posed significant challenges regarding the computational resources required for fine-tuning and deployment. Recent advancements in low-rank adapters have demonstrated their efficacy in parameter-efficient fine-tuning (PEFT) of these models. This retrospective paper comprehensively discusses innovative approaches that synergize low-rank representations with Neural Architecture Search (NAS) techniques, particularly weight-sharing super-networks. Robust solutions for compressing and fine-tuning large pre-trained models are developed by integrating these methodologies. Our analysis highlights the potential of these combined strategies to democratize the use of LLMs, making them more accessible for deployment in resource-constrained environments. The resulting models exhibit reduced memory footprints and faster inference times, paving the way for more practical and scalable applications of LLMs. Models and code are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 29 Jan 2025 20:28:15 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b5aaa035/beaaa8c6.mp3" length="21599384" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1346</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain</p>

            <p><strong>Title:</strong><br>
            Low-Rank Adapters Meet Neural Architecture Search for LLM Compression</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16372v1">http://arxiv.org/abs/2501.16372v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid expansion of Large Language Models (LLMs) has posed significant challenges regarding the computational resources required for fine-tuning and deployment. Recent advancements in low-rank adapters have demonstrated their efficacy in parameter-efficient fine-tuning (PEFT) of these models. This retrospective paper comprehensively discusses innovative approaches that synergize low-rank representations with Neural Architecture Search (NAS) techniques, particularly weight-sharing super-networks. Robust solutions for compressing and fine-tuning large pre-trained models are developed by integrating these methodologies. Our analysis highlights the potential of these combined strategies to democratize the use of LLMs, making them more accessible for deployment in resource-constrained environments. The resulting models exhibit reduced memory footprints and faster inference times, paving the way for more practical and scalable applications of LLMs. Models and code are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding</title>
      <itunes:episode>443</itunes:episode>
      <podcast:episode>443</podcast:episode>
      <itunes:title>IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">46c36a98-fcc5-4e02-b6d7-f7f00dad4f3e</guid>
      <link>https://share.transistor.fm/s/58e5c7fa</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Sankalp KJ, Ashutosh Kumar, Laxmaan Balaji, Nikunj Kotecha, Vinija Jain, Aman Chadha, Sreyoshi Bhaduri</p>

            <p><strong>Title:</strong><br>
            IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15747v2">http://arxiv.org/abs/2501.15747v2</a></p>

            <p><strong>Abstract:</strong><br>
            Known by more than 1.5 billion people in the Indian subcontinent, Indic languages present unique challenges and opportunities for natural language processing (NLP) research due to their rich cultural heritage, linguistic diversity, and complex structures. IndicMMLU-Pro is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) across Indic languages, building upon the MMLU Pro (Massive Multitask Language Understanding) framework. Covering major languages such as Hindi, Bengali, Gujarati, Marathi, Kannada, Punjabi, Tamil, Telugu, and Urdu, our benchmark addresses the unique challenges and opportunities presented by the linguistic diversity of the Indian subcontinent. This benchmark encompasses a wide range of tasks in language comprehension, reasoning, and generation, meticulously crafted to capture the intricacies of Indian languages. IndicMMLU-Pro provides a standardized evaluation framework to push the research boundaries in Indic language AI, facilitating the development of more accurate, efficient, and culturally sensitive models. This paper outlines the benchmarks' design principles, task taxonomy, and data collection methodology, and presents baseline results from state-of-the-art multilingual models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Sankalp KJ, Ashutosh Kumar, Laxmaan Balaji, Nikunj Kotecha, Vinija Jain, Aman Chadha, Sreyoshi Bhaduri</p>

            <p><strong>Title:</strong><br>
            IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15747v2">http://arxiv.org/abs/2501.15747v2</a></p>

            <p><strong>Abstract:</strong><br>
            Known by more than 1.5 billion people in the Indian subcontinent, Indic languages present unique challenges and opportunities for natural language processing (NLP) research due to their rich cultural heritage, linguistic diversity, and complex structures. IndicMMLU-Pro is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) across Indic languages, building upon the MMLU Pro (Massive Multitask Language Understanding) framework. Covering major languages such as Hindi, Bengali, Gujarati, Marathi, Kannada, Punjabi, Tamil, Telugu, and Urdu, our benchmark addresses the unique challenges and opportunities presented by the linguistic diversity of the Indian subcontinent. This benchmark encompasses a wide range of tasks in language comprehension, reasoning, and generation, meticulously crafted to capture the intricacies of Indian languages. IndicMMLU-Pro provides a standardized evaluation framework to push the research boundaries in Indic language AI, facilitating the development of more accurate, efficient, and culturally sensitive models. This paper outlines the benchmarks' design principles, task taxonomy, and data collection methodology, and presents baseline results from state-of-the-art multilingual models.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 29 Jan 2025 20:27:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/58e5c7fa/0ef86503.mp3" length="19286421" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1202</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Sankalp KJ, Ashutosh Kumar, Laxmaan Balaji, Nikunj Kotecha, Vinija Jain, Aman Chadha, Sreyoshi Bhaduri</p>

            <p><strong>Title:</strong><br>
            IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15747v2">http://arxiv.org/abs/2501.15747v2</a></p>

            <p><strong>Abstract:</strong><br>
            Known by more than 1.5 billion people in the Indian subcontinent, Indic languages present unique challenges and opportunities for natural language processing (NLP) research due to their rich cultural heritage, linguistic diversity, and complex structures. IndicMMLU-Pro is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) across Indic languages, building upon the MMLU Pro (Massive Multitask Language Understanding) framework. Covering major languages such as Hindi, Bengali, Gujarati, Marathi, Kannada, Punjabi, Tamil, Telugu, and Urdu, our benchmark addresses the unique challenges and opportunities presented by the linguistic diversity of the Indian subcontinent. This benchmark encompasses a wide range of tasks in language comprehension, reasoning, and generation, meticulously crafted to capture the intricacies of Indian languages. IndicMMLU-Pro provides a standardized evaluation framework to push the research boundaries in Indic language AI, facilitating the development of more accurate, efficient, and culturally sensitive models. This paper outlines the benchmarks' design principles, task taxonomy, and data collection methodology, and presents baseline results from state-of-the-art multilingual models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Histoires Morales: A French Dataset for Assessing Moral Alignment</title>
      <itunes:episode>442</itunes:episode>
      <podcast:episode>442</podcast:episode>
      <itunes:title>Histoires Morales: A French Dataset for Assessing Moral Alignment</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">53c2707e-2244-42f0-9eb8-0323ddea4f62</guid>
      <link>https://share.transistor.fm/s/9def176c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Thibaud Leteno, Irina Proskurina, Antoine Gourru, Julien Velcin, Charlotte Laclau, Guillaume Metzler, Christophe Gravier</p>

            <p><strong>Title:</strong><br>
            Histoires Morales: A French Dataset for Assessing Moral Alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17117v1">http://arxiv.org/abs/2501.17117v1</a></p>

            <p><strong>Abstract:</strong><br>
            Aligning language models with human values is crucial, especially as they become more integrated into everyday life. While models are often adapted to user preferences, it is equally important to ensure they align with moral norms and behaviours in real-world social situations. Despite significant progress in languages like English and Chinese, French has seen little attention in this area, leaving a gap in understanding how LLMs handle moral reasoning in this language. To address this gap, we introduce Histoires Morales, a French dataset derived from Moral Stories, created through translation and subsequently refined with the assistance of native speakers to guarantee grammatical accuracy and adaptation to the French cultural context. We also rely on annotations of the moral values within the dataset to ensure their alignment with French norms. Histoires Morales covers a wide range of social situations, including differences in tipping practices, expressions of honesty in relationships, and responsibilities toward animals. To foster future research, we also conduct preliminary experiments on the alignment of multilingual models on French and English data and the robustness of the alignment. We find that while LLMs are generally aligned with human moral norms by default, they can be easily influenced with user-preference optimization for both moral and immoral data.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Thibaud Leteno, Irina Proskurina, Antoine Gourru, Julien Velcin, Charlotte Laclau, Guillaume Metzler, Christophe Gravier</p>

            <p><strong>Title:</strong><br>
            Histoires Morales: A French Dataset for Assessing Moral Alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17117v1">http://arxiv.org/abs/2501.17117v1</a></p>

            <p><strong>Abstract:</strong><br>
            Aligning language models with human values is crucial, especially as they become more integrated into everyday life. While models are often adapted to user preferences, it is equally important to ensure they align with moral norms and behaviours in real-world social situations. Despite significant progress in languages like English and Chinese, French has seen little attention in this area, leaving a gap in understanding how LLMs handle moral reasoning in this language. To address this gap, we introduce Histoires Morales, a French dataset derived from Moral Stories, created through translation and subsequently refined with the assistance of native speakers to guarantee grammatical accuracy and adaptation to the French cultural context. We also rely on annotations of the moral values within the dataset to ensure their alignment with French norms. Histoires Morales covers a wide range of social situations, including differences in tipping practices, expressions of honesty in relationships, and responsibilities toward animals. To foster future research, we also conduct preliminary experiments on the alignment of multilingual models on French and English data and the robustness of the alignment. We find that while LLMs are generally aligned with human moral norms by default, they can be easily influenced with user-preference optimization for both moral and immoral data.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 29 Jan 2025 20:27:33 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9def176c/c4e9c824.mp3" length="19956800" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1244</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Thibaud Leteno, Irina Proskurina, Antoine Gourru, Julien Velcin, Charlotte Laclau, Guillaume Metzler, Christophe Gravier</p>

            <p><strong>Title:</strong><br>
            Histoires Morales: A French Dataset for Assessing Moral Alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.17117v1">http://arxiv.org/abs/2501.17117v1</a></p>

            <p><strong>Abstract:</strong><br>
            Aligning language models with human values is crucial, especially as they become more integrated into everyday life. While models are often adapted to user preferences, it is equally important to ensure they align with moral norms and behaviours in real-world social situations. Despite significant progress in languages like English and Chinese, French has seen little attention in this area, leaving a gap in understanding how LLMs handle moral reasoning in this language. To address this gap, we introduce Histoires Morales, a French dataset derived from Moral Stories, created through translation and subsequently refined with the assistance of native speakers to guarantee grammatical accuracy and adaptation to the French cultural context. We also rely on annotations of the moral values within the dataset to ensure their alignment with French norms. Histoires Morales covers a wide range of social situations, including differences in tipping practices, expressions of honesty in relationships, and responsibilities toward animals. To foster future research, we also conduct preliminary experiments on the alignment of multilingual models on French and English data and the robustness of the alignment. We find that while LLMs are generally aligned with human moral norms by default, they can be easily influenced with user-preference optimization for both moral and immoral data.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Qwen2.5-1M Technical Report</title>
      <itunes:episode>441</itunes:episode>
      <podcast:episode>441</podcast:episode>
      <itunes:title>Qwen2.5-1M Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">01d90f96-5bf4-4c3c-a338-e0b52afc0471</guid>
      <link>https://share.transistor.fm/s/fb96bebf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, Zipeng Zhang</p>

            <p><strong>Title:</strong><br>
            Qwen2.5-1M Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15383v1">http://arxiv.org/abs/2501.15383v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M series have significantly enhanced long-context capabilities through long-context pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs.   To promote the use of long-context models among a broader user base, we present and open-source our inference framework. This framework includes a length extrapolation method that can expand the model context lengths by at least four times, or even more, without additional training. To reduce inference costs, we implement a sparse attention method along with chunked prefill optimization for deployment scenarios and a sparsity refinement method to improve precision. Additionally, we detail our optimizations in the inference engine, including kernel optimization, pipeline parallelism, and scheduling optimization, which significantly enhance overall inference performance. By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup in scenarios with 1 million tokens of context. This framework provides an efficient and powerful solution for developing applications that require long-context processing using open-source models.   The Qwen2.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo. Evaluations show that Qwen2.5-1M models have been greatly improved in long-context tasks without compromising performance in short-context scenarios. Specifically, the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, Zipeng Zhang</p>

            <p><strong>Title:</strong><br>
            Qwen2.5-1M Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15383v1">http://arxiv.org/abs/2501.15383v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M series have significantly enhanced long-context capabilities through long-context pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs.   To promote the use of long-context models among a broader user base, we present and open-source our inference framework. This framework includes a length extrapolation method that can expand the model context lengths by at least four times, or even more, without additional training. To reduce inference costs, we implement a sparse attention method along with chunked prefill optimization for deployment scenarios and a sparsity refinement method to improve precision. Additionally, we detail our optimizations in the inference engine, including kernel optimization, pipeline parallelism, and scheduling optimization, which significantly enhance overall inference performance. By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup in scenarios with 1 million tokens of context. This framework provides an efficient and powerful solution for developing applications that require long-context processing using open-source models.   The Qwen2.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo. Evaluations show that Qwen2.5-1M models have been greatly improved in long-context tasks without compromising performance in short-context scenarios. Specifically, the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 28 Jan 2025 20:57:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fb96bebf/625b11f7.mp3" length="23377340" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1457</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, Zipeng Zhang</p>

            <p><strong>Title:</strong><br>
            Qwen2.5-1M Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15383v1">http://arxiv.org/abs/2501.15383v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M series have significantly enhanced long-context capabilities through long-context pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs.   To promote the use of long-context models among a broader user base, we present and open-source our inference framework. This framework includes a length extrapolation method that can expand the model context lengths by at least four times, or even more, without additional training. To reduce inference costs, we implement a sparse attention method along with chunked prefill optimization for deployment scenarios and a sparsity refinement method to improve precision. Additionally, we detail our optimizations in the inference engine, including kernel optimization, pipeline parallelism, and scheduling optimization, which significantly enhance overall inference performance. By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup in scenarios with 1 million tokens of context. This framework provides an efficient and powerful solution for developing applications that require long-context processing using open-source models.   The Qwen2.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo. Evaluations show that Qwen2.5-1M models have been greatly improved in long-context tasks without compromising performance in short-context scenarios. Specifically, the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer</title>
      <itunes:episode>440</itunes:episode>
      <podcast:episode>440</podcast:episode>
      <itunes:title>ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2b9e00c6-0c86-46f2-9d8f-2817885d8256</guid>
      <link>https://share.transistor.fm/s/c239e34b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Lin Yueyu, Li Zhiyuan, Peter Yue, Liu Xiao</p>

            <p><strong>Title:</strong><br>
            ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15570v1">http://arxiv.org/abs/2501.15570v1</a></p>

            <p><strong>Abstract:</strong><br>
            As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at \href{https://github.com/yynil/RWKVInside}{https://github.com/yynil/RWKVInside}, \href{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Lin Yueyu, Li Zhiyuan, Peter Yue, Liu Xiao</p>

            <p><strong>Title:</strong><br>
            ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15570v1">http://arxiv.org/abs/2501.15570v1</a></p>

            <p><strong>Abstract:</strong><br>
            As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at \href{https://github.com/yynil/RWKVInside}{https://github.com/yynil/RWKVInside}, \href{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 28 Jan 2025 20:57:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c239e34b/fd16113e.mp3" length="19981491" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1245</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Lin Yueyu, Li Zhiyuan, Peter Yue, Liu Xiao</p>

            <p><strong>Title:</strong><br>
            ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15570v1">http://arxiv.org/abs/2501.15570v1</a></p>

            <p><strong>Abstract:</strong><br>
            As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at \href{https://github.com/yynil/RWKVInside}{https://github.com/yynil/RWKVInside}, \href{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Towards General-Purpose Model-Free Reinforcement Learning</title>
      <itunes:episode>439</itunes:episode>
      <podcast:episode>439</podcast:episode>
      <itunes:title>Towards General-Purpose Model-Free Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">63232e68-6e93-4081-9567-0aacb2850653</guid>
      <link>https://share.transistor.fm/s/1322b74f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Scott Fujimoto, Pierluca D'Oro, Amy Zhang, Yuandong Tian, Michael Rabbat</p>

            <p><strong>Title:</strong><br>
            Towards General-Purpose Model-Free Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16142v1">http://arxiv.org/abs/2501.16142v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) promises a framework for near-universal problem-solving. In practice however, RL algorithms are often tailored to specific benchmarks, relying on carefully tuned hyperparameters and algorithmic choices. Recently, powerful model-based RL methods have shown impressive general results across benchmarks but come at the cost of increased complexity and slow run times, limiting their broader applicability. In this paper, we attempt to find a unifying model-free deep RL algorithm that can address a diverse class of domains and problem settings. To achieve this, we leverage model-based representations that approximately linearize the value function, taking advantage of the denser task objectives used by model-based RL while avoiding the costs associated with planning or simulated trajectories. We evaluate our algorithm, MR.Q, on a variety of common RL benchmarks with a single set of hyperparameters and show a competitive performance against domain-specific and general baselines, providing a concrete step towards building general-purpose model-free deep RL algorithms.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Scott Fujimoto, Pierluca D'Oro, Amy Zhang, Yuandong Tian, Michael Rabbat</p>

            <p><strong>Title:</strong><br>
            Towards General-Purpose Model-Free Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16142v1">http://arxiv.org/abs/2501.16142v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) promises a framework for near-universal problem-solving. In practice however, RL algorithms are often tailored to specific benchmarks, relying on carefully tuned hyperparameters and algorithmic choices. Recently, powerful model-based RL methods have shown impressive general results across benchmarks but come at the cost of increased complexity and slow run times, limiting their broader applicability. In this paper, we attempt to find a unifying model-free deep RL algorithm that can address a diverse class of domains and problem settings. To achieve this, we leverage model-based representations that approximately linearize the value function, taking advantage of the denser task objectives used by model-based RL while avoiding the costs associated with planning or simulated trajectories. We evaluate our algorithm, MR.Q, on a variety of common RL benchmarks with a single set of hyperparameters and show a competitive performance against domain-specific and general baselines, providing a concrete step towards building general-purpose model-free deep RL algorithms.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 28 Jan 2025 20:56:42 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1322b74f/0a64de02.mp3" length="20108929" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1253</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Scott Fujimoto, Pierluca D'Oro, Amy Zhang, Yuandong Tian, Michael Rabbat</p>

            <p><strong>Title:</strong><br>
            Towards General-Purpose Model-Free Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.16142v1">http://arxiv.org/abs/2501.16142v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning (RL) promises a framework for near-universal problem-solving. In practice however, RL algorithms are often tailored to specific benchmarks, relying on carefully tuned hyperparameters and algorithmic choices. Recently, powerful model-based RL methods have shown impressive general results across benchmarks but come at the cost of increased complexity and slow run times, limiting their broader applicability. In this paper, we attempt to find a unifying model-free deep RL algorithm that can address a diverse class of domains and problem settings. To achieve this, we leverage model-based representations that approximately linearize the value function, taking advantage of the denser task objectives used by model-based RL while avoiding the costs associated with planning or simulated trajectories. We evaluate our algorithm, MR.Q, on a variety of common RL benchmarks with a single set of hyperparameters and show a competitive performance against domain-specific and general baselines, providing a concrete step towards building general-purpose model-free deep RL algorithms.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation</title>
      <itunes:episode>438</itunes:episode>
      <podcast:episode>438</podcast:episode>
      <itunes:title>Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e662fd18-a135-4420-b52b-6f467eaef6fe</guid>
      <link>https://share.transistor.fm/s/91d09090</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.SD, cs.CL, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu</p>

            <p><strong>Title:</strong><br>
            Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15907v1">http://arxiv.org/abs/2501.15907v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high-quality training data from valuable yet underexplored in-the-wild data that capture spontaneous human speech in real-world contexts. By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech generation dataset derived from in-the-wild speech data. This dataset comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available. Extensive experiments demonstrate that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech, showcasing superior performance in capturing diverse speaker timbre and speaking styles of real-world human speech. Furthermore, this work underscores the importance of scaling dataset size to advance speech generation research and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.SD, cs.CL, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu</p>

            <p><strong>Title:</strong><br>
            Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15907v1">http://arxiv.org/abs/2501.15907v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high-quality training data from valuable yet underexplored in-the-wild data that capture spontaneous human speech in real-world contexts. By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech generation dataset derived from in-the-wild speech data. This dataset comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available. Extensive experiments demonstrate that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech, showcasing superior performance in capturing diverse speaker timbre and speaking styles of real-world human speech. Furthermore, this work underscores the importance of scaling dataset size to advance speech generation research and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 28 Jan 2025 20:56:21 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/91d09090/31988fad.mp3" length="21207358" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1322</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.SD, cs.CL, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu</p>

            <p><strong>Title:</strong><br>
            Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15907v1">http://arxiv.org/abs/2501.15907v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high-quality training data from valuable yet underexplored in-the-wild data that capture spontaneous human speech in real-world contexts. By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech generation dataset derived from in-the-wild speech data. This dataset comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available. Extensive experiments demonstrate that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech, showcasing superior performance in capturing diverse speaker timbre and speaking styles of real-world human speech. Furthermore, this work underscores the importance of scaling dataset size to advance speech generation research and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>iFormer: Integrating ConvNet and Transformer for Mobile Application</title>
      <itunes:episode>437</itunes:episode>
      <podcast:episode>437</podcast:episode>
      <itunes:title>iFormer: Integrating ConvNet and Transformer for Mobile Application</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5dc17166-c325-473f-a755-7333ccd4ea27</guid>
      <link>https://share.transistor.fm/s/b9547b35</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chuanyang Zheng</p>

            <p><strong>Title:</strong><br>
            iFormer: Integrating ConvNet and Transformer for Mobile Application</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15369v1">http://arxiv.org/abs/2501.15369v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. iFormer effectively integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. The local interactions are derived from transforming a standard convolutional network, \textit{i.e.}, ConvNeXt, to design a more lightweight mobile network. Our newly introduced mobile modulation attention removes memory-intensive operations in MHA and employs an efficient modulation mechanism to boost dynamic global representational capacity. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks. Notably, iFormer achieves an impressive Top-1 accuracy of 80.4\% on ImageNet-1k with a latency of only 1.10 ms on an iPhone 13, surpassing the recently proposed MobileNetV4 under similar latency constraints. Additionally, our method shows significant improvements in downstream tasks, including COCO object detection, instance segmentation, and ADE20k semantic segmentation, while still maintaining low latency on mobile devices for high-resolution inputs in these scenarios.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chuanyang Zheng</p>

            <p><strong>Title:</strong><br>
            iFormer: Integrating ConvNet and Transformer for Mobile Application</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15369v1">http://arxiv.org/abs/2501.15369v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. iFormer effectively integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. The local interactions are derived from transforming a standard convolutional network, \textit{i.e.}, ConvNeXt, to design a more lightweight mobile network. Our newly introduced mobile modulation attention removes memory-intensive operations in MHA and employs an efficient modulation mechanism to boost dynamic global representational capacity. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks. Notably, iFormer achieves an impressive Top-1 accuracy of 80.4\% on ImageNet-1k with a latency of only 1.10 ms on an iPhone 13, surpassing the recently proposed MobileNetV4 under similar latency constraints. Additionally, our method shows significant improvements in downstream tasks, including COCO object detection, instance segmentation, and ADE20k semantic segmentation, while still maintaining low latency on mobile devices for high-resolution inputs in these scenarios.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 28 Jan 2025 20:56:00 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b9547b35/bd774ea2.mp3" length="23111976" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1441</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Chuanyang Zheng</p>

            <p><strong>Title:</strong><br>
            iFormer: Integrating ConvNet and Transformer for Mobile Application</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.15369v1">http://arxiv.org/abs/2501.15369v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. iFormer effectively integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. The local interactions are derived from transforming a standard convolutional network, \textit{i.e.}, ConvNeXt, to design a more lightweight mobile network. Our newly introduced mobile modulation attention removes memory-intensive operations in MHA and employs an efficient modulation mechanism to boost dynamic global representational capacity. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks. Notably, iFormer achieves an impressive Top-1 accuracy of 80.4\% on ImageNet-1k with a latency of only 1.10 ms on an iPhone 13, surpassing the recently proposed MobileNetV4 under similar latency constraints. Additionally, our method shows significant improvements in downstream tasks, including COCO object detection, instance segmentation, and ADE20k semantic segmentation, while still maintaining low latency on mobile devices for high-resolution inputs in these scenarios.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Are Vision Language Models Texture or Shape Biased and Can We Steer Them?</title>
      <itunes:episode>436</itunes:episode>
      <podcast:episode>436</podcast:episode>
      <itunes:title>Are Vision Language Models Texture or Shape Biased and Can We Steer Them?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d38bcb92-bf23-4ca4-b21c-f7190d9c5b76</guid>
      <link>https://share.transistor.fm/s/447f931b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV, cs.AI, cs.LG, q-bio.NC</p>

            <p><strong>Authors:</strong><br>
            Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Bianca Lamm, Muhammad Jehanzeb Mirza, Margret Keuper, Janis Keuper</p>

            <p><strong>Title:</strong><br>
            Are Vision Language Models Texture or Shape Biased and Can We Steer Them?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2403.09193v1">http://arxiv.org/abs/2403.09193v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models encourages us to ask whether they also align with human vision - specifically, how far they adopt human-induced visual biases through multimodal fusion, or whether they simply inherit biases from pure vision models. One important visual bias is the texture vs. shape bias, or the dominance of local over global information. In this paper, we study this bias in a wide range of popular VLMs. Interestingly, we find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone. For now, the strong human bias towards shape (96%) remains out of reach for all tested VLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV, cs.AI, cs.LG, q-bio.NC</p>

            <p><strong>Authors:</strong><br>
            Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Bianca Lamm, Muhammad Jehanzeb Mirza, Margret Keuper, Janis Keuper</p>

            <p><strong>Title:</strong><br>
            Are Vision Language Models Texture or Shape Biased and Can We Steer Them?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2403.09193v1">http://arxiv.org/abs/2403.09193v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models encourages us to ask whether they also align with human vision - specifically, how far they adopt human-induced visual biases through multimodal fusion, or whether they simply inherit biases from pure vision models. One important visual bias is the texture vs. shape bias, or the dominance of local over global information. In this paper, we study this bias in a wide range of popular VLMs. Interestingly, we find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone. For now, the strong human bias towards shape (96%) remains out of reach for all tested VLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 28 Jan 2025 20:55:39 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/447f931b/b4d1c30e.mp3" length="24047375" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1499</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV, cs.AI, cs.LG, q-bio.NC</p>

            <p><strong>Authors:</strong><br>
            Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, Bianca Lamm, Muhammad Jehanzeb Mirza, Margret Keuper, Janis Keuper</p>

            <p><strong>Title:</strong><br>
            Are Vision Language Models Texture or Shape Biased and Can We Steer Them?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2403.09193v1">http://arxiv.org/abs/2403.09193v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models encourages us to ask whether they also align with human vision - specifically, how far they adopt human-induced visual biases through multimodal fusion, or whether they simply inherit biases from pure vision models. One important visual bias is the texture vs. shape bias, or the dominance of local over global information. In this paper, we study this bias in a wide range of popular VLMs. Interestingly, we find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone. For now, the strong human bias towards shape (96%) remains out of reach for all tested VLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CodeMonkeys: Scaling Test-Time Compute for Software Engineering</title>
      <itunes:episode>435</itunes:episode>
      <podcast:episode>435</podcast:episode>
      <itunes:title>CodeMonkeys: Scaling Test-Time Compute for Software Engineering</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">31c851f2-31b7-4bb2-8d30-8ebf9f338b31</guid>
      <link>https://share.transistor.fm/s/975c681f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Ré, Azalia Mirhoseini</p>

            <p><strong>Title:</strong><br>
            CodeMonkeys: Scaling Test-Time Compute for Software Engineering</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.14723v1">http://arxiv.org/abs/2501.14723v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling test-time compute is a promising axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we explore this problem in the context of solving real-world GitHub issues from the SWE-bench dataset. Our system, named CodeMonkeys, allows models to iteratively edit a codebase by jointly generating and running a testing script alongside their draft edit. We sample many of these multi-turn trajectories for every issue to generate a collection of candidate edits. This approach lets us scale "serial" test-time compute by increasing the number of iterations per trajectory and "parallel" test-time compute by increasing the number of trajectories per problem. With parallel scaling, we can amortize up-front costs across multiple downstream samples, allowing us to identify relevant codebase context using the simple method of letting an LLM read every file. In order to select between candidate edits, we combine voting using model-generated tests with a final multi-turn trajectory dedicated to selection. Overall, CodeMonkeys resolves 57.4% of issues from SWE-bench Verified using a budget of approximately 2300 USD. Our selection method can also be used to combine candidates from different sources. Selecting over an ensemble of edits from existing top SWE-bench Verified submissions obtains a score of 66.2% and outperforms the best member of the ensemble on its own. We fully release our code and data at https://scalingintelligence.stanford.edu/pubs/codemonkeys.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Ré, Azalia Mirhoseini</p>

            <p><strong>Title:</strong><br>
            CodeMonkeys: Scaling Test-Time Compute for Software Engineering</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.14723v1">http://arxiv.org/abs/2501.14723v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling test-time compute is a promising axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we explore this problem in the context of solving real-world GitHub issues from the SWE-bench dataset. Our system, named CodeMonkeys, allows models to iteratively edit a codebase by jointly generating and running a testing script alongside their draft edit. We sample many of these multi-turn trajectories for every issue to generate a collection of candidate edits. This approach lets us scale "serial" test-time compute by increasing the number of iterations per trajectory and "parallel" test-time compute by increasing the number of trajectories per problem. With parallel scaling, we can amortize up-front costs across multiple downstream samples, allowing us to identify relevant codebase context using the simple method of letting an LLM read every file. In order to select between candidate edits, we combine voting using model-generated tests with a final multi-turn trajectory dedicated to selection. Overall, CodeMonkeys resolves 57.4% of issues from SWE-bench Verified using a budget of approximately 2300 USD. Our selection method can also be used to combine candidates from different sources. Selecting over an ensemble of edits from existing top SWE-bench Verified submissions obtains a score of 66.2% and outperforms the best member of the ensemble on its own. We fully release our code and data at https://scalingintelligence.stanford.edu/pubs/codemonkeys.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 28 Jan 2025 20:55:07 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/975c681f/1fbc896c.mp3" length="22199567" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1384</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Ré, Azalia Mirhoseini</p>

            <p><strong>Title:</strong><br>
            CodeMonkeys: Scaling Test-Time Compute for Software Engineering</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.14723v1">http://arxiv.org/abs/2501.14723v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling test-time compute is a promising axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we explore this problem in the context of solving real-world GitHub issues from the SWE-bench dataset. Our system, named CodeMonkeys, allows models to iteratively edit a codebase by jointly generating and running a testing script alongside their draft edit. We sample many of these multi-turn trajectories for every issue to generate a collection of candidate edits. This approach lets us scale "serial" test-time compute by increasing the number of iterations per trajectory and "parallel" test-time compute by increasing the number of trajectories per problem. With parallel scaling, we can amortize up-front costs across multiple downstream samples, allowing us to identify relevant codebase context using the simple method of letting an LLM read every file. In order to select between candidate edits, we combine voting using model-generated tests with a final multi-turn trajectory dedicated to selection. Overall, CodeMonkeys resolves 57.4% of issues from SWE-bench Verified using a budget of approximately 2300 USD. Our selection method can also be used to combine candidates from different sources. Selecting over an ensemble of edits from existing top SWE-bench Verified submissions obtains a score of 66.2% and outperforms the best member of the ensemble on its own. We fully release our code and data at https://scalingintelligence.stanford.edu/pubs/codemonkeys.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models</title>
      <itunes:episode>434</itunes:episode>
      <podcast:episode>434</podcast:episode>
      <itunes:title>Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">be4dc17b-c5bf-4818-800c-0d847f8749d7</guid>
      <link>https://share.transistor.fm/s/6780a1c5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, Vimal Thilak</p>

            <p><strong>Title:</strong><br>
            Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12370v2">http://arxiv.org/abs/2501.12370v2</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Experts (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the fraction of inactive parameters, impacts model's performance during pretraining and downstream few-shot evaluation. We find that under different constraints (e.g., parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance. These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, Vimal Thilak</p>

            <p><strong>Title:</strong><br>
            Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12370v2">http://arxiv.org/abs/2501.12370v2</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Experts (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the fraction of inactive parameters, impacts model's performance during pretraining and downstream few-shot evaluation. We find that under different constraints (e.g., parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance. These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 28 Jan 2025 20:54:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6780a1c5/d9b391c0.mp3" length="20559525" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1281</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, Vimal Thilak</p>

            <p><strong>Title:</strong><br>
            Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12370v2">http://arxiv.org/abs/2501.12370v2</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Experts (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the fraction of inactive parameters, impacts model's performance during pretraining and downstream few-shot evaluation. We find that under different constraints (e.g., parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance. These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Humanity's Last Exam</title>
      <itunes:episode>433</itunes:episode>
      <podcast:episode>433</podcast:episode>
      <itunes:title>Humanity's Last Exam</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a84a4c1f-acd3-43d5-ba5e-e58cf4a135c7</guid>
      <link>https://share.transistor.fm/s/67a4e477</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Daron Anderson, Tung Nguyen, Mobeen Mahmood, Fiona Feng, Steven Y. Feng, Haoran Zhao, Michael Yu, Varun Gangal, Chelsea Zou, Zihan Wang, Jessica P. Wang, Pawan Kumar, Oleksandr Pokutnyi, Robert Gerbicz, Serguei Popov, John-Clark Levin, Mstyslav Kazakov, Johannes Schmitt, Geoff Galgon, Alvaro Sanchez, Yongki Lee, Will Yeadon, Scott Sauers, Marc Roth, Chidozie Agu, Søren Riis, Fabian Giska, Saiteja Utpala, Zachary Giboney, Gashaw M. Goshu, Joan of Arc Xavier, Sarah-Jane Crowson, Mohinder Maheshbhai Naiya, Noah Burns, Lennart Finke, Zerui Cheng, Hyunwoo Park, Francesco Fournier-Facio, John Wydallis, Mark Nandor, Ankit Singh, Tim Gehrunger, Jiaqi Cai, Ben McCarty, Darling Duclosel, Jungbae Nam, Jennifer Zampese, Ryan G. Hoerr, Aras Bacho, Gautier Abou Loume, Abdallah Galal, Hangrui Cao, Alexis C Garretson, Damien Sileo, Qiuyu Ren, Doru Cojoc, Pavel Arkhipov, Usman Qazi, Lianghui Li, Sumeet Motwani, Christian Schroeder de Witt, Edwin Taylor, Johannes Veith, Eric Singer, Taylor D. Hartman, Paolo Rissone, Jaehyeok Jin, Jack Wei Lun Shi, Chris G. Willcocks, Joshua Robinson, Aleksandar Mikov, Ameya Prabhu, Longke Tang, Xavier Alapont, Justine Leon Uro, Kevin Zhou, Emily de Oliveira Santos, Andrey Pupasov Maksimov, Edward Vendrow, Kengo Zenitani, Julien Guillod, Yuqi Li, Joshua Vendrow, Vladyslav Kuchkin, Ng Ze-An, Pierre Marion, Denis Efremov, Jayson Lynch, Kaiqu Liang, Andrew Gritsevskiy, Dakotah Martinez, Ben Pageler, Nick Crispino, Dimitri Zvonkine, Natanael Wildner Fraga, Saeed Soori, Ori Press, Henry Tang, Julian Salazar, Sean R. Green, Lina Brüssel, Moon Twayana, Aymeric Dieuleveut, T. Ryan Rogers, Wenjin Zhang, Bikun Li, Jinzhou Yang, Arun Rao, Gabriel Loiseau, Mikhail Kalinin, Marco Lukas, Ciprian Manolescu, Subrata Mishra, Ariel Ghislain Kemogne Kamdoum, Tobias Kreiman, Tad Hogg, Alvin Jin, Carlo Bosio, Gongbo Sun, Brian P Coppola, Tim Tarver, Haline Heidinger, Rafael Sayous, Stefan Ivanov, Joseph M Cavanagh, Jiawei Shen, Joseph Marvin Imperial, Philippe Schwaller, Shaipranesh Senthilkuma, Andres M Bran, Ali Dehghan, Andres Algaba, Brecht Verbeken, David Noever, Ragavendran P V, Lisa Schut, Ilia Sucholutsky, Evgenii Zheltonozhskii, Derek Lim, Richard Stanley, Shankar Sivarajan, Tong Yang, John Maar, Julian Wykowski, Martí Oller, Jennifer Sandlin, Anmol Sahu, Yuzheng Hu, Sara Fish, Nasser Heydari, Archimedes Apronti, Kaivalya Rawal, Tobias Garcia Vilchis, Yuexuan Zu, Martin Lackner, James Koppel, Jeremy Nguyen, Daniil S. Antonenko, Steffi Chern, Bingchen Zhao, Pierrot Arsene, Alan Goldfarb, Sergey Ivanov, Rafał Poświata, Chenguang Wang, Daofeng Li, Donato Crisostomi, Andrea Achilleos, Benjamin Myklebust, Archan Sen, David Perrella, Nurdin Kaparov, Mark H Inlow, Allen Zang, Elliott Thornley, Daniil Orel, Vladislav Poritski, Shalev Ben-David, Zachary Berger, Parker Whitfill, Michael Foster, Daniel Munro, Linh Ho, Dan Bar Hava, Aleksey Kuchkin, Robert Lauff, David Holmes, Frank Sommerhage, Keith Schneider, Zakayo Kazibwe, Nate Stambaugh, Mukhwinder Singh, Ilias Magoulas, Don Clarke, Dae Hyun Kim, Felipe Meneguitti Dias, Veit Elser, Kanu Priya Agarwal, Victor Efren Guadarrama Vilchis, Immo Klose, Christoph Demian, Ujjwala Anantheswaran, Adam Zweiger, Guglielmo Albani, Jeffery Li, Nicolas Daans, Maksim Radionov, Václav Rozhoň, Ziqiao Ma, Christian Stump, Mohammed Berkani, Jacob Platnick, Volodymyr Nevirkovets, Luke Basler, Marco Piccardo, Ferenc Jeanplong, Niv Cohen, Josef Tkadlec, Paul Rosu, Piotr Padlewski, Stanislaw Barzowski, Kyle Montgomery, Aline Menezes, Arkil Patel, Zixuan Wang, Jamie Tucker-Foltz, Jack Stade, Tom Goertzen, Fereshteh Kazemi, Jeremiah Milbauer, John Arnold Ambay, Abhishek Shukla, Yan Carlos Leyva Labrador, Alan Givré, Hew Wolff, Vivien Rossbach, Muhammad Fayez Aziz, Younesse Kaddar, Yanxu Chen, Robin Zhang, Jiayi Pan, Antonio Terpin, Niklas Muennighoff, Hailey Schoelkopf, Eric Zheng, Avishy Carmi, Adam Jones, Jainam Shah, Ethan D. L. Brown, Kelin Zhu, Max Bartolo, Richard Wheeler, Andrew Ho, Shaul Barkan, Jiaqi Wang, Martin Stehberger, Egor Kretov, Kaustubh Sridhar, Zienab EL-Wasif, Anji Zhang, Daniel Pyda, Joanna Tam, David M. Cunningham, Vladimir Goryachev, Demosthenes Patramanis, Michael Krause, Andrew Redenti, Daniel Bugas, David Aldous, Jesyin Lai, Shannon Coleman, Mohsen Bahaloo, Jiangnan Xu, Sangwon Lee, Sandy Zhao, Ning Tang, Michael K. Cohen, Micah Carroll, Orr Paradise, Jan Hendrik Kirchner, Stefan Steinerberger, Maksym Ovchynnikov, Jason O. Matos, Adithya Shenoy, Benedito Alves de Oliveira Junior, Michael Wang, Yuzhou Nie, Paolo Giordano, Philipp Petersen, Anna Sztyber-Betley, Priti Shukla, Jonathan Crozier, Antonella Pinto, Shreyas Verma, Prashant Joshi, Zheng-Xin Yong, Allison Tee, Jérémy Andréoletti, Orion Weller, Raghav Singhal, Gang Zhang, Alexander Ivanov, Seri Khoury, Hamid Mostaghimi, Kunvar Thaman, Qijia Chen, Tran Quoc Khánh, Jacob Loader, Stefano Cavalleri, Hannah Szlyk, Zachary Brown, Jonathan Roberts, William Alley, Kunyang Sun, Ryan Stendall, Max Lamparth, Anka Reuel, Ting Wang, Hanmeng Xu, Sreenivas Goud Raparthi, Pablo Hernández-Cámara, Freddie Martin, Dmitry Malishev, Thomas Preu, Tomek Korbak, Marcus Abramovitch, Dominic Williamson, Ziye Chen, Biró Bálint, M Saiful Bari, Peyman Kassani, Zihao Wang, Behzad Ansarinejad, Laxman Prasad Goswami, Yewen Sun, Hossam Elgnainy, Daniel Tordera, George Balabanian, Earth Anderson, Lynna Kvistad, Alejandro José Moyano, Rajat Maheshwari, Ahmad Sakor, Murat Eron, Isaac C. McAlister, Javier Gimenez, Innocent Enyekwe, Andrew Favre D. O., Shailesh Shah, Xiaoxiang Zhou, Firuz Kamalov, Ronald Clark, Sherwin Abdoli, Tim Santens, Khalida Meer, Harrison K Wang, Kalyan Ramakrishnan, Evan Chen, Alessandro Tomasiello, G. Bruno De Luca, Shi-Zhuo Looi, Vinh-Kha Le, Noam Kolt, Niels Mündler, Avi Semler, Emma Rodman, Jacob Drori, Carl J Fossum, Milind Jagota, Ronak Pradeep, Honglu Fan, Tej Shah, Jonathan Eicher, Michael Chen, Kushal Thaman, William Merrill, Carter Harris, Jason Gross, Ilya Gusev, Asankhaya Sharma, Shashank Agnihotri, Pavel Zhelnov, Siranut Usawasutsakorn, Mohammadreza Mofayezi, Sergei Bogdanov, Alexander Piperski, Marc Carauleanu, David K. Zhang, Dylan Ler, Roman Leventov, Ignat Soroko, Thorben Jansen, Pascal Lauer, Joshua Duersch, Vage Taamazyan, Wiktor Morak, Wenjie Ma, William Held, Tran Đuc Huy, Ruicheng Xian, Armel Randy Zebaze, Mohanad Mohamed, Julian Noah Leser, Michelle X Yuan, Laila Yacar, Johannes Lengler, Hossein Shahrtash, Edson Oliveira, Joseph W. Jackson, Daniel Espinosa Gonzalez, Andy Zou, Muthu Chidambaram, Timothy Manik, Hector Haffenden, Dashiell Stander, Ali Dasouqi, Alexander Shen, Emilien Duc, Bita Golshani, David Stap, Mikalai Uzhou, Alina Borisovna Zhidkovskaya, Lukas Lewark, Mátyás Vincze, Dustin Wehr, Colin Tang, Zaki Hossain, Shaun Phillips, Jiang Muzhen, Fredrik Ekström, Angela Hammon, Oam Patel, Nicolas Remy, Faraz Farhidi, George Medley, Forough Mohammadzadeh, Madellene Peñaflor, Haile Kassahun, Alena Friedrich, Claire Sparrow, Taom Sakal, Omkar Dhamane, Ali Khajegili Mirabadi, Eric Hallman, Mike Battaglia, Mohammad Maghsoudimehrabani, Hieu Hoang, Alon Amit, Dave Hulbert, Roberto Pereira, Simon Weber, Stephen Mensah, Nathan Andre, Anton Peristyy, Chris Harjadi, Himanshu Gupta, Stephen Malina, Samuel Albanie, Will Cai, Mustafa Mehkary, Frank Reidegeld, Anna-Katharina Dick, Cary Friday, Jasdeep Sidhu, Wanyoung Kim, Mariana Costa, Hubeyb Gurdogan, Brian Weber, Harsh Kumar, Tong Jiang, Arunim Agarwal, Chiara Ceconello, Warren S. Vaz, Chao Zhuang, Haon Park, Andrew R. Tawfeek, Daattavya Aggarwal, Michael Kirchhof, Linjie Dai, Evan Kim, Johan Ferret, Yuzhou Wang, Minghao Yan, Krzysztof Burdzy, Lixin Zhang, Antonio Franca, Diana T. Pham...</p>]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Daron Anderson, Tung Nguyen, Mobeen Mahmood, Fiona Feng, Steven Y. Feng, Haoran Zhao, Michael Yu, Varun Gangal, Chelsea Zou, Zihan Wang, Jessica P. Wang, Pawan Kumar, Oleksandr Pokutnyi, Robert Gerbicz, Serguei Popov, John-Clark Levin, Mstyslav Kazakov, Johannes Schmitt, Geoff Galgon, Alvaro Sanchez, Yongki Lee, Will Yeadon, Scott Sauers, Marc Roth, Chidozie Agu, Søren Riis, Fabian Giska, Saiteja Utpala, Zachary Giboney, Gashaw M. Goshu, Joan of Arc Xavier, Sarah-Jane Crowson, Mohinder Maheshbhai Naiya, Noah Burns, Lennart Finke, Zerui Cheng, Hyunwoo Park, Francesco Fournier-Facio, John Wydallis, Mark Nandor, Ankit Singh, Tim Gehrunger, Jiaqi Cai, Ben McCarty, Darling Duclosel, Jungbae Nam, Jennifer Zampese, Ryan G. Hoerr, Aras Bacho, Gautier Abou Loume, Abdallah Galal, Hangrui Cao, Alexis C Garretson, Damien Sileo, Qiuyu Ren, Doru Cojoc, Pavel Arkhipov, Usman Qazi, Lianghui Li, Sumeet Motwani, Christian Schroeder de Witt, Edwin Taylor, Johannes Veith, Eric Singer, Taylor D. Hartman, Paolo Rissone, Jaehyeok Jin, Jack Wei Lun Shi, Chris G. Willcocks, Joshua Robinson, Aleksandar Mikov, Ameya Prabhu, Longke Tang, Xavier Alapont, Justine Leon Uro, Kevin Zhou, Emily de Oliveira Santos, Andrey Pupasov Maksimov, Edward Vendrow, Kengo Zenitani, Julien Guillod, Yuqi Li, Joshua Vendrow, Vladyslav Kuchkin, Ng Ze-An, Pierre Marion, Denis Efremov, Jayson Lynch, Kaiqu Liang, Andrew Gritsevskiy, Dakotah Martinez, Ben Pageler, Nick Crispino, Dimitri Zvonkine, Natanael Wildner Fraga, Saeed Soori, Ori Press, Henry Tang, Julian Salazar, Sean R. Green, Lina Brüssel, Moon Twayana, Aymeric Dieuleveut, T. Ryan Rogers, Wenjin Zhang, Bikun Li, Jinzhou Yang, Arun Rao, Gabriel Loiseau, Mikhail Kalinin, Marco Lukas, Ciprian Manolescu, Subrata Mishra, Ariel Ghislain Kemogne Kamdoum, Tobias Kreiman, Tad Hogg, Alvin Jin, Carlo Bosio, Gongbo Sun, Brian P Coppola, Tim Tarver, Haline Heidinger, Rafael Sayous, Stefan Ivanov, Joseph M Cavanagh, Jiawei Shen, Joseph Marvin Imperial, Philippe Schwaller, Shaipranesh Senthilkuma, Andres M Bran, Ali Dehghan, Andres Algaba, Brecht Verbeken, David Noever, Ragavendran P V, Lisa Schut, Ilia Sucholutsky, Evgenii Zheltonozhskii, Derek Lim, Richard Stanley, Shankar Sivarajan, Tong Yang, John Maar, Julian Wykowski, Martí Oller, Jennifer Sandlin, Anmol Sahu, Yuzheng Hu, Sara Fish, Nasser Heydari, Archimedes Apronti, Kaivalya Rawal, Tobias Garcia Vilchis, Yuexuan Zu, Martin Lackner, James Koppel, Jeremy Nguyen, Daniil S. Antonenko, Steffi Chern, Bingchen Zhao, Pierrot Arsene, Alan Goldfarb, Sergey Ivanov, Rafał Poświata, Chenguang Wang, Daofeng Li, Donato Crisostomi, Andrea Achilleos, Benjamin Myklebust, Archan Sen, David Perrella, Nurdin Kaparov, Mark H Inlow, Allen Zang, Elliott Thornley, Daniil Orel, Vladislav Poritski, Shalev Ben-David, Zachary Berger, Parker Whitfill, Michael Foster, Daniel Munro, Linh Ho, Dan Bar Hava, Aleksey Kuchkin, Robert Lauff, David Holmes, Frank Sommerhage, Keith Schneider, Zakayo Kazibwe, Nate Stambaugh, Mukhwinder Singh, Ilias Magoulas, Don Clarke, Dae Hyun Kim, Felipe Meneguitti Dias, Veit Elser, Kanu Priya Agarwal, Victor Efren Guadarrama Vilchis, Immo Klose, Christoph Demian, Ujjwala Anantheswaran, Adam Zweiger, Guglielmo Albani, Jeffery Li, Nicolas Daans, Maksim Radionov, Václav Rozhoň, Ziqiao Ma, Christian Stump, Mohammed Berkani, Jacob Platnick, Volodymyr Nevirkovets, Luke Basler, Marco Piccardo, Ferenc Jeanplong, Niv Cohen, Josef Tkadlec, Paul Rosu, Piotr Padlewski, Stanislaw Barzowski, Kyle Montgomery, Aline Menezes, Arkil Patel, Zixuan Wang, Jamie Tucker-Foltz, Jack Stade, Tom Goertzen, Fereshteh Kazemi, Jeremiah Milbauer, John Arnold Ambay, Abhishek Shukla, Yan Carlos Leyva Labrador, Alan Givré, Hew Wolff, Vivien Rossbach, Muhammad Fayez Aziz, Younesse Kaddar, Yanxu Chen, Robin Zhang, Jiayi Pan, Antonio Terpin, Niklas Muennighoff, Hailey Schoelkopf, Eric Zheng, Avishy Carmi, Adam Jones, Jainam Shah, Ethan D. L. Brown, Kelin Zhu, Max Bartolo, Richard Wheeler, Andrew Ho, Shaul Barkan, Jiaqi Wang, Martin Stehberger, Egor Kretov, Kaustubh Sridhar, Zienab EL-Wasif, Anji Zhang, Daniel Pyda, Joanna Tam, David M. Cunningham, Vladimir Goryachev, Demosthenes Patramanis, Michael Krause, Andrew Redenti, Daniel Bugas, David Aldous, Jesyin Lai, Shannon Coleman, Mohsen Bahaloo, Jiangnan Xu, Sangwon Lee, Sandy Zhao, Ning Tang, Michael K. Cohen, Micah Carroll, Orr Paradise, Jan Hendrik Kirchner, Stefan Steinerberger, Maksym Ovchynnikov, Jason O. Matos, Adithya Shenoy, Benedito Alves de Oliveira Junior, Michael Wang, Yuzhou Nie, Paolo Giordano, Philipp Petersen, Anna Sztyber-Betley, Priti Shukla, Jonathan Crozier, Antonella Pinto, Shreyas Verma, Prashant Joshi, Zheng-Xin Yong, Allison Tee, Jérémy Andréoletti, Orion Weller, Raghav Singhal, Gang Zhang, Alexander Ivanov, Seri Khoury, Hamid Mostaghimi, Kunvar Thaman, Qijia Chen, Tran Quoc Khánh, Jacob Loader, Stefano Cavalleri, Hannah Szlyk, Zachary Brown, Jonathan Roberts, William Alley, Kunyang Sun, Ryan Stendall, Max Lamparth, Anka Reuel, Ting Wang, Hanmeng Xu, Sreenivas Goud Raparthi, Pablo Hernández-Cámara, Freddie Martin, Dmitry Malishev, Thomas Preu, Tomek Korbak, Marcus Abramovitch, Dominic Williamson, Ziye Chen, Biró Bálint, M Saiful Bari, Peyman Kassani, Zihao Wang, Behzad Ansarinejad, Laxman Prasad Goswami, Yewen Sun, Hossam Elgnainy, Daniel Tordera, George Balabanian, Earth Anderson, Lynna Kvistad, Alejandro José Moyano, Rajat Maheshwari, Ahmad Sakor, Murat Eron, Isaac C. McAlister, Javier Gimenez, Innocent Enyekwe, Andrew Favre D. O., Shailesh Shah, Xiaoxiang Zhou, Firuz Kamalov, Ronald Clark, Sherwin Abdoli, Tim Santens, Khalida Meer, Harrison K Wang, Kalyan Ramakrishnan, Evan Chen, Alessandro Tomasiello, G. Bruno De Luca, Shi-Zhuo Looi, Vinh-Kha Le, Noam Kolt, Niels Mündler, Avi Semler, Emma Rodman, Jacob Drori, Carl J Fossum, Milind Jagota, Ronak Pradeep, Honglu Fan, Tej Shah, Jonathan Eicher, Michael Chen, Kushal Thaman, William Merrill, Carter Harris, Jason Gross, Ilya Gusev, Asankhaya Sharma, Shashank Agnihotri, Pavel Zhelnov, Siranut Usawasutsakorn, Mohammadreza Mofayezi, Sergei Bogdanov, Alexander Piperski, Marc Carauleanu, David K. Zhang, Dylan Ler, Roman Leventov, Ignat Soroko, Thorben Jansen, Pascal Lauer, Joshua Duersch, Vage Taamazyan, Wiktor Morak, Wenjie Ma, William Held, Tran Đuc Huy, Ruicheng Xian, Armel Randy Zebaze, Mohanad Mohamed, Julian Noah Leser, Michelle X Yuan, Laila Yacar, Johannes Lengler, Hossein Shahrtash, Edson Oliveira, Joseph W. Jackson, Daniel Espinosa Gonzalez, Andy Zou, Muthu Chidambaram, Timothy Manik, Hector Haffenden, Dashiell Stander, Ali Dasouqi, Alexander Shen, Emilien Duc, Bita Golshani, David Stap, Mikalai Uzhou, Alina Borisovna Zhidkovskaya, Lukas Lewark, Mátyás Vincze, Dustin Wehr, Colin Tang, Zaki Hossain, Shaun Phillips, Jiang Muzhen, Fredrik Ekström, Angela Hammon, Oam Patel, Nicolas Remy, Faraz Farhidi, George Medley, Forough Mohammadzadeh, Madellene Peñaflor, Haile Kassahun, Alena Friedrich, Claire Sparrow, Taom Sakal, Omkar Dhamane, Ali Khajegili Mirabadi, Eric Hallman, Mike Battaglia, Mohammad Maghsoudimehrabani, Hieu Hoang, Alon Amit, Dave Hulbert, Roberto Pereira, Simon Weber, Stephen Mensah, Nathan Andre, Anton Peristyy, Chris Harjadi, Himanshu Gupta, Stephen Malina, Samuel Albanie, Will Cai, Mustafa Mehkary, Frank Reidegeld, Anna-Katharina Dick, Cary Friday, Jasdeep Sidhu, Wanyoung Kim, Mariana Costa, Hubeyb Gurdogan, Brian Weber, Harsh Kumar, Tong Jiang, Arunim Agarwal, Chiara Ceconello, Warren S. Vaz, Chao Zhuang, Haon Park, Andrew R. Tawfeek, Daattavya Aggarwal, Michael Kirchhof, Linjie Dai, Evan Kim, Johan Ferret, Yuzhou Wang, Minghao Yan, Krzysztof Burdzy, Lixin Zhang, Antonio Franca, Diana T. Pham...</p>]]>
      </content:encoded>
      <pubDate>Mon, 27 Jan 2025 21:00:43 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/67a4e477/4403ea9f.mp3" length="21988037" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1371</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Daron Anderson, Tung Nguyen, Mobeen Mahmood, Fiona Feng, Steven Y. Feng, Haoran Zhao, Michael Yu, Varun Gangal, Chelsea Zou, Zihan Wang, Jessica P. Wang, Pawan Kumar, Oleksandr Pokutnyi, Robert Gerbicz, Serguei Popov, John-Clark Levin, Mstyslav Kazakov, Johannes Schmitt, Geoff Galgon, Alvaro Sanchez, Yongki Lee, Will Yeadon, Scott Sauers, Marc Roth, Chidozie Agu, Søren Riis, Fabian Giska, Saiteja Utpala, Zachary Giboney, Gashaw M. Goshu, Joan of Arc Xavier, Sarah-Jane Crowson, Mohinder Maheshbhai Naiya, Noah Burns, Lennart Finke, Zerui Cheng, Hyunwoo Park, Francesco Fournier-Facio, John Wydallis, Mark Nandor, Ankit Singh, Tim Gehrunger, Jiaqi Cai, Ben McCarty, Darling Duclosel, Jungbae Nam, Jennifer Zampese, Ryan G. Hoerr, Aras Bacho, Gautier Abou Loume, Abdallah Galal, Hangrui Cao, Alexis C Garretson, Damien Sileo, Qiuyu Ren, Doru Cojoc, Pavel Arkhipov, Usman Qazi, Lianghui Li, Sumeet Motwani, Christian Schroeder de Witt, Edwin Taylor, Johannes Veith, Eric Singer, Taylor D. Hartman, Paolo Rissone, Jaehyeok Jin, Jack Wei Lun Shi, Chris G. Willcocks, Joshua Robinson, Aleksandar Mikov, Ameya Prabhu, Longke Tang, Xavier Alapont, Justine Leon Uro, Kevin Zhou, Emily de Oliveira Santos, Andrey Pupasov Maksimov, Edward Vendrow, Kengo Zenitani, Julien Guillod, Yuqi Li, Joshua Vendrow, Vladyslav Kuchkin, Ng Ze-An, Pierre Marion, Denis Efremov, Jayson Lynch, Kaiqu Liang, Andrew Gritsevskiy, Dakotah Martinez, Ben Pageler, Nick Crispino, Dimitri Zvonkine, Natanael Wildner Fraga, Saeed Soori, Ori Press, Henry Tang, Julian Salazar, Sean R. Green, Lina Brüssel, Moon Twayana, Aymeric Dieuleveut, T. Ryan Rogers, Wenjin Zhang, Bikun Li, Jinzhou Yang, Arun Rao, Gabriel Loiseau, Mikhail Kalinin, Marco Lukas, Ciprian Manolescu, Subrata Mishra, Ariel Ghislain Kemogne Kamdoum, Tobias Kreiman, Tad Hogg, Alvin Jin, Carlo Bosio, Gongbo Sun, Brian P Coppola, Tim Tarver, Haline Heidinger, Rafael Sayous, Stefan Ivanov, Joseph M Cavanagh, Jiawei Shen, Joseph Marvin Imperial, Philippe Schwaller, Shaipranesh Senthilkuma, Andres M Bran, Ali Dehghan, Andres Algaba, Brecht Verbeken, David Noever, Ragavendran P V, Lisa Schut, Ilia Sucholutsky, Evgenii Zheltonozhskii, Derek Lim, Richard Stanley, Shankar Sivarajan, Tong Yang, John Maar, Julian Wykowski, Martí Oller, Jennifer Sandlin, Anmol Sahu, Yuzheng Hu, Sara Fish, Nasser Heydari, Archimedes Apronti, Kaivalya Rawal, Tobias Garcia Vilchis, Yuexuan Zu, Martin Lackner, James Koppel, Jeremy Nguyen, Daniil S. Antonenko, Steffi Chern, Bingchen Zhao, Pierrot Arsene, Alan Goldfarb, Sergey Ivanov, Rafał Poświata, Chenguang Wang, Daofeng Li, Donato Crisostomi, Andrea Achilleos, Benjamin Myklebust, Archan Sen, David Perrella, Nurdin Kaparov, Mark H Inlow, Allen Zang, Elliott Thornley, Daniil Orel, Vladislav Poritski, Shalev Ben-David, Zachary Berger, Parker Whitfill, Michael Foster, Daniel Munro, Linh Ho, Dan Bar Hava, Aleksey Kuchkin, Robert Lauff, David Holmes, Frank Sommerhage, Keith Schneider, Zakayo Kazibwe, Nate Stambaugh, Mukhwinder Singh, Ilias Magoulas, Don Clarke, Dae Hyun Kim, Felipe Meneguitti Dias, Veit Elser, Kanu Priya Agarwal, Victor Efren Guadarrama Vilchis, Immo Klose, Christoph Demian, Ujjwala Anantheswaran, Adam Zweiger, Guglielmo Albani, Jeffery Li, Nicolas Daans, Maksim Radionov, Václav Rozhoň, Ziqiao Ma, Christian Stump, Mohammed Berkani, Jacob Platnick, Volodymyr Nevirkovets, Luke Basler, Marco Piccardo, Ferenc Jeanplong, Niv Cohen, Josef Tkadlec, Paul Rosu, Piotr Padlewski, Stanislaw Barzowski, Kyle Montgomery, Aline Menezes, Arkil Patel, Zixuan Wang, Jamie Tucker-Foltz, Jack Stade, Tom Goertzen, Fereshteh Kazemi, Jeremiah Milbauer, John Arnold Ambay, Abhishek Shukla, Yan Carlos Leyva Labrador, Alan Givré, Hew Wolff, Vivien Rossbach, Muhammad Fayez Aziz, Younesse Kaddar, Yanxu Chen, Robin Zhang, Jiayi Pan, Antonio Terpin, Niklas Muennighoff, Hailey Schoelkopf, Eric Zheng, Avishy Carmi, Adam Jones, Jainam Shah, Ethan D. L. Brown, Kelin Zhu, Max Bartolo, Richard Wheeler, Andrew Ho, Shaul Barkan, Jiaqi Wang, Martin Stehberger, Egor Kretov, Kaustubh Sridhar, Zienab EL-Wasif, Anji Zhang, Daniel Pyda, Joanna Tam, David M. Cunningham, Vladimir Goryachev, Demosthenes Patramanis, Michael Krause, Andrew Redenti, Daniel Bugas, David Aldous, Jesyin Lai, Shannon Coleman, Mohsen Bahaloo, Jiangnan Xu, Sangwon Lee, Sandy Zhao, Ning Tang, Michael K. Cohen, Micah Carroll, Orr Paradise, Jan Hendrik Kirchner, Stefan Steinerberger, Maksym Ovchynnikov, Jason O. Matos, Adithya Shenoy, Benedito Alves de Oliveira Junior, Michael Wang, Yuzhou Nie, Paolo Giordano, Philipp Petersen, Anna Sztyber-Betley, Priti Shukla, Jonathan Crozier, Antonella Pinto, Shreyas Verma, Prashant Joshi, Zheng-Xin Yong, Allison Tee, Jérémy Andréoletti, Orion Weller, Raghav Singhal, Gang Zhang, Alexander Ivanov, Seri Khoury, Hamid Mostaghimi, Kunvar Thaman, Qijia Chen, Tran Quoc Khánh, Jacob Loader, Stefano Cavalleri, Hannah Szlyk, Zachary Brown, Jonathan Roberts, William Alley, Kunyang Sun, Ryan Stendall, Max Lamparth, Anka Reuel, Ting Wang, Hanmeng Xu, Sreenivas Goud Raparthi, Pablo Hernández-Cámara, Freddie Martin, Dmitry Malishev, Thomas Preu, Tomek Korbak, Marcus Abramovitch, Dominic Williamson, Ziye Chen, Biró Bálint, M Saiful Bari, Peyman Kassani, Zihao Wang, Behzad Ansarinejad, Laxman Prasad Goswami, Yewen Sun, Hossam Elgnainy, Daniel Tordera, George Balabanian, Earth Anderson, Lynna Kvistad, Alejandro José Moyano, Rajat Maheshwari, Ahmad Sakor, Murat Eron, Isaac C. McAlister, Javier Gimenez, Innocent Enyekwe, Andrew Favre D. O., Shailesh Shah, Xiaoxiang Zhou, Firuz Kamalov, Ronald Clark, Sherwin Abdoli, Tim Santens, Khalida Meer, Harrison K Wang, Kalyan Ramakrishnan, Evan Chen, Alessandro Tomasiello, G. Bruno De Luca, Shi-Zhuo Looi, Vinh-Kha Le, Noam Kolt, Niels Mündler, Avi Semler, Emma Rodman, Jacob Drori, Carl J Fossum, Milind Jagota, Ronak Pradeep, Honglu Fan, Tej Shah, Jonathan Eicher, Michael Chen, Kushal Thaman, William Merrill, Carter Harris, Jason Gross, Ilya Gusev, Asankhaya Sharma, Shashank Agnihotri, Pavel Zhelnov, Siranut Usawasutsakorn, Mohammadreza Mofayezi, Sergei Bogdanov, Alexander Piperski, Marc Carauleanu, David K. Zhang, Dylan Ler, Roman Leventov, Ignat Soroko, Thorben Jansen, Pascal Lauer, Joshua Duersch, Vage Taamazyan, Wiktor Morak, Wenjie Ma, William Held, Tran Đuc Huy, Ruicheng Xian, Armel Randy Zebaze, Mohanad Mohamed, Julian Noah Leser, Michelle X Yuan, Laila Yacar, Johannes Lengler, Hossein Shahrtash, Edson Oliveira, Joseph W. Jackson, Daniel Espinosa Gonzalez, Andy Zou, Muthu Chidambaram, Timothy Manik, Hector Haffenden, Dashiell Stander, Ali Dasouqi, Alexander Shen, Emilien Duc, Bita Golshani, David Stap, Mikalai Uzhou, Alina Borisovna Zhidkovskaya, Lukas Lewark, Mátyás Vincze, Dustin Wehr, Colin Tang, Zaki Hossain, Shaun Phillips, Jiang Muzhen, Fredrik Ekström, Angela Hammon, Oam Patel, Nicolas Remy, Faraz Farhidi, George Medley, Forough Mohammadzadeh, Madellene Peñaflor, Haile Kassahun, Alena Friedrich, Claire Sparrow, Taom Sakal, Omkar Dhamane, Ali Khajegili Mirabadi, Eric Hallman, Mike Battaglia, Mohammad Maghsoudimehrabani, Hieu Hoang, Alon Amit, Dave Hulbert, Roberto Pereira, Simon Weber, Stephen Mensah, Nathan Andre, Anton Peristyy, Chris Harjadi, Himanshu Gupta, Stephen Malina, Samuel Albanie, Will Cai, Mustafa Mehkary, Frank Reidegeld, Anna-Katharina Dick, Cary Friday, Jasdeep Sidhu, Wanyoung Kim, Mariana Costa, Hubeyb Gurdogan, Brian Weber, Harsh Kumar, Tong Jiang, Arunim Agarwal, Chiara Ceconello, Warren S. Vaz, Chao Zhuang, Haon Park, Andrew R. Tawfeek, Daattavya Aggarwal, Michael Kirchhof, Linjie Dai, Evan Kim, Johan Ferret, Yuzhou Wang, Minghao Yan, Krzysztof Burdzy, Lixin Zhang, Antonio Franca, Diana T. Pham...</p>]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Chain-of-Retrieval Augmented Generation</title>
      <itunes:episode>432</itunes:episode>
      <podcast:episode>432</podcast:episode>
      <itunes:title>Chain-of-Retrieval Augmented Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c0facc8c-457b-45ea-b6a5-45d85cee1430</guid>
      <link>https://share.transistor.fm/s/0fe9b0d7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.IR, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, Furu Wei</p>

            <p><strong>Title:</strong><br>
            Chain-of-Retrieval Augmented Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.14342v1">http://arxiv.org/abs/2501.14342v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer. Conventional RAG methods usually perform a single retrieval step before the generation process, which limits their effectiveness in addressing complex queries due to imperfect retrieval results. In contrast, our proposed method, CoRAG (Chain-of-Retrieval Augmented Generation), allows the model to dynamically reformulate the query based on the evolving state. To train CoRAG effectively, we utilize rejection sampling to automatically generate intermediate retrieval chains, thereby augmenting existing RAG datasets that only provide the correct final answer. At test time, we propose various decoding strategies to scale the model's test-time compute by controlling the length and number of sampled retrieval chains. Experimental results across multiple benchmarks validate the efficacy of CoRAG, particularly in multi-hop question answering tasks, where we observe more than 10 points improvement in EM score compared to strong baselines. On the KILT benchmark, CoRAG establishes a new state-of-the-art performance across a diverse range of knowledge-intensive tasks. Furthermore, we offer comprehensive analyses to understand the scaling behavior of CoRAG, laying the groundwork for future research aimed at developing factual and grounded foundation models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.IR, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, Furu Wei</p>

            <p><strong>Title:</strong><br>
            Chain-of-Retrieval Augmented Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.14342v1">http://arxiv.org/abs/2501.14342v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer. Conventional RAG methods usually perform a single retrieval step before the generation process, which limits their effectiveness in addressing complex queries due to imperfect retrieval results. In contrast, our proposed method, CoRAG (Chain-of-Retrieval Augmented Generation), allows the model to dynamically reformulate the query based on the evolving state. To train CoRAG effectively, we utilize rejection sampling to automatically generate intermediate retrieval chains, thereby augmenting existing RAG datasets that only provide the correct final answer. At test time, we propose various decoding strategies to scale the model's test-time compute by controlling the length and number of sampled retrieval chains. Experimental results across multiple benchmarks validate the efficacy of CoRAG, particularly in multi-hop question answering tasks, where we observe more than 10 points improvement in EM score compared to strong baselines. On the KILT benchmark, CoRAG establishes a new state-of-the-art performance across a diverse range of knowledge-intensive tasks. Furthermore, we offer comprehensive analyses to understand the scaling behavior of CoRAG, laying the groundwork for future research aimed at developing factual and grounded foundation models.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 27 Jan 2025 21:00:21 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0fe9b0d7/d8df4949.mp3" length="22507161" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1403</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.IR, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, Furu Wei</p>

            <p><strong>Title:</strong><br>
            Chain-of-Retrieval Augmented Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.14342v1">http://arxiv.org/abs/2501.14342v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer. Conventional RAG methods usually perform a single retrieval step before the generation process, which limits their effectiveness in addressing complex queries due to imperfect retrieval results. In contrast, our proposed method, CoRAG (Chain-of-Retrieval Augmented Generation), allows the model to dynamically reformulate the query based on the evolving state. To train CoRAG effectively, we utilize rejection sampling to automatically generate intermediate retrieval chains, thereby augmenting existing RAG datasets that only provide the correct final answer. At test time, we propose various decoding strategies to scale the model's test-time compute by controlling the length and number of sampled retrieval chains. Experimental results across multiple benchmarks validate the efficacy of CoRAG, particularly in multi-hop question answering tasks, where we observe more than 10 points improvement in EM score compared to strong baselines. On the KILT benchmark, CoRAG establishes a new state-of-the-art performance across a diverse range of knowledge-intensive tasks. Furthermore, we offer comprehensive analyses to understand the scaling behavior of CoRAG, laying the groundwork for future research aimed at developing factual and grounded foundation models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Redundancy Principles for MLLMs Benchmarks</title>
      <itunes:episode>431</itunes:episode>
      <podcast:episode>431</podcast:episode>
      <itunes:title>Redundancy Principles for MLLMs Benchmarks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">225d7715-9d82-4437-989b-5bf8eb108f13</guid>
      <link>https://share.transistor.fm/s/8451509f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zicheng Zhang, Xiangyu Zhao, Xinyu Fang, Chunyi Li, Xiaohong Liu, Xiongkuo Min, Haodong Duan, Kai Chen, Guangtao Zhai</p>

            <p><strong>Title:</strong><br>
            Redundancy Principles for MLLMs Benchmarks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13953v1">http://arxiv.org/abs/2501.13953v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for constructing effective MLLM benchmarks. In this paper, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains. Through the comprehensive analysis over hundreds of MLLMs' performance across more than 20 benchmarks, we aim to quantitatively measure the level of redundancy lies in existing MLLM evaluations, provide valuable insights to guide the future development of MLLM benchmarks, and offer strategies to refine and address redundancy issues effectively.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zicheng Zhang, Xiangyu Zhao, Xinyu Fang, Chunyi Li, Xiaohong Liu, Xiongkuo Min, Haodong Duan, Kai Chen, Guangtao Zhai</p>

            <p><strong>Title:</strong><br>
            Redundancy Principles for MLLMs Benchmarks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13953v1">http://arxiv.org/abs/2501.13953v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for constructing effective MLLM benchmarks. In this paper, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains. Through the comprehensive analysis over hundreds of MLLMs' performance across more than 20 benchmarks, we aim to quantitatively measure the level of redundancy lies in existing MLLM evaluations, provide valuable insights to guide the future development of MLLM benchmarks, and offer strategies to refine and address redundancy issues effectively.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 27 Jan 2025 20:59:59 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8451509f/d1712bb0.mp3" length="21501972" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1340</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zicheng Zhang, Xiangyu Zhao, Xinyu Fang, Chunyi Li, Xiaohong Liu, Xiongkuo Min, Haodong Duan, Kai Chen, Guangtao Zhai</p>

            <p><strong>Title:</strong><br>
            Redundancy Principles for MLLMs Benchmarks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13953v1">http://arxiv.org/abs/2501.13953v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for constructing effective MLLM benchmarks. In this paper, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains. Through the comprehensive analysis over hundreds of MLLMs' performance across more than 20 benchmarks, we aim to quantitatively measure the level of redundancy lies in existing MLLM evaluations, provide valuable insights to guide the future development of MLLM benchmarks, and offer strategies to refine and address redundancy issues effectively.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques</title>
      <itunes:episode>430</itunes:episode>
      <podcast:episode>430</podcast:episode>
      <itunes:title>RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">162d0b69-0258-427f-8cac-08dbc9dae5d7</guid>
      <link>https://share.transistor.fm/s/96008568</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.14492v1">http://arxiv.org/abs/2501.14492v1</a></p>

            <p><strong>Abstract:</strong><br>
            Critiques are important for enhancing the performance of Large Language Models (LLMs), enabling both self-improvement and constructive feedback for others by identifying flaws and suggesting improvements. However, evaluating the critique capabilities of LLMs presents a significant challenge due to the open-ended nature of the task. In this work, we introduce a new benchmark designed to assess the critique capabilities of LLMs. Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques. Moreover, the benchmark incorporates features such as self-critique, cross-critique, and iterative critique, which are crucial for distinguishing the abilities of advanced reasoning models from more classical ones. We implement this benchmark using eight challenging reasoning tasks. We have several interesting findings. First, despite demonstrating comparable performance in direct chain-of-thought generation, classical LLMs significantly lag behind the advanced reasoning-based model o1-mini across all critique scenarios. Second, in self-critique and iterative critique settings, classical LLMs may even underperform relative to their baseline capabilities. We hope that this benchmark will serve as a valuable resource to guide future advancements. The code and data are available at \url{https://github.com/tangzhy/RealCritic}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.14492v1">http://arxiv.org/abs/2501.14492v1</a></p>

            <p><strong>Abstract:</strong><br>
            Critiques are important for enhancing the performance of Large Language Models (LLMs), enabling both self-improvement and constructive feedback for others by identifying flaws and suggesting improvements. However, evaluating the critique capabilities of LLMs presents a significant challenge due to the open-ended nature of the task. In this work, we introduce a new benchmark designed to assess the critique capabilities of LLMs. Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques. Moreover, the benchmark incorporates features such as self-critique, cross-critique, and iterative critique, which are crucial for distinguishing the abilities of advanced reasoning models from more classical ones. We implement this benchmark using eight challenging reasoning tasks. We have several interesting findings. First, despite demonstrating comparable performance in direct chain-of-thought generation, classical LLMs significantly lag behind the advanced reasoning-based model o1-mini across all critique scenarios. Second, in self-critique and iterative critique settings, classical LLMs may even underperform relative to their baseline capabilities. We hope that this benchmark will serve as a valuable resource to guide future advancements. The code and data are available at \url{https://github.com/tangzhy/RealCritic}.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 27 Jan 2025 20:59:37 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/96008568/2c27ffe3.mp3" length="22703642" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1415</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.14492v1">http://arxiv.org/abs/2501.14492v1</a></p>

            <p><strong>Abstract:</strong><br>
            Critiques are important for enhancing the performance of Large Language Models (LLMs), enabling both self-improvement and constructive feedback for others by identifying flaws and suggesting improvements. However, evaluating the critique capabilities of LLMs presents a significant challenge due to the open-ended nature of the task. In this work, we introduce a new benchmark designed to assess the critique capabilities of LLMs. Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques. Moreover, the benchmark incorporates features such as self-critique, cross-critique, and iterative critique, which are crucial for distinguishing the abilities of advanced reasoning models from more classical ones. We implement this benchmark using eight challenging reasoning tasks. We have several interesting findings. First, despite demonstrating comparable performance in direct chain-of-thought generation, classical LLMs significantly lag behind the advanced reasoning-based model o1-mini across all critique scenarios. Second, in self-critique and iterative critique settings, classical LLMs may even underperform relative to their baseline capabilities. We hope that this benchmark will serve as a valuable resource to guide future advancements. The code and data are available at \url{https://github.com/tangzhy/RealCritic}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RL + Transformer = A General-Purpose Problem Solver</title>
      <itunes:episode>429</itunes:episode>
      <podcast:episode>429</podcast:episode>
      <itunes:title>RL + Transformer = A General-Purpose Problem Solver</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9a6d6c02-5ec7-4fc4-9e59-2679fff9f893</guid>
      <link>https://share.transistor.fm/s/b585c7fe</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Micah Rentschler, Jesse Roberts</p>

            <p><strong>Title:</strong><br>
            RL + Transformer = A General-Purpose Problem Solver</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.14176v1">http://arxiv.org/abs/2501.14176v1</a></p>

            <p><strong>Abstract:</strong><br>
            What if artificial intelligence could not only solve problems for which it was trained but also learn to teach itself to solve new problems (i.e., meta-learn)? In this study, we demonstrate that a pre-trained transformer fine-tuned with reinforcement learning over multiple episodes develops the ability to solve problems that it has never encountered before - an emergent ability called In-Context Reinforcement Learning (ICRL). This powerful meta-learner not only excels in solving unseen in-distribution environments with remarkable sample efficiency, but also shows strong performance in out-of-distribution environments. In addition, we show that it exhibits robustness to the quality of its training data, seamlessly stitches together behaviors from its context, and adapts to non-stationary environments. These behaviors demonstrate that an RL-trained transformer can iteratively improve upon its own solutions, making it an excellent general-purpose problem solver.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Micah Rentschler, Jesse Roberts</p>

            <p><strong>Title:</strong><br>
            RL + Transformer = A General-Purpose Problem Solver</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.14176v1">http://arxiv.org/abs/2501.14176v1</a></p>

            <p><strong>Abstract:</strong><br>
            What if artificial intelligence could not only solve problems for which it was trained but also learn to teach itself to solve new problems (i.e., meta-learn)? In this study, we demonstrate that a pre-trained transformer fine-tuned with reinforcement learning over multiple episodes develops the ability to solve problems that it has never encountered before - an emergent ability called In-Context Reinforcement Learning (ICRL). This powerful meta-learner not only excels in solving unseen in-distribution environments with remarkable sample efficiency, but also shows strong performance in out-of-distribution environments. In addition, we show that it exhibits robustness to the quality of its training data, seamlessly stitches together behaviors from its context, and adapts to non-stationary environments. These behaviors demonstrate that an RL-trained transformer can iteratively improve upon its own solutions, making it an excellent general-purpose problem solver.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 27 Jan 2025 20:59:14 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b585c7fe/cdcfeb98.mp3" length="23481854" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1464</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Micah Rentschler, Jesse Roberts</p>

            <p><strong>Title:</strong><br>
            RL + Transformer = A General-Purpose Problem Solver</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.14176v1">http://arxiv.org/abs/2501.14176v1</a></p>

            <p><strong>Abstract:</strong><br>
            What if artificial intelligence could not only solve problems for which it was trained but also learn to teach itself to solve new problems (i.e., meta-learn)? In this study, we demonstrate that a pre-trained transformer fine-tuned with reinforcement learning over multiple episodes develops the ability to solve problems that it has never encountered before - an emergent ability called In-Context Reinforcement Learning (ICRL). This powerful meta-learner not only excels in solving unseen in-distribution environments with remarkable sample efficiency, but also shows strong performance in out-of-distribution environments. In addition, we show that it exhibits robustness to the quality of its training data, seamlessly stitches together behaviors from its context, and adapts to non-stationary environments. These behaviors demonstrate that an RL-trained transformer can iteratively improve upon its own solutions, making it an excellent general-purpose problem solver.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Relightable Full-Body Gaussian Codec Avatars</title>
      <itunes:episode>428</itunes:episode>
      <podcast:episode>428</podcast:episode>
      <itunes:title>Relightable Full-Body Gaussian Codec Avatars</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c4ca2b17-5210-4d5a-be6a-6847acf7a027</guid>
      <link>https://share.transistor.fm/s/c2ebe843</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Shaofei Wang, Tomas Simon, Igor Santesteban, Timur Bagautdinov, Junxuan Li, Vasu Agrawal, Fabian Prada, Shoou-I Yu, Pace Nalbone, Matt Gramlich, Roman Lubachersky, Chenglei Wu, Javier Romero, Jason Saragih, Michael Zollhoefer, Andreas Geiger, Siyu Tang, Shunsuke Saito</p>

            <p><strong>Title:</strong><br>
            Relightable Full-Body Gaussian Codec Avatars</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.14726v1">http://arxiv.org/abs/2501.14726v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose Relightable Full-Body Gaussian Codec Avatars, a new approach for modeling relightable full-body avatars with fine-grained details including face and hands. The unique challenge for relighting full-body avatars lies in the large deformations caused by body articulation and the resulting impact on appearance caused by light transport. Changes in body pose can dramatically change the orientation of body surfaces with respect to lights, resulting in both local appearance changes due to changes in local light transport functions, as well as non-local changes due to occlusion between body parts. To address this, we decompose the light transport into local and non-local effects. Local appearance changes are modeled using learnable zonal harmonics for diffuse radiance transfer. Unlike spherical harmonics, zonal harmonics are highly efficient to rotate under articulation. This allows us to learn diffuse radiance transfer in a local coordinate frame, which disentangles the local radiance transfer from the articulation of the body. To account for non-local appearance changes, we introduce a shadow network that predicts shadows given precomputed incoming irradiance on a base mesh. This facilitates the learning of non-local shadowing between the body parts. Finally, we use a deferred shading approach to model specular radiance transfer and better capture reflections and highlights such as eye glints. We demonstrate that our approach successfully models both the local and non-local light transport required for relightable full-body avatars, with a superior generalization ability under novel illumination conditions and unseen poses.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Shaofei Wang, Tomas Simon, Igor Santesteban, Timur Bagautdinov, Junxuan Li, Vasu Agrawal, Fabian Prada, Shoou-I Yu, Pace Nalbone, Matt Gramlich, Roman Lubachersky, Chenglei Wu, Javier Romero, Jason Saragih, Michael Zollhoefer, Andreas Geiger, Siyu Tang, Shunsuke Saito</p>

            <p><strong>Title:</strong><br>
            Relightable Full-Body Gaussian Codec Avatars</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.14726v1">http://arxiv.org/abs/2501.14726v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose Relightable Full-Body Gaussian Codec Avatars, a new approach for modeling relightable full-body avatars with fine-grained details including face and hands. The unique challenge for relighting full-body avatars lies in the large deformations caused by body articulation and the resulting impact on appearance caused by light transport. Changes in body pose can dramatically change the orientation of body surfaces with respect to lights, resulting in both local appearance changes due to changes in local light transport functions, as well as non-local changes due to occlusion between body parts. To address this, we decompose the light transport into local and non-local effects. Local appearance changes are modeled using learnable zonal harmonics for diffuse radiance transfer. Unlike spherical harmonics, zonal harmonics are highly efficient to rotate under articulation. This allows us to learn diffuse radiance transfer in a local coordinate frame, which disentangles the local radiance transfer from the articulation of the body. To account for non-local appearance changes, we introduce a shadow network that predicts shadows given precomputed incoming irradiance on a base mesh. This facilitates the learning of non-local shadowing between the body parts. Finally, we use a deferred shading approach to model specular radiance transfer and better capture reflections and highlights such as eye glints. We demonstrate that our approach successfully models both the local and non-local light transport required for relightable full-body avatars, with a superior generalization ability under novel illumination conditions and unseen poses.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 27 Jan 2025 20:58:52 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c2ebe843/852eae15.mp3" length="20105155" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1253</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Shaofei Wang, Tomas Simon, Igor Santesteban, Timur Bagautdinov, Junxuan Li, Vasu Agrawal, Fabian Prada, Shoou-I Yu, Pace Nalbone, Matt Gramlich, Roman Lubachersky, Chenglei Wu, Javier Romero, Jason Saragih, Michael Zollhoefer, Andreas Geiger, Siyu Tang, Shunsuke Saito</p>

            <p><strong>Title:</strong><br>
            Relightable Full-Body Gaussian Codec Avatars</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.14726v1">http://arxiv.org/abs/2501.14726v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose Relightable Full-Body Gaussian Codec Avatars, a new approach for modeling relightable full-body avatars with fine-grained details including face and hands. The unique challenge for relighting full-body avatars lies in the large deformations caused by body articulation and the resulting impact on appearance caused by light transport. Changes in body pose can dramatically change the orientation of body surfaces with respect to lights, resulting in both local appearance changes due to changes in local light transport functions, as well as non-local changes due to occlusion between body parts. To address this, we decompose the light transport into local and non-local effects. Local appearance changes are modeled using learnable zonal harmonics for diffuse radiance transfer. Unlike spherical harmonics, zonal harmonics are highly efficient to rotate under articulation. This allows us to learn diffuse radiance transfer in a local coordinate frame, which disentangles the local radiance transfer from the articulation of the body. To account for non-local appearance changes, we introduce a shadow network that predicts shadows given precomputed incoming irradiance on a base mesh. This facilitates the learning of non-local shadowing between the body parts. Finally, we use a deferred shading approach to model specular radiance transfer and better capture reflections and highlights such as eye glints. We demonstrate that our approach successfully models both the local and non-local light transport required for relightable full-body avatars, with a superior generalization ability under novel illumination conditions and unseen poses.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Question Answering on Patient Medical Records with Private Fine-Tuned LLMs</title>
      <itunes:episode>427</itunes:episode>
      <podcast:episode>427</podcast:episode>
      <itunes:title>Question Answering on Patient Medical Records with Private Fine-Tuned LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6a8b9e93-b6ef-472f-9cd2-8de3901fb075</guid>
      <link>https://share.transistor.fm/s/2a4fe9fb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Sara Kothari, Ayush Gupta</p>

            <p><strong>Title:</strong><br>
            Question Answering on Patient Medical Records with Private Fine-Tuned LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13687v1">http://arxiv.org/abs/2501.13687v1</a></p>

            <p><strong>Abstract:</strong><br>
            Healthcare systems continuously generate vast amounts of electronic health records (EHRs), commonly stored in the Fast Healthcare Interoperability Resources (FHIR) standard. Despite the wealth of information in these records, their complexity and volume make it difficult for users to retrieve and interpret crucial health insights. Recent advances in Large Language Models (LLMs) offer a solution, enabling semantic question answering (QA) over medical data, allowing users to interact with their health records more effectively. However, ensuring privacy and compliance requires edge and private deployments of LLMs.   This paper proposes a novel approach to semantic QA over EHRs by first identifying the most relevant FHIR resources for a user query (Task1) and subsequently answering the query based on these resources (Task2). We explore the performance of privately hosted, fine-tuned LLMs, evaluating them against benchmark models such as GPT-4 and GPT-4o. Our results demonstrate that fine-tuned LLMs, while 250x smaller in size, outperform GPT-4 family models by 0.55% in F1 score on Task1 and 42% on Meteor Task in Task2. Additionally, we examine advanced aspects of LLM usage, including sequential fine-tuning, model self-evaluation (narcissistic evaluation), and the impact of training data size on performance. The models and datasets are available here: https://huggingface.co/genloop</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Sara Kothari, Ayush Gupta</p>

            <p><strong>Title:</strong><br>
            Question Answering on Patient Medical Records with Private Fine-Tuned LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13687v1">http://arxiv.org/abs/2501.13687v1</a></p>

            <p><strong>Abstract:</strong><br>
            Healthcare systems continuously generate vast amounts of electronic health records (EHRs), commonly stored in the Fast Healthcare Interoperability Resources (FHIR) standard. Despite the wealth of information in these records, their complexity and volume make it difficult for users to retrieve and interpret crucial health insights. Recent advances in Large Language Models (LLMs) offer a solution, enabling semantic question answering (QA) over medical data, allowing users to interact with their health records more effectively. However, ensuring privacy and compliance requires edge and private deployments of LLMs.   This paper proposes a novel approach to semantic QA over EHRs by first identifying the most relevant FHIR resources for a user query (Task1) and subsequently answering the query based on these resources (Task2). We explore the performance of privately hosted, fine-tuned LLMs, evaluating them against benchmark models such as GPT-4 and GPT-4o. Our results demonstrate that fine-tuned LLMs, while 250x smaller in size, outperform GPT-4 family models by 0.55% in F1 score on Task1 and 42% on Meteor Task in Task2. Additionally, we examine advanced aspects of LLM usage, including sequential fine-tuning, model self-evaluation (narcissistic evaluation), and the impact of training data size on performance. The models and datasets are available here: https://huggingface.co/genloop</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 27 Jan 2025 20:58:30 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2a4fe9fb/c5023211.mp3" length="21125005" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1317</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Sara Kothari, Ayush Gupta</p>

            <p><strong>Title:</strong><br>
            Question Answering on Patient Medical Records with Private Fine-Tuned LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13687v1">http://arxiv.org/abs/2501.13687v1</a></p>

            <p><strong>Abstract:</strong><br>
            Healthcare systems continuously generate vast amounts of electronic health records (EHRs), commonly stored in the Fast Healthcare Interoperability Resources (FHIR) standard. Despite the wealth of information in these records, their complexity and volume make it difficult for users to retrieve and interpret crucial health insights. Recent advances in Large Language Models (LLMs) offer a solution, enabling semantic question answering (QA) over medical data, allowing users to interact with their health records more effectively. However, ensuring privacy and compliance requires edge and private deployments of LLMs.   This paper proposes a novel approach to semantic QA over EHRs by first identifying the most relevant FHIR resources for a user query (Task1) and subsequently answering the query based on these resources (Task2). We explore the performance of privately hosted, fine-tuned LLMs, evaluating them against benchmark models such as GPT-4 and GPT-4o. Our results demonstrate that fine-tuned LLMs, while 250x smaller in size, outperform GPT-4 family models by 0.55% in F1 score on Task1 and 42% on Meteor Task in Task2. Additionally, we examine advanced aspects of LLM usage, including sequential fine-tuning, model self-evaluation (narcissistic evaluation), and the impact of training data size on performance. The models and datasets are available here: https://huggingface.co/genloop</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing</title>
      <itunes:episode>426</itunes:episode>
      <podcast:episode>426</podcast:episode>
      <itunes:title>GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">51526a10-e407-4899-be65-52d836595578</guid>
      <link>https://share.transistor.fm/s/fae4bd28</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad S. Khan, Salman Khan</p>

            <p><strong>Title:</strong><br>
            GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13925v1">http://arxiv.org/abs/2501.13925v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large multimodal models (LMMs) have recognized fine-grained grounding as an imperative factor of visual understanding and dialogue. However, the benefits of such representation in LMMs are limited to the natural image domain, and these models perform poorly for remote sensing (RS). The distinct overhead viewpoint, scale variation, and presence of small objects in high-resolution RS imagery present a unique challenge in region-level comprehension. Moreover, the development of the grounding conversation capability of LMMs within RS is hindered by the lack of granular, RS domain-specific grounded data. Addressing these limitations, we propose GeoPixel - the first end-to-end high resolution RS-LMM that supports pixel-level grounding. This capability allows fine-grained visual perception by generating interleaved masks in conversation. GeoPixel supports up to 4K HD resolution in any aspect ratio, ideal for high-precision RS image analysis. To support the grounded conversation generation (GCG) in RS imagery, we curate a visually grounded dataset GeoPixelD through a semi-automated pipeline that utilizes set-of-marks prompting and spatial priors tailored for RS data to methodically control the data generation process. GeoPixel demonstrates superior performance in pixel-level comprehension, surpassing existing LMMs in both single-target and multi-target segmentation tasks. Our methodological ablation studies validate the effectiveness of each component in the overall architecture. Our code and data will be publicly released.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad S. Khan, Salman Khan</p>

            <p><strong>Title:</strong><br>
            GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13925v1">http://arxiv.org/abs/2501.13925v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large multimodal models (LMMs) have recognized fine-grained grounding as an imperative factor of visual understanding and dialogue. However, the benefits of such representation in LMMs are limited to the natural image domain, and these models perform poorly for remote sensing (RS). The distinct overhead viewpoint, scale variation, and presence of small objects in high-resolution RS imagery present a unique challenge in region-level comprehension. Moreover, the development of the grounding conversation capability of LMMs within RS is hindered by the lack of granular, RS domain-specific grounded data. Addressing these limitations, we propose GeoPixel - the first end-to-end high resolution RS-LMM that supports pixel-level grounding. This capability allows fine-grained visual perception by generating interleaved masks in conversation. GeoPixel supports up to 4K HD resolution in any aspect ratio, ideal for high-precision RS image analysis. To support the grounded conversation generation (GCG) in RS imagery, we curate a visually grounded dataset GeoPixelD through a semi-automated pipeline that utilizes set-of-marks prompting and spatial priors tailored for RS data to methodically control the data generation process. GeoPixel demonstrates superior performance in pixel-level comprehension, surpassing existing LMMs in both single-target and multi-target segmentation tasks. Our methodological ablation studies validate the effectiveness of each component in the overall architecture. Our code and data will be publicly released.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 27 Jan 2025 20:58:08 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fae4bd28/aea62e8b.mp3" length="22566956" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1407</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad S. Khan, Salman Khan</p>

            <p><strong>Title:</strong><br>
            GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13925v1">http://arxiv.org/abs/2501.13925v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in large multimodal models (LMMs) have recognized fine-grained grounding as an imperative factor of visual understanding and dialogue. However, the benefits of such representation in LMMs are limited to the natural image domain, and these models perform poorly for remote sensing (RS). The distinct overhead viewpoint, scale variation, and presence of small objects in high-resolution RS imagery present a unique challenge in region-level comprehension. Moreover, the development of the grounding conversation capability of LMMs within RS is hindered by the lack of granular, RS domain-specific grounded data. Addressing these limitations, we propose GeoPixel - the first end-to-end high resolution RS-LMM that supports pixel-level grounding. This capability allows fine-grained visual perception by generating interleaved masks in conversation. GeoPixel supports up to 4K HD resolution in any aspect ratio, ideal for high-precision RS image analysis. To support the grounded conversation generation (GCG) in RS imagery, we curate a visually grounded dataset GeoPixelD through a semi-automated pipeline that utilizes set-of-marks prompting and spatial priors tailored for RS data to methodically control the data generation process. GeoPixel demonstrates superior performance in pixel-level comprehension, surpassing existing LMMs in both single-target and multi-target segmentation tasks. Our methodological ablation studies validate the effectiveness of each component in the overall architecture. Our code and data will be publicly released.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation</title>
      <itunes:episode>425</itunes:episode>
      <podcast:episode>425</podcast:episode>
      <itunes:title>AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">47a0da67-f17a-4c15-9b75-f9efd5e46093</guid>
      <link>https://share.transistor.fm/s/16bdad92</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuning Cui, Syed Waqas Zamir, Salman Khan, Alois Knoll, Mubarak Shah, Fahad Shahbaz Khan</p>

            <p><strong>Title:</strong><br>
            AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2403.14614v1">http://arxiv.org/abs/2403.14614v1</a></p>

            <p><strong>Abstract:</strong><br>
            In the image acquisition process, various forms of degradation, including noise, haze, and rain, are frequently introduced. These degradations typically arise from the inherent limitations of cameras or unfavorable ambient conditions. To recover clean images from degraded versions, numerous specialized restoration methods have been developed, each targeting a specific type of degradation. Recently, all-in-one algorithms have garnered significant attention by addressing different types of degradations within a single model without requiring prior information of the input degradation type. However, these methods purely operate in the spatial domain and do not delve into the distinct frequency variations inherent to different degradation types. To address this gap, we propose an adaptive all-in-one image restoration network based on frequency mining and modulation. Our approach is motivated by the observation that different degradation types impact the image content on different frequency subbands, thereby requiring different treatments for each restoration task. Specifically, we first mine low- and high-frequency information from the input features, guided by the adaptively decoupled spectra of the degraded image. The extracted features are then modulated by a bidirectional operator to facilitate interactions between different frequency components. Finally, the modulated features are merged into the original input for a progressively guided restoration. With this approach, the model achieves adaptive reconstruction by accentuating the informative frequency subbands according to different input degradations. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on different image restoration tasks, including denoising, dehazing, deraining, motion deblurring, and low-light image enhancement. Our code is available at https://github.com/c-yn/AdaIR.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuning Cui, Syed Waqas Zamir, Salman Khan, Alois Knoll, Mubarak Shah, Fahad Shahbaz Khan</p>

            <p><strong>Title:</strong><br>
            AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2403.14614v1">http://arxiv.org/abs/2403.14614v1</a></p>

            <p><strong>Abstract:</strong><br>
            In the image acquisition process, various forms of degradation, including noise, haze, and rain, are frequently introduced. These degradations typically arise from the inherent limitations of cameras or unfavorable ambient conditions. To recover clean images from degraded versions, numerous specialized restoration methods have been developed, each targeting a specific type of degradation. Recently, all-in-one algorithms have garnered significant attention by addressing different types of degradations within a single model without requiring prior information of the input degradation type. However, these methods purely operate in the spatial domain and do not delve into the distinct frequency variations inherent to different degradation types. To address this gap, we propose an adaptive all-in-one image restoration network based on frequency mining and modulation. Our approach is motivated by the observation that different degradation types impact the image content on different frequency subbands, thereby requiring different treatments for each restoration task. Specifically, we first mine low- and high-frequency information from the input features, guided by the adaptively decoupled spectra of the degraded image. The extracted features are then modulated by a bidirectional operator to facilitate interactions between different frequency components. Finally, the modulated features are merged into the original input for a progressively guided restoration. With this approach, the model achieves adaptive reconstruction by accentuating the informative frequency subbands according to different input degradations. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on different image restoration tasks, including denoising, dehazing, deraining, motion deblurring, and low-light image enhancement. Our code is available at https://github.com/c-yn/AdaIR.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 27 Jan 2025 20:57:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/16bdad92/e27f7d5b.mp3" length="20299124" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1265</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuning Cui, Syed Waqas Zamir, Salman Khan, Alois Knoll, Mubarak Shah, Fahad Shahbaz Khan</p>

            <p><strong>Title:</strong><br>
            AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2403.14614v1">http://arxiv.org/abs/2403.14614v1</a></p>

            <p><strong>Abstract:</strong><br>
            In the image acquisition process, various forms of degradation, including noise, haze, and rain, are frequently introduced. These degradations typically arise from the inherent limitations of cameras or unfavorable ambient conditions. To recover clean images from degraded versions, numerous specialized restoration methods have been developed, each targeting a specific type of degradation. Recently, all-in-one algorithms have garnered significant attention by addressing different types of degradations within a single model without requiring prior information of the input degradation type. However, these methods purely operate in the spatial domain and do not delve into the distinct frequency variations inherent to different degradation types. To address this gap, we propose an adaptive all-in-one image restoration network based on frequency mining and modulation. Our approach is motivated by the observation that different degradation types impact the image content on different frequency subbands, thereby requiring different treatments for each restoration task. Specifically, we first mine low- and high-frequency information from the input features, guided by the adaptively decoupled spectra of the degraded image. The extracted features are then modulated by a bidirectional operator to facilitate interactions between different frequency components. Finally, the modulated features are merged into the original input for a progressively guided restoration. With this approach, the model achieves adaptive reconstruction by accentuating the informative frequency subbands according to different input degradations. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on different image restoration tasks, including denoising, dehazing, deraining, motion deblurring, and low-light image enhancement. Our code is available at https://github.com/c-yn/AdaIR.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning</title>
      <itunes:episode>424</itunes:episode>
      <podcast:episode>424</podcast:episode>
      <itunes:title>Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">710aa208-0444-44c9-9bbf-9efc608e2121</guid>
      <link>https://share.transistor.fm/s/c1559231</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang You, Yixin Li, Congyue Deng, Yue Wang, Leonidas Guibas</p>

            <p><strong>Title:</strong><br>
            Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.19458v1">http://arxiv.org/abs/2411.19458v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear. In this work, we evaluate and enhance the 3D awareness of ViT-based models. We begin by systematically assessing their ability to learn 3D equivariant features, specifically examining the consistency of semantic embeddings across different viewpoints. Our findings indicate that improved 3D equivariance leads to better performance on various downstream tasks, including pose estimation, tracking, and semantic transfer. Building on this insight, we propose a simple yet effective finetuning strategy based on 3D correspondences, which significantly enhances the 3D correspondence understanding of existing vision models. Remarkably, even finetuning on a single object for just one iteration results in substantial performance gains. All code and resources will be made publicly available to support further advancements in 3D-aware vision models. Our code is available at https://github.com/qq456cvb/3DCorrEnhance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang You, Yixin Li, Congyue Deng, Yue Wang, Leonidas Guibas</p>

            <p><strong>Title:</strong><br>
            Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.19458v1">http://arxiv.org/abs/2411.19458v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear. In this work, we evaluate and enhance the 3D awareness of ViT-based models. We begin by systematically assessing their ability to learn 3D equivariant features, specifically examining the consistency of semantic embeddings across different viewpoints. Our findings indicate that improved 3D equivariance leads to better performance on various downstream tasks, including pose estimation, tracking, and semantic transfer. Building on this insight, we propose a simple yet effective finetuning strategy based on 3D correspondences, which significantly enhances the 3D correspondence understanding of existing vision models. Remarkably, even finetuning on a single object for just one iteration results in substantial performance gains. All code and resources will be made publicly available to support further advancements in 3D-aware vision models. Our code is available at https://github.com/qq456cvb/3DCorrEnhance.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 27 Jan 2025 20:57:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c1559231/88a7ed13.mp3" length="23397470" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1459</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang You, Yixin Li, Congyue Deng, Yue Wang, Leonidas Guibas</p>

            <p><strong>Title:</strong><br>
            Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.19458v1">http://arxiv.org/abs/2411.19458v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear. In this work, we evaluate and enhance the 3D awareness of ViT-based models. We begin by systematically assessing their ability to learn 3D equivariant features, specifically examining the consistency of semantic embeddings across different viewpoints. Our findings indicate that improved 3D equivariance leads to better performance on various downstream tasks, including pose estimation, tracking, and semantic transfer. Building on this insight, we propose a simple yet effective finetuning strategy based on 3D correspondences, which significantly enhances the 3D correspondence understanding of existing vision models. Remarkably, even finetuning on a single object for just one iteration results in substantial performance gains. All code and resources will be made publicly available to support further advancements in 3D-aware vision models. Our code is available at https://github.com/qq456cvb/3DCorrEnhance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SRMT: Shared Memory for Multi-agent Lifelong Pathfinding</title>
      <itunes:episode>423</itunes:episode>
      <podcast:episode>423</podcast:episode>
      <itunes:title>SRMT: Shared Memory for Multi-agent Lifelong Pathfinding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a3f6a0db-3501-4ab5-aeda-33eb70ba1789</guid>
      <link>https://share.transistor.fm/s/8c6120d7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.LG, cs.AI, cs.MA, I.2.11</p>

            <p><strong>Authors:</strong><br>
            Alsu Sagirova, Yuri Kuratov, Mikhail Burtsev</p>

            <p><strong>Title:</strong><br>
            SRMT: Shared Memory for Multi-agent Lifelong Pathfinding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13200v1">http://arxiv.org/abs/2501.13200v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-agent reinforcement learning (MARL) demonstrates significant progress in solving cooperative and competitive multi-agent problems in various environments. One of the principal challenges in MARL is the need for explicit prediction of the agents' behavior to achieve cooperation. To resolve this issue, we propose the Shared Recurrent Memory Transformer (SRMT) which extends memory transformers to multi-agent settings by pooling and globally broadcasting individual working memories, enabling agents to exchange information implicitly and coordinate their actions. We evaluate SRMT on the Partially Observable Multi-Agent Pathfinding problem in a toy Bottleneck navigation task that requires agents to pass through a narrow corridor and on a POGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistently outperforms a variety of reinforcement learning baselines, especially under sparse rewards, and generalizes effectively to longer corridors than those seen during training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT is competitive with recent MARL, hybrid, and planning-based algorithms. These results suggest that incorporating shared recurrent memory into the transformer-based architectures can enhance coordination in decentralized multi-agent systems. The source code for training and evaluation is available on GitHub: https://github.com/Aloriosa/srmt.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.LG, cs.AI, cs.MA, I.2.11</p>

            <p><strong>Authors:</strong><br>
            Alsu Sagirova, Yuri Kuratov, Mikhail Burtsev</p>

            <p><strong>Title:</strong><br>
            SRMT: Shared Memory for Multi-agent Lifelong Pathfinding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13200v1">http://arxiv.org/abs/2501.13200v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-agent reinforcement learning (MARL) demonstrates significant progress in solving cooperative and competitive multi-agent problems in various environments. One of the principal challenges in MARL is the need for explicit prediction of the agents' behavior to achieve cooperation. To resolve this issue, we propose the Shared Recurrent Memory Transformer (SRMT) which extends memory transformers to multi-agent settings by pooling and globally broadcasting individual working memories, enabling agents to exchange information implicitly and coordinate their actions. We evaluate SRMT on the Partially Observable Multi-Agent Pathfinding problem in a toy Bottleneck navigation task that requires agents to pass through a narrow corridor and on a POGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistently outperforms a variety of reinforcement learning baselines, especially under sparse rewards, and generalizes effectively to longer corridors than those seen during training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT is competitive with recent MARL, hybrid, and planning-based algorithms. These results suggest that incorporating shared recurrent memory into the transformer-based architectures can enhance coordination in decentralized multi-agent systems. The source code for training and evaluation is available on GitHub: https://github.com/Aloriosa/srmt.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 24 Jan 2025 20:43:14 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8c6120d7/3a7a2596.mp3" length="22941020" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1430</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.LG, cs.AI, cs.MA, I.2.11</p>

            <p><strong>Authors:</strong><br>
            Alsu Sagirova, Yuri Kuratov, Mikhail Burtsev</p>

            <p><strong>Title:</strong><br>
            SRMT: Shared Memory for Multi-agent Lifelong Pathfinding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13200v1">http://arxiv.org/abs/2501.13200v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-agent reinforcement learning (MARL) demonstrates significant progress in solving cooperative and competitive multi-agent problems in various environments. One of the principal challenges in MARL is the need for explicit prediction of the agents' behavior to achieve cooperation. To resolve this issue, we propose the Shared Recurrent Memory Transformer (SRMT) which extends memory transformers to multi-agent settings by pooling and globally broadcasting individual working memories, enabling agents to exchange information implicitly and coordinate their actions. We evaluate SRMT on the Partially Observable Multi-Agent Pathfinding problem in a toy Bottleneck navigation task that requires agents to pass through a narrow corridor and on a POGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistently outperforms a variety of reinforcement learning baselines, especially under sparse rewards, and generalizes effectively to longer corridors than those seen during training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT is competitive with recent MARL, hybrid, and planning-based algorithms. These results suggest that incorporating shared recurrent memory into the transformer-based architectures can enhance coordination in decentralized multi-agent systems. The source code for training and evaluation is available on GitHub: https://github.com/Aloriosa/srmt.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models</title>
      <itunes:episode>422</itunes:episode>
      <podcast:episode>422</podcast:episode>
      <itunes:title>Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">701e9aba-3454-4423-a7ce-c643d6eb5cfe</guid>
      <link>https://share.transistor.fm/s/f304b1ad</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhenghao Lin, Zihao Tang, Xiao Liu, Yeyun Gong, Yi Cheng, Qi Chen, Hang Li, Ying Xin, Ziyue Yang, Kailai Yang, Yu Yan, Xiao Liang, Shuai Lu, Yiming Huang, Zheheng Luo, Lei Qu, Xuan Feng, Yaoxiang Wang, Yuqing Xia, Feiyang Chen, Yuting Jiang, Yasen Hu, Hao Ni, Binyang Li, Guoshuai Zhao, Jui-Hao Chiang, Zhongxin Guo, Chen Lin, Kun Kuang, Wenjie Li, Yelong Shen, Jian Jiao, Peng Cheng, Mao Yang</p>

            <p><strong>Title:</strong><br>
            Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13629v1">http://arxiv.org/abs/2501.13629v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Sigma, an efficient large language model specialized for the system domain, empowered by a novel architecture including DiffQKV attention, and pre-trained on our meticulously collected system domain data. DiffQKV attention significantly enhances the inference efficiency of Sigma by optimizing the Query (Q), Key (K), and Value (V) components in the attention mechanism differentially, based on their varying impacts on the model performance and efficiency indicators. Specifically, we (1) conduct extensive experiments that demonstrate the model's varying sensitivity to the compression of K and V components, leading to the development of differentially compressed KV, and (2) propose augmented Q to expand the Q head dimension, which enhances the model's representation capacity with minimal impacts on the inference speed. Rigorous theoretical and empirical analyses reveal that DiffQKV attention significantly enhances efficiency, achieving up to a 33.36% improvement in inference speed over the conventional grouped-query attention (GQA) in long-context scenarios. We pre-train Sigma on 6T tokens from various sources, including 19.5B system domain data that we carefully collect and 1T tokens of synthesized and rewritten data. In general domains, Sigma achieves comparable performance to other state-of-arts models. In the system domain, we introduce the first comprehensive benchmark AIMicius, where Sigma demonstrates remarkable performance across all tasks, significantly outperforming GPT-4 with an absolute improvement up to 52.5%.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhenghao Lin, Zihao Tang, Xiao Liu, Yeyun Gong, Yi Cheng, Qi Chen, Hang Li, Ying Xin, Ziyue Yang, Kailai Yang, Yu Yan, Xiao Liang, Shuai Lu, Yiming Huang, Zheheng Luo, Lei Qu, Xuan Feng, Yaoxiang Wang, Yuqing Xia, Feiyang Chen, Yuting Jiang, Yasen Hu, Hao Ni, Binyang Li, Guoshuai Zhao, Jui-Hao Chiang, Zhongxin Guo, Chen Lin, Kun Kuang, Wenjie Li, Yelong Shen, Jian Jiao, Peng Cheng, Mao Yang</p>

            <p><strong>Title:</strong><br>
            Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13629v1">http://arxiv.org/abs/2501.13629v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Sigma, an efficient large language model specialized for the system domain, empowered by a novel architecture including DiffQKV attention, and pre-trained on our meticulously collected system domain data. DiffQKV attention significantly enhances the inference efficiency of Sigma by optimizing the Query (Q), Key (K), and Value (V) components in the attention mechanism differentially, based on their varying impacts on the model performance and efficiency indicators. Specifically, we (1) conduct extensive experiments that demonstrate the model's varying sensitivity to the compression of K and V components, leading to the development of differentially compressed KV, and (2) propose augmented Q to expand the Q head dimension, which enhances the model's representation capacity with minimal impacts on the inference speed. Rigorous theoretical and empirical analyses reveal that DiffQKV attention significantly enhances efficiency, achieving up to a 33.36% improvement in inference speed over the conventional grouped-query attention (GQA) in long-context scenarios. We pre-train Sigma on 6T tokens from various sources, including 19.5B system domain data that we carefully collect and 1T tokens of synthesized and rewritten data. In general domains, Sigma achieves comparable performance to other state-of-arts models. In the system domain, we introduce the first comprehensive benchmark AIMicius, where Sigma demonstrates remarkable performance across all tasks, significantly outperforming GPT-4 with an absolute improvement up to 52.5%.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 24 Jan 2025 20:42:53 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f304b1ad/2472093f.mp3" length="19959744" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1244</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhenghao Lin, Zihao Tang, Xiao Liu, Yeyun Gong, Yi Cheng, Qi Chen, Hang Li, Ying Xin, Ziyue Yang, Kailai Yang, Yu Yan, Xiao Liang, Shuai Lu, Yiming Huang, Zheheng Luo, Lei Qu, Xuan Feng, Yaoxiang Wang, Yuqing Xia, Feiyang Chen, Yuting Jiang, Yasen Hu, Hao Ni, Binyang Li, Guoshuai Zhao, Jui-Hao Chiang, Zhongxin Guo, Chen Lin, Kun Kuang, Wenjie Li, Yelong Shen, Jian Jiao, Peng Cheng, Mao Yang</p>

            <p><strong>Title:</strong><br>
            Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13629v1">http://arxiv.org/abs/2501.13629v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Sigma, an efficient large language model specialized for the system domain, empowered by a novel architecture including DiffQKV attention, and pre-trained on our meticulously collected system domain data. DiffQKV attention significantly enhances the inference efficiency of Sigma by optimizing the Query (Q), Key (K), and Value (V) components in the attention mechanism differentially, based on their varying impacts on the model performance and efficiency indicators. Specifically, we (1) conduct extensive experiments that demonstrate the model's varying sensitivity to the compression of K and V components, leading to the development of differentially compressed KV, and (2) propose augmented Q to expand the Q head dimension, which enhances the model's representation capacity with minimal impacts on the inference speed. Rigorous theoretical and empirical analyses reveal that DiffQKV attention significantly enhances efficiency, achieving up to a 33.36% improvement in inference speed over the conventional grouped-query attention (GQA) in long-context scenarios. We pre-train Sigma on 6T tokens from various sources, including 19.5B system domain data that we carefully collect and 1T tokens of synthesized and rewritten data. In general domains, Sigma achieves comparable performance to other state-of-arts models. In the system domain, we introduce the first comprehensive benchmark AIMicius, where Sigma demonstrates remarkable performance across all tasks, significantly outperforming GPT-4 with an absolute improvement up to 52.5%.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Improving Video Generation with Human Feedback</title>
      <itunes:episode>421</itunes:episode>
      <podcast:episode>421</podcast:episode>
      <itunes:title>Improving Video Generation with Human Feedback</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">983109ef-8a63-45be-96b9-311ae87f135e</guid>
      <link>https://share.transistor.fm/s/c570541e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.GR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang</p>

            <p><strong>Title:</strong><br>
            Improving Video Generation with Human Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13918v1">http://arxiv.org/abs/2501.13918v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models by extending those from diffusion models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and standard supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs. Project page: https://gongyeliu.github.io/videoalign.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.GR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang</p>

            <p><strong>Title:</strong><br>
            Improving Video Generation with Human Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13918v1">http://arxiv.org/abs/2501.13918v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models by extending those from diffusion models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and standard supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs. Project page: https://gongyeliu.github.io/videoalign.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 24 Jan 2025 20:42:32 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c570541e/066ce60b.mp3" length="23425842" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1460</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.GR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang</p>

            <p><strong>Title:</strong><br>
            Improving Video Generation with Human Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13918v1">http://arxiv.org/abs/2501.13918v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models by extending those from diffusion models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and standard supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs. Project page: https://gongyeliu.github.io/videoalign.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Temporal Preference Optimization for Long-Form Video Understanding</title>
      <itunes:episode>420</itunes:episode>
      <podcast:episode>420</podcast:episode>
      <itunes:title>Temporal Preference Optimization for Long-Form Video Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c7d7b1d4-ba9b-4742-ae4c-6821cdcd4efa</guid>
      <link>https://share.transistor.fm/s/e414dc40</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CV, cs.AI, cs.CL, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Rui Li, Xiaohan Wang, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy</p>

            <p><strong>Title:</strong><br>
            Temporal Preference Optimization for Long-Form Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13919v1">http://arxiv.org/abs/2501.13919v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a self-training approach that enables models to differentiate between well-grounded and less accurate temporal responses by leveraging curated preference datasets at two granularities: localized temporal grounding, which focuses on specific video segments, and comprehensive temporal grounding, which captures extended temporal dependencies across entire video sequences. By optimizing on these preference datasets, TPO significantly enhances temporal understanding while reducing reliance on manually annotated data. Extensive experiments on three long-form video understanding benchmarks--LongVideoBench, MLVU, and Video-MME--demonstrate the effectiveness of TPO across two state-of-the-art video-LMMs. Notably, LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark, underscoring the potential of TPO as a scalable and efficient solution for advancing temporal reasoning in long-form video understanding. Project page: https://ruili33.github.io/tpo_website.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CV, cs.AI, cs.CL, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Rui Li, Xiaohan Wang, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy</p>

            <p><strong>Title:</strong><br>
            Temporal Preference Optimization for Long-Form Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13919v1">http://arxiv.org/abs/2501.13919v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a self-training approach that enables models to differentiate between well-grounded and less accurate temporal responses by leveraging curated preference datasets at two granularities: localized temporal grounding, which focuses on specific video segments, and comprehensive temporal grounding, which captures extended temporal dependencies across entire video sequences. By optimizing on these preference datasets, TPO significantly enhances temporal understanding while reducing reliance on manually annotated data. Extensive experiments on three long-form video understanding benchmarks--LongVideoBench, MLVU, and Video-MME--demonstrate the effectiveness of TPO across two state-of-the-art video-LMMs. Notably, LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark, underscoring the potential of TPO as a scalable and efficient solution for advancing temporal reasoning in long-form video understanding. Project page: https://ruili33.github.io/tpo_website.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 24 Jan 2025 20:42:11 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e414dc40/402506fa.mp3" length="23851345" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1487</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CV, cs.AI, cs.CL, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Rui Li, Xiaohan Wang, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy</p>

            <p><strong>Title:</strong><br>
            Temporal Preference Optimization for Long-Form Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13919v1">http://arxiv.org/abs/2501.13919v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a self-training approach that enables models to differentiate between well-grounded and less accurate temporal responses by leveraging curated preference datasets at two granularities: localized temporal grounding, which focuses on specific video segments, and comprehensive temporal grounding, which captures extended temporal dependencies across entire video sequences. By optimizing on these preference datasets, TPO significantly enhances temporal understanding while reducing reliance on manually annotated data. Extensive experiments on three long-form video understanding benchmarks--LongVideoBench, MLVU, and Video-MME--demonstrate the effectiveness of TPO across two state-of-the-art video-LMMs. Notably, LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark, underscoring the potential of TPO as a scalable and efficient solution for advancing temporal reasoning in long-form video understanding. Project page: https://ruili33.github.io/tpo_website.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step</title>
      <itunes:episode>419</itunes:episode>
      <podcast:episode>419</podcast:episode>
      <itunes:title>Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2b5a6b91-6fec-48a1-a71a-9220426d63e3</guid>
      <link>https://share.transistor.fm/s/a95526dc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng</p>

            <p><strong>Title:</strong><br>
            Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13926v1">http://arxiv.org/abs/2501.13926v1</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios. In this paper, we provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. We focus on three techniques: scaling test-time computation for verification, aligning model preferences with Direct Preference Optimization (DPO), and integrating these techniques for complementary effects. Our results demonstrate that these approaches can be effectively adapted and combined to significantly improve image generation performance. Furthermore, given the pivotal role of reward models in our findings, we propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation. PARM adaptively assesses each generation step through a potential assessment approach, merging the strengths of existing reward models, and PARM++ further introduces a reflection mechanism to self-correct the generated unsatisfactory image. Using our investigated reasoning strategies, we enhance a baseline model, Show-o, to achieve superior results, with a significant +24% improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. We hope our study provides unique insights and paves a new path for integrating CoT reasoning with autoregressive image generation. Code and models are released at https://github.com/ZiyuGuo99/Image-Generation-CoT</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng</p>

            <p><strong>Title:</strong><br>
            Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13926v1">http://arxiv.org/abs/2501.13926v1</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios. In this paper, we provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. We focus on three techniques: scaling test-time computation for verification, aligning model preferences with Direct Preference Optimization (DPO), and integrating these techniques for complementary effects. Our results demonstrate that these approaches can be effectively adapted and combined to significantly improve image generation performance. Furthermore, given the pivotal role of reward models in our findings, we propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation. PARM adaptively assesses each generation step through a potential assessment approach, merging the strengths of existing reward models, and PARM++ further introduces a reflection mechanism to self-correct the generated unsatisfactory image. Using our investigated reasoning strategies, we enhance a baseline model, Show-o, to achieve superior results, with a significant +24% improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. We hope our study provides unique insights and paves a new path for integrating CoT reasoning with autoregressive image generation. Code and models are released at https://github.com/ZiyuGuo99/Image-Generation-CoT</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 24 Jan 2025 20:41:50 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a95526dc/8be01329.mp3" length="20281996" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1264</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng</p>

            <p><strong>Title:</strong><br>
            Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13926v1">http://arxiv.org/abs/2501.13926v1</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios. In this paper, we provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. We focus on three techniques: scaling test-time computation for verification, aligning model preferences with Direct Preference Optimization (DPO), and integrating these techniques for complementary effects. Our results demonstrate that these approaches can be effectively adapted and combined to significantly improve image generation performance. Furthermore, given the pivotal role of reward models in our findings, we propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation. PARM adaptively assesses each generation step through a potential assessment approach, merging the strengths of existing reward models, and PARM++ further introduces a reflection mechanism to self-correct the generated unsatisfactory image. Using our investigated reasoning strategies, we enhance a baseline model, Show-o, to achieve superior results, with a significant +24% improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. We hope our study provides unique insights and paves a new path for integrating CoT reasoning with autoregressive image generation. Code and models are released at https://github.com/ZiyuGuo99/Image-Generation-CoT</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos</title>
      <itunes:episode>418</itunes:episode>
      <podcast:episode>418</podcast:episode>
      <itunes:title>Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">66da1027-424c-46f5-b568-eafb25a7a413</guid>
      <link>https://share.transistor.fm/s/c973b4fa</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13826v1">http://arxiv.org/abs/2501.13826v1</a></p>

            <p><strong>Abstract:</strong><br>
            Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, {\Delta}knowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13826v1">http://arxiv.org/abs/2501.13826v1</a></p>

            <p><strong>Abstract:</strong><br>
            Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, {\Delta}knowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 24 Jan 2025 20:41:29 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c973b4fa/924d7b14.mp3" length="20481360" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1276</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13826v1">http://arxiv.org/abs/2501.13826v1</a></p>

            <p><strong>Abstract:</strong><br>
            Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, {\Delta}knowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DiffuEraser: A Diffusion Model for Video Inpainting</title>
      <itunes:episode>417</itunes:episode>
      <podcast:episode>417</podcast:episode>
      <itunes:title>DiffuEraser: A Diffusion Model for Video Inpainting</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">aa98f7c0-18f9-48cd-bfed-2f42f8881b97</guid>
      <link>https://share.transistor.fm/s/a94307f9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiaowen Li, Haolan Xue, Peiran Ren, Liefeng Bo</p>

            <p><strong>Title:</strong><br>
            DiffuEraser: A Diffusion Model for Video Inpainting</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.10018v1">http://arxiv.org/abs/2501.10018v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent video inpainting algorithms integrate flow-based pixel propagation with transformer-based generation to leverage optical flow for restoring textures and objects using information from neighboring frames, while completing masked regions through visual Transformers. However, these approaches often encounter blurring and temporal inconsistencies when dealing with large masks, highlighting the need for models with enhanced generative capabilities. Recently, diffusion models have emerged as a prominent technique in image and video generation due to their impressive performance. In this paper, we introduce DiffuEraser, a video inpainting model based on stable diffusion, designed to fill masked regions with greater details and more coherent structures. We incorporate prior information to provide initialization and weak conditioning,which helps mitigate noisy artifacts and suppress hallucinations. Additionally, to improve temporal consistency during long-sequence inference, we expand the temporal receptive fields of both the prior model and DiffuEraser, and further enhance consistency by leveraging the temporal smoothing property of Video Diffusion Models. Experimental results demonstrate that our proposed method outperforms state-of-the-art techniques in both content completeness and temporal consistency while maintaining acceptable efficiency.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiaowen Li, Haolan Xue, Peiran Ren, Liefeng Bo</p>

            <p><strong>Title:</strong><br>
            DiffuEraser: A Diffusion Model for Video Inpainting</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.10018v1">http://arxiv.org/abs/2501.10018v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent video inpainting algorithms integrate flow-based pixel propagation with transformer-based generation to leverage optical flow for restoring textures and objects using information from neighboring frames, while completing masked regions through visual Transformers. However, these approaches often encounter blurring and temporal inconsistencies when dealing with large masks, highlighting the need for models with enhanced generative capabilities. Recently, diffusion models have emerged as a prominent technique in image and video generation due to their impressive performance. In this paper, we introduce DiffuEraser, a video inpainting model based on stable diffusion, designed to fill masked regions with greater details and more coherent structures. We incorporate prior information to provide initialization and weak conditioning,which helps mitigate noisy artifacts and suppress hallucinations. Additionally, to improve temporal consistency during long-sequence inference, we expand the temporal receptive fields of both the prior model and DiffuEraser, and further enhance consistency by leveraging the temporal smoothing property of Video Diffusion Models. Experimental results demonstrate that our proposed method outperforms state-of-the-art techniques in both content completeness and temporal consistency while maintaining acceptable efficiency.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 24 Jan 2025 20:41:08 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a94307f9/b5cf8eb0.mp3" length="21022582" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1310</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiaowen Li, Haolan Xue, Peiran Ren, Liefeng Bo</p>

            <p><strong>Title:</strong><br>
            DiffuEraser: A Diffusion Model for Video Inpainting</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.10018v1">http://arxiv.org/abs/2501.10018v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent video inpainting algorithms integrate flow-based pixel propagation with transformer-based generation to leverage optical flow for restoring textures and objects using information from neighboring frames, while completing masked regions through visual Transformers. However, these approaches often encounter blurring and temporal inconsistencies when dealing with large masks, highlighting the need for models with enhanced generative capabilities. Recently, diffusion models have emerged as a prominent technique in image and video generation due to their impressive performance. In this paper, we introduce DiffuEraser, a video inpainting model based on stable diffusion, designed to fill masked regions with greater details and more coherent structures. We incorporate prior information to provide initialization and weak conditioning,which helps mitigate noisy artifacts and suppress hallucinations. Additionally, to improve temporal consistency during long-sequence inference, we expand the temporal receptive fields of both the prior model and DiffuEraser, and further enhance consistency by leveraging the temporal smoothing property of Video Diffusion Models. Experimental results demonstrate that our proposed method outperforms state-of-the-art techniques in both content completeness and temporal consistency while maintaining acceptable efficiency.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models</title>
      <itunes:episode>416</itunes:episode>
      <podcast:episode>416</podcast:episode>
      <itunes:title>IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">64b94a2f-ed41-4fda-a0d6-c979d9dfd7db</guid>
      <link>https://share.transistor.fm/s/ed14cee8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiayi Lei, Renrui Zhang, Xiangfei Hu, Weifeng Lin, Zhen Li, Wenjian Sun, Ruoyi Du, Le Zhuo, Zhongyu Li, Xinyue Li, Shitian Zhao, Ziyu Guo, Yiting Lu, Peng Gao, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13920v1">http://arxiv.org/abs/2501.13920v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the rapid development of diffusion models, text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation. Recently launched models such as FLUX.1 and Ideogram2.0, along with others like Dall-E3 and Stable Diffusion 3, have demonstrated exceptional performance across various complex tasks, raising questions about whether T2I models are moving towards general-purpose applicability. Beyond traditional image generation, these models exhibit capabilities across a range of fields, including controllable generation, image editing, video, audio, 3D, and motion generation, as well as computer vision tasks like semantic segmentation and depth estimation. However, current evaluation frameworks are insufficient to comprehensively assess these models' performance across expanding domains. To thoroughly evaluate these models, we developed the IMAGINE-E and tested six prominent models: FLUX.1, Ideogram2.0, Midjourney, Dall-E3, Stable Diffusion 3, and Jimeng. Our evaluation is divided into five key domains: structured output generation, realism, and physical consistency, specific domain generation, challenging scenario generation, and multi-style creation tasks. This comprehensive assessment highlights each model's strengths and limitations, particularly the outstanding performance of FLUX.1 and Ideogram2.0 in structured and specific domain tasks, underscoring the expanding applications and potential of T2I models as foundational AI tools. This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability. Evaluation scripts will be released at https://github.com/jylei16/Imagine-e.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiayi Lei, Renrui Zhang, Xiangfei Hu, Weifeng Lin, Zhen Li, Wenjian Sun, Ruoyi Du, Le Zhuo, Zhongyu Li, Xinyue Li, Shitian Zhao, Ziyu Guo, Yiting Lu, Peng Gao, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13920v1">http://arxiv.org/abs/2501.13920v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the rapid development of diffusion models, text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation. Recently launched models such as FLUX.1 and Ideogram2.0, along with others like Dall-E3 and Stable Diffusion 3, have demonstrated exceptional performance across various complex tasks, raising questions about whether T2I models are moving towards general-purpose applicability. Beyond traditional image generation, these models exhibit capabilities across a range of fields, including controllable generation, image editing, video, audio, 3D, and motion generation, as well as computer vision tasks like semantic segmentation and depth estimation. However, current evaluation frameworks are insufficient to comprehensively assess these models' performance across expanding domains. To thoroughly evaluate these models, we developed the IMAGINE-E and tested six prominent models: FLUX.1, Ideogram2.0, Midjourney, Dall-E3, Stable Diffusion 3, and Jimeng. Our evaluation is divided into five key domains: structured output generation, realism, and physical consistency, specific domain generation, challenging scenario generation, and multi-style creation tasks. This comprehensive assessment highlights each model's strengths and limitations, particularly the outstanding performance of FLUX.1 and Ideogram2.0 in structured and specific domain tasks, underscoring the expanding applications and potential of T2I models as foundational AI tools. This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability. Evaluation scripts will be released at https://github.com/jylei16/Imagine-e.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 24 Jan 2025 20:40:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ed14cee8/9306af38.mp3" length="28333147" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1767</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiayi Lei, Renrui Zhang, Xiangfei Hu, Weifeng Lin, Zhen Li, Wenjian Sun, Ruoyi Du, Le Zhuo, Zhongyu Li, Xinyue Li, Shitian Zhao, Ziyu Guo, Yiting Lu, Peng Gao, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13920v1">http://arxiv.org/abs/2501.13920v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the rapid development of diffusion models, text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation. Recently launched models such as FLUX.1 and Ideogram2.0, along with others like Dall-E3 and Stable Diffusion 3, have demonstrated exceptional performance across various complex tasks, raising questions about whether T2I models are moving towards general-purpose applicability. Beyond traditional image generation, these models exhibit capabilities across a range of fields, including controllable generation, image editing, video, audio, 3D, and motion generation, as well as computer vision tasks like semantic segmentation and depth estimation. However, current evaluation frameworks are insufficient to comprehensively assess these models' performance across expanding domains. To thoroughly evaluate these models, we developed the IMAGINE-E and tested six prominent models: FLUX.1, Ideogram2.0, Midjourney, Dall-E3, Stable Diffusion 3, and Jimeng. Our evaluation is divided into five key domains: structured output generation, realism, and physical consistency, specific domain generation, challenging scenario generation, and multi-style creation tasks. This comprehensive assessment highlights each model's strengths and limitations, particularly the outstanding performance of FLUX.1 and Ideogram2.0 in structured and specific domain tasks, underscoring the expanding applications and potential of T2I models as foundational AI tools. This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability. Evaluation scripts will be released at https://github.com/jylei16/Imagine-e.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback</title>
      <itunes:episode>415</itunes:episode>
      <podcast:episode>415</podcast:episode>
      <itunes:title>Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c81d2392-6ece-4d56-a3a3-772a9f34181b</guid>
      <link>https://share.transistor.fm/s/73903041</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yen-Ting Lin, Di Jin, Tengyu Xu, Tianhao Wu, Sainbayar Sukhbaatar, Chen Zhu, Yun He, Yun-Nung Chen, Jason Weston, Yuandong Tian, Arash Rahnama, Sinong Wang, Hao Ma, Han Fang</p>

            <p><strong>Title:</strong><br>
            Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.10799v1">http://arxiv.org/abs/2501.10799v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yen-Ting Lin, Di Jin, Tengyu Xu, Tianhao Wu, Sainbayar Sukhbaatar, Chen Zhu, Yun He, Yun-Nung Chen, Jason Weston, Yuandong Tian, Arash Rahnama, Sinong Wang, Hao Ma, Han Fang</p>

            <p><strong>Title:</strong><br>
            Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.10799v1">http://arxiv.org/abs/2501.10799v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 24 Jan 2025 20:40:25 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/73903041/ca07a7dd.mp3" length="20472991" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1276</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yen-Ting Lin, Di Jin, Tengyu Xu, Tianhao Wu, Sainbayar Sukhbaatar, Chen Zhu, Yun He, Yun-Nung Chen, Jason Weston, Yuandong Tian, Arash Rahnama, Sinong Wang, Hao Ma, Han Fang</p>

            <p><strong>Title:</strong><br>
            Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.10799v1">http://arxiv.org/abs/2501.10799v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt</title>
      <itunes:episode>414</itunes:episode>
      <podcast:episode>414</podcast:episode>
      <itunes:title>One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">136b51a9-f283-4c57-bf4d-99ea5642cb45</guid>
      <link>https://share.transistor.fm/s/875b818d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng</p>

            <p><strong>Title:</strong><br>
            One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13554v1">http://arxiv.org/abs/2501.13554v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original model architectures. This limits their applicability across different domains and diverse diffusion model configurations. In this paper, we first observe the inherent capability of language models, coined context consistency, to comprehend identity through context with a single prompt. Drawing inspiration from the inherent context consistency, we propose a novel training-free method for consistent text-to-image (T2I) generation, termed "One-Prompt-One-Story" (1Prompt1Story). Our approach 1Prompt1Story concatenates all prompts into a single input for T2I diffusion models, initially preserving character identities. We then refine the generation process using two novel techniques: Singular-Value Reweighting and Identity-Preserving Cross-Attention, ensuring better alignment with the input description for each frame. In our experiments, we compare our method against various existing consistent T2I generation approaches to demonstrate its effectiveness through quantitative metrics and qualitative assessments. Code is available at https://github.com/byliutao/1Prompt1Story.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng</p>

            <p><strong>Title:</strong><br>
            One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13554v1">http://arxiv.org/abs/2501.13554v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original model architectures. This limits their applicability across different domains and diverse diffusion model configurations. In this paper, we first observe the inherent capability of language models, coined context consistency, to comprehend identity through context with a single prompt. Drawing inspiration from the inherent context consistency, we propose a novel training-free method for consistent text-to-image (T2I) generation, termed "One-Prompt-One-Story" (1Prompt1Story). Our approach 1Prompt1Story concatenates all prompts into a single input for T2I diffusion models, initially preserving character identities. We then refine the generation process using two novel techniques: Singular-Value Reweighting and Identity-Preserving Cross-Attention, ensuring better alignment with the input description for each frame. In our experiments, we compare our method against various existing consistent T2I generation approaches to demonstrate its effectiveness through quantitative metrics and qualitative assessments. Code is available at https://github.com/byliutao/1Prompt1Story.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 24 Jan 2025 20:40:04 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/875b818d/f4b80501.mp3" length="21303908" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1328</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng</p>

            <p><strong>Title:</strong><br>
            One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13554v1">http://arxiv.org/abs/2501.13554v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original model architectures. This limits their applicability across different domains and diverse diffusion model configurations. In this paper, we first observe the inherent capability of language models, coined context consistency, to comprehend identity through context with a single prompt. Drawing inspiration from the inherent context consistency, we propose a novel training-free method for consistent text-to-image (T2I) generation, termed "One-Prompt-One-Story" (1Prompt1Story). Our approach 1Prompt1Story concatenates all prompts into a single input for T2I diffusion models, initially preserving character identities. We then refine the generation process using two novel techniques: Singular-Value Reweighting and Identity-Preserving Cross-Attention, ensuring better alignment with the input description for each frame. In our experiments, we compare our method against various existing consistent T2I generation approaches to demonstrate its effectiveness through quantitative metrics and qualitative assessments. Code is available at https://github.com/byliutao/1Prompt1Story.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning</title>
      <itunes:episode>413</itunes:episode>
      <podcast:episode>413</podcast:episode>
      <itunes:title>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">702e33c3-8920-4c5d-9a48-c67cfa13fd37</guid>
      <link>https://share.transistor.fm/s/5d8b4fd1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 109 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang</p>

            <p><strong>Title:</strong><br>
            DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12948v1">http://arxiv.org/abs/2501.12948v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 109 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang</p>

            <p><strong>Title:</strong><br>
            DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12948v1">http://arxiv.org/abs/2501.12948v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 23 Jan 2025 20:37:11 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5d8b4fd1/874a309d.mp3" length="20244791" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1262</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 109 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang</p>

            <p><strong>Title:</strong><br>
            DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12948v1">http://arxiv.org/abs/2501.12948v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding</title>
      <itunes:episode>412</itunes:episode>
      <podcast:episode>412</podcast:episode>
      <itunes:title>VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a5638d6d-ff22-408c-8f73-3d640d70488f</guid>
      <link>https://share.transistor.fm/s/ed9d5ab0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao</p>

            <p><strong>Title:</strong><br>
            VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13106v2">http://arxiv.org/abs/2501.13106v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for image and video understanding. The core design philosophy of VideoLLaMA3 is vision-centric. The meaning of "vision-centric" is two-fold: the vision-centric training paradigm and vision-centric framework design. The key insight of our vision-centric training paradigm is that high-quality image-text data is crucial for both image and video understanding. Instead of preparing massive video-text datasets, we focus on constructing large-scale and high-quality image-text datasets. VideoLLaMA3 has four training stages: 1) Vision Encoder Adaptation, which enables vision encoder to accept images of variable resolutions as input; 2) Vision-Language Alignment, which jointly tunes the vision encoder, projector, and LLM with large-scale image-text data covering multiple types (including scene images, documents, charts) as well as text-only data. 3) Multi-task Fine-tuning, which incorporates image-text SFT data for downstream tasks and video-text data to establish a foundation for video understanding. 4) Video-centric Fine-tuning, which further improves the model's capability in video understanding. As for the framework design, to better capture fine-grained details in images, the pretrained vision encoder is adapted to encode images of varying sizes into vision tokens with corresponding numbers, rather than a fixed number of tokens. For video inputs, we reduce the number of vision tokens according to their similarity so that the representation of videos will be more precise and compact. Benefit from vision-centric designs, VideoLLaMA3 achieves compelling performances in both image and video understanding benchmarks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao</p>

            <p><strong>Title:</strong><br>
            VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13106v2">http://arxiv.org/abs/2501.13106v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for image and video understanding. The core design philosophy of VideoLLaMA3 is vision-centric. The meaning of "vision-centric" is two-fold: the vision-centric training paradigm and vision-centric framework design. The key insight of our vision-centric training paradigm is that high-quality image-text data is crucial for both image and video understanding. Instead of preparing massive video-text datasets, we focus on constructing large-scale and high-quality image-text datasets. VideoLLaMA3 has four training stages: 1) Vision Encoder Adaptation, which enables vision encoder to accept images of variable resolutions as input; 2) Vision-Language Alignment, which jointly tunes the vision encoder, projector, and LLM with large-scale image-text data covering multiple types (including scene images, documents, charts) as well as text-only data. 3) Multi-task Fine-tuning, which incorporates image-text SFT data for downstream tasks and video-text data to establish a foundation for video understanding. 4) Video-centric Fine-tuning, which further improves the model's capability in video understanding. As for the framework design, to better capture fine-grained details in images, the pretrained vision encoder is adapted to encode images of varying sizes into vision tokens with corresponding numbers, rather than a fixed number of tokens. For video inputs, we reduce the number of vision tokens according to their similarity so that the representation of videos will be more precise and compact. Benefit from vision-centric designs, VideoLLaMA3 achieves compelling performances in both image and video understanding benchmarks.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 23 Jan 2025 20:36:50 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ed9d5ab0/0c8bed0a.mp3" length="22603756" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1409</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao</p>

            <p><strong>Title:</strong><br>
            VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13106v2">http://arxiv.org/abs/2501.13106v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for image and video understanding. The core design philosophy of VideoLLaMA3 is vision-centric. The meaning of "vision-centric" is two-fold: the vision-centric training paradigm and vision-centric framework design. The key insight of our vision-centric training paradigm is that high-quality image-text data is crucial for both image and video understanding. Instead of preparing massive video-text datasets, we focus on constructing large-scale and high-quality image-text datasets. VideoLLaMA3 has four training stages: 1) Vision Encoder Adaptation, which enables vision encoder to accept images of variable resolutions as input; 2) Vision-Language Alignment, which jointly tunes the vision encoder, projector, and LLM with large-scale image-text data covering multiple types (including scene images, documents, charts) as well as text-only data. 3) Multi-task Fine-tuning, which incorporates image-text SFT data for downstream tasks and video-text data to establish a foundation for video understanding. 4) Video-centric Fine-tuning, which further improves the model's capability in video understanding. As for the framework design, to better capture fine-grained details in images, the pretrained vision encoder is adapted to encode images of varying sizes into vision tokens with corresponding numbers, rather than a fixed number of tokens. For video inputs, we reduce the number of vision tokens according to their similarity so that the representation of videos will be more precise and compact. Benefit from vision-centric designs, VideoLLaMA3 achieves compelling performances in both image and video understanding benchmarks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces</title>
      <itunes:episode>411</itunes:episode>
      <podcast:episode>411</podcast:episode>
      <itunes:title>FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">97b47ab1-49cf-454c-b898-4ed0432a3533</guid>
      <link>https://share.transistor.fm/s/87139787</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CL, cs.GR, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Zhenran Xu, Longyue Wang, Jifang Wang, Zhouyi Li, Senbao Shi, Xue Yang, Yiyu Wang, Baotian Hu, Jun Yu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12909v1">http://arxiv.org/abs/2501.12909v1</a></p>

            <p><strong>Abstract:</strong><br>
            Virtual film production requires intricate decision-making processes, including scriptwriting, virtual cinematography, and precise actor positioning and actions. Motivated by recent advances in automated decision-making with language agent-based societies, this paper introduces FilmAgent, a novel LLM-based multi-agent collaborative framework for end-to-end film automation in our constructed 3D virtual spaces. FilmAgent simulates various crew roles, including directors, screenwriters, actors, and cinematographers, and covers key stages of a film production workflow: (1) idea development transforms brainstormed ideas into structured story outlines; (2) scriptwriting elaborates on dialogue and character actions for each scene; (3) cinematography determines the camera setups for each shot. A team of agents collaborates through iterative feedback and revisions, thereby verifying intermediate scripts and reducing hallucinations. We evaluate the generated videos on 15 ideas and 4 key aspects. Human evaluation shows that FilmAgent outperforms all baselines across all aspects and scores 3.98 out of 5 on average, showing the feasibility of multi-agent collaboration in filmmaking. Further analysis reveals that FilmAgent, despite using the less advanced GPT-4o model, surpasses the single-agent o1, showing the advantage of a well-coordinated multi-agent system. Lastly, we discuss the complementary strengths and weaknesses of OpenAI's text-to-video model Sora and our FilmAgent in filmmaking.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CL, cs.GR, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Zhenran Xu, Longyue Wang, Jifang Wang, Zhouyi Li, Senbao Shi, Xue Yang, Yiyu Wang, Baotian Hu, Jun Yu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12909v1">http://arxiv.org/abs/2501.12909v1</a></p>

            <p><strong>Abstract:</strong><br>
            Virtual film production requires intricate decision-making processes, including scriptwriting, virtual cinematography, and precise actor positioning and actions. Motivated by recent advances in automated decision-making with language agent-based societies, this paper introduces FilmAgent, a novel LLM-based multi-agent collaborative framework for end-to-end film automation in our constructed 3D virtual spaces. FilmAgent simulates various crew roles, including directors, screenwriters, actors, and cinematographers, and covers key stages of a film production workflow: (1) idea development transforms brainstormed ideas into structured story outlines; (2) scriptwriting elaborates on dialogue and character actions for each scene; (3) cinematography determines the camera setups for each shot. A team of agents collaborates through iterative feedback and revisions, thereby verifying intermediate scripts and reducing hallucinations. We evaluate the generated videos on 15 ideas and 4 key aspects. Human evaluation shows that FilmAgent outperforms all baselines across all aspects and scores 3.98 out of 5 on average, showing the feasibility of multi-agent collaboration in filmmaking. Further analysis reveals that FilmAgent, despite using the less advanced GPT-4o model, surpasses the single-agent o1, showing the advantage of a well-coordinated multi-agent system. Lastly, we discuss the complementary strengths and weaknesses of OpenAI's text-to-video model Sora and our FilmAgent in filmmaking.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 23 Jan 2025 20:36:28 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/87139787/a1f584dc.mp3" length="23914059" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1491</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CL, cs.GR, cs.MA</p>

            <p><strong>Authors:</strong><br>
            Zhenran Xu, Longyue Wang, Jifang Wang, Zhouyi Li, Senbao Shi, Xue Yang, Yiyu Wang, Baotian Hu, Jun Yu, Min Zhang</p>

            <p><strong>Title:</strong><br>
            FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12909v1">http://arxiv.org/abs/2501.12909v1</a></p>

            <p><strong>Abstract:</strong><br>
            Virtual film production requires intricate decision-making processes, including scriptwriting, virtual cinematography, and precise actor positioning and actions. Motivated by recent advances in automated decision-making with language agent-based societies, this paper introduces FilmAgent, a novel LLM-based multi-agent collaborative framework for end-to-end film automation in our constructed 3D virtual spaces. FilmAgent simulates various crew roles, including directors, screenwriters, actors, and cinematographers, and covers key stages of a film production workflow: (1) idea development transforms brainstormed ideas into structured story outlines; (2) scriptwriting elaborates on dialogue and character actions for each scene; (3) cinematography determines the camera setups for each shot. A team of agents collaborates through iterative feedback and revisions, thereby verifying intermediate scripts and reducing hallucinations. We evaluate the generated videos on 15 ideas and 4 key aspects. Human evaluation shows that FilmAgent outperforms all baselines across all aspects and scores 3.98 out of 5 on average, showing the feasibility of multi-agent collaboration in filmmaking. Further analysis reveals that FilmAgent, despite using the less advanced GPT-4o model, surpasses the single-agent o1, showing the advantage of a well-coordinated multi-agent system. Lastly, we discuss the complementary strengths and weaknesses of OpenAI's text-to-video model Sora and our FilmAgent in filmmaking.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback</title>
      <itunes:episode>410</itunes:episode>
      <podcast:episode>410</podcast:episode>
      <itunes:title>Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4b0c5f90-71c9-4c40-b2a4-292b680aa418</guid>
      <link>https://share.transistor.fm/s/25912427</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12895v1">http://arxiv.org/abs/2501.12895v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, removing the need to update model parameters. Rather than relying on purely numerical rewards, TPO translates reward signals into textual critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and depth during inference. Through case studies, we illustrate how TPO exploits the innate capacity of LLM to interpret and act upon reward signals. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly. Our code is publicly available at https://github.com/yafuly/TPO.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12895v1">http://arxiv.org/abs/2501.12895v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, removing the need to update model parameters. Rather than relying on purely numerical rewards, TPO translates reward signals into textual critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and depth during inference. Through case studies, we illustrate how TPO exploits the innate capacity of LLM to interpret and act upon reward signals. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly. Our code is publicly available at https://github.com/yafuly/TPO.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 23 Jan 2025 20:36:07 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/25912427/37a3a208.mp3" length="21835966" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1361</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 42 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng</p>

            <p><strong>Title:</strong><br>
            Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12895v1">http://arxiv.org/abs/2501.12895v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, removing the need to update model parameters. Rather than relying on purely numerical rewards, TPO translates reward signals into textual critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and depth during inference. Through case studies, we illustrate how TPO exploits the innate capacity of LLM to interpret and act upon reward signals. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly. Our code is publicly available at https://github.com/yafuly/TPO.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Kimi k1.5: Scaling Reinforcement Learning with LLMs</title>
      <itunes:episode>409</itunes:episode>
      <podcast:episode>409</podcast:episode>
      <itunes:title>Kimi k1.5: Scaling Reinforcement Learning with LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">677532f0-802b-49f4-8696-5ba94e8cb574</guid>
      <link>https://share.transistor.fm/s/6b7138cc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haozhen Yu, Hongcheng Gao, Huabin Zheng, Huan Yuan, Jia Chen, Jianhang Guo, Jianlin Su, Jianzhou Wang, Jie Zhao, Jin Zhang, Jingyuan Liu, Junjie Yan, Junyan Wu, Lidong Shi, Ling Ye, Longhui Yu, Mengnan Dong, Neo Zhang, Ningchen Ma, Qiwei Pan, Qucheng Gong, Shaowei Liu, Shengling Ma, Shupeng Wei, Sihan Cao, Siying Huang, Tao Jiang, Weihao Gao, Weimin Xiong, Weiran He, Weixiao Huang, Wenhao Wu, Wenyang He, Xianghui Wei, Xianqing Jia, Xingzhe Wu, Xinran Xu, Xinxing Zu, Xinyu Zhou, Xuehai Pan, Y. Charles, Yang Li, Yangyang Hu, Yangyang Liu, Yanru Chen, Yejie Wang, Yibo Liu, Yidao Qin, Yifeng Liu, Ying Yang, Yiping Bao, Yulun Du, Yuxin Wu, Yuzhi Wang, Zaida Zhou, Zhaoji Wang, Zhaowei Li, Zhen Zhu, Zheng Zhang, Zhexu Wang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Ziyao Xu, Zonghan Yang</p>

            <p><strong>Title:</strong><br>
            Kimi k1.5: Scaling Reinforcement Learning with LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12599v1">http://arxiv.org/abs/2501.12599v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching OpenAI's o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%).</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haozhen Yu, Hongcheng Gao, Huabin Zheng, Huan Yuan, Jia Chen, Jianhang Guo, Jianlin Su, Jianzhou Wang, Jie Zhao, Jin Zhang, Jingyuan Liu, Junjie Yan, Junyan Wu, Lidong Shi, Ling Ye, Longhui Yu, Mengnan Dong, Neo Zhang, Ningchen Ma, Qiwei Pan, Qucheng Gong, Shaowei Liu, Shengling Ma, Shupeng Wei, Sihan Cao, Siying Huang, Tao Jiang, Weihao Gao, Weimin Xiong, Weiran He, Weixiao Huang, Wenhao Wu, Wenyang He, Xianghui Wei, Xianqing Jia, Xingzhe Wu, Xinran Xu, Xinxing Zu, Xinyu Zhou, Xuehai Pan, Y. Charles, Yang Li, Yangyang Hu, Yangyang Liu, Yanru Chen, Yejie Wang, Yibo Liu, Yidao Qin, Yifeng Liu, Ying Yang, Yiping Bao, Yulun Du, Yuxin Wu, Yuzhi Wang, Zaida Zhou, Zhaoji Wang, Zhaowei Li, Zhen Zhu, Zheng Zhang, Zhexu Wang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Ziyao Xu, Zonghan Yang</p>

            <p><strong>Title:</strong><br>
            Kimi k1.5: Scaling Reinforcement Learning with LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12599v1">http://arxiv.org/abs/2501.12599v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching OpenAI's o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%).</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 23 Jan 2025 20:35:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6b7138cc/f0406fb0.mp3" length="17825194" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1110</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haozhen Yu, Hongcheng Gao, Huabin Zheng, Huan Yuan, Jia Chen, Jianhang Guo, Jianlin Su, Jianzhou Wang, Jie Zhao, Jin Zhang, Jingyuan Liu, Junjie Yan, Junyan Wu, Lidong Shi, Ling Ye, Longhui Yu, Mengnan Dong, Neo Zhang, Ningchen Ma, Qiwei Pan, Qucheng Gong, Shaowei Liu, Shengling Ma, Shupeng Wei, Sihan Cao, Siying Huang, Tao Jiang, Weihao Gao, Weimin Xiong, Weiran He, Weixiao Huang, Wenhao Wu, Wenyang He, Xianghui Wei, Xianqing Jia, Xingzhe Wu, Xinran Xu, Xinxing Zu, Xinyu Zhou, Xuehai Pan, Y. Charles, Yang Li, Yangyang Hu, Yangyang Liu, Yanru Chen, Yejie Wang, Yibo Liu, Yidao Qin, Yifeng Liu, Ying Yang, Yiping Bao, Yulun Du, Yuxin Wu, Yuzhi Wang, Zaida Zhou, Zhaoji Wang, Zhaowei Li, Zhen Zhu, Zheng Zhang, Zhexu Wang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Ziyao Xu, Zonghan Yang</p>

            <p><strong>Title:</strong><br>
            Kimi k1.5: Scaling Reinforcement Learning with LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12599v1">http://arxiv.org/abs/2501.12599v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching OpenAI's o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%).</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Autonomy-of-Experts Models</title>
      <itunes:episode>408</itunes:episode>
      <podcast:episode>408</podcast:episode>
      <itunes:title>Autonomy-of-Experts Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ed2bbb3f-f006-4aaa-aa81-ba5ac0029f98</guid>
      <link>https://share.transistor.fm/s/764315fb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ang Lv, Ruobing Xie, Yining Qian, Songhao Wu, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan</p>

            <p><strong>Title:</strong><br>
            Autonomy-of-Experts Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13074v1">http://arxiv.org/abs/2501.13074v1</a></p>

            <p><strong>Abstract:</strong><br>
            Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ang Lv, Ruobing Xie, Yining Qian, Songhao Wu, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan</p>

            <p><strong>Title:</strong><br>
            Autonomy-of-Experts Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13074v1">http://arxiv.org/abs/2501.13074v1</a></p>

            <p><strong>Abstract:</strong><br>
            Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 23 Jan 2025 20:35:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/764315fb/d9573b05.mp3" length="19370364" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1207</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ang Lv, Ruobing Xie, Yining Qian, Songhao Wu, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan</p>

            <p><strong>Title:</strong><br>
            Autonomy-of-Experts Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13074v1">http://arxiv.org/abs/2501.13074v1</a></p>

            <p><strong>Abstract:</strong><br>
            Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning</title>
      <itunes:episode>407</itunes:episode>
      <podcast:episode>407</podcast:episode>
      <itunes:title>O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">15635267-db88-4bec-99a2-804d9d2ed28a</guid>
      <link>https://share.transistor.fm/s/5221cef5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, Dacheng Tao</p>

            <p><strong>Title:</strong><br>
            O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12570v1">http://arxiv.org/abs/2501.12570v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, long-thought reasoning LLMs, such as OpenAI's O1, adopt extended reasoning processes similar to how humans ponder over complex problems. This reasoning paradigm significantly enhances the model's problem-solving abilities and has achieved promising results. However, long-thought reasoning process leads to a substantial increase in inference time. A pressing challenge is reducing the inference overhead of long-thought LLMs while ensuring accuracy. In this paper, we experimentally demonstrate that long-thought reasoning models struggle to effectively allocate token budgets based on problem difficulty and reasoning redundancies. To address this, we propose Length-Harmonizing Fine-Tuning (O1-Pruner), aiming at minimizing reasoning overhead while maintaining accuracy. This effective fine-tuning method first estimates the LLM's baseline performance through pre-sampling and then uses RL-style fine-tuning to encourage the model to generate shorter reasoning processes under accuracy constraints. This allows the model to achieve efficient reasoning with lower redundancy while maintaining accuracy. Experiments on various mathematical reasoning benchmarks show that O1-Pruner not only significantly reduces inference overhead but also achieves higher accuracy, providing a novel and promising solution to this challenge. Our code is coming soon at https://github.com/StarDewXXX/O1-Pruner</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, Dacheng Tao</p>

            <p><strong>Title:</strong><br>
            O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12570v1">http://arxiv.org/abs/2501.12570v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, long-thought reasoning LLMs, such as OpenAI's O1, adopt extended reasoning processes similar to how humans ponder over complex problems. This reasoning paradigm significantly enhances the model's problem-solving abilities and has achieved promising results. However, long-thought reasoning process leads to a substantial increase in inference time. A pressing challenge is reducing the inference overhead of long-thought LLMs while ensuring accuracy. In this paper, we experimentally demonstrate that long-thought reasoning models struggle to effectively allocate token budgets based on problem difficulty and reasoning redundancies. To address this, we propose Length-Harmonizing Fine-Tuning (O1-Pruner), aiming at minimizing reasoning overhead while maintaining accuracy. This effective fine-tuning method first estimates the LLM's baseline performance through pre-sampling and then uses RL-style fine-tuning to encourage the model to generate shorter reasoning processes under accuracy constraints. This allows the model to achieve efficient reasoning with lower redundancy while maintaining accuracy. Experiments on various mathematical reasoning benchmarks show that O1-Pruner not only significantly reduces inference overhead but also achieves higher accuracy, providing a novel and promising solution to this challenge. Our code is coming soon at https://github.com/StarDewXXX/O1-Pruner</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 23 Jan 2025 20:35:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5221cef5/60965b25.mp3" length="21679634" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1351</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, Dacheng Tao</p>

            <p><strong>Title:</strong><br>
            O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12570v1">http://arxiv.org/abs/2501.12570v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, long-thought reasoning LLMs, such as OpenAI's O1, adopt extended reasoning processes similar to how humans ponder over complex problems. This reasoning paradigm significantly enhances the model's problem-solving abilities and has achieved promising results. However, long-thought reasoning process leads to a substantial increase in inference time. A pressing challenge is reducing the inference overhead of long-thought LLMs while ensuring accuracy. In this paper, we experimentally demonstrate that long-thought reasoning models struggle to effectively allocate token budgets based on problem difficulty and reasoning redundancies. To address this, we propose Length-Harmonizing Fine-Tuning (O1-Pruner), aiming at minimizing reasoning overhead while maintaining accuracy. This effective fine-tuning method first estimates the LLM's baseline performance through pre-sampling and then uses RL-style fine-tuning to encourage the model to generate shorter reasoning processes under accuracy constraints. This allows the model to achieve efficient reasoning with lower redundancy while maintaining accuracy. Experiments on various mathematical reasoning benchmarks show that O1-Pruner not only significantly reduces inference overhead but also achieves higher accuracy, providing a novel and promising solution to this challenge. Our code is coming soon at https://github.com/StarDewXXX/O1-Pruner</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament</title>
      <itunes:episode>406</itunes:episode>
      <podcast:episode>406</podcast:episode>
      <itunes:title>Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8e23402d-4324-4e5b-a94c-d2f29644316c</guid>
      <link>https://share.transistor.fm/s/095f670c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13007v1">http://arxiv.org/abs/2501.13007v1</a></p>

            <p><strong>Abstract:</strong><br>
            Best-of-N (BoN) sampling, a common strategy for test-time scaling of Large Language Models (LLMs), relies on reward models to select the best candidate solution from multiple generations. However, traditional reward models often assign arbitrary and inconsistent scores, limiting their effectiveness. To address this, we propose a Pairwise Reward Model (Pairwise RM) combined with a knockout tournament for BoN sampling. Instead of assigning absolute scores, given one math problem, Pairwise RM evaluates two candidate solutions' correctness simultaneously. This approach eliminates the need for arbitrary scoring and enables cross-validation of solutions through parallel comparison. In the knockout tournament, Pairwise RM conducts pairwise comparisons between candidate solutions and eliminates the incorrect ones iteratively. We construct \ourdataset, a large-scale dataset of 443K pairwise comparisons derived from NumiaMath and annotated using \texttt{gemini-1.5-flash}, and train the Pairwise RM via supervised fine-tuning. Experiments on MATH-500 and the Olympiad Bench demonstrate significant improvements over traditional discriminative reward models. And a 40\% to 60\% relative improvement is achieved on the top 50\% challenging problems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13007v1">http://arxiv.org/abs/2501.13007v1</a></p>

            <p><strong>Abstract:</strong><br>
            Best-of-N (BoN) sampling, a common strategy for test-time scaling of Large Language Models (LLMs), relies on reward models to select the best candidate solution from multiple generations. However, traditional reward models often assign arbitrary and inconsistent scores, limiting their effectiveness. To address this, we propose a Pairwise Reward Model (Pairwise RM) combined with a knockout tournament for BoN sampling. Instead of assigning absolute scores, given one math problem, Pairwise RM evaluates two candidate solutions' correctness simultaneously. This approach eliminates the need for arbitrary scoring and enables cross-validation of solutions through parallel comparison. In the knockout tournament, Pairwise RM conducts pairwise comparisons between candidate solutions and eliminates the incorrect ones iteratively. We construct \ourdataset, a large-scale dataset of 443K pairwise comparisons derived from NumiaMath and annotated using \texttt{gemini-1.5-flash}, and train the Pairwise RM via supervised fine-tuning. Experiments on MATH-500 and the Olympiad Bench demonstrate significant improvements over traditional discriminative reward models. And a 40\% to 60\% relative improvement is achieved on the top 50\% challenging problems.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 23 Jan 2025 20:34:42 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/095f670c/e35c23fe.mp3" length="21293433" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1327</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13007v1">http://arxiv.org/abs/2501.13007v1</a></p>

            <p><strong>Abstract:</strong><br>
            Best-of-N (BoN) sampling, a common strategy for test-time scaling of Large Language Models (LLMs), relies on reward models to select the best candidate solution from multiple generations. However, traditional reward models often assign arbitrary and inconsistent scores, limiting their effectiveness. To address this, we propose a Pairwise Reward Model (Pairwise RM) combined with a knockout tournament for BoN sampling. Instead of assigning absolute scores, given one math problem, Pairwise RM evaluates two candidate solutions' correctness simultaneously. This approach eliminates the need for arbitrary scoring and enables cross-validation of solutions through parallel comparison. In the knockout tournament, Pairwise RM conducts pairwise comparisons between candidate solutions and eliminates the incorrect ones iteratively. We construct \ourdataset, a large-scale dataset of 443K pairwise comparisons derived from NumiaMath and annotated using \texttt{gemini-1.5-flash}, and train the Pairwise RM via supervised fine-tuning. Experiments on MATH-500 and the Olympiad Bench demonstrate significant improvements over traditional discriminative reward models. And a 40\% to 60\% relative improvement is achieved on the top 50\% challenging problems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems</title>
      <itunes:episode>405</itunes:episode>
      <podcast:episode>405</podcast:episode>
      <itunes:title>IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">43146722-184b-4ef1-96ba-33c09174da39</guid>
      <link>https://share.transistor.fm/s/e0ae0546</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Elad Levi, Ilan Kadar</p>

            <p><strong>Title:</strong><br>
            IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.11067v1">http://arxiv.org/abs/2501.11067v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) are transforming artificial intelligence, evolving into task-oriented systems capable of autonomous planning and execution. One of the primary applications of LLMs is conversational AI systems, which must navigate multi-turn dialogues, integrate domain-specific APIs, and adhere to strict policy constraints. However, evaluating these agents remains a significant challenge, as traditional methods fail to capture the complexity and variability of real-world interactions. We introduce IntellAgent, a scalable, open-source multi-agent framework designed to evaluate conversational AI systems comprehensively. IntellAgent automates the creation of diverse, synthetic benchmarks by combining policy-driven graph modeling, realistic event generation, and interactive user-agent simulations. This innovative approach provides fine-grained diagnostics, addressing the limitations of static and manually curated benchmarks with coarse-grained metrics. IntellAgent represents a paradigm shift in evaluating conversational AI. By simulating realistic, multi-policy scenarios across varying levels of complexity, IntellAgent captures the nuanced interplay of agent capabilities and policy constraints. Unlike traditional methods, it employs a graph-based policy model to represent relationships, likelihoods, and complexities of policy interactions, enabling highly detailed diagnostics. IntellAgent also identifies critical performance gaps, offering actionable insights for targeted optimization. Its modular, open-source design supports seamless integration of new domains, policies, and APIs, fostering reproducibility and community collaboration. Our findings demonstrate that IntellAgent serves as an effective framework for advancing conversational AI by addressing challenges in bridging research and deployment. The framework is available at https://github.com/plurai-ai/intellagent</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Elad Levi, Ilan Kadar</p>

            <p><strong>Title:</strong><br>
            IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.11067v1">http://arxiv.org/abs/2501.11067v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) are transforming artificial intelligence, evolving into task-oriented systems capable of autonomous planning and execution. One of the primary applications of LLMs is conversational AI systems, which must navigate multi-turn dialogues, integrate domain-specific APIs, and adhere to strict policy constraints. However, evaluating these agents remains a significant challenge, as traditional methods fail to capture the complexity and variability of real-world interactions. We introduce IntellAgent, a scalable, open-source multi-agent framework designed to evaluate conversational AI systems comprehensively. IntellAgent automates the creation of diverse, synthetic benchmarks by combining policy-driven graph modeling, realistic event generation, and interactive user-agent simulations. This innovative approach provides fine-grained diagnostics, addressing the limitations of static and manually curated benchmarks with coarse-grained metrics. IntellAgent represents a paradigm shift in evaluating conversational AI. By simulating realistic, multi-policy scenarios across varying levels of complexity, IntellAgent captures the nuanced interplay of agent capabilities and policy constraints. Unlike traditional methods, it employs a graph-based policy model to represent relationships, likelihoods, and complexities of policy interactions, enabling highly detailed diagnostics. IntellAgent also identifies critical performance gaps, offering actionable insights for targeted optimization. Its modular, open-source design supports seamless integration of new domains, policies, and APIs, fostering reproducibility and community collaboration. Our findings demonstrate that IntellAgent serves as an effective framework for advancing conversational AI by addressing challenges in bridging research and deployment. The framework is available at https://github.com/plurai-ai/intellagent</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 23 Jan 2025 20:34:20 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e0ae0546/90906cd8.mp3" length="23796603" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1484</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Elad Levi, Ilan Kadar</p>

            <p><strong>Title:</strong><br>
            IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.11067v1">http://arxiv.org/abs/2501.11067v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) are transforming artificial intelligence, evolving into task-oriented systems capable of autonomous planning and execution. One of the primary applications of LLMs is conversational AI systems, which must navigate multi-turn dialogues, integrate domain-specific APIs, and adhere to strict policy constraints. However, evaluating these agents remains a significant challenge, as traditional methods fail to capture the complexity and variability of real-world interactions. We introduce IntellAgent, a scalable, open-source multi-agent framework designed to evaluate conversational AI systems comprehensively. IntellAgent automates the creation of diverse, synthetic benchmarks by combining policy-driven graph modeling, realistic event generation, and interactive user-agent simulations. This innovative approach provides fine-grained diagnostics, addressing the limitations of static and manually curated benchmarks with coarse-grained metrics. IntellAgent represents a paradigm shift in evaluating conversational AI. By simulating realistic, multi-policy scenarios across varying levels of complexity, IntellAgent captures the nuanced interplay of agent capabilities and policy constraints. Unlike traditional methods, it employs a graph-based policy model to represent relationships, likelihoods, and complexities of policy interactions, enabling highly detailed diagnostics. IntellAgent also identifies critical performance gaps, offering actionable insights for targeted optimization. Its modular, open-source design supports seamless integration of new domains, policies, and APIs, fostering reproducibility and community collaboration. Our findings demonstrate that IntellAgent serves as an effective framework for advancing conversational AI by addressing challenges in bridging research and deployment. The framework is available at https://github.com/plurai-ai/intellagent</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass</title>
      <itunes:episode>404</itunes:episode>
      <podcast:episode>404</podcast:episode>
      <itunes:title>Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">068d7fb8-be28-4db5-afaf-c43a491851d4</guid>
      <link>https://share.transistor.fm/s/81b7b82e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CV, cs.AI, cs.GR, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, Matt Feiszli</p>

            <p><strong>Title:</strong><br>
            Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13928v1">http://arxiv.org/abs/2501.13928v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-view 3D reconstruction remains a core challenge in computer vision, particularly in applications requiring accurate and scalable representations across diverse perspectives. Current leading methods such as DUSt3R employ a fundamentally pairwise approach, processing images in pairs and necessitating costly global alignment procedures to reconstruct from multiple views. In this work, we propose Fast 3D Reconstruction (Fast3R), a novel multi-view generalization to DUSt3R that achieves efficient and scalable 3D reconstruction by processing many views in parallel. Fast3R's Transformer-based architecture forwards N images in a single forward pass, bypassing the need for iterative alignment. Through extensive experiments on camera pose estimation and 3D reconstruction, Fast3R demonstrates state-of-the-art performance, with significant improvements in inference speed and reduced error accumulation. These results establish Fast3R as a robust alternative for multi-view applications, offering enhanced scalability without compromising reconstruction accuracy.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CV, cs.AI, cs.GR, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, Matt Feiszli</p>

            <p><strong>Title:</strong><br>
            Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13928v1">http://arxiv.org/abs/2501.13928v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-view 3D reconstruction remains a core challenge in computer vision, particularly in applications requiring accurate and scalable representations across diverse perspectives. Current leading methods such as DUSt3R employ a fundamentally pairwise approach, processing images in pairs and necessitating costly global alignment procedures to reconstruct from multiple views. In this work, we propose Fast 3D Reconstruction (Fast3R), a novel multi-view generalization to DUSt3R that achieves efficient and scalable 3D reconstruction by processing many views in parallel. Fast3R's Transformer-based architecture forwards N images in a single forward pass, bypassing the need for iterative alignment. Through extensive experiments on camera pose estimation and 3D reconstruction, Fast3R demonstrates state-of-the-art performance, with significant improvements in inference speed and reduced error accumulation. These results establish Fast3R as a robust alternative for multi-view applications, offering enhanced scalability without compromising reconstruction accuracy.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 23 Jan 2025 20:33:59 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/81b7b82e/a30a15ec.mp3" length="20777258" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1295</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CV, cs.AI, cs.GR, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, Matt Feiszli</p>

            <p><strong>Title:</strong><br>
            Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.13928v1">http://arxiv.org/abs/2501.13928v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-view 3D reconstruction remains a core challenge in computer vision, particularly in applications requiring accurate and scalable representations across diverse perspectives. Current leading methods such as DUSt3R employ a fundamentally pairwise approach, processing images in pairs and necessitating costly global alignment procedures to reconstruct from multiple views. In this work, we propose Fast 3D Reconstruction (Fast3R), a novel multi-view generalization to DUSt3R that achieves efficient and scalable 3D reconstruction by processing many views in parallel. Fast3R's Transformer-based architecture forwards N images in a single forward pass, bypassing the need for iterative alignment. Through extensive experiments on camera pose estimation and 3D reconstruction, Fast3R demonstrates state-of-the-art performance, with significant improvements in inference speed and reduced error accumulation. These results establish Fast3R as a robust alternative for multi-view applications, offering enhanced scalability without compromising reconstruction accuracy.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training</title>
      <itunes:episode>403</itunes:episode>
      <podcast:episode>403</podcast:episode>
      <itunes:title>Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3a27ff33-a396-42c0-a68c-7a3af92b3751</guid>
      <link>https://share.transistor.fm/s/5bd44ba5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 61 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, Jiecao Chen</p>

            <p><strong>Title:</strong><br>
            Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.11425v1">http://arxiv.org/abs/2501.11425v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) agents are increasingly pivotal for addressing complex tasks in interactive environments. Existing work mainly focuses on enhancing performance through behavior cloning from stronger experts, yet such approaches often falter in real-world applications, mainly due to the inability to recover from errors. However, step-level critique data is difficult and expensive to collect. Automating and dynamically constructing self-critique datasets is thus crucial to empowering models with intelligent agent capabilities. In this work, we propose an iterative self-training framework, Agent-R, that enables language Agent to Reflect on the fly. Unlike traditional methods that reward or penalize actions based on correctness, Agent-R leverages MCTS to construct training data that recover correct trajectories from erroneous ones. A key challenge of agent reflection lies in the necessity for timely revision rather than waiting until the end of a rollout. To address this, we introduce a model-guided critique construction mechanism: the actor model identifies the first error step (within its current capability) in a failed trajectory. Starting from it, we splice it with the adjacent correct path, which shares the same parent node in the tree. This strategy enables the model to learn reflection based on its current policy, therefore yielding better learning efficiency. To further explore the scalability of this self-improvement paradigm, we investigate iterative refinement of both error correction capabilities and dataset construction. Our findings demonstrate that Agent-R continuously improves the model's ability to recover from errors and enables timely error correction. Experiments on three interactive environments show that Agent-R effectively equips agents to correct erroneous actions while avoiding loops, achieving superior performance compared to baseline methods (+5.59%).</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 61 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, Jiecao Chen</p>

            <p><strong>Title:</strong><br>
            Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.11425v1">http://arxiv.org/abs/2501.11425v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) agents are increasingly pivotal for addressing complex tasks in interactive environments. Existing work mainly focuses on enhancing performance through behavior cloning from stronger experts, yet such approaches often falter in real-world applications, mainly due to the inability to recover from errors. However, step-level critique data is difficult and expensive to collect. Automating and dynamically constructing self-critique datasets is thus crucial to empowering models with intelligent agent capabilities. In this work, we propose an iterative self-training framework, Agent-R, that enables language Agent to Reflect on the fly. Unlike traditional methods that reward or penalize actions based on correctness, Agent-R leverages MCTS to construct training data that recover correct trajectories from erroneous ones. A key challenge of agent reflection lies in the necessity for timely revision rather than waiting until the end of a rollout. To address this, we introduce a model-guided critique construction mechanism: the actor model identifies the first error step (within its current capability) in a failed trajectory. Starting from it, we splice it with the adjacent correct path, which shares the same parent node in the tree. This strategy enables the model to learn reflection based on its current policy, therefore yielding better learning efficiency. To further explore the scalability of this self-improvement paradigm, we investigate iterative refinement of both error correction capabilities and dataset construction. Our findings demonstrate that Agent-R continuously improves the model's ability to recover from errors and enables timely error correction. Experiments on three interactive environments show that Agent-R effectively equips agents to correct erroneous actions while avoiding loops, achieving superior performance compared to baseline methods (+5.59%).</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Jan 2025 20:30:30 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5bd44ba5/3d6de593.mp3" length="19979383" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1245</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 61 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Siyu Yuan, Zehui Chen, Zhiheng Xi, Junjie Ye, Zhengyin Du, Jiecao Chen</p>

            <p><strong>Title:</strong><br>
            Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.11425v1">http://arxiv.org/abs/2501.11425v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) agents are increasingly pivotal for addressing complex tasks in interactive environments. Existing work mainly focuses on enhancing performance through behavior cloning from stronger experts, yet such approaches often falter in real-world applications, mainly due to the inability to recover from errors. However, step-level critique data is difficult and expensive to collect. Automating and dynamically constructing self-critique datasets is thus crucial to empowering models with intelligent agent capabilities. In this work, we propose an iterative self-training framework, Agent-R, that enables language Agent to Reflect on the fly. Unlike traditional methods that reward or penalize actions based on correctness, Agent-R leverages MCTS to construct training data that recover correct trajectories from erroneous ones. A key challenge of agent reflection lies in the necessity for timely revision rather than waiting until the end of a rollout. To address this, we introduce a model-guided critique construction mechanism: the actor model identifies the first error step (within its current capability) in a failed trajectory. Starting from it, we splice it with the adjacent correct path, which shares the same parent node in the tree. This strategy enables the model to learn reflection based on its current policy, therefore yielding better learning efficiency. To further explore the scalability of this self-improvement paradigm, we investigate iterative refinement of both error correction capabilities and dataset construction. Our findings demonstrate that Agent-R continuously improves the model's ability to recover from errors and enables timely error correction. Experiments on three interactive environments show that Agent-R effectively equips agents to correct erroneous actions while avoiding loops, achieving superior performance compared to baseline methods (+5.59%).</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MMVU: Measuring Expert-Level Multi-Discipline Video Understanding</title>
      <itunes:episode>402</itunes:episode>
      <podcast:episode>402</podcast:episode>
      <itunes:title>MMVU: Measuring Expert-Level Multi-Discipline Video Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bdc88aa4-94b5-418c-a027-e263ee72bf53</guid>
      <link>https://share.transistor.fm/s/9d216e6f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, Arman Cohan</p>

            <p><strong>Title:</strong><br>
            MMVU: Measuring Expert-Level Multi-Discipline Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12380v1">http://arxiv.org/abs/2501.12380v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. MMVU includes 3,000 expert-annotated questions spanning 27 subjects across four core disciplines: Science, Healthcare, Humanities &amp; Social Sciences, and Engineering. Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks. Second, each example is annotated by human experts from scratch. We implement strict data quality controls to ensure the high quality of the dataset. Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth analysis. We conduct an extensive evaluation of 32 frontier multimodal foundation models on MMVU. The latest System-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highest performance among the tested models. However, they still fall short of matching human expertise. Through in-depth error analyses and case studies, we offer actionable insights for future advancements in expert-level, knowledge-intensive video understanding for specialized domains.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, Arman Cohan</p>

            <p><strong>Title:</strong><br>
            MMVU: Measuring Expert-Level Multi-Discipline Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12380v1">http://arxiv.org/abs/2501.12380v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. MMVU includes 3,000 expert-annotated questions spanning 27 subjects across four core disciplines: Science, Healthcare, Humanities &amp; Social Sciences, and Engineering. Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks. Second, each example is annotated by human experts from scratch. We implement strict data quality controls to ensure the high quality of the dataset. Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth analysis. We conduct an extensive evaluation of 32 frontier multimodal foundation models on MMVU. The latest System-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highest performance among the tested models. However, they still fall short of matching human expertise. Through in-depth error analyses and case studies, we offer actionable insights for future advancements in expert-level, knowledge-intensive video understanding for specialized domains.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Jan 2025 20:30:08 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9d216e6f/f1c4644e.mp3" length="24299396" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1515</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 59 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, Arman Cohan</p>

            <p><strong>Title:</strong><br>
            MMVU: Measuring Expert-Level Multi-Discipline Video Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12380v1">http://arxiv.org/abs/2501.12380v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. MMVU includes 3,000 expert-annotated questions spanning 27 subjects across four core disciplines: Science, Healthcare, Humanities &amp; Social Sciences, and Engineering. Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks. Second, each example is annotated by human experts from scratch. We implement strict data quality controls to ensure the high quality of the dataset. Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth analysis. We conduct an extensive evaluation of 32 frontier multimodal foundation models on MMVU. The latest System-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highest performance among the tested models. However, they still fall short of matching human expertise. Through in-depth error analyses and case studies, we offer actionable insights for future advancements in expert-level, knowledge-intensive video understanding for specialized domains.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models</title>
      <itunes:episode>401</itunes:episode>
      <podcast:episode>401</podcast:episode>
      <itunes:title>Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3ec406a4-e777-45d6-b451-34cf4ea8e8f1</guid>
      <link>https://share.transistor.fm/s/458772cc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.11873v1">http://arxiv.org/abs/2501.11873v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper revisits the implementation of $\textbf{L}$oad-$\textbf{b}$alancing $\textbf{L}$oss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as $N_E \sum_{i=1}^{N_E} f_i p_i$, where $N_E$ is the total number of experts, $f_i$ represents the frequency of expert $i$ being selected, and $p_i$ denotes the average gating score of the expert $i$. Existing MoE training frameworks usually employ the parallel training strategy so that $f_i$ and the LBL are calculated within a $\textbf{micro-batch}$ and then averaged across parallel groups. In essence, a micro-batch for training billion-scale LLMs normally contains very few sequences. So, the micro-batch LBL is almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence. Under this strict constraint, even tokens from a domain-specific sequence ($\textit{e.g.}$, code) are uniformly routed to all experts, thereby inhibiting expert specialization. In this work, we propose calculating LBL using a $\textbf{global-batch}$ to loose this constraint. Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level. Specifically, we introduce an extra communication step to synchronize $f_i$ across micro-batches and then use it to calculate the LBL. Through experiments on training MoEs-based LLMs (up to $\textbf{42.8B}$ total parameters and $\textbf{400B}$ tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks. Our analysis reveals that the global-batch LBL also greatly improves the domain specialization of MoE experts.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.11873v1">http://arxiv.org/abs/2501.11873v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper revisits the implementation of $\textbf{L}$oad-$\textbf{b}$alancing $\textbf{L}$oss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as $N_E \sum_{i=1}^{N_E} f_i p_i$, where $N_E$ is the total number of experts, $f_i$ represents the frequency of expert $i$ being selected, and $p_i$ denotes the average gating score of the expert $i$. Existing MoE training frameworks usually employ the parallel training strategy so that $f_i$ and the LBL are calculated within a $\textbf{micro-batch}$ and then averaged across parallel groups. In essence, a micro-batch for training billion-scale LLMs normally contains very few sequences. So, the micro-batch LBL is almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence. Under this strict constraint, even tokens from a domain-specific sequence ($\textit{e.g.}$, code) are uniformly routed to all experts, thereby inhibiting expert specialization. In this work, we propose calculating LBL using a $\textbf{global-batch}$ to loose this constraint. Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level. Specifically, we introduce an extra communication step to synchronize $f_i$ across micro-batches and then use it to calculate the LBL. Through experiments on training MoEs-based LLMs (up to $\textbf{42.8B}$ total parameters and $\textbf{400B}$ tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks. Our analysis reveals that the global-batch LBL also greatly improves the domain specialization of MoE experts.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Jan 2025 20:29:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/458772cc/9564eded.mp3" length="22772633" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1420</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.11873v1">http://arxiv.org/abs/2501.11873v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper revisits the implementation of $\textbf{L}$oad-$\textbf{b}$alancing $\textbf{L}$oss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as $N_E \sum_{i=1}^{N_E} f_i p_i$, where $N_E$ is the total number of experts, $f_i$ represents the frequency of expert $i$ being selected, and $p_i$ denotes the average gating score of the expert $i$. Existing MoE training frameworks usually employ the parallel training strategy so that $f_i$ and the LBL are calculated within a $\textbf{micro-batch}$ and then averaged across parallel groups. In essence, a micro-batch for training billion-scale LLMs normally contains very few sequences. So, the micro-batch LBL is almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence. Under this strict constraint, even tokens from a domain-specific sequence ($\textit{e.g.}$, code) are uniformly routed to all experts, thereby inhibiting expert specialization. In this work, we propose calculating LBL using a $\textbf{global-batch}$ to loose this constraint. Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level. Specifically, we introduce an extra communication step to synchronize $f_i$ across micro-batches and then use it to calculate the LBL. Through experiments on training MoEs-based LLMs (up to $\textbf{42.8B}$ total parameters and $\textbf{400B}$ tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks. Our analysis reveals that the global-batch LBL also greatly improves the domain specialization of MoE experts.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space</title>
      <itunes:episode>400</itunes:episode>
      <podcast:episode>400</podcast:episode>
      <itunes:title>TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5f496e59-0126-4264-a101-7ec7822fa193</guid>
      <link>https://share.transistor.fm/s/885ddadb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, Tali Dekel</p>

            <p><strong>Title:</strong><br>
            TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12224v1">http://arxiv.org/abs/2501.12224v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present TokenVerse -- a method for multi-concept personalization, leveraging a pre-trained text-to-image diffusion model. Our framework can disentangle complex visual elements and attributes from as little as a single image, while enabling seamless plug-and-play generation of combinations of concepts extracted from multiple images. As opposed to existing works, TokenVerse can handle multiple images with multiple concepts each, and supports a wide-range of concepts, including objects, accessories, materials, pose, and lighting. Our work exploits a DiT-based text-to-image model, in which the input text affects the generation through both attention and modulation (shift and scale). We observe that the modulation space is semantic and enables localized control over complex concepts. Building on this insight, we devise an optimization-based framework that takes as input an image and a text description, and finds for each word a distinct direction in the modulation space. These directions can then be used to generate new images that combine the learned concepts in a desired configuration. We demonstrate the effectiveness of TokenVerse in challenging personalization settings, and showcase its advantages over existing methods. project's webpage in https://token-verse.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, Tali Dekel</p>

            <p><strong>Title:</strong><br>
            TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12224v1">http://arxiv.org/abs/2501.12224v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present TokenVerse -- a method for multi-concept personalization, leveraging a pre-trained text-to-image diffusion model. Our framework can disentangle complex visual elements and attributes from as little as a single image, while enabling seamless plug-and-play generation of combinations of concepts extracted from multiple images. As opposed to existing works, TokenVerse can handle multiple images with multiple concepts each, and supports a wide-range of concepts, including objects, accessories, materials, pose, and lighting. Our work exploits a DiT-based text-to-image model, in which the input text affects the generation through both attention and modulation (shift and scale). We observe that the modulation space is semantic and enables localized control over complex concepts. Building on this insight, we devise an optimization-based framework that takes as input an image and a text description, and finds for each word a distinct direction in the modulation space. These directions can then be used to generate new images that combine the learned concepts in a desired configuration. We demonstrate the effectiveness of TokenVerse in challenging personalization settings, and showcase its advantages over existing methods. project's webpage in https://token-verse.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Jan 2025 20:29:25 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/885ddadb/0f33223a.mp3" length="25011193" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1560</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, Tali Dekel</p>

            <p><strong>Title:</strong><br>
            TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12224v1">http://arxiv.org/abs/2501.12224v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present TokenVerse -- a method for multi-concept personalization, leveraging a pre-trained text-to-image diffusion model. Our framework can disentangle complex visual elements and attributes from as little as a single image, while enabling seamless plug-and-play generation of combinations of concepts extracted from multiple images. As opposed to existing works, TokenVerse can handle multiple images with multiple concepts each, and supports a wide-range of concepts, including objects, accessories, materials, pose, and lighting. Our work exploits a DiT-based text-to-image model, in which the input text affects the generation through both attention and modulation (shift and scale). We observe that the modulation space is semantic and enables localized control over complex concepts. Building on this insight, we devise an optimization-based framework that takes as input an image and a text description, and finds for each word a distinct direction in the modulation space. These directions can then be used to generate new images that combine the learned concepts in a desired configuration. We demonstrate the effectiveness of TokenVerse in challenging personalization settings, and showcase its advantages over existing methods. project's webpage in https://token-verse.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UI-TARS: Pioneering Automated GUI Interaction with Native Agents</title>
      <itunes:episode>399</itunes:episode>
      <podcast:episode>399</podcast:episode>
      <itunes:title>UI-TARS: Pioneering Automated GUI Interaction with Native Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">603b515e-ea69-4121-86fb-6263f1a74ea7</guid>
      <link>https://share.transistor.fm/s/aa6a6b63</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi</p>

            <p><strong>Title:</strong><br>
            UI-TARS: Pioneering Automated GUI Interaction with Native Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12326v1">http://arxiv.org/abs/2501.12326v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi</p>

            <p><strong>Title:</strong><br>
            UI-TARS: Pioneering Automated GUI Interaction with Native Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12326v1">http://arxiv.org/abs/2501.12326v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Jan 2025 20:29:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/aa6a6b63/2360310e.mp3" length="20176646" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1257</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi</p>

            <p><strong>Title:</strong><br>
            UI-TARS: Pioneering Automated GUI Interaction with Native Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12326v1">http://arxiv.org/abs/2501.12326v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model</title>
      <itunes:episode>398</itunes:episode>
      <podcast:episode>398</podcast:episode>
      <itunes:title>InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">36f472e6-21fd-4539-a60e-3e4cba6195dd</guid>
      <link>https://share.transistor.fm/s/fa752484</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12368v1">http://arxiv.org/abs/2501.12368v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks. We further demonstrate three key applications of IXC-2.5-Reward: (1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate responses for test-time scaling; and (3) Filtering outlier or noisy samples from existing image and video instruction tuning training data. To ensure reproducibility and facilitate further research, we have open-sourced all model weights and training recipes at https://github.com/InternLM/InternLM-XComposer</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12368v1">http://arxiv.org/abs/2501.12368v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks. We further demonstrate three key applications of IXC-2.5-Reward: (1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate responses for test-time scaling; and (3) Filtering outlier or noisy samples from existing image and video instruction tuning training data. To ensure reproducibility and facilitate further research, we have open-sourced all model weights and training recipes at https://github.com/InternLM/InternLM-XComposer</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Jan 2025 20:28:41 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fa752484/b352cea2.mp3" length="20630980" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1286</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12368v1">http://arxiv.org/abs/2501.12368v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks. We further demonstrate three key applications of IXC-2.5-Reward: (1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate responses for test-time scaling; and (3) Filtering outlier or noisy samples from existing image and video instruction tuning training data. To ensure reproducibility and facilitate further research, we have open-sourced all model weights and training recipes at https://github.com/InternLM/InternLM-XComposer</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks</title>
      <itunes:episode>397</itunes:episode>
      <podcast:episode>397</podcast:episode>
      <itunes:title>Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5e365c95-4e2e-4272-b324-cfbffeb9a1bb</guid>
      <link>https://share.transistor.fm/s/d093d547</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji</p>

            <p><strong>Title:</strong><br>
            Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.11733v1">http://arxiv.org/abs/2501.11733v1</a></p>

            <p><strong>Abstract:</strong><br>
            Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world human needs, struggle with reasoning-intensive and long-horizon tasks, and lack mechanisms to learn and improve from prior experiences. To overcome these challenges, we introduce Mobile-Agent-E, a hierarchical multi-agent framework capable of self-evolution through past experience. By hierarchical, we mean an explicit separation of high-level planning and low-level action execution. The framework comprises a Manager, responsible for devising overall plans by breaking down complex tasks into subgoals, and four subordinate agents--Perceptor, Operator, Action Reflector, and Notetaker--which handle fine-grained visual perception, immediate action execution, error verification, and information aggregation, respectively. Mobile-Agent-E also features a novel self-evolution module which maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidance and lessons learned from prior tasks on how to effectively interact with the environment. Shortcuts are reusable, executable sequences of atomic operations tailored for specific subroutines. The inclusion of Tips and Shortcuts facilitates continuous refinement in performance and efficiency. Alongside this framework, we introduce Mobile-Eval-E, a new benchmark featuring complex mobile tasks requiring long-horizon, multi-app interactions. Empirical results show that Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches across three foundation model backbones. Project page: https://x-plug.github.io/MobileAgent.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji</p>

            <p><strong>Title:</strong><br>
            Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.11733v1">http://arxiv.org/abs/2501.11733v1</a></p>

            <p><strong>Abstract:</strong><br>
            Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world human needs, struggle with reasoning-intensive and long-horizon tasks, and lack mechanisms to learn and improve from prior experiences. To overcome these challenges, we introduce Mobile-Agent-E, a hierarchical multi-agent framework capable of self-evolution through past experience. By hierarchical, we mean an explicit separation of high-level planning and low-level action execution. The framework comprises a Manager, responsible for devising overall plans by breaking down complex tasks into subgoals, and four subordinate agents--Perceptor, Operator, Action Reflector, and Notetaker--which handle fine-grained visual perception, immediate action execution, error verification, and information aggregation, respectively. Mobile-Agent-E also features a novel self-evolution module which maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidance and lessons learned from prior tasks on how to effectively interact with the environment. Shortcuts are reusable, executable sequences of atomic operations tailored for specific subroutines. The inclusion of Tips and Shortcuts facilitates continuous refinement in performance and efficiency. Alongside this framework, we introduce Mobile-Eval-E, a new benchmark featuring complex mobile tasks requiring long-horizon, multi-app interactions. Empirical results show that Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches across three foundation model backbones. Project page: https://x-plug.github.io/MobileAgent.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Jan 2025 20:28:19 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d093d547/399519b6.mp3" length="22494229" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1402</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji</p>

            <p><strong>Title:</strong><br>
            Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.11733v1">http://arxiv.org/abs/2501.11733v1</a></p>

            <p><strong>Abstract:</strong><br>
            Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world human needs, struggle with reasoning-intensive and long-horizon tasks, and lack mechanisms to learn and improve from prior experiences. To overcome these challenges, we introduce Mobile-Agent-E, a hierarchical multi-agent framework capable of self-evolution through past experience. By hierarchical, we mean an explicit separation of high-level planning and low-level action execution. The framework comprises a Manager, responsible for devising overall plans by breaking down complex tasks into subgoals, and four subordinate agents--Perceptor, Operator, Action Reflector, and Notetaker--which handle fine-grained visual perception, immediate action execution, error verification, and information aggregation, respectively. Mobile-Agent-E also features a novel self-evolution module which maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidance and lessons learned from prior tasks on how to effectively interact with the environment. Shortcuts are reusable, executable sequences of atomic operations tailored for specific subroutines. The inclusion of Tips and Shortcuts facilitates continuous refinement in performance and efficiency. Alongside this framework, we introduce Mobile-Eval-E, a new benchmark featuring complex mobile tasks requiring long-horizon, multi-app interactions. Empirical results show that Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches across three foundation model backbones. Project page: https://x-plug.github.io/MobileAgent.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Reasoning Language Models: A Blueprint</title>
      <itunes:episode>396</itunes:episode>
      <podcast:episode>396</podcast:episode>
      <itunes:title>Reasoning Language Models: A Blueprint</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">50f7f729-3dbc-4c3c-8a72-23f208d90f17</guid>
      <link>https://share.transistor.fm/s/6a535bf4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler</p>

            <p><strong>Title:</strong><br>
            Reasoning Language Models: A Blueprint</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.11223v2">http://arxiv.org/abs/2501.11223v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning language models (RLMs), also known as Large Reasoning Models (LRMs), such as OpenAI's o1 and o3, DeepSeek-V3, and Alibaba's QwQ, have redefined AI's problem-solving capabilities by extending LLMs with advanced reasoning mechanisms. Yet, their high costs, proprietary nature, and complex architectures - uniquely combining Reinforcement Learning (RL), search heuristics, and LLMs - present accessibility and scalability challenges. To address these, we propose a comprehensive blueprint that organizes RLM components into a modular framework, based on a survey and analysis of all RLM works. This blueprint incorporates diverse reasoning structures (chains, trees, graphs, and nested forms), reasoning strategies (e.g., Monte Carlo Tree Search, Beam Search), RL concepts (policy, value models and others), supervision schemes (Outcome-Based and Process-Based Supervision), and other related concepts (e.g., Test-Time Compute, Retrieval-Augmented Generation, agent tools). We provide detailed mathematical formulations and algorithmic specifications to simplify RLM implementation. By showing how schemes like LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases, we demonstrate the blueprint's versatility and unifying potential. To illustrate its utility, we introduce x1, a modular implementation for rapid RLM prototyping and experimentation. Using x1 and a literature review, we provide key insights, such as multi-phase training for policy and value models, and the importance of familiar training distributions. Finally, we discuss scalable RLM cloud deployments and we outline how RLMs can integrate with a broader LLM ecosystem. Our work demystifies RLM construction, democratizes advanced reasoning capabilities, and fosters innovation, aiming to mitigate the gap between "rich AI" and "poor AI" by lowering barriers to RLM development and experimentation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler</p>

            <p><strong>Title:</strong><br>
            Reasoning Language Models: A Blueprint</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.11223v2">http://arxiv.org/abs/2501.11223v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning language models (RLMs), also known as Large Reasoning Models (LRMs), such as OpenAI's o1 and o3, DeepSeek-V3, and Alibaba's QwQ, have redefined AI's problem-solving capabilities by extending LLMs with advanced reasoning mechanisms. Yet, their high costs, proprietary nature, and complex architectures - uniquely combining Reinforcement Learning (RL), search heuristics, and LLMs - present accessibility and scalability challenges. To address these, we propose a comprehensive blueprint that organizes RLM components into a modular framework, based on a survey and analysis of all RLM works. This blueprint incorporates diverse reasoning structures (chains, trees, graphs, and nested forms), reasoning strategies (e.g., Monte Carlo Tree Search, Beam Search), RL concepts (policy, value models and others), supervision schemes (Outcome-Based and Process-Based Supervision), and other related concepts (e.g., Test-Time Compute, Retrieval-Augmented Generation, agent tools). We provide detailed mathematical formulations and algorithmic specifications to simplify RLM implementation. By showing how schemes like LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases, we demonstrate the blueprint's versatility and unifying potential. To illustrate its utility, we introduce x1, a modular implementation for rapid RLM prototyping and experimentation. Using x1 and a literature review, we provide key insights, such as multi-phase training for policy and value models, and the importance of familiar training distributions. Finally, we discuss scalable RLM cloud deployments and we outline how RLMs can integrate with a broader LLM ecosystem. Our work demystifies RLM construction, democratizes advanced reasoning capabilities, and fosters innovation, aiming to mitigate the gap between "rich AI" and "poor AI" by lowering barriers to RLM development and experimentation.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Jan 2025 20:27:57 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6a535bf4/4ac15554.mp3" length="20910556" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1303</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, Torsten Hoefler</p>

            <p><strong>Title:</strong><br>
            Reasoning Language Models: A Blueprint</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.11223v2">http://arxiv.org/abs/2501.11223v2</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning language models (RLMs), also known as Large Reasoning Models (LRMs), such as OpenAI's o1 and o3, DeepSeek-V3, and Alibaba's QwQ, have redefined AI's problem-solving capabilities by extending LLMs with advanced reasoning mechanisms. Yet, their high costs, proprietary nature, and complex architectures - uniquely combining Reinforcement Learning (RL), search heuristics, and LLMs - present accessibility and scalability challenges. To address these, we propose a comprehensive blueprint that organizes RLM components into a modular framework, based on a survey and analysis of all RLM works. This blueprint incorporates diverse reasoning structures (chains, trees, graphs, and nested forms), reasoning strategies (e.g., Monte Carlo Tree Search, Beam Search), RL concepts (policy, value models and others), supervision schemes (Outcome-Based and Process-Based Supervision), and other related concepts (e.g., Test-Time Compute, Retrieval-Augmented Generation, agent tools). We provide detailed mathematical formulations and algorithmic specifications to simplify RLM implementation. By showing how schemes like LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases, we demonstrate the blueprint's versatility and unifying potential. To illustrate its utility, we introduce x1, a modular implementation for rapid RLM prototyping and experimentation. Using x1 and a literature review, we provide key insights, such as multi-phase training for policy and value models, and the importance of familiar training distributions. Finally, we discuss scalable RLM cloud deployments and we outline how RLMs can integrate with a broader LLM ecosystem. Our work demystifies RLM construction, democratizes advanced reasoning capabilities, and fosters innovation, aiming to mitigate the gap between "rich AI" and "poor AI" by lowering barriers to RLM development and experimentation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation</title>
      <itunes:episode>395</itunes:episode>
      <podcast:episode>395</podcast:episode>
      <itunes:title>Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a3cf4ed2-a7f6-44d6-aa94-2626246258cf</guid>
      <link>https://share.transistor.fm/s/83c14967</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Tianyu Huang, Lifu Wang, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Chao Zhang, Yonghao Tan, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Zhichao Hu, Lei Qin, Jianbing Peng, Zhan Li, Minghui Chen, Xipeng Zhang, Lin Niu, Paige Wang, Yingkai Wang, Haozhao Kuang, Zhongyi Fan, Xu Zheng, Weihao Zhuang, YingPing He, Tian Liu, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Jingwei Huang, Chunchao Guo</p>

            <p><strong>Title:</strong><br>
            Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12202v2">http://arxiv.org/abs/2501.12202v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model -- Hunyuan3D-DiT, and a large-scale texture synthesis model -- Hunyuan3D-Paint. The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications. The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes. Furthermore, we build Hunyuan3D-Studio -- a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models. The code and pre-trained weights of our models are available at: https://github.com/Tencent/Hunyuan3D-2</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Tianyu Huang, Lifu Wang, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Chao Zhang, Yonghao Tan, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Zhichao Hu, Lei Qin, Jianbing Peng, Zhan Li, Minghui Chen, Xipeng Zhang, Lin Niu, Paige Wang, Yingkai Wang, Haozhao Kuang, Zhongyi Fan, Xu Zheng, Weihao Zhuang, YingPing He, Tian Liu, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Jingwei Huang, Chunchao Guo</p>

            <p><strong>Title:</strong><br>
            Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12202v2">http://arxiv.org/abs/2501.12202v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model -- Hunyuan3D-DiT, and a large-scale texture synthesis model -- Hunyuan3D-Paint. The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications. The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes. Furthermore, we build Hunyuan3D-Studio -- a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models. The code and pre-trained weights of our models are available at: https://github.com/Tencent/Hunyuan3D-2</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Jan 2025 20:27:35 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/83c14967/38404da0.mp3" length="19989843" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1246</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Tianyu Huang, Lifu Wang, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Chao Zhang, Yonghao Tan, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Zhichao Hu, Lei Qin, Jianbing Peng, Zhan Li, Minghui Chen, Xipeng Zhang, Lin Niu, Paige Wang, Yingkai Wang, Haozhao Kuang, Zhongyi Fan, Xu Zheng, Weihao Zhuang, YingPing He, Tian Liu, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Jingwei Huang, Chunchao Guo</p>

            <p><strong>Title:</strong><br>
            Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.12202v2">http://arxiv.org/abs/2501.12202v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model -- Hunyuan3D-DiT, and a large-scale texture synthesis model -- Hunyuan3D-Paint. The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications. The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes. Furthermore, we build Hunyuan3D-Studio -- a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models. The code and pre-trained weights of our models are available at: https://github.com/Tencent/Hunyuan3D-2</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments</title>
      <itunes:episode>394</itunes:episode>
      <podcast:episode>394</podcast:episode>
      <itunes:title>Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9bd32cc2-336f-4478-9c5e-7a47d20119aa</guid>
      <link>https://share.transistor.fm/s/7c13b867</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, Sercan Ö. Arık</p>

            <p><strong>Title:</strong><br>
            Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.10893v1">http://arxiv.org/abs/2501.10893v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autonomous agents powered by large language models (LLMs) have the potential to enhance human capabilities, assisting with digital tasks from sending emails to performing data analysis. The abilities of existing LLMs at such tasks are often hindered by the lack of high-quality agent data from the corresponding environments they interact with. We propose Learn-by-interact, a data-centric framework to adapt LLM agents to any given environments without human annotations. Learn-by-interact synthesizes trajectories of agent-environment interactions based on documentations, and constructs instructions by summarizing or abstracting the interaction histories, a process called backward construction. We assess the quality of our synthetic data by using them in both training-based scenarios and training-free in-context learning (ICL), where we craft innovative retrieval approaches optimized for agents. Extensive experiments on SWE-bench, WebArena, OSWorld and Spider2-V spanning across realistic coding, web, and desktop environments show the effectiveness of Learn-by-interact in various downstream agentic tasks -- baseline results are improved by up to 12.2\% for ICL with Claude-3.5 and 19.5\% for training with Codestral-22B. We further demonstrate the critical role of backward construction, which provides up to 14.0\% improvement for training. Our ablation studies demonstrate the efficiency provided by our synthesized data in ICL and the superiority of our retrieval pipeline over alternative approaches like conventional retrieval-augmented generation (RAG). We expect that Learn-by-interact will serve as a foundation for agent data synthesis as LLMs are increasingly deployed at real-world environments.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, Sercan Ö. Arık</p>

            <p><strong>Title:</strong><br>
            Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.10893v1">http://arxiv.org/abs/2501.10893v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autonomous agents powered by large language models (LLMs) have the potential to enhance human capabilities, assisting with digital tasks from sending emails to performing data analysis. The abilities of existing LLMs at such tasks are often hindered by the lack of high-quality agent data from the corresponding environments they interact with. We propose Learn-by-interact, a data-centric framework to adapt LLM agents to any given environments without human annotations. Learn-by-interact synthesizes trajectories of agent-environment interactions based on documentations, and constructs instructions by summarizing or abstracting the interaction histories, a process called backward construction. We assess the quality of our synthetic data by using them in both training-based scenarios and training-free in-context learning (ICL), where we craft innovative retrieval approaches optimized for agents. Extensive experiments on SWE-bench, WebArena, OSWorld and Spider2-V spanning across realistic coding, web, and desktop environments show the effectiveness of Learn-by-interact in various downstream agentic tasks -- baseline results are improved by up to 12.2\% for ICL with Claude-3.5 and 19.5\% for training with Codestral-22B. We further demonstrate the critical role of backward construction, which provides up to 14.0\% improvement for training. Our ablation studies demonstrate the efficiency provided by our synthesized data in ICL and the superiority of our retrieval pipeline over alternative approaches like conventional retrieval-augmented generation (RAG). We expect that Learn-by-interact will serve as a foundation for agent data synthesis as LLMs are increasingly deployed at real-world environments.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 22 Jan 2025 20:27:14 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7c13b867/08a04e1c.mp3" length="18752271" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1168</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, Sercan Ö. Arık</p>

            <p><strong>Title:</strong><br>
            Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.10893v1">http://arxiv.org/abs/2501.10893v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autonomous agents powered by large language models (LLMs) have the potential to enhance human capabilities, assisting with digital tasks from sending emails to performing data analysis. The abilities of existing LLMs at such tasks are often hindered by the lack of high-quality agent data from the corresponding environments they interact with. We propose Learn-by-interact, a data-centric framework to adapt LLM agents to any given environments without human annotations. Learn-by-interact synthesizes trajectories of agent-environment interactions based on documentations, and constructs instructions by summarizing or abstracting the interaction histories, a process called backward construction. We assess the quality of our synthetic data by using them in both training-based scenarios and training-free in-context learning (ICL), where we craft innovative retrieval approaches optimized for agents. Extensive experiments on SWE-bench, WebArena, OSWorld and Spider2-V spanning across realistic coding, web, and desktop environments show the effectiveness of Learn-by-interact in various downstream agentic tasks -- baseline results are improved by up to 12.2\% for ICL with Claude-3.5 and 19.5\% for training with Codestral-22B. We further demonstrate the critical role of backward construction, which provides up to 14.0\% improvement for training. Our ablation studies demonstrate the efficiency provided by our synthesized data in ICL and the superiority of our retrieval pipeline over alternative approaches like conventional retrieval-augmented generation (RAG). We expect that Learn-by-interact will serve as a foundation for agent data synthesis as LLMs are increasingly deployed at real-world environments.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GameFactory: Creating New Games with Generative Interactive Videos</title>
      <itunes:episode>393</itunes:episode>
      <podcast:episode>393</podcast:episode>
      <itunes:title>GameFactory: Creating New Games with Generative Interactive Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">530f4498-c414-422a-baa0-20fc8a43c45a</guid>
      <link>https://share.transistor.fm/s/8dfedd6a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            GameFactory: Creating New Games with Generative Interactive Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.08325v1">http://arxiv.org/abs/2501.08325v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generative game engines have the potential to revolutionize game development by autonomously creating new content and reducing manual workload. However, existing video-based game generation methods fail to address the critical challenge of scene generalization, limiting their applicability to existing games with fixed styles and scenes. In this paper, we present GameFactory, a framework focused on exploring scene generalization in game video generation. To enable the creation of entirely new and diverse games, we leverage pre-trained video diffusion models trained on open-domain video data. To bridge the domain gap between open-domain priors and small-scale game dataset, we propose a multi-phase training strategy that decouples game style learning from action control, preserving open-domain generalization while achieving action controllability. Using Minecraft as our data source, we release GF-Minecraft, a high-quality and diversity action-annotated video dataset for research. Furthermore, we extend our framework to enable autoregressive action-controllable game video generation, allowing the production of unlimited-length interactive game videos. Experimental results demonstrate that GameFactory effectively generates open-domain, diverse, and action-controllable game videos, representing a significant step forward in AI-driven game generation. Our dataset and project page are publicly available at \url{https://vvictoryuki.github.io/gamefactory/}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            GameFactory: Creating New Games with Generative Interactive Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.08325v1">http://arxiv.org/abs/2501.08325v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generative game engines have the potential to revolutionize game development by autonomously creating new content and reducing manual workload. However, existing video-based game generation methods fail to address the critical challenge of scene generalization, limiting their applicability to existing games with fixed styles and scenes. In this paper, we present GameFactory, a framework focused on exploring scene generalization in game video generation. To enable the creation of entirely new and diverse games, we leverage pre-trained video diffusion models trained on open-domain video data. To bridge the domain gap between open-domain priors and small-scale game dataset, we propose a multi-phase training strategy that decouples game style learning from action control, preserving open-domain generalization while achieving action controllability. Using Minecraft as our data source, we release GF-Minecraft, a high-quality and diversity action-annotated video dataset for research. Furthermore, we extend our framework to enable autoregressive action-controllable game video generation, allowing the production of unlimited-length interactive game videos. Experimental results demonstrate that GameFactory effectively generates open-domain, diverse, and action-controllable game videos, representing a significant step forward in AI-driven game generation. Our dataset and project page are publicly available at \url{https://vvictoryuki.github.io/gamefactory/}.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 21 Jan 2025 19:07:42 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8dfedd6a/bbd23013.mp3" length="21970946" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1369</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 48 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            GameFactory: Creating New Games with Generative Interactive Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.08325v1">http://arxiv.org/abs/2501.08325v1</a></p>

            <p><strong>Abstract:</strong><br>
            Generative game engines have the potential to revolutionize game development by autonomously creating new content and reducing manual workload. However, existing video-based game generation methods fail to address the critical challenge of scene generalization, limiting their applicability to existing games with fixed styles and scenes. In this paper, we present GameFactory, a framework focused on exploring scene generalization in game video generation. To enable the creation of entirely new and diverse games, we leverage pre-trained video diffusion models trained on open-domain video data. To bridge the domain gap between open-domain priors and small-scale game dataset, we propose a multi-phase training strategy that decouples game style learning from action control, preserving open-domain generalization while achieving action controllability. Using Minecraft as our data source, we release GF-Minecraft, a high-quality and diversity action-annotated video dataset for research. Furthermore, we extend our framework to enable autoregressive action-controllable game video generation, allowing the production of unlimited-length interactive game videos. Experimental results demonstrate that GameFactory effectively generates open-domain, diverse, and action-controllable game videos, representing a significant step forward in AI-driven game generation. Our dataset and project page are publicly available at \url{https://vvictoryuki.github.io/gamefactory/}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VideoWorld: Exploring Knowledge Learning from Unlabeled Videos</title>
      <itunes:episode>392</itunes:episode>
      <podcast:episode>392</podcast:episode>
      <itunes:title>VideoWorld: Exploring Knowledge Learning from Unlabeled Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">711ca948-f176-435d-8778-979e8cc472b6</guid>
      <link>https://share.transistor.fm/s/c7a0654d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin</p>

            <p><strong>Title:</strong><br>
            VideoWorld: Exploring Knowledge Learning from Unlabeled Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.09781v1">http://arxiv.org/abs/2501.09781v1</a></p>

            <p><strong>Abstract:</strong><br>
            This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning knowledge, including rules, reasoning and planning capabilities, and (2) the representation of visual change is crucial for knowledge acquisition. To improve both the efficiency and efficacy of this process, we introduce the Latent Dynamics Model (LDM) as a key component of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. In robotic tasks, VideoWorld effectively learns diverse control operations and generalizes across environments, approaching the performance of oracle models in CALVIN and RLBench. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models open-sourced for further research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin</p>

            <p><strong>Title:</strong><br>
            VideoWorld: Exploring Knowledge Learning from Unlabeled Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.09781v1">http://arxiv.org/abs/2501.09781v1</a></p>

            <p><strong>Abstract:</strong><br>
            This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning knowledge, including rules, reasoning and planning capabilities, and (2) the representation of visual change is crucial for knowledge acquisition. To improve both the efficiency and efficacy of this process, we introduce the Latent Dynamics Model (LDM) as a key component of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. In robotic tasks, VideoWorld effectively learns diverse control operations and generalizes across environments, approaching the performance of oracle models in CALVIN and RLBench. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models open-sourced for further research.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 21 Jan 2025 19:07:19 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c7a0654d/12ebe23b.mp3" length="18706681" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1165</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin</p>

            <p><strong>Title:</strong><br>
            VideoWorld: Exploring Knowledge Learning from Unlabeled Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.09781v1">http://arxiv.org/abs/2501.09781v1</a></p>

            <p><strong>Abstract:</strong><br>
            This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning knowledge, including rules, reasoning and planning capabilities, and (2) the representation of visual change is crucial for knowledge acquisition. To improve both the efficiency and efficacy of this process, we introduce the Latent Dynamics Model (LDM) as a key component of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. In robotic tasks, VideoWorld effectively learns diverse control operations and generalizes across environments, approaching the performance of oracle models in CALVIN and RLBench. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models open-sourced for further research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SEAL: Entangled White-box Watermarks on Low-Rank Adaptation</title>
      <itunes:episode>391</itunes:episode>
      <podcast:episode>391</podcast:episode>
      <itunes:title>SEAL: Entangled White-box Watermarks on Low-Rank Adaptation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">92f0d7be-4afa-4849-aa53-7755a84e14af</guid>
      <link>https://share.transistor.fm/s/f7679a33</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.AI, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Giyeong Oh, Saejin Kim, Woohyun Cho, Sangkyu Lee, Jiwan Chung, Dokyung Song, Youngjae Yu</p>

            <p><strong>Title:</strong><br>
            SEAL: Entangled White-box Watermarks on Low-Rank Adaptation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.09284v2">http://arxiv.org/abs/2501.09284v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, LoRA and its variants have become the de facto strategy for training and sharing task-specific versions of large pretrained models, thanks to their efficiency and simplicity. However, the issue of copyright protection for LoRA weights, especially through watermark-based techniques, remains underexplored. To address this gap, we propose SEAL (SEcure wAtermarking on LoRA weights), the universal whitebox watermarking for LoRA. SEAL embeds a secret, non-trainable matrix between trainable LoRA weights, serving as a passport to claim ownership. SEAL then entangles the passport with the LoRA weights through training, without extra loss for entanglement, and distributes the finetuned weights after hiding the passport. When applying SEAL, we observed no performance degradation across commonsense reasoning, textual/visual instruction tuning, and text-to-image synthesis tasks. We demonstrate that SEAL is robust against a variety of known attacks: removal, obfuscation, and ambiguity attacks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.AI, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Giyeong Oh, Saejin Kim, Woohyun Cho, Sangkyu Lee, Jiwan Chung, Dokyung Song, Youngjae Yu</p>

            <p><strong>Title:</strong><br>
            SEAL: Entangled White-box Watermarks on Low-Rank Adaptation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.09284v2">http://arxiv.org/abs/2501.09284v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, LoRA and its variants have become the de facto strategy for training and sharing task-specific versions of large pretrained models, thanks to their efficiency and simplicity. However, the issue of copyright protection for LoRA weights, especially through watermark-based techniques, remains underexplored. To address this gap, we propose SEAL (SEcure wAtermarking on LoRA weights), the universal whitebox watermarking for LoRA. SEAL embeds a secret, non-trainable matrix between trainable LoRA weights, serving as a passport to claim ownership. SEAL then entangles the passport with the LoRA weights through training, without extra loss for entanglement, and distributes the finetuned weights after hiding the passport. When applying SEAL, we observed no performance degradation across commonsense reasoning, textual/visual instruction tuning, and text-to-image synthesis tasks. We demonstrate that SEAL is robust against a variety of known attacks: removal, obfuscation, and ambiguity attacks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 21 Jan 2025 19:06:55 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f7679a33/dd595892.mp3" length="21431772" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1336</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.AI, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Giyeong Oh, Saejin Kim, Woohyun Cho, Sangkyu Lee, Jiwan Chung, Dokyung Song, Youngjae Yu</p>

            <p><strong>Title:</strong><br>
            SEAL: Entangled White-box Watermarks on Low-Rank Adaptation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.09284v2">http://arxiv.org/abs/2501.09284v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, LoRA and its variants have become the de facto strategy for training and sharing task-specific versions of large pretrained models, thanks to their efficiency and simplicity. However, the issue of copyright protection for LoRA weights, especially through watermark-based techniques, remains underexplored. To address this gap, we propose SEAL (SEcure wAtermarking on LoRA weights), the universal whitebox watermarking for LoRA. SEAL embeds a secret, non-trainable matrix between trainable LoRA weights, serving as a passport to claim ownership. SEAL then entangles the passport with the LoRA weights through training, without extra loss for entanglement, and distributes the finetuned weights after hiding the passport. When applying SEAL, we observed no performance degradation across commonsense reasoning, textual/visual instruction tuning, and text-to-image synthesis tasks. We demonstrate that SEAL is robust against a variety of known attacks: removal, obfuscation, and ambiguity attacks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Lessons of Developing Process Reward Models in Mathematical Reasoning</title>
      <itunes:episode>390</itunes:episode>
      <podcast:episode>390</podcast:episode>
      <itunes:title>The Lessons of Developing Process Reward Models in Mathematical Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">991f5c17-e0cd-4ed0-9ab5-5d3159a4f8f4</guid>
      <link>https://share.transistor.fm/s/19d865d6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            The Lessons of Developing Process Reward Models in Mathematical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.07301v1">http://arxiv.org/abs/2501.07301v1</a></p>

            <p><strong>Abstract:</strong><br>
            Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            The Lessons of Developing Process Reward Models in Mathematical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.07301v1">http://arxiv.org/abs/2501.07301v1</a></p>

            <p><strong>Abstract:</strong><br>
            Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Jan 2025 22:41:58 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/19d865d6/1139b8d4.mp3" length="18192602" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1133</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            The Lessons of Developing Process Reward Models in Mathematical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.07301v1">http://arxiv.org/abs/2501.07301v1</a></p>

            <p><strong>Abstract:</strong><br>
            Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Tensor Product Attention Is All You Need</title>
      <itunes:episode>389</itunes:episode>
      <podcast:episode>389</podcast:episode>
      <itunes:title>Tensor Product Attention Is All You Need</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b2b58706-3e3c-4019-8bbf-87d3fb0607ce</guid>
      <link>https://share.transistor.fm/s/9bfe2d86</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao</p>

            <p><strong>Title:</strong><br>
            Tensor Product Attention Is All You Need</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.06425v1">http://arxiv.org/abs/2501.06425v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, significantly shrinking KV cache size at inference time. By factorizing these representations into contextual low-rank components (contextual factorization) and seamlessly integrating with RoPE, TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture for sequence modeling. Through extensive empirical evaluation of language modeling tasks, we demonstrate that T6 exceeds the performance of standard Transformer baselines including MHA, MQA, GQA, and MLA across various metrics, including perplexity and a range of renowned evaluation benchmarks. Notably, TPAs memory efficiency enables the processing of significantly longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. The code is available at https://github.com/tensorgi/T6.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao</p>

            <p><strong>Title:</strong><br>
            Tensor Product Attention Is All You Need</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.06425v1">http://arxiv.org/abs/2501.06425v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, significantly shrinking KV cache size at inference time. By factorizing these representations into contextual low-rank components (contextual factorization) and seamlessly integrating with RoPE, TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture for sequence modeling. Through extensive empirical evaluation of language modeling tasks, we demonstrate that T6 exceeds the performance of standard Transformer baselines including MHA, MQA, GQA, and MLA across various metrics, including perplexity and a range of renowned evaluation benchmarks. Notably, TPAs memory efficiency enables the processing of significantly longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. The code is available at https://github.com/tensorgi/T6.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Jan 2025 22:41:37 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9bfe2d86/cb311eec.mp3" length="20172442" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1257</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao</p>

            <p><strong>Title:</strong><br>
            Tensor Product Attention Is All You Need</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.06425v1">http://arxiv.org/abs/2501.06425v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, significantly shrinking KV cache size at inference time. By factorizing these representations into contextual low-rank components (contextual factorization) and seamlessly integrating with RoPE, TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture for sequence modeling. Through extensive empirical evaluation of language modeling tasks, we demonstrate that T6 exceeds the performance of standard Transformer baselines including MHA, MQA, GQA, and MLA across various metrics, including perplexity and a range of renowned evaluation benchmarks. Notably, TPAs memory efficiency enables the processing of significantly longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. The code is available at https://github.com/tensorgi/T6.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>$\text{Transformer}^2$: Self-adaptive LLMs</title>
      <itunes:episode>388</itunes:episode>
      <podcast:episode>388</podcast:episode>
      <itunes:title>$\text{Transformer}^2$: Self-adaptive LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">29757604-ab69-4358-84f1-89aefbc9b0d0</guid>
      <link>https://share.transistor.fm/s/f22ecc77</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qi Sun, Edoardo Cetin, Yujin Tang</p>

            <p><strong>Title:</strong><br>
            $\text{Transformer}^2$: Self-adaptive LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.06252v2">http://arxiv.org/abs/2501.06252v2</a></p>

            <p><strong>Abstract:</strong><br>
            Self-adaptive large language models (LLMs) aim to solve the challenges posed by traditional fine-tuning methods, which are often computationally intensive and static in their ability to handle diverse tasks. We introduce $\text{Transformer}^2$, a novel self-adaptation framework that adapts LLMs for unseen tasks in real-time by selectively adjusting only the singular components of their weight matrices. During inference, $\text{Transformer}^2$ employs a two-pass mechanism: first, a dispatch system identifies the task properties, and then task-specific "expert" vectors, trained using reinforcement learning, are dynamically mixed to obtain targeted behavior for the incoming prompt. Our method outperforms ubiquitous approaches such as LoRA, with fewer parameters and greater efficiency. $\text{Transformer}^2$ demonstrates versatility across different LLM architectures and modalities, including vision-language tasks. $\text{Transformer}^2$ represents a significant leap forward, offering a scalable, efficient solution for enhancing the adaptability and task-specific performance of LLMs, paving the way for truly dynamic, self-organizing AI systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qi Sun, Edoardo Cetin, Yujin Tang</p>

            <p><strong>Title:</strong><br>
            $\text{Transformer}^2$: Self-adaptive LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.06252v2">http://arxiv.org/abs/2501.06252v2</a></p>

            <p><strong>Abstract:</strong><br>
            Self-adaptive large language models (LLMs) aim to solve the challenges posed by traditional fine-tuning methods, which are often computationally intensive and static in their ability to handle diverse tasks. We introduce $\text{Transformer}^2$, a novel self-adaptation framework that adapts LLMs for unseen tasks in real-time by selectively adjusting only the singular components of their weight matrices. During inference, $\text{Transformer}^2$ employs a two-pass mechanism: first, a dispatch system identifies the task properties, and then task-specific "expert" vectors, trained using reinforcement learning, are dynamically mixed to obtain targeted behavior for the incoming prompt. Our method outperforms ubiquitous approaches such as LoRA, with fewer parameters and greater efficiency. $\text{Transformer}^2$ demonstrates versatility across different LLM architectures and modalities, including vision-language tasks. $\text{Transformer}^2$ represents a significant leap forward, offering a scalable, efficient solution for enhancing the adaptability and task-specific performance of LLMs, paving the way for truly dynamic, self-organizing AI systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Jan 2025 22:41:06 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f22ecc77/3f3c1fe2.mp3" length="25651471" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1600</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qi Sun, Edoardo Cetin, Yujin Tang</p>

            <p><strong>Title:</strong><br>
            $\text{Transformer}^2$: Self-adaptive LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.06252v2">http://arxiv.org/abs/2501.06252v2</a></p>

            <p><strong>Abstract:</strong><br>
            Self-adaptive large language models (LLMs) aim to solve the challenges posed by traditional fine-tuning methods, which are often computationally intensive and static in their ability to handle diverse tasks. We introduce $\text{Transformer}^2$, a novel self-adaptation framework that adapts LLMs for unseen tasks in real-time by selectively adjusting only the singular components of their weight matrices. During inference, $\text{Transformer}^2$ employs a two-pass mechanism: first, a dispatch system identifies the task properties, and then task-specific "expert" vectors, trained using reinforcement learning, are dynamically mixed to obtain targeted behavior for the incoming prompt. Our method outperforms ubiquitous approaches such as LoRA, with fewer parameters and greater efficiency. $\text{Transformer}^2$ demonstrates versatility across different LLM architectures and modalities, including vision-language tasks. $\text{Transformer}^2$ represents a significant leap forward, offering a scalable, efficient solution for enhancing the adaptability and task-specific performance of LLMs, paving the way for truly dynamic, self-organizing AI systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MinMo: A Multimodal Large Language Model for Seamless Voice Interaction</title>
      <itunes:episode>387</itunes:episode>
      <podcast:episode>387</podcast:episode>
      <itunes:title>MinMo: A Multimodal Large Language Model for Seamless Voice Interaction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9f96ddaa-a05d-458d-96e6-a689ef0d3d5f</guid>
      <link>https://share.transistor.fm/s/7930d7fc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.AI, cs.HC, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan, Yexin Yang, Baosong Yang, Xian Yang, Guanrou Yang, Tianyu Zhao, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Pei Zhang, Chong Zhang, Jinren Zhou</p>

            <p><strong>Title:</strong><br>
            MinMo: A Multimodal Large Language Model for Seamless Voice Interaction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.06282v1">http://arxiv.org/abs/2501.06282v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudiollm.github.io/minmo, and the code and models will be released soon.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.AI, cs.HC, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan, Yexin Yang, Baosong Yang, Xian Yang, Guanrou Yang, Tianyu Zhao, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Pei Zhang, Chong Zhang, Jinren Zhou</p>

            <p><strong>Title:</strong><br>
            MinMo: A Multimodal Large Language Model for Seamless Voice Interaction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.06282v1">http://arxiv.org/abs/2501.06282v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudiollm.github.io/minmo, and the code and models will be released soon.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Jan 2025 22:40:45 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7930d7fc/6aacaadc.mp3" length="22656405" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1412</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.AI, cs.HC, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan, Yexin Yang, Baosong Yang, Xian Yang, Guanrou Yang, Tianyu Zhao, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Pei Zhang, Chong Zhang, Jinren Zhou</p>

            <p><strong>Title:</strong><br>
            MinMo: A Multimodal Large Language Model for Seamless Voice Interaction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.06282v1">http://arxiv.org/abs/2501.06282v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudiollm.github.io/minmo, and the code and models will be released soon.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VideoAuteur: Towards Long Narrative Video Generation</title>
      <itunes:episode>386</itunes:episode>
      <podcast:episode>386</podcast:episode>
      <itunes:title>VideoAuteur: Towards Long Narrative Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">83eb9481-e343-4f22-8005-385197f57aaa</guid>
      <link>https://share.transistor.fm/s/b9291e7a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junfei Xiao, Feng Cheng, Lu Qi, Liangke Gui, Jiepeng Cen, Zhibei Ma, Alan Yuille, Lu Jiang</p>

            <p><strong>Title:</strong><br>
            VideoAuteur: Towards Long Narrative Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.06173v1">http://arxiv.org/abs/2501.06173v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent video generation models have shown promising results in producing high-quality video clips lasting several seconds. However, these models face challenges in generating long sequences that convey clear and informative events, limiting their ability to support coherent narrations. In this paper, we present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain. We validate the quality of our proposed dataset in terms of visual fidelity and textual caption accuracy using state-of-the-art Vision-Language Models (VLMs) and video generation models, respectively. We further introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos and emphasize the role of aligning visual embeddings to achieve improved overall video quality. Our method demonstrates substantial improvements in generating visually detailed and semantically aligned keyframes, supported by finetuning techniques that integrate text and image embeddings within the video generation process. Project page: https://videoauteur.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junfei Xiao, Feng Cheng, Lu Qi, Liangke Gui, Jiepeng Cen, Zhibei Ma, Alan Yuille, Lu Jiang</p>

            <p><strong>Title:</strong><br>
            VideoAuteur: Towards Long Narrative Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.06173v1">http://arxiv.org/abs/2501.06173v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent video generation models have shown promising results in producing high-quality video clips lasting several seconds. However, these models face challenges in generating long sequences that convey clear and informative events, limiting their ability to support coherent narrations. In this paper, we present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain. We validate the quality of our proposed dataset in terms of visual fidelity and textual caption accuracy using state-of-the-art Vision-Language Models (VLMs) and video generation models, respectively. We further introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos and emphasize the role of aligning visual embeddings to achieve improved overall video quality. Our method demonstrates substantial improvements in generating visually detailed and semantically aligned keyframes, supported by finetuning techniques that integrate text and image embeddings within the video generation process. Project page: https://videoauteur.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Jan 2025 22:40:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b9291e7a/b2b3e977.mp3" length="20913914" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1303</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junfei Xiao, Feng Cheng, Lu Qi, Liangke Gui, Jiepeng Cen, Zhibei Ma, Alan Yuille, Lu Jiang</p>

            <p><strong>Title:</strong><br>
            VideoAuteur: Towards Long Narrative Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.06173v1">http://arxiv.org/abs/2501.06173v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent video generation models have shown promising results in producing high-quality video clips lasting several seconds. However, these models face challenges in generating long sequences that convey clear and informative events, limiting their ability to support coherent narrations. In this paper, we present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain. We validate the quality of our proposed dataset in terms of visual fidelity and textual caption accuracy using state-of-the-art Vision-Language Models (VLMs) and video generation models, respectively. We further introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos and emphasize the role of aligning visual embeddings to achieve improved overall video quality. Our method demonstrates substantial improvements in generating visually detailed and semantically aligned keyframes, supported by finetuning techniques that integrate text and image embeddings within the video generation process. Project page: https://videoauteur.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning</title>
      <itunes:episode>385</itunes:episode>
      <podcast:episode>385</podcast:episode>
      <itunes:title>O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6574ff51-7d43-4ce1-8dbf-6622339504d7</guid>
      <link>https://share.transistor.fm/s/de7f4910</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhongzhen Huang, Gui Geng, Shengyi Hua, Zhen Huang, Haoyang Zou, Shaoting Zhang, Pengfei Liu, Xiaofan Zhang</p>

            <p><strong>Title:</strong><br>
            O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.06458v1">http://arxiv.org/abs/2501.06458v1</a></p>

            <p><strong>Abstract:</strong><br>
            Building upon our previous investigations of O1 replication (Part 1: Journey Learning [Qin et al., 2024] and Part 2: Distillation [Huang et al., 2024]), this work explores the potential of inference-time scaling in large language models (LLMs) for medical reasoning tasks, ranging from diagnostic decision-making to treatment planning. Through extensive experiments on medical benchmarks of varying complexity (MedQA, Medbullets, and JAMA Clinical Challenges), our investigation reveals several key insights: (1) Increasing inference time does lead to improved performance. With a modest training set of 500 samples, our model yields substantial performance improvements of 6%-11%. (2) Task complexity directly correlates with the required length of reasoning chains, confirming the necessity of extended thought processes for challenging problems. (3) The differential diagnoses generated by our model adhere to the principles of the hypothetico-deductive method, producing a list of potential conditions that may explain a patient's symptoms and systematically narrowing these possibilities by evaluating the evidence. These findings demonstrate the promising synergy between inference-time scaling and journey learning in advancing LLMs' real-world clinical reasoning capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhongzhen Huang, Gui Geng, Shengyi Hua, Zhen Huang, Haoyang Zou, Shaoting Zhang, Pengfei Liu, Xiaofan Zhang</p>

            <p><strong>Title:</strong><br>
            O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.06458v1">http://arxiv.org/abs/2501.06458v1</a></p>

            <p><strong>Abstract:</strong><br>
            Building upon our previous investigations of O1 replication (Part 1: Journey Learning [Qin et al., 2024] and Part 2: Distillation [Huang et al., 2024]), this work explores the potential of inference-time scaling in large language models (LLMs) for medical reasoning tasks, ranging from diagnostic decision-making to treatment planning. Through extensive experiments on medical benchmarks of varying complexity (MedQA, Medbullets, and JAMA Clinical Challenges), our investigation reveals several key insights: (1) Increasing inference time does lead to improved performance. With a modest training set of 500 samples, our model yields substantial performance improvements of 6%-11%. (2) Task complexity directly correlates with the required length of reasoning chains, confirming the necessity of extended thought processes for challenging problems. (3) The differential diagnoses generated by our model adhere to the principles of the hypothetico-deductive method, producing a list of potential conditions that may explain a patient's symptoms and systematically narrowing these possibilities by evaluating the evidence. These findings demonstrate the promising synergy between inference-time scaling and journey learning in advancing LLMs' real-world clinical reasoning capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Jan 2025 22:40:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/de7f4910/1a68febd.mp3" length="23992209" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1496</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhongzhen Huang, Gui Geng, Shengyi Hua, Zhen Huang, Haoyang Zou, Shaoting Zhang, Pengfei Liu, Xiaofan Zhang</p>

            <p><strong>Title:</strong><br>
            O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.06458v1">http://arxiv.org/abs/2501.06458v1</a></p>

            <p><strong>Abstract:</strong><br>
            Building upon our previous investigations of O1 replication (Part 1: Journey Learning [Qin et al., 2024] and Part 2: Distillation [Huang et al., 2024]), this work explores the potential of inference-time scaling in large language models (LLMs) for medical reasoning tasks, ranging from diagnostic decision-making to treatment planning. Through extensive experiments on medical benchmarks of varying complexity (MedQA, Medbullets, and JAMA Clinical Challenges), our investigation reveals several key insights: (1) Increasing inference time does lead to improved performance. With a modest training set of 500 samples, our model yields substantial performance improvements of 6%-11%. (2) Task complexity directly correlates with the required length of reasoning chains, confirming the necessity of extended thought processes for challenging problems. (3) The differential diagnoses generated by our model adhere to the principles of the hypothetico-deductive method, producing a list of potential conditions that may explain a patient's symptoms and systematically narrowing these possibilities by evaluating the evidence. These findings demonstrate the promising synergy between inference-time scaling and journey learning in advancing LLMs' real-world clinical reasoning capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WebWalker: Benchmarking LLMs in Web Traversal</title>
      <itunes:episode>384</itunes:episode>
      <podcast:episode>384</podcast:episode>
      <itunes:title>WebWalker: Benchmarking LLMs in Web Traversal</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">343fe8f1-7496-4823-98cb-9b1be4a4f481</guid>
      <link>https://share.transistor.fm/s/a115bed4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang</p>

            <p><strong>Title:</strong><br>
            WebWalker: Benchmarking LLMs in Web Traversal</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.07572v2">http://arxiv.org/abs/2501.07572v2</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang</p>

            <p><strong>Title:</strong><br>
            WebWalker: Benchmarking LLMs in Web Traversal</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.07572v2">http://arxiv.org/abs/2501.07572v2</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Jan 2025 22:39:42 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a115bed4/ad4b09a2.mp3" length="23149152" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1443</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang</p>

            <p><strong>Title:</strong><br>
            WebWalker: Benchmarking LLMs in Web Traversal</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.07572v2">http://arxiv.org/abs/2501.07572v2</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training</title>
      <itunes:episode>383</itunes:episode>
      <podcast:episode>383</podcast:episode>
      <itunes:title>SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f667f8b2-953b-4b65-842b-16d46a5c28de</guid>
      <link>https://share.transistor.fm/s/c16eeb6b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, Shiwei Liu</p>

            <p><strong>Title:</strong><br>
            SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.06842v1">http://arxiv.org/abs/2501.06842v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to critical challenges such as training instability. A predominant source of this instability stems from gradient and loss spikes, which disrupt the learning process, often leading to costly interventions like checkpoint recovery and experiment restarts, further amplifying inefficiencies. This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets. Our analysis shows that these spikes can be up to $1000\times$ larger than typical gradients, substantially deteriorating model performance. To address this issue, we propose Spike-Aware Adam with Momentum Reset SPAM, a novel optimizer designed to counteract gradient spikes through momentum reset and spike-aware gradient clipping. Extensive experiments, including both pre-training and fine-tuning, demonstrate that SPAM consistently surpasses Adam and its variants across various tasks, including (1) LLM pre-training from 60M to 1B, (2) 4-bit LLM pre-training,(3) reinforcement learning, and (4) Time Series Forecasting. Additionally, SPAM facilitates memory-efficient training by enabling sparse momentum, where only a subset of momentum terms are maintained and updated. When operating under memory constraints, SPAM outperforms state-of-the-art memory-efficient optimizers such as GaLore and Adam-Mini. Our work underscores the importance of mitigating gradient spikes in LLM training and introduces an effective optimization strategy that enhances both training stability and resource efficiency at scale. Code is available at https://github.com/TianjinYellow/SPAM-Optimizer.git</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, Shiwei Liu</p>

            <p><strong>Title:</strong><br>
            SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.06842v1">http://arxiv.org/abs/2501.06842v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to critical challenges such as training instability. A predominant source of this instability stems from gradient and loss spikes, which disrupt the learning process, often leading to costly interventions like checkpoint recovery and experiment restarts, further amplifying inefficiencies. This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets. Our analysis shows that these spikes can be up to $1000\times$ larger than typical gradients, substantially deteriorating model performance. To address this issue, we propose Spike-Aware Adam with Momentum Reset SPAM, a novel optimizer designed to counteract gradient spikes through momentum reset and spike-aware gradient clipping. Extensive experiments, including both pre-training and fine-tuning, demonstrate that SPAM consistently surpasses Adam and its variants across various tasks, including (1) LLM pre-training from 60M to 1B, (2) 4-bit LLM pre-training,(3) reinforcement learning, and (4) Time Series Forecasting. Additionally, SPAM facilitates memory-efficient training by enabling sparse momentum, where only a subset of momentum terms are maintained and updated. When operating under memory constraints, SPAM outperforms state-of-the-art memory-efficient optimizers such as GaLore and Adam-Mini. Our work underscores the importance of mitigating gradient spikes in LLM training and introduces an effective optimization strategy that enhances both training stability and resource efficiency at scale. Code is available at https://github.com/TianjinYellow/SPAM-Optimizer.git</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Jan 2025 22:39:21 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c16eeb6b/daf40bad.mp3" length="19943009" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1243</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, Shiwei Liu</p>

            <p><strong>Title:</strong><br>
            SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.06842v1">http://arxiv.org/abs/2501.06842v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to critical challenges such as training instability. A predominant source of this instability stems from gradient and loss spikes, which disrupt the learning process, often leading to costly interventions like checkpoint recovery and experiment restarts, further amplifying inefficiencies. This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets. Our analysis shows that these spikes can be up to $1000\times$ larger than typical gradients, substantially deteriorating model performance. To address this issue, we propose Spike-Aware Adam with Momentum Reset SPAM, a novel optimizer designed to counteract gradient spikes through momentum reset and spike-aware gradient clipping. Extensive experiments, including both pre-training and fine-tuning, demonstrate that SPAM consistently surpasses Adam and its variants across various tasks, including (1) LLM pre-training from 60M to 1B, (2) 4-bit LLM pre-training,(3) reinforcement learning, and (4) Time Series Forecasting. Additionally, SPAM facilitates memory-efficient training by enabling sparse momentum, where only a subset of momentum terms are maintained and updated. When operating under memory constraints, SPAM outperforms state-of-the-art memory-efficient optimizers such as GaLore and Adam-Mini. Our work underscores the importance of mitigating gradient spikes in LLM training and introduces an effective optimization strategy that enhances both training stability and resource efficiency at scale. Code is available at https://github.com/TianjinYellow/SPAM-Optimizer.git</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UnCommon Objects in 3D</title>
      <itunes:episode>382</itunes:episode>
      <podcast:episode>382</podcast:episode>
      <itunes:title>UnCommon Objects in 3D</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">416f0292-fe76-4335-8f7a-43c9f67d2813</guid>
      <link>https://share.transistor.fm/s/b21ab46a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Xingchen Liu, Piyush Tayal, Jianyuan Wang, Jesus Zarzar, Tom Monnier, Konstantinos Tertikas, Jiali Duan, Antoine Toisoul, Jason Y. Zhang, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, David Novotny</p>

            <p><strong>Title:</strong><br>
            UnCommon Objects in 3D</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.07574v1">http://arxiv.org/abs/2501.07574v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Uncommon Objects in 3D (uCO3D), a new object-centric dataset for 3D deep learning and 3D generative AI. uCO3D is the largest publicly-available collection of high-resolution videos of objects with 3D annotations that ensures full-360$^{\circ}$ coverage. uCO3D is significantly more diverse than MVImgNet and CO3Dv2, covering more than 1,000 object categories. It is also of higher quality, due to extensive quality checks of both the collected videos and the 3D annotations. Similar to analogous datasets, uCO3D contains annotations for 3D camera poses, depth maps and sparse point clouds. In addition, each object is equipped with a caption and a 3D Gaussian Splat reconstruction. We train several large 3D models on MVImgNet, CO3Dv2, and uCO3D and obtain superior results using the latter, showing that uCO3D is better for learning applications.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Xingchen Liu, Piyush Tayal, Jianyuan Wang, Jesus Zarzar, Tom Monnier, Konstantinos Tertikas, Jiali Duan, Antoine Toisoul, Jason Y. Zhang, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, David Novotny</p>

            <p><strong>Title:</strong><br>
            UnCommon Objects in 3D</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.07574v1">http://arxiv.org/abs/2501.07574v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Uncommon Objects in 3D (uCO3D), a new object-centric dataset for 3D deep learning and 3D generative AI. uCO3D is the largest publicly-available collection of high-resolution videos of objects with 3D annotations that ensures full-360$^{\circ}$ coverage. uCO3D is significantly more diverse than MVImgNet and CO3Dv2, covering more than 1,000 object categories. It is also of higher quality, due to extensive quality checks of both the collected videos and the 3D annotations. Similar to analogous datasets, uCO3D contains annotations for 3D camera poses, depth maps and sparse point clouds. In addition, each object is equipped with a caption and a 3D Gaussian Splat reconstruction. We train several large 3D models on MVImgNet, CO3Dv2, and uCO3D and obtain superior results using the latter, showing that uCO3D is better for learning applications.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 14 Jan 2025 22:39:00 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b21ab46a/4995e8a2.mp3" length="21024643" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1310</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Xingchen Liu, Piyush Tayal, Jianyuan Wang, Jesus Zarzar, Tom Monnier, Konstantinos Tertikas, Jiali Duan, Antoine Toisoul, Jason Y. Zhang, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, David Novotny</p>

            <p><strong>Title:</strong><br>
            UnCommon Objects in 3D</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.07574v1">http://arxiv.org/abs/2501.07574v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Uncommon Objects in 3D (uCO3D), a new object-centric dataset for 3D deep learning and 3D generative AI. uCO3D is the largest publicly-available collection of high-resolution videos of objects with 3D annotations that ensures full-360$^{\circ}$ coverage. uCO3D is significantly more diverse than MVImgNet and CO3Dv2, covering more than 1,000 object categories. It is also of higher quality, due to extensive quality checks of both the collected videos and the 3D annotations. Similar to analogous datasets, uCO3D contains annotations for 3D camera poses, depth maps and sparse point clouds. In addition, each object is equipped with a caption and a 3D Gaussian Splat reconstruction. We train several large 3D models on MVImgNet, CO3Dv2, and uCO3D and obtain superior results using the latter, showing that uCO3D is better for learning applications.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VideoRAG: Retrieval-Augmented Generation over Video Corpus</title>
      <itunes:episode>381</itunes:episode>
      <podcast:episode>381</podcast:episode>
      <itunes:title>VideoRAG: Retrieval-Augmented Generation over Video Corpus</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a0bf1d80-496d-4546-8b81-732c18240381</guid>
      <link>https://share.transistor.fm/s/dd556528</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV, cs.AI, cs.CL, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Soyeong Jeong, Kangsan Kim, Jinheon Baek, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            VideoRAG: Retrieval-Augmented Generation over Video Corpus</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05874v1">http://arxiv.org/abs/2501.05874v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-Augmented Generation (RAG) is a powerful strategy to address the issue of generating factually incorrect outputs in foundation models by retrieving external knowledge relevant to queries and incorporating it into their generation process. However, existing RAG approaches have primarily focused on textual information, with some recent advancements beginning to consider images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing events, processes, and contextual details more effectively than any other modality. While a few recent studies explore the integration of videos in the response generation process, they either predefine query-associated videos without retrieving them according to queries, or convert videos into the textual descriptions without harnessing their multimodal richness. To tackle these, we introduce VideoRAG, a novel framework that not only dynamically retrieves relevant videos based on their relevance with queries but also utilizes both visual and textual information of videos in the output generation. Further, to operationalize this, our method revolves around the recent advance of Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and seamless integration of the retrieved videos jointly with queries. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV, cs.AI, cs.CL, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Soyeong Jeong, Kangsan Kim, Jinheon Baek, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            VideoRAG: Retrieval-Augmented Generation over Video Corpus</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05874v1">http://arxiv.org/abs/2501.05874v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-Augmented Generation (RAG) is a powerful strategy to address the issue of generating factually incorrect outputs in foundation models by retrieving external knowledge relevant to queries and incorporating it into their generation process. However, existing RAG approaches have primarily focused on textual information, with some recent advancements beginning to consider images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing events, processes, and contextual details more effectively than any other modality. While a few recent studies explore the integration of videos in the response generation process, they either predefine query-associated videos without retrieving them according to queries, or convert videos into the textual descriptions without harnessing their multimodal richness. To tackle these, we introduce VideoRAG, a novel framework that not only dynamically retrieves relevant videos based on their relevance with queries but also utilizes both visual and textual information of videos in the output generation. Further, to operationalize this, our method revolves around the recent advance of Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and seamless integration of the retrieved videos jointly with queries. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Jan 2025 20:30:35 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/dd556528/394477b6.mp3" length="21506586" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1340</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 43 | cs.CV, cs.AI, cs.CL, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Soyeong Jeong, Kangsan Kim, Jinheon Baek, Sung Ju Hwang</p>

            <p><strong>Title:</strong><br>
            VideoRAG: Retrieval-Augmented Generation over Video Corpus</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05874v1">http://arxiv.org/abs/2501.05874v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-Augmented Generation (RAG) is a powerful strategy to address the issue of generating factually incorrect outputs in foundation models by retrieving external knowledge relevant to queries and incorporating it into their generation process. However, existing RAG approaches have primarily focused on textual information, with some recent advancements beginning to consider images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing events, processes, and contextual details more effectively than any other modality. While a few recent studies explore the integration of videos in the response generation process, they either predefine query-associated videos without retrieving them according to queries, or convert videos into the textual descriptions without harnessing their multimodal richness. To tackle these, we introduce VideoRAG, a novel framework that not only dynamically retrieves relevant videos based on their relevance with queries but also utilizes both visual and textual information of videos in the output generation. Further, to operationalize this, our method revolves around the recent advance of Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and seamless integration of the retrieved videos jointly with queries. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?</title>
      <itunes:episode>380</itunes:episode>
      <podcast:episode>380</podcast:episode>
      <itunes:title>OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4bda0964-1f97-49a3-a6cb-5181649999cd</guid>
      <link>https://share.transistor.fm/s/64004fe6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05510v1">http://arxiv.org/abs/2501.05510v1</a></p>

            <p><strong>Abstract:</strong><br>
            Temporal Awareness, the ability to reason dynamically based on the timestamp when a question is raised, is the key distinction between offline and online video LLMs. Unlike offline models, which rely on complete videos for static, post hoc analysis, online models process video streams incrementally and dynamically adapt their responses based on the timestamp at which the question is posed. Despite its significance, temporal awareness has not been adequately evaluated in existing benchmarks. To fill this gap, we present OVO-Bench (Online-VideO-Benchmark), a novel video benchmark that emphasizes the importance of timestamps for advanced online video understanding capability benchmarking. OVO-Bench evaluates the ability of video LLMs to reason and respond to events occurring at specific timestamps under three distinct scenarios: (1) Backward tracing: trace back to past events to answer the question. (2) Real-time understanding: understand and respond to events as they unfold at the current timestamp. (3) Forward active responding: delay the response until sufficient future information becomes available to answer the question accurately. OVO-Bench comprises 12 tasks, featuring 644 unique videos and approximately human-curated 2,800 fine-grained meta-annotations with precise timestamps. We combine automated generation pipelines with human curation. With these high-quality samples, we further developed an evaluation pipeline to systematically query video LLMs along the video timeline. Evaluations of nine Video-LLMs reveal that, despite advancements on traditional benchmarks, current models struggle with online video understanding, showing a significant gap compared to human agents. We hope OVO-Bench will drive progress in video LLMs and inspire future research in online video reasoning. Our benchmark and code can be accessed at https://github.com/JoeLeelyf/OVO-Bench.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05510v1">http://arxiv.org/abs/2501.05510v1</a></p>

            <p><strong>Abstract:</strong><br>
            Temporal Awareness, the ability to reason dynamically based on the timestamp when a question is raised, is the key distinction between offline and online video LLMs. Unlike offline models, which rely on complete videos for static, post hoc analysis, online models process video streams incrementally and dynamically adapt their responses based on the timestamp at which the question is posed. Despite its significance, temporal awareness has not been adequately evaluated in existing benchmarks. To fill this gap, we present OVO-Bench (Online-VideO-Benchmark), a novel video benchmark that emphasizes the importance of timestamps for advanced online video understanding capability benchmarking. OVO-Bench evaluates the ability of video LLMs to reason and respond to events occurring at specific timestamps under three distinct scenarios: (1) Backward tracing: trace back to past events to answer the question. (2) Real-time understanding: understand and respond to events as they unfold at the current timestamp. (3) Forward active responding: delay the response until sufficient future information becomes available to answer the question accurately. OVO-Bench comprises 12 tasks, featuring 644 unique videos and approximately human-curated 2,800 fine-grained meta-annotations with precise timestamps. We combine automated generation pipelines with human curation. With these high-quality samples, we further developed an evaluation pipeline to systematically query video LLMs along the video timeline. Evaluations of nine Video-LLMs reveal that, despite advancements on traditional benchmarks, current models struggle with online video understanding, showing a significant gap compared to human agents. We hope OVO-Bench will drive progress in video LLMs and inspire future research in online video reasoning. Our benchmark and code can be accessed at https://github.com/JoeLeelyf/OVO-Bench.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Jan 2025 20:29:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/64004fe6/c3e70530.mp3" length="21614442" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1347</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05510v1">http://arxiv.org/abs/2501.05510v1</a></p>

            <p><strong>Abstract:</strong><br>
            Temporal Awareness, the ability to reason dynamically based on the timestamp when a question is raised, is the key distinction between offline and online video LLMs. Unlike offline models, which rely on complete videos for static, post hoc analysis, online models process video streams incrementally and dynamically adapt their responses based on the timestamp at which the question is posed. Despite its significance, temporal awareness has not been adequately evaluated in existing benchmarks. To fill this gap, we present OVO-Bench (Online-VideO-Benchmark), a novel video benchmark that emphasizes the importance of timestamps for advanced online video understanding capability benchmarking. OVO-Bench evaluates the ability of video LLMs to reason and respond to events occurring at specific timestamps under three distinct scenarios: (1) Backward tracing: trace back to past events to answer the question. (2) Real-time understanding: understand and respond to events as they unfold at the current timestamp. (3) Forward active responding: delay the response until sufficient future information becomes available to answer the question accurately. OVO-Bench comprises 12 tasks, featuring 644 unique videos and approximately human-curated 2,800 fine-grained meta-annotations with precise timestamps. We combine automated generation pipelines with human curation. With these high-quality samples, we further developed an evaluation pipeline to systematically query video LLMs along the video timeline. Evaluations of nine Video-LLMs reveal that, despite advancements on traditional benchmarks, current models struggle with online video understanding, showing a significant gap compared to human agents. We hope OVO-Bench will drive progress in video LLMs and inspire future research in online video reasoning. Our benchmark and code can be accessed at https://github.com/JoeLeelyf/OVO-Bench.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Enabling Scalable Oversight via Self-Evolving Critic</title>
      <itunes:episode>379</itunes:episode>
      <podcast:episode>379</podcast:episode>
      <itunes:title>Enabling Scalable Oversight via Self-Evolving Critic</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a2edabfa-93e6-4410-85f1-326d833bb10a</guid>
      <link>https://share.transistor.fm/s/44fc19f3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Enabling Scalable Oversight via Self-Evolving Critic</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05727v1">http://arxiv.org/abs/2501.05727v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite their remarkable performance, the development of Large Language Models (LLMs) faces a critical challenge in scalable oversight: providing effective feedback for tasks where human evaluation is difficult or where LLMs outperform humans. While there is growing interest in using LLMs for critique, current approaches still rely on human annotations or more powerful models, leaving the issue of enhancing critique capabilities without external supervision unresolved. We introduce SCRIT (Self-evolving CRITic), a framework that enables genuine self-evolution of critique abilities. Technically, SCRIT self-improves by training on synthetic data, generated by a contrastive-based self-critic that uses reference solutions for step-by-step critique, and a self-validation mechanism that ensures critique quality through correction outcomes. Implemented with Qwen2.5-72B-Instruct, one of the most powerful LLMs, SCRIT achieves up to a 10.3\% improvement on critique-correction and error identification benchmarks. Our analysis reveals that SCRIT's performance scales positively with data and model size, outperforms alternative approaches, and benefits critically from its self-validation component.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Enabling Scalable Oversight via Self-Evolving Critic</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05727v1">http://arxiv.org/abs/2501.05727v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite their remarkable performance, the development of Large Language Models (LLMs) faces a critical challenge in scalable oversight: providing effective feedback for tasks where human evaluation is difficult or where LLMs outperform humans. While there is growing interest in using LLMs for critique, current approaches still rely on human annotations or more powerful models, leaving the issue of enhancing critique capabilities without external supervision unresolved. We introduce SCRIT (Self-evolving CRITic), a framework that enables genuine self-evolution of critique abilities. Technically, SCRIT self-improves by training on synthetic data, generated by a contrastive-based self-critic that uses reference solutions for step-by-step critique, and a self-validation mechanism that ensures critique quality through correction outcomes. Implemented with Qwen2.5-72B-Instruct, one of the most powerful LLMs, SCRIT achieves up to a 10.3\% improvement on critique-correction and error identification benchmarks. Our analysis reveals that SCRIT's performance scales positively with data and model size, outperforms alternative approaches, and benefits critically from its self-validation component.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Jan 2025 20:29:33 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/44fc19f3/d399039d.mp3" length="26787494" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1671</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            Enabling Scalable Oversight via Self-Evolving Critic</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05727v1">http://arxiv.org/abs/2501.05727v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite their remarkable performance, the development of Large Language Models (LLMs) faces a critical challenge in scalable oversight: providing effective feedback for tasks where human evaluation is difficult or where LLMs outperform humans. While there is growing interest in using LLMs for critique, current approaches still rely on human annotations or more powerful models, leaving the issue of enhancing critique capabilities without external supervision unresolved. We introduce SCRIT (Self-evolving CRITic), a framework that enables genuine self-evolution of critique abilities. Technically, SCRIT self-improves by training on synthetic data, generated by a contrastive-based self-critic that uses reference solutions for step-by-step critique, and a self-validation mechanism that ensures critique quality through correction outcomes. Implemented with Qwen2.5-72B-Instruct, one of the most powerful LLMs, SCRIT achieves up to a 10.3\% improvement on critique-correction and error identification benchmarks. Our analysis reveals that SCRIT's performance scales positively with data and model size, outperforms alternative approaches, and benefits critically from its self-validation component.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models</title>
      <itunes:episode>378</itunes:episode>
      <podcast:episode>378</podcast:episode>
      <itunes:title>Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">edba513d-2a4a-4c8c-a19d-a380e8b54d23</guid>
      <link>https://share.transistor.fm/s/2cebf848</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            You Li, Heyu Huang, Chi Chen, Kaiyu Huang, Chao Huang, Zonghao Guo, Zhiyuan Liu, Jinan Xu, Yuhua Li, Ruixuan Li, Maosong Sun</p>

            <p><strong>Title:</strong><br>
            Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05767v2">http://arxiv.org/abs/2501.05767v2</a></p>

            <p><strong>Abstract:</strong><br>
            The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced at https://migician-vg.github.io/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            You Li, Heyu Huang, Chi Chen, Kaiyu Huang, Chao Huang, Zonghao Guo, Zhiyuan Liu, Jinan Xu, Yuhua Li, Ruixuan Li, Maosong Sun</p>

            <p><strong>Title:</strong><br>
            Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05767v2">http://arxiv.org/abs/2501.05767v2</a></p>

            <p><strong>Abstract:</strong><br>
            The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced at https://migician-vg.github.io/.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Jan 2025 20:29:12 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2cebf848/79b105ec.mp3" length="21148019" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1318</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CL, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            You Li, Heyu Huang, Chi Chen, Kaiyu Huang, Chao Huang, Zonghao Guo, Zhiyuan Liu, Jinan Xu, Yuhua Li, Ruixuan Li, Maosong Sun</p>

            <p><strong>Title:</strong><br>
            Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05767v2">http://arxiv.org/abs/2501.05767v2</a></p>

            <p><strong>Abstract:</strong><br>
            The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced at https://migician-vg.github.io/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding</title>
      <itunes:episode>377</itunes:episode>
      <podcast:episode>377</podcast:episode>
      <itunes:title>ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0197b478-571a-4713-bc96-16942a944199</guid>
      <link>https://share.transistor.fm/s/5cbc346f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, Cha Zhang</p>

            <p><strong>Title:</strong><br>
            ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05452v1">http://arxiv.org/abs/2501.05452v1</a></p>

            <p><strong>Abstract:</strong><br>
            Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate "visual thoughts" by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving tables and charts. ReFocus largely improves performance on all tasks over GPT-4o without visual editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart tasks. We present an in-depth analysis of the effects of different visual edits, and reasons why ReFocus can improve the performance without introducing additional information. Further, we collect a 14k training set using ReFocus, and prove that such visual chain-of-thought with intermediate information offers a better supervision than standard VQA data, reaching a 8.0% average gain over the same model trained with QA pairs and 2.6% over CoT.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, Cha Zhang</p>

            <p><strong>Title:</strong><br>
            ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05452v1">http://arxiv.org/abs/2501.05452v1</a></p>

            <p><strong>Abstract:</strong><br>
            Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate "visual thoughts" by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving tables and charts. ReFocus largely improves performance on all tasks over GPT-4o without visual editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart tasks. We present an in-depth analysis of the effects of different visual edits, and reasons why ReFocus can improve the performance without introducing additional information. Further, we collect a 14k training set using ReFocus, and prove that such visual chain-of-thought with intermediate information offers a better supervision than standard VQA data, reaching a 8.0% average gain over the same model trained with QA pairs and 2.6% over CoT.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Jan 2025 20:28:51 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5cbc346f/9c22d6f2.mp3" length="21940867" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1368</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, Cha Zhang</p>

            <p><strong>Title:</strong><br>
            ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05452v1">http://arxiv.org/abs/2501.05452v1</a></p>

            <p><strong>Abstract:</strong><br>
            Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate "visual thoughts" by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving tables and charts. ReFocus largely improves performance on all tasks over GPT-4o without visual editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart tasks. We present an in-depth analysis of the effects of different visual edits, and reasons why ReFocus can improve the performance without introducing additional information. Further, we collect a 14k training set using ReFocus, and prove that such visual chain-of-thought with intermediate information offers a better supervision than standard VQA data, reaching a 8.0% average gain over the same model trained with QA pairs and 2.6% over CoT.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning</title>
      <itunes:episode>376</itunes:episode>
      <podcast:episode>376</podcast:episode>
      <itunes:title>ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1431937b-0863-4717-ac17-5b922dfb4067</guid>
      <link>https://share.transistor.fm/s/682b68ef</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, Kun Gai</p>

            <p><strong>Title:</strong><br>
            ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04698v1">http://arxiv.org/abs/2501.04698v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-video generation has made remarkable advancements through diffusion models. However, Multi-Concept Video Customization (MCVC) remains a significant challenge. We identify two key challenges in this task: 1) the identity decoupling problem, where directly adopting existing customization methods inevitably mix attributes when handling multiple concepts simultaneously, and 2) the scarcity of high-quality video-entity pairs, which is crucial for training such a model that represents and decouples various concepts well. To address these challenges, we introduce ConceptMaster, an innovative framework that effectively tackles the critical issues of identity decoupling while maintaining concept fidelity in customized videos. Specifically, we introduce a novel strategy of learning decoupled multi-concept embeddings that are injected into the diffusion models in a standalone manner, which effectively guarantees the quality of customized videos with multiple identities, even for highly similar visual concepts. To further overcome the scarcity of high-quality MCVC data, we carefully establish a data construction pipeline, which enables systematic collection of precise multi-concept video-entity data across diverse concepts. A comprehensive benchmark is designed to validate the effectiveness of our model from three critical dimensions: concept fidelity, identity decoupling ability, and video generation quality across six different concept composition scenarios. Extensive experiments demonstrate that our ConceptMaster significantly outperforms previous approaches for this task, paving the way for generating personalized and semantically accurate videos across multiple concepts.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, Kun Gai</p>

            <p><strong>Title:</strong><br>
            ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04698v1">http://arxiv.org/abs/2501.04698v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-video generation has made remarkable advancements through diffusion models. However, Multi-Concept Video Customization (MCVC) remains a significant challenge. We identify two key challenges in this task: 1) the identity decoupling problem, where directly adopting existing customization methods inevitably mix attributes when handling multiple concepts simultaneously, and 2) the scarcity of high-quality video-entity pairs, which is crucial for training such a model that represents and decouples various concepts well. To address these challenges, we introduce ConceptMaster, an innovative framework that effectively tackles the critical issues of identity decoupling while maintaining concept fidelity in customized videos. Specifically, we introduce a novel strategy of learning decoupled multi-concept embeddings that are injected into the diffusion models in a standalone manner, which effectively guarantees the quality of customized videos with multiple identities, even for highly similar visual concepts. To further overcome the scarcity of high-quality MCVC data, we carefully establish a data construction pipeline, which enables systematic collection of precise multi-concept video-entity data across diverse concepts. A comprehensive benchmark is designed to validate the effectiveness of our model from three critical dimensions: concept fidelity, identity decoupling ability, and video generation quality across six different concept composition scenarios. Extensive experiments demonstrate that our ConceptMaster significantly outperforms previous approaches for this task, paving the way for generating personalized and semantically accurate videos across multiple concepts.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Jan 2025 20:28:30 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/682b68ef/0c5b35d7.mp3" length="22758003" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1419</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, Kun Gai</p>

            <p><strong>Title:</strong><br>
            ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04698v1">http://arxiv.org/abs/2501.04698v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-video generation has made remarkable advancements through diffusion models. However, Multi-Concept Video Customization (MCVC) remains a significant challenge. We identify two key challenges in this task: 1) the identity decoupling problem, where directly adopting existing customization methods inevitably mix attributes when handling multiple concepts simultaneously, and 2) the scarcity of high-quality video-entity pairs, which is crucial for training such a model that represents and decouples various concepts well. To address these challenges, we introduce ConceptMaster, an innovative framework that effectively tackles the critical issues of identity decoupling while maintaining concept fidelity in customized videos. Specifically, we introduce a novel strategy of learning decoupled multi-concept embeddings that are injected into the diffusion models in a standalone manner, which effectively guarantees the quality of customized videos with multiple identities, even for highly similar visual concepts. To further overcome the scarcity of high-quality MCVC data, we carefully establish a data construction pipeline, which enables systematic collection of precise multi-concept video-entity data across diverse concepts. A comprehensive benchmark is designed to validate the effectiveness of our model from three critical dimensions: concept fidelity, identity decoupling ability, and video generation quality across six different concept composition scenarios. Extensive experiments demonstrate that our ConceptMaster significantly outperforms previous approaches for this task, paving the way for generating personalized and semantically accurate videos across multiple concepts.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains</title>
      <itunes:episode>375</itunes:episode>
      <podcast:episode>375</podcast:episode>
      <itunes:title>Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6e32cf39-89bf-4bb3-ab9b-e01f4ae5edf8</guid>
      <link>https://share.transistor.fm/s/3a33efd4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Vighnesh Subramaniam, Yilun Du, Joshua B. Tenenbaum, Antonio Torralba, Shuang Li, Igor Mordatch</p>

            <p><strong>Title:</strong><br>
            Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05707v1">http://arxiv.org/abs/2501.05707v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have achieved remarkable performance in recent years but are fundamentally limited by the underlying training data. To improve models beyond the training data, recent works have explored how LLMs can be used to generate synthetic data for autonomous self-improvement. However, successive steps of self-improvement can reach a point of diminishing returns. In this work, we propose a complementary approach towards self-improvement where finetuning is applied to a multiagent society of language models. A group of language models, all starting from the same base model, are independently specialized by updating each one using data generated through multiagent interactions among the models. By training each model on independent sets of data, we illustrate how this approach enables specialization across models and diversification over the set of models. As a result, our overall system is able to preserve diverse reasoning chains and autonomously improve over many more rounds of fine-tuning than single-agent self-improvement methods. We quantitatively illustrate the efficacy of the approach across a wide suite of reasoning tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Vighnesh Subramaniam, Yilun Du, Joshua B. Tenenbaum, Antonio Torralba, Shuang Li, Igor Mordatch</p>

            <p><strong>Title:</strong><br>
            Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05707v1">http://arxiv.org/abs/2501.05707v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have achieved remarkable performance in recent years but are fundamentally limited by the underlying training data. To improve models beyond the training data, recent works have explored how LLMs can be used to generate synthetic data for autonomous self-improvement. However, successive steps of self-improvement can reach a point of diminishing returns. In this work, we propose a complementary approach towards self-improvement where finetuning is applied to a multiagent society of language models. A group of language models, all starting from the same base model, are independently specialized by updating each one using data generated through multiagent interactions among the models. By training each model on independent sets of data, we illustrate how this approach enables specialization across models and diversification over the set of models. As a result, our overall system is able to preserve diverse reasoning chains and autonomously improve over many more rounds of fine-tuning than single-agent self-improvement methods. We quantitatively illustrate the efficacy of the approach across a wide suite of reasoning tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 13 Jan 2025 20:27:58 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3a33efd4/f8a257fb.mp3" length="21563439" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1344</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Vighnesh Subramaniam, Yilun Du, Joshua B. Tenenbaum, Antonio Torralba, Shuang Li, Igor Mordatch</p>

            <p><strong>Title:</strong><br>
            Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05707v1">http://arxiv.org/abs/2501.05707v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have achieved remarkable performance in recent years but are fundamentally limited by the underlying training data. To improve models beyond the training data, recent works have explored how LLMs can be used to generate synthetic data for autonomous self-improvement. However, successive steps of self-improvement can reach a point of diminishing returns. In this work, we propose a complementary approach towards self-improvement where finetuning is applied to a multiagent society of language models. A group of language models, all starting from the same base model, are independently specialized by updating each one using data generated through multiagent interactions among the models. By training each model on independent sets of data, we illustrate how this approach enables specialization across models and diversification over the set of models. As a result, our overall system is able to preserve diverse reasoning chains and autonomously improve over many more rounds of fine-tuning than single-agent self-improvement methods. We quantitatively illustrate the efficacy of the approach across a wide suite of reasoning tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The GAN is dead; long live the GAN! A Modern GAN Baseline</title>
      <itunes:episode>374</itunes:episode>
      <podcast:episode>374</podcast:episode>
      <itunes:title>The GAN is dead; long live the GAN! A Modern GAN Baseline</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7ed09254-bed9-42a0-9c8f-f9cd21791fcb</guid>
      <link>https://share.transistor.fm/s/26ef3290</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.LG, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yiwen Huang, Aaron Gokaslan, Volodymyr Kuleshov, James Tompkin</p>

            <p><strong>Title:</strong><br>
            The GAN is dead; long live the GAN! A Modern GAN Baseline</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05441v1">http://arxiv.org/abs/2501.05441v1</a></p>

            <p><strong>Abstract:</strong><br>
            There is a widely-spread claim that GANs are difficult to train, and GAN architectures in the literature are littered with empirical tricks. We provide evidence against this claim and build a modern GAN baseline in a more principled manner. First, we derive a well-behaved regularized relativistic GAN loss that addresses issues of mode dropping and non-convergence that were previously tackled via a bag of ad-hoc tricks. We analyze our loss mathematically and prove that it admits local convergence guarantees, unlike most existing relativistic losses. Second, our new loss allows us to discard all ad-hoc tricks and replace outdated backbones used in common GANs with modern architectures. Using StyleGAN2 as an example, we present a roadmap of simplification and modernization that results in a new minimalist baseline -- R3GAN. Despite being simple, our approach surpasses StyleGAN2 on FFHQ, ImageNet, CIFAR, and Stacked MNIST datasets, and compares favorably against state-of-the-art GANs and diffusion models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.LG, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yiwen Huang, Aaron Gokaslan, Volodymyr Kuleshov, James Tompkin</p>

            <p><strong>Title:</strong><br>
            The GAN is dead; long live the GAN! A Modern GAN Baseline</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05441v1">http://arxiv.org/abs/2501.05441v1</a></p>

            <p><strong>Abstract:</strong><br>
            There is a widely-spread claim that GANs are difficult to train, and GAN architectures in the literature are littered with empirical tricks. We provide evidence against this claim and build a modern GAN baseline in a more principled manner. First, we derive a well-behaved regularized relativistic GAN loss that addresses issues of mode dropping and non-convergence that were previously tackled via a bag of ad-hoc tricks. We analyze our loss mathematically and prove that it admits local convergence guarantees, unlike most existing relativistic losses. Second, our new loss allows us to discard all ad-hoc tricks and replace outdated backbones used in common GANs with modern architectures. Using StyleGAN2 as an example, we present a roadmap of simplification and modernization that results in a new minimalist baseline -- R3GAN. Despite being simple, our approach surpasses StyleGAN2 on FFHQ, ImageNet, CIFAR, and Stacked MNIST datasets, and compares favorably against state-of-the-art GANs and diffusion models.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Jan 2025 21:04:15 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/26ef3290/8e957b4a.mp3" length="19454405" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1212</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.LG, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yiwen Huang, Aaron Gokaslan, Volodymyr Kuleshov, James Tompkin</p>

            <p><strong>Title:</strong><br>
            The GAN is dead; long live the GAN! A Modern GAN Baseline</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05441v1">http://arxiv.org/abs/2501.05441v1</a></p>

            <p><strong>Abstract:</strong><br>
            There is a widely-spread claim that GANs are difficult to train, and GAN architectures in the literature are littered with empirical tricks. We provide evidence against this claim and build a modern GAN baseline in a more principled manner. First, we derive a well-behaved regularized relativistic GAN loss that addresses issues of mode dropping and non-convergence that were previously tackled via a bag of ad-hoc tricks. We analyze our loss mathematically and prove that it admits local convergence guarantees, unlike most existing relativistic losses. Second, our new loss allows us to discard all ad-hoc tricks and replace outdated backbones used in common GANs with modern architectures. Using StyleGAN2 as an example, we present a roadmap of simplification and modernization that results in a new minimalist baseline -- R3GAN. Despite being simple, our approach surpasses StyleGAN2 on FFHQ, ImageNet, CIFAR, and Stacked MNIST datasets, and compares favorably against state-of-the-art GANs and diffusion models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>An Empirical Study of Autoregressive Pre-training from Videos</title>
      <itunes:episode>373</itunes:episode>
      <podcast:episode>373</podcast:episode>
      <itunes:title>An Empirical Study of Autoregressive Pre-training from Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d38a552d-8837-4994-9a97-876bb6ea550c</guid>
      <link>https://share.transistor.fm/s/63c57bc0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, Jitendra Malik</p>

            <p><strong>Title:</strong><br>
            An Empirical Study of Autoregressive Pre-training from Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05453v1">http://arxiv.org/abs/2501.05453v1</a></p>

            <p><strong>Abstract:</strong><br>
            We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at https://brjathu.github.io/toto/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, Jitendra Malik</p>

            <p><strong>Title:</strong><br>
            An Empirical Study of Autoregressive Pre-training from Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05453v1">http://arxiv.org/abs/2501.05453v1</a></p>

            <p><strong>Abstract:</strong><br>
            We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at https://brjathu.github.io/toto/</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Jan 2025 21:03:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/63c57bc0/3bb6ee9a.mp3" length="20974527" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1307</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, Jitendra Malik</p>

            <p><strong>Title:</strong><br>
            An Empirical Study of Autoregressive Pre-training from Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05453v1">http://arxiv.org/abs/2501.05453v1</a></p>

            <p><strong>Abstract:</strong><br>
            We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at https://brjathu.github.io/toto/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives</title>
      <itunes:episode>372</itunes:episode>
      <podcast:episode>372</podcast:episode>
      <itunes:title>Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1d2f26bc-4503-45a0-a87e-2f8208baadd6</guid>
      <link>https://share.transistor.fm/s/bbce7996</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, Liang Pan</p>

            <p><strong>Title:</strong><br>
            Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04003v1">http://arxiv.org/abs/2501.04003v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in Vision-Language Models (VLMs) have sparked interest in their use for autonomous driving, particularly in generating interpretable driving decisions through natural language. However, the assumption that VLMs inherently provide visually grounded, reliable, and interpretable explanations for driving remains largely unexamined. To address this gap, we introduce DriveBench, a benchmark dataset designed to evaluate VLM reliability across 17 settings (clean, corrupted, and text-only inputs), encompassing 19,200 frames, 20,498 question-answer pairs, three question types, four mainstream driving tasks, and a total of 12 popular VLMs. Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving. We further observe that VLMs struggle with multi-modal reasoning and display heightened sensitivity to input corruptions, leading to inconsistencies in performance. To address these challenges, we propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding. Additionally, we highlight the potential of leveraging VLMs' awareness of corruptions to enhance their reliability, offering a roadmap for developing more trustworthy and interpretable decision-making systems in real-world autonomous driving contexts. The benchmark toolkit is publicly accessible.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, Liang Pan</p>

            <p><strong>Title:</strong><br>
            Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04003v1">http://arxiv.org/abs/2501.04003v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in Vision-Language Models (VLMs) have sparked interest in their use for autonomous driving, particularly in generating interpretable driving decisions through natural language. However, the assumption that VLMs inherently provide visually grounded, reliable, and interpretable explanations for driving remains largely unexamined. To address this gap, we introduce DriveBench, a benchmark dataset designed to evaluate VLM reliability across 17 settings (clean, corrupted, and text-only inputs), encompassing 19,200 frames, 20,498 question-answer pairs, three question types, four mainstream driving tasks, and a total of 12 popular VLMs. Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving. We further observe that VLMs struggle with multi-modal reasoning and display heightened sensitivity to input corruptions, leading to inconsistencies in performance. To address these challenges, we propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding. Additionally, we highlight the potential of leveraging VLMs' awareness of corruptions to enhance their reliability, offering a roadmap for developing more trustworthy and interpretable decision-making systems in real-world autonomous driving contexts. The benchmark toolkit is publicly accessible.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Jan 2025 21:03:32 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bbce7996/eb293f27.mp3" length="20940302" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1305</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, Liang Pan</p>

            <p><strong>Title:</strong><br>
            Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04003v1">http://arxiv.org/abs/2501.04003v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in Vision-Language Models (VLMs) have sparked interest in their use for autonomous driving, particularly in generating interpretable driving decisions through natural language. However, the assumption that VLMs inherently provide visually grounded, reliable, and interpretable explanations for driving remains largely unexamined. To address this gap, we introduce DriveBench, a benchmark dataset designed to evaluate VLM reliability across 17 settings (clean, corrupted, and text-only inputs), encompassing 19,200 frames, 20,498 question-answer pairs, three question types, four mainstream driving tasks, and a total of 12 popular VLMs. Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving. We further observe that VLMs struggle with multi-modal reasoning and display heightened sensitivity to input corruptions, leading to inconsistencies in performance. To address these challenges, we propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding. Additionally, we highlight the potential of leveraging VLMs' awareness of corruptions to enhance their reliability, offering a roadmap for developing more trustworthy and interpretable decision-making systems in real-world autonomous driving contexts. The benchmark toolkit is publicly accessible.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Entropy-Guided Attention for Private LLMs</title>
      <itunes:episode>371</itunes:episode>
      <podcast:episode>371</podcast:episode>
      <itunes:title>Entropy-Guided Attention for Private LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">994c1887-b558-4ca0-ab4e-4d303342886d</guid>
      <link>https://share.transistor.fm/s/e72333c0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.LG, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Nandan Kumar Jha, Brandon Reagen</p>

            <p><strong>Title:</strong><br>
            Entropy-Guided Attention for Private LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03489v2">http://arxiv.org/abs/2501.03489v2</a></p>

            <p><strong>Abstract:</strong><br>
            The pervasiveness of proprietary language models has raised critical privacy concerns, necessitating advancements in private inference (PI), where computations are performed directly on encrypted data without revealing users' sensitive information. While PI offers a promising solution, its practical deployment is hindered by substantial communication and latency overheads, primarily stemming from nonlinear operations. To address this, we introduce an information-theoretic framework to characterize the role of nonlinearities in decoder-only language models, laying a principled foundation for optimizing transformer-architectures tailored to the demands of PI.   By leveraging Shannon's entropy as a quantitative measure, we uncover the previously unexplored dual significance of nonlinearities: beyond ensuring training stability, they are crucial for maintaining attention head diversity. Specifically, we find that their removal triggers two critical failure modes: {\em entropy collapse} in deeper layers that destabilizes training, and {\em entropic overload} in earlier layers that leads to under-utilization of Multi-Head Attention's (MHA) representational capacity.   We propose an entropy-guided attention mechanism paired with a novel entropy regularization technique to mitigate entropic overload. Additionally, we explore PI-friendly alternatives to layer normalization for preventing entropy collapse and stabilizing the training of LLMs with reduced-nonlinearities. Our study bridges the gap between information theory and architectural design, establishing entropy dynamics as a principled guide for developing efficient PI architectures. The code and implementation are available at https://github.com/Nandan91/entropy-guided-attention-llm</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.LG, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Nandan Kumar Jha, Brandon Reagen</p>

            <p><strong>Title:</strong><br>
            Entropy-Guided Attention for Private LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03489v2">http://arxiv.org/abs/2501.03489v2</a></p>

            <p><strong>Abstract:</strong><br>
            The pervasiveness of proprietary language models has raised critical privacy concerns, necessitating advancements in private inference (PI), where computations are performed directly on encrypted data without revealing users' sensitive information. While PI offers a promising solution, its practical deployment is hindered by substantial communication and latency overheads, primarily stemming from nonlinear operations. To address this, we introduce an information-theoretic framework to characterize the role of nonlinearities in decoder-only language models, laying a principled foundation for optimizing transformer-architectures tailored to the demands of PI.   By leveraging Shannon's entropy as a quantitative measure, we uncover the previously unexplored dual significance of nonlinearities: beyond ensuring training stability, they are crucial for maintaining attention head diversity. Specifically, we find that their removal triggers two critical failure modes: {\em entropy collapse} in deeper layers that destabilizes training, and {\em entropic overload} in earlier layers that leads to under-utilization of Multi-Head Attention's (MHA) representational capacity.   We propose an entropy-guided attention mechanism paired with a novel entropy regularization technique to mitigate entropic overload. Additionally, we explore PI-friendly alternatives to layer normalization for preventing entropy collapse and stabilizing the training of LLMs with reduced-nonlinearities. Our study bridges the gap between information theory and architectural design, establishing entropy dynamics as a principled guide for developing efficient PI architectures. The code and implementation are available at https://github.com/Nandan91/entropy-guided-attention-llm</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Jan 2025 21:02:59 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e72333c0/f85efa19.mp3" length="23265341" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1450</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.LG, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Nandan Kumar Jha, Brandon Reagen</p>

            <p><strong>Title:</strong><br>
            Entropy-Guided Attention for Private LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03489v2">http://arxiv.org/abs/2501.03489v2</a></p>

            <p><strong>Abstract:</strong><br>
            The pervasiveness of proprietary language models has raised critical privacy concerns, necessitating advancements in private inference (PI), where computations are performed directly on encrypted data without revealing users' sensitive information. While PI offers a promising solution, its practical deployment is hindered by substantial communication and latency overheads, primarily stemming from nonlinear operations. To address this, we introduce an information-theoretic framework to characterize the role of nonlinearities in decoder-only language models, laying a principled foundation for optimizing transformer-architectures tailored to the demands of PI.   By leveraging Shannon's entropy as a quantitative measure, we uncover the previously unexplored dual significance of nonlinearities: beyond ensuring training stability, they are crucial for maintaining attention head diversity. Specifically, we find that their removal triggers two critical failure modes: {\em entropy collapse} in deeper layers that destabilizes training, and {\em entropic overload} in earlier layers that leads to under-utilization of Multi-Head Attention's (MHA) representational capacity.   We propose an entropy-guided attention mechanism paired with a novel entropy regularization technique to mitigate entropic overload. Additionally, we explore PI-friendly alternatives to layer normalization for preventing entropy collapse and stabilizing the training of LLMs with reduced-nonlinearities. Our study bridges the gap between information theory and architectural design, establishing entropy dynamics as a principled guide for developing efficient PI architectures. The code and implementation are available at https://github.com/Nandan91/entropy-guided-attention-llm</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis</title>
      <itunes:episode>370</itunes:episode>
      <podcast:episode>370</podcast:episode>
      <itunes:title>On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">aee1c932-1271-4b0c-931b-a8d6ace039ac</guid>
      <link>https://share.transistor.fm/s/d09b0c1e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG, cs.AI, cs.CC, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yekun Ke, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song</p>

            <p><strong>Title:</strong><br>
            On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04377v1">http://arxiv.org/abs/2501.04377v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, Visual Autoregressive ($\mathsf{VAR}$) Models introduced a groundbreaking advancement in the field of image generation, offering a scalable approach through a coarse-to-fine "next-scale prediction" paradigm. However, the state-of-the-art algorithm of $\mathsf{VAR}$ models in [Tian, Jiang, Yuan, Peng and Wang, NeurIPS 2024] takes $O(n^4)$ time, which is computationally inefficient. In this work, we analyze the computational limits and efficiency criteria of $\mathsf{VAR}$ Models through a fine-grained complexity lens. Our key contribution is identifying the conditions under which $\mathsf{VAR}$ computations can achieve sub-quadratic time complexity. Specifically, we establish a critical threshold for the norm of input matrices used in $\mathsf{VAR}$ attention mechanisms. Above this threshold, assuming the Strong Exponential Time Hypothesis ($\mathsf{SETH}$) from fine-grained complexity theory, a sub-quartic time algorithm for $\mathsf{VAR}$ models is impossible. To substantiate our theoretical findings, we present efficient constructions leveraging low-rank approximations that align with the derived criteria. This work initiates the study of the computational efficiency of the $\mathsf{VAR}$ model from a theoretical perspective. Our technique will shed light on advancing scalable and efficient image generation in $\mathsf{VAR}$ frameworks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG, cs.AI, cs.CC, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yekun Ke, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song</p>

            <p><strong>Title:</strong><br>
            On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04377v1">http://arxiv.org/abs/2501.04377v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, Visual Autoregressive ($\mathsf{VAR}$) Models introduced a groundbreaking advancement in the field of image generation, offering a scalable approach through a coarse-to-fine "next-scale prediction" paradigm. However, the state-of-the-art algorithm of $\mathsf{VAR}$ models in [Tian, Jiang, Yuan, Peng and Wang, NeurIPS 2024] takes $O(n^4)$ time, which is computationally inefficient. In this work, we analyze the computational limits and efficiency criteria of $\mathsf{VAR}$ Models through a fine-grained complexity lens. Our key contribution is identifying the conditions under which $\mathsf{VAR}$ computations can achieve sub-quadratic time complexity. Specifically, we establish a critical threshold for the norm of input matrices used in $\mathsf{VAR}$ attention mechanisms. Above this threshold, assuming the Strong Exponential Time Hypothesis ($\mathsf{SETH}$) from fine-grained complexity theory, a sub-quartic time algorithm for $\mathsf{VAR}$ models is impossible. To substantiate our theoretical findings, we present efficient constructions leveraging low-rank approximations that align with the derived criteria. This work initiates the study of the computational efficiency of the $\mathsf{VAR}$ model from a theoretical perspective. Our technique will shed light on advancing scalable and efficient image generation in $\mathsf{VAR}$ frameworks.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Jan 2025 21:02:37 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d09b0c1e/33ddbc0b.mp3" length="18531617" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1155</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG, cs.AI, cs.CC, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yekun Ke, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song</p>

            <p><strong>Title:</strong><br>
            On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04377v1">http://arxiv.org/abs/2501.04377v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, Visual Autoregressive ($\mathsf{VAR}$) Models introduced a groundbreaking advancement in the field of image generation, offering a scalable approach through a coarse-to-fine "next-scale prediction" paradigm. However, the state-of-the-art algorithm of $\mathsf{VAR}$ models in [Tian, Jiang, Yuan, Peng and Wang, NeurIPS 2024] takes $O(n^4)$ time, which is computationally inefficient. In this work, we analyze the computational limits and efficiency criteria of $\mathsf{VAR}$ Models through a fine-grained complexity lens. Our key contribution is identifying the conditions under which $\mathsf{VAR}$ computations can achieve sub-quadratic time complexity. Specifically, we establish a critical threshold for the norm of input matrices used in $\mathsf{VAR}$ attention mechanisms. Above this threshold, assuming the Strong Exponential Time Hypothesis ($\mathsf{SETH}$) from fine-grained complexity theory, a sub-quartic time algorithm for $\mathsf{VAR}$ models is impossible. To substantiate our theoretical findings, we present efficient constructions leveraging low-rank approximations that align with the derived criteria. This work initiates the study of the computational efficiency of the $\mathsf{VAR}$ model from a theoretical perspective. Our technique will shed light on advancing scalable and efficient image generation in $\mathsf{VAR}$ frameworks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model</title>
      <itunes:episode>369</itunes:episode>
      <podcast:episode>369</podcast:episode>
      <itunes:title>Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cfe47d3f-fdf8-48ce-ab7e-ce6c2f4a501b</guid>
      <link>https://share.transistor.fm/s/c9cc208b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Gregor Geigle, Florian Schneider, Carolin Holtermann, Chris Biemann, Radu Timofte, Anne Lauscher, Goran Glavaš</p>

            <p><strong>Title:</strong><br>
            Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05122v1">http://arxiv.org/abs/2501.05122v1</a></p>

            <p><strong>Abstract:</strong><br>
            Most Large Vision-Language Models (LVLMs) to date are trained predominantly on English data, which makes them struggle to understand non-English input and fail to generate output in the desired target language. Existing efforts mitigate these issues by adding multilingual training data, but do so in a largely ad-hoc manner, lacking insight into how different training mixes tip the scale for different groups of languages. In this work, we present a comprehensive investigation into the training strategies for massively multilingual LVLMs. First, we conduct a series of multi-stage experiments spanning 13 downstream vision-language tasks and 43 languages, systematically examining: (1) the number of training languages that can be included without degrading English performance and (2) optimal language distributions of pre-training as well as (3) instruction-tuning data. Further, we (4) investigate how to improve multilingual text-in-image understanding, and introduce a new benchmark for the task. Surprisingly, our analysis reveals that one can (i) include as many as 100 training languages simultaneously (ii) with as little as 25-50\% of non-English data, to greatly improve multilingual performance while retaining strong English performance. We further find that (iii) including non-English OCR data in pre-training and instruction-tuning is paramount for improving multilingual text-in-image understanding. Finally, we put all our findings together and train Centurio, a 100-language LVLM, offering state-of-the-art performance in an evaluation covering 14 tasks and 56 languages.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Gregor Geigle, Florian Schneider, Carolin Holtermann, Chris Biemann, Radu Timofte, Anne Lauscher, Goran Glavaš</p>

            <p><strong>Title:</strong><br>
            Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05122v1">http://arxiv.org/abs/2501.05122v1</a></p>

            <p><strong>Abstract:</strong><br>
            Most Large Vision-Language Models (LVLMs) to date are trained predominantly on English data, which makes them struggle to understand non-English input and fail to generate output in the desired target language. Existing efforts mitigate these issues by adding multilingual training data, but do so in a largely ad-hoc manner, lacking insight into how different training mixes tip the scale for different groups of languages. In this work, we present a comprehensive investigation into the training strategies for massively multilingual LVLMs. First, we conduct a series of multi-stage experiments spanning 13 downstream vision-language tasks and 43 languages, systematically examining: (1) the number of training languages that can be included without degrading English performance and (2) optimal language distributions of pre-training as well as (3) instruction-tuning data. Further, we (4) investigate how to improve multilingual text-in-image understanding, and introduce a new benchmark for the task. Surprisingly, our analysis reveals that one can (i) include as many as 100 training languages simultaneously (ii) with as little as 25-50\% of non-English data, to greatly improve multilingual performance while retaining strong English performance. We further find that (iii) including non-English OCR data in pre-training and instruction-tuning is paramount for improving multilingual text-in-image understanding. Finally, we put all our findings together and train Centurio, a 100-language LVLM, offering state-of-the-art performance in an evaluation covering 14 tasks and 56 languages.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Jan 2025 21:02:16 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c9cc208b/419021cf.mp3" length="21318103" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1329</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Gregor Geigle, Florian Schneider, Carolin Holtermann, Chris Biemann, Radu Timofte, Anne Lauscher, Goran Glavaš</p>

            <p><strong>Title:</strong><br>
            Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05122v1">http://arxiv.org/abs/2501.05122v1</a></p>

            <p><strong>Abstract:</strong><br>
            Most Large Vision-Language Models (LVLMs) to date are trained predominantly on English data, which makes them struggle to understand non-English input and fail to generate output in the desired target language. Existing efforts mitigate these issues by adding multilingual training data, but do so in a largely ad-hoc manner, lacking insight into how different training mixes tip the scale for different groups of languages. In this work, we present a comprehensive investigation into the training strategies for massively multilingual LVLMs. First, we conduct a series of multi-stage experiments spanning 13 downstream vision-language tasks and 43 languages, systematically examining: (1) the number of training languages that can be included without degrading English performance and (2) optimal language distributions of pre-training as well as (3) instruction-tuning data. Further, we (4) investigate how to improve multilingual text-in-image understanding, and introduce a new benchmark for the task. Surprisingly, our analysis reveals that one can (i) include as many as 100 training languages simultaneously (ii) with as little as 25-50\% of non-English data, to greatly improve multilingual performance while retaining strong English performance. We further find that (iii) including non-English OCR data in pre-training and instruction-tuning is paramount for improving multilingual text-in-image understanding. Finally, we put all our findings together and train Centurio, a 100-language LVLM, offering state-of-the-art performance in an evaluation covering 14 tasks and 56 languages.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution</title>
      <itunes:episode>368</itunes:episode>
      <podcast:episode>368</podcast:episode>
      <itunes:title>SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">97f72386-3472-4718-ab60-7cdbbfe8840e</guid>
      <link>https://share.transistor.fm/s/87d11548</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chengxing Xie, Bowen Li, Chang Gao, He Du, Wai Lam, Difan Zou, Kai Chen</p>

            <p><strong>Title:</strong><br>
            SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05040v1">http://arxiv.org/abs/2501.05040v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks. One significant application of LLMs is in tackling software engineering challenges, particularly in resolving real-world tasks on GitHub by fixing code based on the issues reported by the users. However, many current approaches rely on proprietary LLMs, which limits reproducibility, accessibility, and transparency. The critical components of LLMs for addressing software engineering issues and how their capabilities can be effectively enhanced remain unclear. To address these challenges, we introduce SWE-Fixer, a novel open-source LLM designed to effectively and efficiently resolve GitHub issues. SWE-Fixer comprises two essential modules: a code file retrieval module and a code editing module. The retrieval module employs BM25 along with a lightweight LLM model to achieve coarse-to-fine file retrieval. Subsequently, the code editing module utilizes the other LLM model to generate patches for the identified files. Then, to mitigate the lack of publicly available datasets, we compile an extensive dataset that includes 110K GitHub issues along with their corresponding patches, and train the two modules of SWE-Fixer separately. We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving state-of-the-art performance among open-source models with scores of 23.3% and 30.2%, respectively. These outcomes highlight the efficacy of our approach. We will make our model, dataset, and code publicly available at https://github.com/InternLM/SWE-Fixer.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chengxing Xie, Bowen Li, Chang Gao, He Du, Wai Lam, Difan Zou, Kai Chen</p>

            <p><strong>Title:</strong><br>
            SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05040v1">http://arxiv.org/abs/2501.05040v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks. One significant application of LLMs is in tackling software engineering challenges, particularly in resolving real-world tasks on GitHub by fixing code based on the issues reported by the users. However, many current approaches rely on proprietary LLMs, which limits reproducibility, accessibility, and transparency. The critical components of LLMs for addressing software engineering issues and how their capabilities can be effectively enhanced remain unclear. To address these challenges, we introduce SWE-Fixer, a novel open-source LLM designed to effectively and efficiently resolve GitHub issues. SWE-Fixer comprises two essential modules: a code file retrieval module and a code editing module. The retrieval module employs BM25 along with a lightweight LLM model to achieve coarse-to-fine file retrieval. Subsequently, the code editing module utilizes the other LLM model to generate patches for the identified files. Then, to mitigate the lack of publicly available datasets, we compile an extensive dataset that includes 110K GitHub issues along with their corresponding patches, and train the two modules of SWE-Fixer separately. We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving state-of-the-art performance among open-source models with scores of 23.3% and 30.2%, respectively. These outcomes highlight the efficacy of our approach. We will make our model, dataset, and code publicly available at https://github.com/InternLM/SWE-Fixer.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Jan 2025 21:01:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/87d11548/82ae4217.mp3" length="20885111" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1302</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chengxing Xie, Bowen Li, Chang Gao, He Du, Wai Lam, Difan Zou, Kai Chen</p>

            <p><strong>Title:</strong><br>
            SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.05040v1">http://arxiv.org/abs/2501.05040v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks. One significant application of LLMs is in tackling software engineering challenges, particularly in resolving real-world tasks on GitHub by fixing code based on the issues reported by the users. However, many current approaches rely on proprietary LLMs, which limits reproducibility, accessibility, and transparency. The critical components of LLMs for addressing software engineering issues and how their capabilities can be effectively enhanced remain unclear. To address these challenges, we introduce SWE-Fixer, a novel open-source LLM designed to effectively and efficiently resolve GitHub issues. SWE-Fixer comprises two essential modules: a code file retrieval module and a code editing module. The retrieval module employs BM25 along with a lightweight LLM model to achieve coarse-to-fine file retrieval. Subsequently, the code editing module utilizes the other LLM model to generate patches for the identified files. Then, to mitigate the lack of publicly available datasets, we compile an extensive dataset that includes 110K GitHub issues along with their corresponding patches, and train the two modules of SWE-Fixer separately. We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving state-of-the-art performance among open-source models with scores of 23.3% and 30.2%, respectively. These outcomes highlight the efficacy of our approach. We will make our model, dataset, and code publicly available at https://github.com/InternLM/SWE-Fixer.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models</title>
      <itunes:episode>367</itunes:episode>
      <podcast:episode>367</podcast:episode>
      <itunes:title>Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c7770935-43be-4202-b434-984516cc2f39</guid>
      <link>https://share.transistor.fm/s/de76a8c6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Şaziye Betül Özateş, Tarık Emre Tıraş, Ece Elif Adak, Berat Doğan, Fatih Burak Karagöz, Efe Eren Genç, Esma F. Bilgin Taşdemir</p>

            <p><strong>Title:</strong><br>
            Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04828v1">http://arxiv.org/abs/2501.04828v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish, a domain that has remained underexplored in computational linguistics. We present the first named entity recognition (NER) dataset, HisTR and the first Universal Dependencies treebank, OTA-BOUN for a historical form of the Turkish language along with transformer-based models trained using these datasets for named entity recognition, dependency parsing, and part-of-speech tagging tasks. Additionally, we introduce Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts that spans a wide range of historical periods. Our experimental results show significant improvements in the computational analysis of historical Turkish, achieving promising results in tasks that require understanding of historical linguistic structures. They also highlight existing challenges, such as domain adaptation and language variations across time periods. All of the presented resources and models are made available at https://huggingface.co/bucolin to serve as a benchmark for future progress in historical Turkish NLP.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Şaziye Betül Özateş, Tarık Emre Tıraş, Ece Elif Adak, Berat Doğan, Fatih Burak Karagöz, Efe Eren Genç, Esma F. Bilgin Taşdemir</p>

            <p><strong>Title:</strong><br>
            Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04828v1">http://arxiv.org/abs/2501.04828v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish, a domain that has remained underexplored in computational linguistics. We present the first named entity recognition (NER) dataset, HisTR and the first Universal Dependencies treebank, OTA-BOUN for a historical form of the Turkish language along with transformer-based models trained using these datasets for named entity recognition, dependency parsing, and part-of-speech tagging tasks. Additionally, we introduce Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts that spans a wide range of historical periods. Our experimental results show significant improvements in the computational analysis of historical Turkish, achieving promising results in tasks that require understanding of historical linguistic structures. They also highlight existing challenges, such as domain adaptation and language variations across time periods. All of the presented resources and models are made available at https://huggingface.co/bucolin to serve as a benchmark for future progress in historical Turkish NLP.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 10 Jan 2025 21:01:32 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/de76a8c6/6967b243.mp3" length="25583816" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1595</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Şaziye Betül Özateş, Tarık Emre Tıraş, Ece Elif Adak, Berat Doğan, Fatih Burak Karagöz, Efe Eren Genç, Esma F. Bilgin Taşdemir</p>

            <p><strong>Title:</strong><br>
            Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04828v1">http://arxiv.org/abs/2501.04828v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish, a domain that has remained underexplored in computational linguistics. We present the first named entity recognition (NER) dataset, HisTR and the first Universal Dependencies treebank, OTA-BOUN for a historical form of the Turkish language along with transformer-based models trained using these datasets for named entity recognition, dependency parsing, and part-of-speech tagging tasks. Additionally, we introduce Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts that spans a wide range of historical periods. Our experimental results show significant improvements in the computational analysis of historical Turkish, achieving promising results in tasks that require understanding of historical linguistic structures. They also highlight existing challenges, such as domain adaptation and language variations across time periods. All of the presented resources and models are made available at https://huggingface.co/bucolin to serve as a benchmark for future progress in historical Turkish NLP.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking</title>
      <itunes:episode>366</itunes:episode>
      <podcast:episode>366</podcast:episode>
      <itunes:title>rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">847d0664-6955-4f8c-af0e-71ee95aa0114</guid>
      <link>https://share.transistor.fm/s/bcc41f54</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 116 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, Mao Yang</p>

            <p><strong>Title:</strong><br>
            rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04519v1">http://arxiv.org/abs/2501.04519v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 116 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, Mao Yang</p>

            <p><strong>Title:</strong><br>
            rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04519v1">http://arxiv.org/abs/2501.04519v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Jan 2025 20:47:29 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bcc41f54/28d36e39.mp3" length="26093292" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1627</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 116 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, Mao Yang</p>

            <p><strong>Title:</strong><br>
            rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04519v1">http://arxiv.org/abs/2501.04519v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought</title>
      <itunes:episode>365</itunes:episode>
      <podcast:episode>365</podcast:episode>
      <itunes:title>Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4c8a667d-8943-4576-9783-9b0f8c2bf3ac</guid>
      <link>https://share.transistor.fm/s/b2b7d265</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn</p>

            <p><strong>Title:</strong><br>
            Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04682v1">http://arxiv.org/abs/2501.04682v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT) by explicitly modeling the underlying reasoning required to arrive at a particular CoT. We present empirical evidence from state-of-the-art models exhibiting behaviors consistent with in-context search, and explore methods for producing Meta-CoT via process supervision, synthetic data generation, and search algorithms. Finally, we outline a concrete pipeline for training a model to produce Meta-CoTs, incorporating instruction tuning with linearized search traces and reinforcement learning post-training. Finally, we discuss open research questions, including scaling laws, verifier roles, and the potential for discovering novel reasoning algorithms. This work provides a theoretical and practical roadmap to enable Meta-CoT in LLMs, paving the way for more powerful and human-like reasoning in artificial intelligence.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn</p>

            <p><strong>Title:</strong><br>
            Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04682v1">http://arxiv.org/abs/2501.04682v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT) by explicitly modeling the underlying reasoning required to arrive at a particular CoT. We present empirical evidence from state-of-the-art models exhibiting behaviors consistent with in-context search, and explore methods for producing Meta-CoT via process supervision, synthetic data generation, and search algorithms. Finally, we outline a concrete pipeline for training a model to produce Meta-CoTs, incorporating instruction tuning with linearized search traces and reinforcement learning post-training. Finally, we discuss open research questions, including scaling laws, verifier roles, and the potential for discovering novel reasoning algorithms. This work provides a theoretical and practical roadmap to enable Meta-CoT in LLMs, paving the way for more powerful and human-like reasoning in artificial intelligence.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Jan 2025 20:47:07 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b2b7d265/9871580c.mp3" length="23959614" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1494</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 47 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn</p>

            <p><strong>Title:</strong><br>
            Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04682v1">http://arxiv.org/abs/2501.04682v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT) by explicitly modeling the underlying reasoning required to arrive at a particular CoT. We present empirical evidence from state-of-the-art models exhibiting behaviors consistent with in-context search, and explore methods for producing Meta-CoT via process supervision, synthetic data generation, and search algorithms. Finally, we outline a concrete pipeline for training a model to produce Meta-CoTs, incorporating instruction tuning with linearized search traces and reinforcement learning post-training. Finally, we discuss open research questions, including scaling laws, verifier roles, and the potential for discovering novel reasoning algorithms. This work provides a theoretical and practical roadmap to enable Meta-CoT in LLMs, paving the way for more powerful and human-like reasoning in artificial intelligence.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics</title>
      <itunes:episode>364</itunes:episode>
      <podcast:episode>364</podcast:episode>
      <itunes:title>URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7b570483-aeb8-4800-8de8-cbe92313ca37</guid>
      <link>https://share.transistor.fm/s/b4a2f635</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ruilin Luo, Zhuofan Zheng, Yifan Wang, Yiyao Yu, Xinzhe Ni, Zicheng Lin, Jin Zeng, Yujiu Yang</p>

            <p><strong>Title:</strong><br>
            URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04686v1">http://arxiv.org/abs/2501.04686v1</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-thought (CoT) reasoning has been widely applied in the mathematical reasoning of Large Language Models (LLMs). Recently, the introduction of derivative process supervision on CoT trajectories has sparked discussions on enhancing scaling capabilities during test time, thereby boosting the potential of these models. However, in multimodal mathematical reasoning, the scarcity of high-quality CoT training data has hindered existing models from achieving high-precision CoT reasoning and has limited the realization of reasoning potential during test time. In this work, we propose a three-module synthesis strategy that integrates CoT distillation, trajectory-format rewriting, and format unification. It results in a high-quality CoT reasoning instruction fine-tuning dataset in multimodal mathematics, MMathCoT-1M. We comprehensively validate the state-of-the-art (SOTA) performance of the trained URSA-7B model on multiple multimodal mathematical benchmarks. For test-time scaling, we introduce a data synthesis strategy that automatically generates process annotation datasets, known as DualMath-1.1M, focusing on both interpretation and logic. By further training URSA-7B on DualMath-1.1M, we transition from CoT reasoning capabilities to robust supervision abilities. The trained URSA-RM-7B acts as a verifier, effectively enhancing the performance of URSA-7B at test time. URSA-RM-7B also demonstrates excellent out-of-distribution (OOD) verifying capabilities, showcasing its generalization. Model weights, training data and code will be open-sourced.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ruilin Luo, Zhuofan Zheng, Yifan Wang, Yiyao Yu, Xinzhe Ni, Zicheng Lin, Jin Zeng, Yujiu Yang</p>

            <p><strong>Title:</strong><br>
            URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04686v1">http://arxiv.org/abs/2501.04686v1</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-thought (CoT) reasoning has been widely applied in the mathematical reasoning of Large Language Models (LLMs). Recently, the introduction of derivative process supervision on CoT trajectories has sparked discussions on enhancing scaling capabilities during test time, thereby boosting the potential of these models. However, in multimodal mathematical reasoning, the scarcity of high-quality CoT training data has hindered existing models from achieving high-precision CoT reasoning and has limited the realization of reasoning potential during test time. In this work, we propose a three-module synthesis strategy that integrates CoT distillation, trajectory-format rewriting, and format unification. It results in a high-quality CoT reasoning instruction fine-tuning dataset in multimodal mathematics, MMathCoT-1M. We comprehensively validate the state-of-the-art (SOTA) performance of the trained URSA-7B model on multiple multimodal mathematical benchmarks. For test-time scaling, we introduce a data synthesis strategy that automatically generates process annotation datasets, known as DualMath-1.1M, focusing on both interpretation and logic. By further training URSA-7B on DualMath-1.1M, we transition from CoT reasoning capabilities to robust supervision abilities. The trained URSA-RM-7B acts as a verifier, effectively enhancing the performance of URSA-7B at test time. URSA-RM-7B also demonstrates excellent out-of-distribution (OOD) verifying capabilities, showcasing its generalization. Model weights, training data and code will be open-sourced.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Jan 2025 20:46:45 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b4a2f635/4e7da2c3.mp3" length="22752968" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1418</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ruilin Luo, Zhuofan Zheng, Yifan Wang, Yiyao Yu, Xinzhe Ni, Zicheng Lin, Jin Zeng, Yujiu Yang</p>

            <p><strong>Title:</strong><br>
            URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04686v1">http://arxiv.org/abs/2501.04686v1</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-thought (CoT) reasoning has been widely applied in the mathematical reasoning of Large Language Models (LLMs). Recently, the introduction of derivative process supervision on CoT trajectories has sparked discussions on enhancing scaling capabilities during test time, thereby boosting the potential of these models. However, in multimodal mathematical reasoning, the scarcity of high-quality CoT training data has hindered existing models from achieving high-precision CoT reasoning and has limited the realization of reasoning potential during test time. In this work, we propose a three-module synthesis strategy that integrates CoT distillation, trajectory-format rewriting, and format unification. It results in a high-quality CoT reasoning instruction fine-tuning dataset in multimodal mathematics, MMathCoT-1M. We comprehensively validate the state-of-the-art (SOTA) performance of the trained URSA-7B model on multiple multimodal mathematical benchmarks. For test-time scaling, we introduce a data synthesis strategy that automatically generates process annotation datasets, known as DualMath-1.1M, focusing on both interpretation and logic. By further training URSA-7B on DualMath-1.1M, we transition from CoT reasoning capabilities to robust supervision abilities. The trained URSA-RM-7B acts as a verifier, effectively enhancing the performance of URSA-7B at test time. URSA-RM-7B also demonstrates excellent out-of-distribution (OOD) verifying capabilities, showcasing its generalization. Model weights, training data and code will be open-sourced.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Agent Laboratory: Using LLM Agents as Research Assistants</title>
      <itunes:episode>363</itunes:episode>
      <podcast:episode>363</podcast:episode>
      <itunes:title>Agent Laboratory: Using LLM Agents as Research Assistants</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">93eb910e-3b3c-48e1-8956-f1fe8d8faa23</guid>
      <link>https://share.transistor.fm/s/a53879d8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.HC, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, Emad Barsoum</p>

            <p><strong>Title:</strong><br>
            Agent Laboratory: Using LLM Agents as Research Assistants</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04227v1">http://arxiv.org/abs/2501.04227v1</a></p>

            <p><strong>Abstract:</strong><br>
            Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory, an autonomous LLM-based framework capable of completing the entire research process. This framework accepts a human-provided research idea and progresses through three stages--literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state-of-the-art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1-preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state-of-the-art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low-level coding and writing, ultimately accelerating scientific discovery.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.HC, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, Emad Barsoum</p>

            <p><strong>Title:</strong><br>
            Agent Laboratory: Using LLM Agents as Research Assistants</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04227v1">http://arxiv.org/abs/2501.04227v1</a></p>

            <p><strong>Abstract:</strong><br>
            Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory, an autonomous LLM-based framework capable of completing the entire research process. This framework accepts a human-provided research idea and progresses through three stages--literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state-of-the-art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1-preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state-of-the-art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low-level coding and writing, ultimately accelerating scientific discovery.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Jan 2025 20:46:23 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a53879d8/7a2c8a10.mp3" length="22648449" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1412</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.HC, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, Emad Barsoum</p>

            <p><strong>Title:</strong><br>
            Agent Laboratory: Using LLM Agents as Research Assistants</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04227v1">http://arxiv.org/abs/2501.04227v1</a></p>

            <p><strong>Abstract:</strong><br>
            Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory, an autonomous LLM-based framework capable of completing the entire research process. This framework accepts a human-provided research idea and progresses through three stages--literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state-of-the-art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1-preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state-of-the-art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low-level coding and writing, ultimately accelerating scientific discovery.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LLM4SR: A Survey on Large Language Models for Scientific Research</title>
      <itunes:episode>362</itunes:episode>
      <podcast:episode>362</podcast:episode>
      <itunes:title>LLM4SR: A Survey on Large Language Models for Scientific Research</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">448b8d66-847f-4ba9-912c-87df66583d62</guid>
      <link>https://share.transistor.fm/s/35b863e1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.DL</p>

            <p><strong>Authors:</strong><br>
            Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, Xinya Du</p>

            <p><strong>Title:</strong><br>
            LLM4SR: A Survey on Large Language Models for Scientific Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04306v1">http://arxiv.org/abs/2501.04306v1</a></p>

            <p><strong>Abstract:</strong><br>
            In recent years, the rapid advancement of Large Language Models (LLMs) has transformed the landscape of scientific research, offering unprecedented support across various stages of the research cycle. This paper presents the first systematic survey dedicated to exploring how LLMs are revolutionizing the scientific research process. We analyze the unique roles LLMs play across four critical stages of research: hypothesis discovery, experiment planning and implementation, scientific writing, and peer reviewing. Our review comprehensively showcases the task-specific methodologies and evaluation benchmarks. By identifying current challenges and proposing future research directions, this survey not only highlights the transformative potential of LLMs, but also aims to inspire and guide researchers and practitioners in leveraging LLMs to advance scientific inquiry. Resources are available at the following repository: https://github.com/du-nlp-lab/LLM4SR</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.DL</p>

            <p><strong>Authors:</strong><br>
            Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, Xinya Du</p>

            <p><strong>Title:</strong><br>
            LLM4SR: A Survey on Large Language Models for Scientific Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04306v1">http://arxiv.org/abs/2501.04306v1</a></p>

            <p><strong>Abstract:</strong><br>
            In recent years, the rapid advancement of Large Language Models (LLMs) has transformed the landscape of scientific research, offering unprecedented support across various stages of the research cycle. This paper presents the first systematic survey dedicated to exploring how LLMs are revolutionizing the scientific research process. We analyze the unique roles LLMs play across four critical stages of research: hypothesis discovery, experiment planning and implementation, scientific writing, and peer reviewing. Our review comprehensively showcases the task-specific methodologies and evaluation benchmarks. By identifying current challenges and proposing future research directions, this survey not only highlights the transformative potential of LLMs, but also aims to inspire and guide researchers and practitioners in leveraging LLMs to advance scientific inquiry. Resources are available at the following repository: https://github.com/du-nlp-lab/LLM4SR</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Jan 2025 20:46:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/35b863e1/1dfe6f64.mp3" length="24278916" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1514</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.DL</p>

            <p><strong>Authors:</strong><br>
            Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, Xinya Du</p>

            <p><strong>Title:</strong><br>
            LLM4SR: A Survey on Large Language Models for Scientific Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04306v1">http://arxiv.org/abs/2501.04306v1</a></p>

            <p><strong>Abstract:</strong><br>
            In recent years, the rapid advancement of Large Language Models (LLMs) has transformed the landscape of scientific research, offering unprecedented support across various stages of the research cycle. This paper presents the first systematic survey dedicated to exploring how LLMs are revolutionizing the scientific research process. We analyze the unique roles LLMs play across four critical stages of research: hypothesis discovery, experiment planning and implementation, scientific writing, and peer reviewing. Our review comprehensively showcases the task-specific methodologies and evaluation benchmarks. By identifying current challenges and proposing future research directions, this survey not only highlights the transformative potential of LLMs, but also aims to inspire and guide researchers and practitioners in leveraging LLMs to advance scientific inquiry. Resources are available at the following repository: https://github.com/du-nlp-lab/LLM4SR</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection</title>
      <itunes:episode>361</itunes:episode>
      <podcast:episode>361</podcast:episode>
      <itunes:title>InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">83d368e0-3ba5-4e51-b529-8c6d316dcbda</guid>
      <link>https://share.transistor.fm/s/d7d8e482</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.AI, cs.CL, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, Fei Wu</p>

            <p><strong>Title:</strong><br>
            InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04575v1">http://arxiv.org/abs/2501.04575v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interface (GUI) Agents, powered by multimodal large language models (MLLMs), have shown great potential for task automation on computing devices such as computers and mobile phones. However, existing agents face challenges in multi-step reasoning and reliance on textual annotations, limiting their effectiveness. We introduce \textit{InfiGUIAgent}, an MLLM-based GUI Agent trained with a two-stage supervised fine-tuning pipeline. Stage 1 enhances fundamental skills such as GUI understanding and grounding, while Stage 2 integrates hierarchical reasoning and expectation-reflection reasoning skills using synthesized data to enable native reasoning abilities of the agents. \textit{InfiGUIAgent} achieves competitive performance on several GUI benchmarks, highlighting the impact of native reasoning skills in enhancing GUI interaction for automation tasks. Resources are available at \url{https://github.com/Reallm-Labs/InfiGUIAgent}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.AI, cs.CL, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, Fei Wu</p>

            <p><strong>Title:</strong><br>
            InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04575v1">http://arxiv.org/abs/2501.04575v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interface (GUI) Agents, powered by multimodal large language models (MLLMs), have shown great potential for task automation on computing devices such as computers and mobile phones. However, existing agents face challenges in multi-step reasoning and reliance on textual annotations, limiting their effectiveness. We introduce \textit{InfiGUIAgent}, an MLLM-based GUI Agent trained with a two-stage supervised fine-tuning pipeline. Stage 1 enhances fundamental skills such as GUI understanding and grounding, while Stage 2 integrates hierarchical reasoning and expectation-reflection reasoning skills using synthesized data to enable native reasoning abilities of the agents. \textit{InfiGUIAgent} achieves competitive performance on several GUI benchmarks, highlighting the impact of native reasoning skills in enhancing GUI interaction for automation tasks. Resources are available at \url{https://github.com/Reallm-Labs/InfiGUIAgent}.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Jan 2025 20:45:40 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d7d8e482/e7336842.mp3" length="20227239" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1261</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.AI, cs.CL, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, Fei Wu</p>

            <p><strong>Title:</strong><br>
            InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04575v1">http://arxiv.org/abs/2501.04575v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interface (GUI) Agents, powered by multimodal large language models (MLLMs), have shown great potential for task automation on computing devices such as computers and mobile phones. However, existing agents face challenges in multi-step reasoning and reliance on textual annotations, limiting their effectiveness. We introduce \textit{InfiGUIAgent}, an MLLM-based GUI Agent trained with a two-stage supervised fine-tuning pipeline. Stage 1 enhances fundamental skills such as GUI understanding and grounding, while Stage 2 integrates hierarchical reasoning and expectation-reflection reasoning skills using synthesized data to enable native reasoning abilities of the agents. \textit{InfiGUIAgent} achieves competitive performance on several GUI benchmarks, highlighting the impact of native reasoning skills in enhancing GUI interaction for automation tasks. Resources are available at \url{https://github.com/Reallm-Labs/InfiGUIAgent}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images</title>
      <itunes:episode>360</itunes:episode>
      <podcast:episode>360</podcast:episode>
      <itunes:title>SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3fffee6f-87ab-4cd0-b55e-5d858b375b7f</guid>
      <link>https://share.transistor.fm/s/133e3dfb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Zixuan Huang, Mark Boss, Aaryaman Vasishta, James M. Rehg, Varun Jampani</p>

            <p><strong>Title:</strong><br>
            SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04689v1">http://arxiv.org/abs/2501.04689v1</a></p>

            <p><strong>Abstract:</strong><br>
            We study the problem of single-image 3D object reconstruction. Recent works have diverged into two directions: regression-based modeling and generative modeling. Regression methods efficiently infer visible surfaces, but struggle with occluded regions. Generative methods handle uncertain regions better by modeling distributions, but are computationally expensive and the generation is often misaligned with visible surfaces. In this paper, we present SPAR3D, a novel two-stage approach aiming to take the best of both directions. The first stage of SPAR3D generates sparse 3D point clouds using a lightweight point diffusion model, which has a fast sampling speed. The second stage uses both the sampled point cloud and the input image to create highly detailed meshes. Our two-stage design enables probabilistic modeling of the ill-posed single-image 3D task while maintaining high computational efficiency and great output fidelity. Using point clouds as an intermediate representation further allows for interactive user edits. Evaluated on diverse datasets, SPAR3D demonstrates superior performance over previous state-of-the-art methods, at an inference speed of 0.7 seconds. Project page with code and model: https://spar3d.github.io</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Zixuan Huang, Mark Boss, Aaryaman Vasishta, James M. Rehg, Varun Jampani</p>

            <p><strong>Title:</strong><br>
            SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04689v1">http://arxiv.org/abs/2501.04689v1</a></p>

            <p><strong>Abstract:</strong><br>
            We study the problem of single-image 3D object reconstruction. Recent works have diverged into two directions: regression-based modeling and generative modeling. Regression methods efficiently infer visible surfaces, but struggle with occluded regions. Generative methods handle uncertain regions better by modeling distributions, but are computationally expensive and the generation is often misaligned with visible surfaces. In this paper, we present SPAR3D, a novel two-stage approach aiming to take the best of both directions. The first stage of SPAR3D generates sparse 3D point clouds using a lightweight point diffusion model, which has a fast sampling speed. The second stage uses both the sampled point cloud and the input image to create highly detailed meshes. Our two-stage design enables probabilistic modeling of the ill-posed single-image 3D task while maintaining high computational efficiency and great output fidelity. Using point clouds as an intermediate representation further allows for interactive user edits. Evaluated on diverse datasets, SPAR3D demonstrates superior performance over previous state-of-the-art methods, at an inference speed of 0.7 seconds. Project page with code and model: https://spar3d.github.io</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Jan 2025 20:45:18 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/133e3dfb/f31976dc.mp3" length="22118076" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1379</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Zixuan Huang, Mark Boss, Aaryaman Vasishta, James M. Rehg, Varun Jampani</p>

            <p><strong>Title:</strong><br>
            SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04689v1">http://arxiv.org/abs/2501.04689v1</a></p>

            <p><strong>Abstract:</strong><br>
            We study the problem of single-image 3D object reconstruction. Recent works have diverged into two directions: regression-based modeling and generative modeling. Regression methods efficiently infer visible surfaces, but struggle with occluded regions. Generative methods handle uncertain regions better by modeling distributions, but are computationally expensive and the generation is often misaligned with visible surfaces. In this paper, we present SPAR3D, a novel two-stage approach aiming to take the best of both directions. The first stage of SPAR3D generates sparse 3D point clouds using a lightweight point diffusion model, which has a fast sampling speed. The second stage uses both the sampled point cloud and the input image to create highly detailed meshes. Our two-stage design enables probabilistic modeling of the ill-posed single-image 3D task while maintaining high computational efficiency and great output fidelity. Using point clouds as an intermediate representation further allows for interactive user edits. Evaluated on diverse datasets, SPAR3D demonstrates superior performance over previous state-of-the-art methods, at an inference speed of 0.7 seconds. Project page with code and model: https://spar3d.github.io</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GeAR: Generation Augmented Retrieval</title>
      <itunes:episode>359</itunes:episode>
      <podcast:episode>359</podcast:episode>
      <itunes:title>GeAR: Generation Augmented Retrieval</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">45059662-2236-46ab-9a12-2178f9d5c32d</guid>
      <link>https://share.transistor.fm/s/16f8b169</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.IR, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haoyu Liu, Shaohan Huang, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Weiwei Deng, Feng Sun, Furu Wei, Qi Zhang</p>

            <p><strong>Title:</strong><br>
            GeAR: Generation Augmented Retrieval</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02772v1">http://arxiv.org/abs/2501.02772v1</a></p>

            <p><strong>Abstract:</strong><br>
            Document retrieval techniques form the foundation for the development of large-scale information systems. The prevailing methodology is to construct a bi-encoder and compute the semantic similarity. However, such scalar similarity is difficult to reflect enough information and impedes our comprehension of the retrieval results. In addition, this computational process mainly emphasizes the global semantics and ignores the fine-grained semantic relationship between the query and the complex text in the document. In this paper, we propose a new method called $\textbf{Ge}$neration $\textbf{A}$ugmented $\textbf{R}$etrieval ($\textbf{GeAR}$) that incorporates well-designed fusion and decoding modules. This enables GeAR to generate the relevant text from documents based on the fused representation of the query and the document, thus learning to "focus on" the fine-grained information. Also when used as a retriever, GeAR does not add any computational burden over bi-encoders. To support the training of the new framework, we have introduced a pipeline to efficiently synthesize high-quality data by utilizing large language models. GeAR exhibits competitive retrieval and localization performance across diverse scenarios and datasets. Moreover, the qualitative analysis and the results generated by GeAR provide novel insights into the interpretation of retrieval results. The code, data, and models will be released after completing technical review to facilitate future research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.IR, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haoyu Liu, Shaohan Huang, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Weiwei Deng, Feng Sun, Furu Wei, Qi Zhang</p>

            <p><strong>Title:</strong><br>
            GeAR: Generation Augmented Retrieval</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02772v1">http://arxiv.org/abs/2501.02772v1</a></p>

            <p><strong>Abstract:</strong><br>
            Document retrieval techniques form the foundation for the development of large-scale information systems. The prevailing methodology is to construct a bi-encoder and compute the semantic similarity. However, such scalar similarity is difficult to reflect enough information and impedes our comprehension of the retrieval results. In addition, this computational process mainly emphasizes the global semantics and ignores the fine-grained semantic relationship between the query and the complex text in the document. In this paper, we propose a new method called $\textbf{Ge}$neration $\textbf{A}$ugmented $\textbf{R}$etrieval ($\textbf{GeAR}$) that incorporates well-designed fusion and decoding modules. This enables GeAR to generate the relevant text from documents based on the fused representation of the query and the document, thus learning to "focus on" the fine-grained information. Also when used as a retriever, GeAR does not add any computational burden over bi-encoders. To support the training of the new framework, we have introduced a pipeline to efficiently synthesize high-quality data by utilizing large language models. GeAR exhibits competitive retrieval and localization performance across diverse scenarios and datasets. Moreover, the qualitative analysis and the results generated by GeAR provide novel insights into the interpretation of retrieval results. The code, data, and models will be released after completing technical review to facilitate future research.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Jan 2025 20:44:56 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/16f8b169/7f744754.mp3" length="21342724" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1330</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.IR, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haoyu Liu, Shaohan Huang, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Weiwei Deng, Feng Sun, Furu Wei, Qi Zhang</p>

            <p><strong>Title:</strong><br>
            GeAR: Generation Augmented Retrieval</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02772v1">http://arxiv.org/abs/2501.02772v1</a></p>

            <p><strong>Abstract:</strong><br>
            Document retrieval techniques form the foundation for the development of large-scale information systems. The prevailing methodology is to construct a bi-encoder and compute the semantic similarity. However, such scalar similarity is difficult to reflect enough information and impedes our comprehension of the retrieval results. In addition, this computational process mainly emphasizes the global semantics and ignores the fine-grained semantic relationship between the query and the complex text in the document. In this paper, we propose a new method called $\textbf{Ge}$neration $\textbf{A}$ugmented $\textbf{R}$etrieval ($\textbf{GeAR}$) that incorporates well-designed fusion and decoding modules. This enables GeAR to generate the relevant text from documents based on the fused representation of the query and the document, thus learning to "focus on" the fine-grained information. Also when used as a retriever, GeAR does not add any computational burden over bi-encoders. To support the training of the new framework, we have introduced a pipeline to efficiently synthesize high-quality data by utilizing large language models. GeAR exhibits competitive retrieval and localization performance across diverse scenarios and datasets. Moreover, the qualitative analysis and the results generated by GeAR provide novel insights into the interpretation of retrieval results. The code, data, and models will be released after completing technical review to facilitate future research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation</title>
      <itunes:episode>358</itunes:episode>
      <podcast:episode>358</podcast:episode>
      <itunes:title>Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">809a8fc6-245b-48cc-bb69-d5bc1fbee669</guid>
      <link>https://share.transistor.fm/s/30db3dba</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Kam Woh Ng, Jing Yang, Jia Wei Sii, Jiankang Deng, Chee Seng Chan, Yi-Zhe Song, Tao Xiang, Xiatian Zhu</p>

            <p><strong>Title:</strong><br>
            Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04144v1">http://arxiv.org/abs/2501.04144v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we push the boundaries of fine-grained 3D generation into truly creative territory. Current methods either lack intricate details or simply mimic existing objects -- we enable both. By lifting 2D fine-grained understanding into 3D through multi-view diffusion and modeling part latents as continuous distributions, we unlock the ability to generate entirely new, yet plausible parts through interpolation and sampling. A self-supervised feature consistency loss further ensures stable generation of these unseen parts. The result is the first system capable of creating novel 3D objects with species-specific details that transcend existing examples. While we demonstrate our approach on birds, the underlying framework extends beyond things that can chirp! Code will be released at https://github.com/kamwoh/chirpy3d.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Kam Woh Ng, Jing Yang, Jia Wei Sii, Jiankang Deng, Chee Seng Chan, Yi-Zhe Song, Tao Xiang, Xiatian Zhu</p>

            <p><strong>Title:</strong><br>
            Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04144v1">http://arxiv.org/abs/2501.04144v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we push the boundaries of fine-grained 3D generation into truly creative territory. Current methods either lack intricate details or simply mimic existing objects -- we enable both. By lifting 2D fine-grained understanding into 3D through multi-view diffusion and modeling part latents as continuous distributions, we unlock the ability to generate entirely new, yet plausible parts through interpolation and sampling. A self-supervised feature consistency loss further ensures stable generation of these unseen parts. The result is the first system capable of creating novel 3D objects with species-specific details that transcend existing examples. While we demonstrate our approach on birds, the underlying framework extends beyond things that can chirp! Code will be released at https://github.com/kamwoh/chirpy3d.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Jan 2025 20:44:35 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/30db3dba/f95c91f1.mp3" length="23097763" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1440</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Kam Woh Ng, Jing Yang, Jia Wei Sii, Jiankang Deng, Chee Seng Chan, Yi-Zhe Song, Tao Xiang, Xiatian Zhu</p>

            <p><strong>Title:</strong><br>
            Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04144v1">http://arxiv.org/abs/2501.04144v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we push the boundaries of fine-grained 3D generation into truly creative territory. Current methods either lack intricate details or simply mimic existing objects -- we enable both. By lifting 2D fine-grained understanding into 3D through multi-view diffusion and modeling part latents as continuous distributions, we unlock the ability to generate entirely new, yet plausible parts through interpolation and sampling. A self-supervised feature consistency loss further ensures stable generation of these unseen parts. The result is the first system capable of creating novel 3D objects with species-specific details that transcend existing examples. While we demonstrate our approach on birds, the underlying framework extends beyond things that can chirp! Code will be released at https://github.com/kamwoh/chirpy3d.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization</title>
      <itunes:episode>357</itunes:episode>
      <podcast:episode>357</podcast:episode>
      <itunes:title>DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">344ea509-16cd-428e-b1b8-3157e8bc5635</guid>
      <link>https://share.transistor.fm/s/3817ffd1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG, cs.AI, cs.CL, 68T45</p>

            <p><strong>Authors:</strong><br>
            Amitava Das, Suranjana Trivedy, Danush Khanna, Rajarshi Roy, Gurpreet Singh, Basab Ghosh, Yaswanth Narsupalli, Vinija Jain, Vasu Sharma, Aishwarya Naresh Reganti, Aman Chadha</p>

            <p><strong>Title:</strong><br>
            DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03271v2">http://arxiv.org/abs/2501.03271v2</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid rise of large language models (LLMs) has unlocked many applications but also underscores the challenge of aligning them with diverse values and preferences. Direct Preference Optimization (DPO) is central to alignment but constrained by fixed divergences and limited feature transformations. We propose DPO-Kernels, which integrates kernel methods to address these issues through four key contributions: (i) Kernelized Representations with polynomial, RBF, Mahalanobis, and spectral kernels for richer transformations, plus a hybrid loss combining embedding-based and probability-based objectives; (ii) Divergence Alternatives (Jensen-Shannon, Hellinger, Renyi, Bhattacharyya, Wasserstein, and f-divergences) for greater stability; (iii) Data-Driven Selection metrics that automatically choose the best kernel-divergence pair; and (iv) a Hierarchical Mixture of Kernels for both local precision and global modeling. Evaluations on 12 datasets demonstrate state-of-the-art performance in factuality, safety, reasoning, and instruction following. Grounded in Heavy-Tailed Self-Regularization, DPO-Kernels maintains robust generalization for LLMs, offering a comprehensive resource for further alignment research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG, cs.AI, cs.CL, 68T45</p>

            <p><strong>Authors:</strong><br>
            Amitava Das, Suranjana Trivedy, Danush Khanna, Rajarshi Roy, Gurpreet Singh, Basab Ghosh, Yaswanth Narsupalli, Vinija Jain, Vasu Sharma, Aishwarya Naresh Reganti, Aman Chadha</p>

            <p><strong>Title:</strong><br>
            DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03271v2">http://arxiv.org/abs/2501.03271v2</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid rise of large language models (LLMs) has unlocked many applications but also underscores the challenge of aligning them with diverse values and preferences. Direct Preference Optimization (DPO) is central to alignment but constrained by fixed divergences and limited feature transformations. We propose DPO-Kernels, which integrates kernel methods to address these issues through four key contributions: (i) Kernelized Representations with polynomial, RBF, Mahalanobis, and spectral kernels for richer transformations, plus a hybrid loss combining embedding-based and probability-based objectives; (ii) Divergence Alternatives (Jensen-Shannon, Hellinger, Renyi, Bhattacharyya, Wasserstein, and f-divergences) for greater stability; (iii) Data-Driven Selection metrics that automatically choose the best kernel-divergence pair; and (iv) a Hierarchical Mixture of Kernels for both local precision and global modeling. Evaluations on 12 datasets demonstrate state-of-the-art performance in factuality, safety, reasoning, and instruction following. Grounded in Heavy-Tailed Self-Regularization, DPO-Kernels maintains robust generalization for LLMs, offering a comprehensive resource for further alignment research.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 09 Jan 2025 20:44:13 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3817ffd1/4c48df74.mp3" length="21748223" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1356</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG, cs.AI, cs.CL, 68T45</p>

            <p><strong>Authors:</strong><br>
            Amitava Das, Suranjana Trivedy, Danush Khanna, Rajarshi Roy, Gurpreet Singh, Basab Ghosh, Yaswanth Narsupalli, Vinija Jain, Vasu Sharma, Aishwarya Naresh Reganti, Aman Chadha</p>

            <p><strong>Title:</strong><br>
            DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03271v2">http://arxiv.org/abs/2501.03271v2</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid rise of large language models (LLMs) has unlocked many applications but also underscores the challenge of aligning them with diverse values and preferences. Direct Preference Optimization (DPO) is central to alignment but constrained by fixed divergences and limited feature transformations. We propose DPO-Kernels, which integrates kernel methods to address these issues through four key contributions: (i) Kernelized Representations with polynomial, RBF, Mahalanobis, and spectral kernels for richer transformations, plus a hybrid loss combining embedding-based and probability-based objectives; (ii) Divergence Alternatives (Jensen-Shannon, Hellinger, Renyi, Bhattacharyya, Wasserstein, and f-divergences) for greater stability; (iii) Data-Driven Selection metrics that automatically choose the best kernel-divergence pair; and (iv) a Hierarchical Mixture of Kernels for both local precision and global modeling. Evaluations on 12 datasets demonstrate state-of-the-art performance in factuality, safety, reasoning, and instruction following. Grounded in Heavy-Tailed Self-Regularization, DPO-Kernels maintains robust generalization for LLMs, offering a comprehensive resource for further alignment research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models</title>
      <itunes:episode>356</itunes:episode>
      <podcast:episode>356</podcast:episode>
      <itunes:title>REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c9c6f350-fbb3-4bfa-9ebc-cf631c37e25c</guid>
      <link>https://share.transistor.fm/s/d3e50e84</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jian Hu</p>

            <p><strong>Title:</strong><br>
            REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03262v1">http://arxiv.org/abs/2501.03262v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical approach for aligning large language models with human preferences, witnessing rapid algorithmic evolution through methods such as Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO). We present REINFORCE++, an enhanced variant of the classical REINFORCE algorithm that incorporates key optimization techniques from PPO while eliminating the need for a critic network. REINFORCE++ achieves three primary objectives: (1) simplicity (2) enhanced training stability, and (3) reduced computational overhead. Through extensive empirical evaluation, we demonstrate that REINFORCE++ exhibits superior stability compared to GRPO and achieves greater computational efficiency than PPO while maintaining comparable performance. The implementation is available at \url{https://github.com/OpenRLHF/OpenRLHF}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jian Hu</p>

            <p><strong>Title:</strong><br>
            REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03262v1">http://arxiv.org/abs/2501.03262v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical approach for aligning large language models with human preferences, witnessing rapid algorithmic evolution through methods such as Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO). We present REINFORCE++, an enhanced variant of the classical REINFORCE algorithm that incorporates key optimization techniques from PPO while eliminating the need for a critic network. REINFORCE++ achieves three primary objectives: (1) simplicity (2) enhanced training stability, and (3) reduced computational overhead. Through extensive empirical evaluation, we demonstrate that REINFORCE++ exhibits superior stability compared to GRPO and achieves greater computational efficiency than PPO while maintaining comparable performance. The implementation is available at \url{https://github.com/OpenRLHF/OpenRLHF}.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Jan 2025 21:52:21 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d3e50e84/50712bfa.mp3" length="20949885" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1306</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 51 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jian Hu</p>

            <p><strong>Title:</strong><br>
            REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03262v1">http://arxiv.org/abs/2501.03262v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical approach for aligning large language models with human preferences, witnessing rapid algorithmic evolution through methods such as Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO). We present REINFORCE++, an enhanced variant of the classical REINFORCE algorithm that incorporates key optimization techniques from PPO while eliminating the need for a critic network. REINFORCE++ achieves three primary objectives: (1) simplicity (2) enhanced training stability, and (3) reduced computational overhead. Through extensive empirical evaluation, we demonstrate that REINFORCE++ exhibits superior stability compared to GRPO and achieves greater computational efficiency than PPO while maintaining comparable performance. The implementation is available at \url{https://github.com/OpenRLHF/OpenRLHF}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models</title>
      <itunes:episode>355</itunes:episode>
      <podcast:episode>355</podcast:episode>
      <itunes:title>MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">84e66465-3622-4c15-aee6-beaf02b0c851</guid>
      <link>https://share.transistor.fm/s/b84eb4fd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang</p>

            <p><strong>Title:</strong><br>
            MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02955v1">http://arxiv.org/abs/2501.02955v1</a></p>

            <p><strong>Abstract:</strong><br>
            In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang</p>

            <p><strong>Title:</strong><br>
            MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02955v1">http://arxiv.org/abs/2501.02955v1</a></p>

            <p><strong>Abstract:</strong><br>
            In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Jan 2025 21:51:58 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b84eb4fd/d4cb9748.mp3" length="21721047" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1354</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang</p>

            <p><strong>Title:</strong><br>
            MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02955v1">http://arxiv.org/abs/2501.02955v1</a></p>

            <p><strong>Abstract:</strong><br>
            In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Cosmos World Foundation Model Platform for Physical AI</title>
      <itunes:episode>354</itunes:episode>
      <podcast:episode>354</podcast:episode>
      <itunes:title>Cosmos World Foundation Model Platform for Physical AI</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7db6edb4-9a91-4939-89c3-2a90123f8574</guid>
      <link>https://share.transistor.fm/s/7e783862</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Laura Leal-Taixe, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Arsalan Mousavian, Seungjun Nah, Sriharsha Niverty, David Page, Despoina Paschalidou, Zeeshan Patel, Lindsey Pavao, Morteza Ramezanali, Fitsum Reda, Xiaowei Ren, Vasanth Rao Naik Sabavat, Ed Schmerling, Stella Shi, Bartosz Stefaniak, Shitao Tang, Lyne Tchapmi, Przemek Tredak, Wei-Cheng Tseng, Jibin Varghese, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Xinyue Wei, Jay Zhangjie Wu, Jiashu Xu, Wei Yang, Lin Yen-Chen, Xiaohui Zeng, Yu Zeng, Jing Zhang, Qinsheng Zhang, Yuxuan Zhang, Qingqing Zhao, Artur Zolkowski</p>

            <p><strong>Title:</strong><br>
            Cosmos World Foundation Model Platform for Physical AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03575v1">http://arxiv.org/abs/2501.03575v1</a></p>

            <p><strong>Abstract:</strong><br>
            Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via https://github.com/NVIDIA/Cosmos.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Laura Leal-Taixe, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Arsalan Mousavian, Seungjun Nah, Sriharsha Niverty, David Page, Despoina Paschalidou, Zeeshan Patel, Lindsey Pavao, Morteza Ramezanali, Fitsum Reda, Xiaowei Ren, Vasanth Rao Naik Sabavat, Ed Schmerling, Stella Shi, Bartosz Stefaniak, Shitao Tang, Lyne Tchapmi, Przemek Tredak, Wei-Cheng Tseng, Jibin Varghese, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Xinyue Wei, Jay Zhangjie Wu, Jiashu Xu, Wei Yang, Lin Yen-Chen, Xiaohui Zeng, Yu Zeng, Jing Zhang, Qinsheng Zhang, Yuxuan Zhang, Qingqing Zhao, Artur Zolkowski</p>

            <p><strong>Title:</strong><br>
            Cosmos World Foundation Model Platform for Physical AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03575v1">http://arxiv.org/abs/2501.03575v1</a></p>

            <p><strong>Abstract:</strong><br>
            Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via https://github.com/NVIDIA/Cosmos.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Jan 2025 21:51:35 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7e783862/4b5f5583.mp3" length="24669697" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1538</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV, cs.AI, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Laura Leal-Taixe, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Arsalan Mousavian, Seungjun Nah, Sriharsha Niverty, David Page, Despoina Paschalidou, Zeeshan Patel, Lindsey Pavao, Morteza Ramezanali, Fitsum Reda, Xiaowei Ren, Vasanth Rao Naik Sabavat, Ed Schmerling, Stella Shi, Bartosz Stefaniak, Shitao Tang, Lyne Tchapmi, Przemek Tredak, Wei-Cheng Tseng, Jibin Varghese, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Xinyue Wei, Jay Zhangjie Wu, Jiashu Xu, Wei Yang, Lin Yen-Chen, Xiaohui Zeng, Yu Zeng, Jing Zhang, Qinsheng Zhang, Yuxuan Zhang, Qingqing Zhao, Artur Zolkowski</p>

            <p><strong>Title:</strong><br>
            Cosmos World Foundation Model Platform for Physical AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03575v1">http://arxiv.org/abs/2501.03575v1</a></p>

            <p><strong>Abstract:</strong><br>
            Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via https://github.com/NVIDIA/Cosmos.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token</title>
      <itunes:episode>353</itunes:episode>
      <podcast:episode>353</podcast:episode>
      <itunes:title>LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ee1ed5a9-aa27-4043-b160-3164fc7d8959</guid>
      <link>https://share.transistor.fm/s/c374c9b1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng</p>

            <p><strong>Title:</strong><br>
            LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03895v1">http://arxiv.org/abs/2501.03895v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advent of real-time large multimodal models (LMMs) like GPT-4o has sparked considerable interest in efficient LMMs. LMM frameworks typically encode visual inputs into vision tokens (continuous representations) and integrate them and textual instructions into the context of large language models (LLMs), where large-scale parameters and numerous context tokens (predominantly vision tokens) result in substantial computational overhead. Previous efforts towards efficient LMMs always focus on replacing the LLM backbone with smaller models, while neglecting the crucial issue of token quantity. In this paper, we introduce LLaVA-Mini, an efficient LMM with minimal vision tokens. To achieve a high compression ratio of vision tokens while preserving visual information, we first analyze how LMMs understand vision tokens and find that most vision tokens only play a crucial role in the early layers of LLM backbone, where they mainly fuse visual information into text tokens. Building on this finding, LLaVA-Mini introduces modality pre-fusion to fuse visual information into text tokens in advance, thereby facilitating the extreme compression of vision tokens fed to LLM backbone into one token. LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Experiments across 11 image-based and 7 video-based benchmarks demonstrate that LLaVA-Mini outperforms LLaVA-v1.5 with just 1 vision token instead of 576. Efficiency analyses reveal that LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng</p>

            <p><strong>Title:</strong><br>
            LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03895v1">http://arxiv.org/abs/2501.03895v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advent of real-time large multimodal models (LMMs) like GPT-4o has sparked considerable interest in efficient LMMs. LMM frameworks typically encode visual inputs into vision tokens (continuous representations) and integrate them and textual instructions into the context of large language models (LLMs), where large-scale parameters and numerous context tokens (predominantly vision tokens) result in substantial computational overhead. Previous efforts towards efficient LMMs always focus on replacing the LLM backbone with smaller models, while neglecting the crucial issue of token quantity. In this paper, we introduce LLaVA-Mini, an efficient LMM with minimal vision tokens. To achieve a high compression ratio of vision tokens while preserving visual information, we first analyze how LMMs understand vision tokens and find that most vision tokens only play a crucial role in the early layers of LLM backbone, where they mainly fuse visual information into text tokens. Building on this finding, LLaVA-Mini introduces modality pre-fusion to fuse visual information into text tokens in advance, thereby facilitating the extreme compression of vision tokens fed to LLM backbone into one token. LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Experiments across 11 image-based and 7 video-based benchmarks demonstrate that LLaVA-Mini outperforms LLaVA-v1.5 with just 1 vision token instead of 576. Efficiency analyses reveal that LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Jan 2025 21:51:11 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c374c9b1/2988b7be.mp3" length="21035989" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1311</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng</p>

            <p><strong>Title:</strong><br>
            LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03895v1">http://arxiv.org/abs/2501.03895v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advent of real-time large multimodal models (LMMs) like GPT-4o has sparked considerable interest in efficient LMMs. LMM frameworks typically encode visual inputs into vision tokens (continuous representations) and integrate them and textual instructions into the context of large language models (LLMs), where large-scale parameters and numerous context tokens (predominantly vision tokens) result in substantial computational overhead. Previous efforts towards efficient LMMs always focus on replacing the LLM backbone with smaller models, while neglecting the crucial issue of token quantity. In this paper, we introduce LLaVA-Mini, an efficient LMM with minimal vision tokens. To achieve a high compression ratio of vision tokens while preserving visual information, we first analyze how LMMs understand vision tokens and find that most vision tokens only play a crucial role in the early layers of LLM backbone, where they mainly fuse visual information into text tokens. Building on this finding, LLaVA-Mini introduces modality pre-fusion to fuse visual information into text tokens in advance, thereby facilitating the extreme compression of vision tokens fed to LLM backbone into one token. LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Experiments across 11 image-based and 7 video-based benchmarks demonstrate that LLaVA-Mini outperforms LLaVA-v1.5 with just 1 vision token instead of 576. Efficiency analyses reveal that LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos</title>
      <itunes:episode>352</itunes:episode>
      <podcast:episode>352</podcast:episode>
      <itunes:title>Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">97c26eee-3d1c-47bd-8863-56398acb6292</guid>
      <link>https://share.transistor.fm/s/7813083b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang</p>

            <p><strong>Title:</strong><br>
            Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04001v1">http://arxiv.org/abs/2501.04001v1</a></p>

            <p><strong>Abstract:</strong><br>
            This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang</p>

            <p><strong>Title:</strong><br>
            Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04001v1">http://arxiv.org/abs/2501.04001v1</a></p>

            <p><strong>Abstract:</strong><br>
            This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Jan 2025 21:50:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7813083b/f975e3ad.mp3" length="21834293" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1361</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang</p>

            <p><strong>Title:</strong><br>
            Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04001v1">http://arxiv.org/abs/2501.04001v1</a></p>

            <p><strong>Abstract:</strong><br>
            This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control</title>
      <itunes:episode>351</itunes:episode>
      <podcast:episode>351</podcast:episode>
      <itunes:title>Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8cdcc419-3023-4810-99ea-62349cc33170</guid>
      <link>https://share.transistor.fm/s/6d2c1227</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, Yuan Liu</p>

            <p><strong>Title:</strong><br>
            Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03847v1">http://arxiv.org/abs/2501.03847v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images. However, precise control over the video generation process, such as camera manipulation or content editing, remains a significant challenge. Existing methods for controlled video generation are typically limited to a single control type, lacking the flexibility to handle diverse control demands. In this paper, we introduce Diffusion as Shader (DaS), a novel approach that supports multiple video control tasks within a unified architecture. Our key insight is that achieving versatile video control necessitates leveraging 3D control signals, as videos are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods limited to 2D control signals, DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware. This innovation allows DaS to achieve a wide range of video controls by simply manipulating the 3D tracking videos. A further advantage of using 3D tracking videos is their ability to effectively link frames, significantly enhancing the temporal consistency of the generated videos. With just 3 days of fine-tuning on 8 H800 GPUs using less than 10k videos, DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, Yuan Liu</p>

            <p><strong>Title:</strong><br>
            Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03847v1">http://arxiv.org/abs/2501.03847v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images. However, precise control over the video generation process, such as camera manipulation or content editing, remains a significant challenge. Existing methods for controlled video generation are typically limited to a single control type, lacking the flexibility to handle diverse control demands. In this paper, we introduce Diffusion as Shader (DaS), a novel approach that supports multiple video control tasks within a unified architecture. Our key insight is that achieving versatile video control necessitates leveraging 3D control signals, as videos are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods limited to 2D control signals, DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware. This innovation allows DaS to achieve a wide range of video controls by simply manipulating the 3D tracking videos. A further advantage of using 3D tracking videos is their ability to effectively link frames, significantly enhancing the temporal consistency of the generated videos. With just 3 days of fine-tuning on 8 H800 GPUs using less than 10k videos, DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Jan 2025 21:50:25 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6d2c1227/cf638727.mp3" length="22379311" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1395</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, Yuan Liu</p>

            <p><strong>Title:</strong><br>
            Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03847v1">http://arxiv.org/abs/2501.03847v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images. However, precise control over the video generation process, such as camera manipulation or content editing, remains a significant challenge. Existing methods for controlled video generation are typically limited to a single control type, lacking the flexibility to handle diverse control demands. In this paper, we introduce Diffusion as Shader (DaS), a novel approach that supports multiple video control tasks within a unified architecture. Our key insight is that achieving versatile video control necessitates leveraging 3D control signals, as videos are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods limited to 2D control signals, DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware. This innovation allows DaS to achieve a wide range of video controls by simply manipulating the 3D tracking videos. A further advantage of using 3D tracking videos is their ability to effectively link frames, significantly enhancing the temporal consistency of the generated videos. With just 3 days of fine-tuning on 8 H800 GPUs using less than 10k videos, DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis</title>
      <itunes:episode>350</itunes:episode>
      <podcast:episode>350</podcast:episode>
      <itunes:title>OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">662c98c0-727d-4261-89cc-9cc95a543fe9</guid>
      <link>https://share.transistor.fm/s/08e7e16b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Min Yang, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Yangyi Chen, Hamid Alinejad-Rokny, Fei Huang</p>

            <p><strong>Title:</strong><br>
            OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04561v1">http://arxiv.org/abs/2501.04561v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in omnimodal learning have been achieved in understanding and generation across images, text, and speech, though mainly within proprietary models. Limited omnimodal datasets and the inherent challenges associated with real-time emotional speech generation have hindered open-source progress. To address these issues, we propose openomni, a two-stage training method combining omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pre-trained speech model is further trained on text-image tasks to generalize from vision to speech in a (near) zero-shot manner, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder facilitates real-time emotional speech through training on speech tasks and preference learning. Experiments demonstrate that openomni consistently improves across omnimodal, vision-language, and speech-language evaluations, enabling natural, emotion-rich dialogues and real-time emotional speech generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Min Yang, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Yangyi Chen, Hamid Alinejad-Rokny, Fei Huang</p>

            <p><strong>Title:</strong><br>
            OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04561v1">http://arxiv.org/abs/2501.04561v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in omnimodal learning have been achieved in understanding and generation across images, text, and speech, though mainly within proprietary models. Limited omnimodal datasets and the inherent challenges associated with real-time emotional speech generation have hindered open-source progress. To address these issues, we propose openomni, a two-stage training method combining omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pre-trained speech model is further trained on text-image tasks to generalize from vision to speech in a (near) zero-shot manner, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder facilitates real-time emotional speech through training on speech tasks and preference learning. Experiments demonstrate that openomni consistently improves across omnimodal, vision-language, and speech-language evaluations, enabling natural, emotion-rich dialogues and real-time emotional speech generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Jan 2025 21:50:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/08e7e16b/504e1407.mp3" length="19799719" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1234</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Min Yang, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Yangyi Chen, Hamid Alinejad-Rokny, Fei Huang</p>

            <p><strong>Title:</strong><br>
            OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.04561v1">http://arxiv.org/abs/2501.04561v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in omnimodal learning have been achieved in understanding and generation across images, text, and speech, though mainly within proprietary models. Limited omnimodal datasets and the inherent challenges associated with real-time emotional speech generation have hindered open-source progress. To address these issues, we propose openomni, a two-stage training method combining omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pre-trained speech model is further trained on text-image tasks to generalize from vision to speech in a (near) zero-shot manner, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder facilitates real-time emotional speech through training on speech tasks and preference learning. Experiments demonstrate that openomni consistently improves across omnimodal, vision-language, and speech-language evaluations, enabling natural, emotion-rich dialogues and real-time emotional speech generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides</title>
      <itunes:episode>349</itunes:episode>
      <podcast:episode>349</podcast:episode>
      <itunes:title>PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">76557c71-661d-4b66-ac02-83634184a58e</guid>
      <link>https://share.transistor.fm/s/d80f7e29</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hao Zheng, Xinyan Guan, Hao Kong, Jia Zheng, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, Le Sun</p>

            <p><strong>Title:</strong><br>
            PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03936v1">http://arxiv.org/abs/2501.03936v1</a></p>

            <p><strong>Abstract:</strong><br>
            Automatically generating presentations from documents is a challenging task that requires balancing content quality, visual design, and structural coherence. Existing methods primarily focus on improving and evaluating the content quality in isolation, often overlooking visual design and structural coherence, which limits their practical applicability. To address these limitations, we propose PPTAgent, which comprehensively improves presentation generation through a two-stage, edit-based approach inspired by human workflows. PPTAgent first analyzes reference presentations to understand their structural patterns and content schemas, then drafts outlines and generates slides through code actions to ensure consistency and alignment. To comprehensively evaluate the quality of generated presentations, we further introduce PPTEval, an evaluation framework that assesses presentations across three dimensions: Content, Design, and Coherence. Experiments show that PPTAgent significantly outperforms traditional automatic presentation generation methods across all three dimensions. The code and data are available at https://github.com/icip-cas/PPTAgent.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hao Zheng, Xinyan Guan, Hao Kong, Jia Zheng, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, Le Sun</p>

            <p><strong>Title:</strong><br>
            PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03936v1">http://arxiv.org/abs/2501.03936v1</a></p>

            <p><strong>Abstract:</strong><br>
            Automatically generating presentations from documents is a challenging task that requires balancing content quality, visual design, and structural coherence. Existing methods primarily focus on improving and evaluating the content quality in isolation, often overlooking visual design and structural coherence, which limits their practical applicability. To address these limitations, we propose PPTAgent, which comprehensively improves presentation generation through a two-stage, edit-based approach inspired by human workflows. PPTAgent first analyzes reference presentations to understand their structural patterns and content schemas, then drafts outlines and generates slides through code actions to ensure consistency and alignment. To comprehensively evaluate the quality of generated presentations, we further introduce PPTEval, an evaluation framework that assesses presentations across three dimensions: Content, Design, and Coherence. Experiments show that PPTAgent significantly outperforms traditional automatic presentation generation methods across all three dimensions. The code and data are available at https://github.com/icip-cas/PPTAgent.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Jan 2025 21:49:38 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d80f7e29/33203fb9.mp3" length="21324369" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1329</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hao Zheng, Xinyan Guan, Hao Kong, Jia Zheng, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, Le Sun</p>

            <p><strong>Title:</strong><br>
            PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03936v1">http://arxiv.org/abs/2501.03936v1</a></p>

            <p><strong>Abstract:</strong><br>
            Automatically generating presentations from documents is a challenging task that requires balancing content quality, visual design, and structural coherence. Existing methods primarily focus on improving and evaluating the content quality in isolation, often overlooking visual design and structural coherence, which limits their practical applicability. To address these limitations, we propose PPTAgent, which comprehensively improves presentation generation through a two-stage, edit-based approach inspired by human workflows. PPTAgent first analyzes reference presentations to understand their structural patterns and content schemas, then drafts outlines and generates slides through code actions to ensure consistency and alignment. To comprehensively evaluate the quality of generated presentations, we further introduce PPTEval, an evaluation framework that assesses presentations across three dimensions: Content, Design, and Coherence. Experiments show that PPTAgent significantly outperforms traditional automatic presentation generation methods across all three dimensions. The code and data are available at https://github.com/icip-cas/PPTAgent.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model</title>
      <itunes:episode>348</itunes:episode>
      <podcast:episode>348</podcast:episode>
      <itunes:title>Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">43e79df3-3052-4d0f-a8d4-cbe997f117e3</guid>
      <link>https://share.transistor.fm/s/46e239be</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yueqin Yin, Shentao Yang, Yujia Xie, Ziyi Yang, Yuting Sun, Hany Awadalla, Weizhu Chen, Mingyuan Zhou</p>

            <p><strong>Title:</strong><br>
            Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02790v1">http://arxiv.org/abs/2501.02790v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference. Prior RLHF works typically take a bandit formulation, which, though intuitive, ignores the sequential nature of LM generation and can suffer from the sparse reward issue. While recent works propose dense token-level RLHF, treating each token as an action may be oversubtle to proper reward assignment. In this paper, we seek to get the best of both by training and utilizing a segment-level reward model, which assigns a reward to each semantically complete text segment that spans over a short sequence of tokens. For reward learning, our method allows dynamic text segmentation and compatibility with standard sequence-preference datasets. For effective RL-based LM training against segment reward, we generalize the classical scalar bandit reward normalizers into location-aware normalizer functions and interpolate the segment reward for further densification. With these designs, our method performs competitively on three popular RLHF benchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation studies are conducted to further demonstrate our method.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yueqin Yin, Shentao Yang, Yujia Xie, Ziyi Yang, Yuting Sun, Hany Awadalla, Weizhu Chen, Mingyuan Zhou</p>

            <p><strong>Title:</strong><br>
            Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02790v1">http://arxiv.org/abs/2501.02790v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference. Prior RLHF works typically take a bandit formulation, which, though intuitive, ignores the sequential nature of LM generation and can suffer from the sparse reward issue. While recent works propose dense token-level RLHF, treating each token as an action may be oversubtle to proper reward assignment. In this paper, we seek to get the best of both by training and utilizing a segment-level reward model, which assigns a reward to each semantically complete text segment that spans over a short sequence of tokens. For reward learning, our method allows dynamic text segmentation and compatibility with standard sequence-preference datasets. For effective RL-based LM training against segment reward, we generalize the classical scalar bandit reward normalizers into location-aware normalizer functions and interpolate the segment reward for further densification. With these designs, our method performs competitively on three popular RLHF benchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation studies are conducted to further demonstrate our method.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Jan 2025 21:49:15 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/46e239be/efc7a84b.mp3" length="21729378" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1354</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yueqin Yin, Shentao Yang, Yujia Xie, Ziyi Yang, Yuting Sun, Hany Awadalla, Weizhu Chen, Mingyuan Zhou</p>

            <p><strong>Title:</strong><br>
            Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02790v1">http://arxiv.org/abs/2501.02790v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference. Prior RLHF works typically take a bandit formulation, which, though intuitive, ignores the sequential nature of LM generation and can suffer from the sparse reward issue. While recent works propose dense token-level RLHF, treating each token as an action may be oversubtle to proper reward assignment. In this paper, we seek to get the best of both by training and utilizing a segment-level reward model, which assigns a reward to each semantically complete text segment that spans over a short sequence of tokens. For reward learning, our method allows dynamic text segmentation and compatibility with standard sequence-preference datasets. For effective RL-based LM training against segment reward, we generalize the classical scalar bandit reward normalizers into location-aware normalizer functions and interpolate the segment reward for further densification. With these designs, our method performs competitively on three popular RLHF benchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation studies are conducted to further demonstrate our method.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting</title>
      <itunes:episode>347</itunes:episode>
      <podcast:episode>347</podcast:episode>
      <itunes:title>MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">24ef8211-5aa5-451a-8345-71faa7e415b6</guid>
      <link>https://share.transistor.fm/s/faf449a5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sangwoon Kwak, Joonsoo Kim, Jun Young Jeong, Won-Sik Cheong, Jihyong Oh, Munchurl Kim</p>

            <p><strong>Title:</strong><br>
            MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03714v1">http://arxiv.org/abs/2501.03714v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D Gaussian Splatting (3DGS) has made significant strides in scene representation and neural rendering, with intense efforts focused on adapting it for dynamic scenes. Despite delivering remarkable rendering quality and speed, existing methods struggle with storage demands and representing complex real-world motions. To tackle these issues, we propose MoDecGS, a memory-efficient Gaussian splatting framework designed for reconstructing novel views in challenging scenarios with complex motions. We introduce GlobaltoLocal Motion Decomposition (GLMD) to effectively capture dynamic motions in a coarsetofine manner. This approach leverages Global Canonical Scaffolds (Global CS) and Local Canonical Scaffolds (Local CS), extending static Scaffold representation to dynamic video reconstruction. For Global CS, we propose Global Anchor Deformation (GAD) to efficiently represent global dynamics along complex motions, by directly deforming the implicit Scaffold attributes which are anchor position, offset, and local context features. Next, we finely adjust local motions via the Local Gaussian Deformation (LGD) of Local CS explicitly. Additionally, we introduce Temporal Interval Adjustment (TIA) to automatically control the temporal coverage of each Local CS during training, allowing MoDecGS to find optimal interval assignments based on the specified number of temporal segments. Extensive evaluations demonstrate that MoDecGS achieves an average 70% reduction in model size over stateoftheart methods for dynamic 3D Gaussians from realworld dynamic videos while maintaining or even improving rendering quality.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sangwoon Kwak, Joonsoo Kim, Jun Young Jeong, Won-Sik Cheong, Jihyong Oh, Munchurl Kim</p>

            <p><strong>Title:</strong><br>
            MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03714v1">http://arxiv.org/abs/2501.03714v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D Gaussian Splatting (3DGS) has made significant strides in scene representation and neural rendering, with intense efforts focused on adapting it for dynamic scenes. Despite delivering remarkable rendering quality and speed, existing methods struggle with storage demands and representing complex real-world motions. To tackle these issues, we propose MoDecGS, a memory-efficient Gaussian splatting framework designed for reconstructing novel views in challenging scenarios with complex motions. We introduce GlobaltoLocal Motion Decomposition (GLMD) to effectively capture dynamic motions in a coarsetofine manner. This approach leverages Global Canonical Scaffolds (Global CS) and Local Canonical Scaffolds (Local CS), extending static Scaffold representation to dynamic video reconstruction. For Global CS, we propose Global Anchor Deformation (GAD) to efficiently represent global dynamics along complex motions, by directly deforming the implicit Scaffold attributes which are anchor position, offset, and local context features. Next, we finely adjust local motions via the Local Gaussian Deformation (LGD) of Local CS explicitly. Additionally, we introduce Temporal Interval Adjustment (TIA) to automatically control the temporal coverage of each Local CS during training, allowing MoDecGS to find optimal interval assignments based on the specified number of temporal segments. Extensive evaluations demonstrate that MoDecGS achieves an average 70% reduction in model size over stateoftheart methods for dynamic 3D Gaussians from realworld dynamic videos while maintaining or even improving rendering quality.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 08 Jan 2025 21:48:52 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/faf449a5/08c8a088.mp3" length="20050061" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1249</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sangwoon Kwak, Joonsoo Kim, Jun Young Jeong, Won-Sik Cheong, Jihyong Oh, Munchurl Kim</p>

            <p><strong>Title:</strong><br>
            MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03714v1">http://arxiv.org/abs/2501.03714v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D Gaussian Splatting (3DGS) has made significant strides in scene representation and neural rendering, with intense efforts focused on adapting it for dynamic scenes. Despite delivering remarkable rendering quality and speed, existing methods struggle with storage demands and representing complex real-world motions. To tackle these issues, we propose MoDecGS, a memory-efficient Gaussian splatting framework designed for reconstructing novel views in challenging scenarios with complex motions. We introduce GlobaltoLocal Motion Decomposition (GLMD) to effectively capture dynamic motions in a coarsetofine manner. This approach leverages Global Canonical Scaffolds (Global CS) and Local Canonical Scaffolds (Local CS), extending static Scaffold representation to dynamic video reconstruction. For Global CS, we propose Global Anchor Deformation (GAD) to efficiently represent global dynamics along complex motions, by directly deforming the implicit Scaffold attributes which are anchor position, offset, and local context features. Next, we finely adjust local motions via the Local Gaussian Deformation (LGD) of Local CS explicitly. Additionally, we introduce Temporal Interval Adjustment (TIA) to automatically control the temporal coverage of each Local CS during training, allowing MoDecGS to find optimal interval assignments based on the specified number of temporal segments. Extensive evaluations demonstrate that MoDecGS achieves an average 70% reduction in model size over stateoftheart methods for dynamic 3D Gaussians from realworld dynamic videos while maintaining or even improving rendering quality.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution</title>
      <itunes:episode>346</itunes:episode>
      <podcast:episode>346</podcast:episode>
      <itunes:title>STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c2ba1d3a-f789-47bf-823c-0b48b731f829</guid>
      <link>https://share.transistor.fm/s/f1cbf750</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Rui Xie, Yinhong Liu, Penghao Zhou, Chen Zhao, Jun Zhou, Kai Zhang, Zhenyu Zhang, Jian Yang, Zhenheng Yang, Ying Tai</p>

            <p><strong>Title:</strong><br>
            STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02976v1">http://arxiv.org/abs/2501.02976v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image diffusion models have been adapted for real-world video super-resolution to tackle over-smoothing issues in GAN-based methods. However, these models struggle to maintain temporal consistency, as they are trained on static images, limiting their ability to capture temporal dynamics effectively. Integrating text-to-video (T2V) models into video super-resolution for improved temporal modeling is straightforward. However, two key challenges remain: artifacts introduced by complex degradations in real-world scenarios, and compromised fidelity due to the strong generative capacity of powerful T2V models (\textit{e.g.}, CogVideoX-5B). To enhance the spatio-temporal quality of restored videos, we introduce\textbf{~\name} (\textbf{S}patial-\textbf{T}emporal \textbf{A}ugmentation with T2V models for \textbf{R}eal-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. Specifically, we introduce a Local Information Enhancement Module (LIEM) before the global attention block to enrich local details and mitigate degradation artifacts. Moreover, we propose a Dynamic Frequency (DF) Loss to reinforce fidelity, guiding the model to focus on different frequency components across diffusion steps. Extensive experiments demonstrate\textbf{~\name}~outperforms state-of-the-art methods on both synthetic and real-world datasets.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Rui Xie, Yinhong Liu, Penghao Zhou, Chen Zhao, Jun Zhou, Kai Zhang, Zhenyu Zhang, Jian Yang, Zhenheng Yang, Ying Tai</p>

            <p><strong>Title:</strong><br>
            STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02976v1">http://arxiv.org/abs/2501.02976v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image diffusion models have been adapted for real-world video super-resolution to tackle over-smoothing issues in GAN-based methods. However, these models struggle to maintain temporal consistency, as they are trained on static images, limiting their ability to capture temporal dynamics effectively. Integrating text-to-video (T2V) models into video super-resolution for improved temporal modeling is straightforward. However, two key challenges remain: artifacts introduced by complex degradations in real-world scenarios, and compromised fidelity due to the strong generative capacity of powerful T2V models (\textit{e.g.}, CogVideoX-5B). To enhance the spatio-temporal quality of restored videos, we introduce\textbf{~\name} (\textbf{S}patial-\textbf{T}emporal \textbf{A}ugmentation with T2V models for \textbf{R}eal-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. Specifically, we introduce a Local Information Enhancement Module (LIEM) before the global attention block to enrich local details and mitigate degradation artifacts. Moreover, we propose a Dynamic Frequency (DF) Loss to reinforce fidelity, guiding the model to focus on different frequency components across diffusion steps. Extensive experiments demonstrate\textbf{~\name}~outperforms state-of-the-art methods on both synthetic and real-world datasets.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Jan 2025 21:42:10 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f1cbf750/9764e89a.mp3" length="21464831" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1338</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Rui Xie, Yinhong Liu, Penghao Zhou, Chen Zhao, Jun Zhou, Kai Zhang, Zhenyu Zhang, Jian Yang, Zhenheng Yang, Ying Tai</p>

            <p><strong>Title:</strong><br>
            STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02976v1">http://arxiv.org/abs/2501.02976v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image diffusion models have been adapted for real-world video super-resolution to tackle over-smoothing issues in GAN-based methods. However, these models struggle to maintain temporal consistency, as they are trained on static images, limiting their ability to capture temporal dynamics effectively. Integrating text-to-video (T2V) models into video super-resolution for improved temporal modeling is straightforward. However, two key challenges remain: artifacts introduced by complex degradations in real-world scenarios, and compromised fidelity due to the strong generative capacity of powerful T2V models (\textit{e.g.}, CogVideoX-5B). To enhance the spatio-temporal quality of restored videos, we introduce\textbf{~\name} (\textbf{S}patial-\textbf{T}emporal \textbf{A}ugmentation with T2V models for \textbf{R}eal-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. Specifically, we introduce a Local Information Enhancement Module (LIEM) before the global attention block to enrich local details and mitigate degradation artifacts. Moreover, we propose a Dynamic Frequency (DF) Loss to reinforce fidelity, guiding the model to focus on different frequency components across diffusion steps. Extensive experiments demonstrate\textbf{~\name}~outperforms state-of-the-art methods on both synthetic and real-world datasets.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction</title>
      <itunes:episode>345</itunes:episode>
      <podcast:episode>345</podcast:episode>
      <itunes:title>Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">39a140bc-6612-4b36-8867-8f82415be3b6</guid>
      <link>https://share.transistor.fm/s/781a2fdd</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03218v1">http://arxiv.org/abs/2501.03218v1</a></p>

            <p><strong>Abstract:</strong><br>
            Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming video on the fly. Unlike offline video LLMs, which analyze the entire video before answering questions, active real-time interaction requires three capabilities: 1) Perception: real-time video monitoring and interaction capturing. 2) Decision: raising proactive interaction in proper situations, 3) Reaction: continuous interaction with users. However, inherent conflicts exist among the desired capabilities. The Decision and Reaction require a contrary Perception scale and grain, and the autoregressive decoding blocks the real-time Perception and Decision during the Reaction. To unify the conflicted capabilities within a harmonious system, we present Dispider, a system that disentangles Perception, Decision, and Reaction. Dispider features a lightweight proactive streaming video processing module that tracks the video stream and identifies optimal moments for interaction. Once the interaction is triggered, an asynchronous interaction module provides detailed responses, while the processing module continues to monitor the video in the meantime. Our disentangled and asynchronous design ensures timely, contextually accurate, and computationally efficient responses, making Dispider ideal for active real-time interaction for long-duration video streams. Experiments show that Dispider not only maintains strong performance in conventional video QA tasks, but also significantly surpasses previous online models in streaming scenario responses, thereby validating the effectiveness of our architecture. The code and model are released at \url{https://github.com/Mark12Ding/Dispider}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03218v1">http://arxiv.org/abs/2501.03218v1</a></p>

            <p><strong>Abstract:</strong><br>
            Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming video on the fly. Unlike offline video LLMs, which analyze the entire video before answering questions, active real-time interaction requires three capabilities: 1) Perception: real-time video monitoring and interaction capturing. 2) Decision: raising proactive interaction in proper situations, 3) Reaction: continuous interaction with users. However, inherent conflicts exist among the desired capabilities. The Decision and Reaction require a contrary Perception scale and grain, and the autoregressive decoding blocks the real-time Perception and Decision during the Reaction. To unify the conflicted capabilities within a harmonious system, we present Dispider, a system that disentangles Perception, Decision, and Reaction. Dispider features a lightweight proactive streaming video processing module that tracks the video stream and identifies optimal moments for interaction. Once the interaction is triggered, an asynchronous interaction module provides detailed responses, while the processing module continues to monitor the video in the meantime. Our disentangled and asynchronous design ensures timely, contextually accurate, and computationally efficient responses, making Dispider ideal for active real-time interaction for long-duration video streams. Experiments show that Dispider not only maintains strong performance in conventional video QA tasks, but also significantly surpasses previous online models in streaming scenario responses, thereby validating the effectiveness of our architecture. The code and model are released at \url{https://github.com/Mark12Ding/Dispider}.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Jan 2025 21:41:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/781a2fdd/6380072f.mp3" length="25890617" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1614</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03218v1">http://arxiv.org/abs/2501.03218v1</a></p>

            <p><strong>Abstract:</strong><br>
            Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming video on the fly. Unlike offline video LLMs, which analyze the entire video before answering questions, active real-time interaction requires three capabilities: 1) Perception: real-time video monitoring and interaction capturing. 2) Decision: raising proactive interaction in proper situations, 3) Reaction: continuous interaction with users. However, inherent conflicts exist among the desired capabilities. The Decision and Reaction require a contrary Perception scale and grain, and the autoregressive decoding blocks the real-time Perception and Decision during the Reaction. To unify the conflicted capabilities within a harmonious system, we present Dispider, a system that disentangles Perception, Decision, and Reaction. Dispider features a lightweight proactive streaming video processing module that tracks the video stream and identifies optimal moments for interaction. Once the interaction is triggered, an asynchronous interaction module provides detailed responses, while the processing module continues to monitor the video in the meantime. Our disentangled and asynchronous design ensures timely, contextually accurate, and computationally efficient responses, making Dispider ideal for active real-time interaction for long-duration video streams. Experiments show that Dispider not only maintains strong performance in conventional video QA tasks, but also significantly surpasses previous online models in streaming scenario responses, thereby validating the effectiveness of our architecture. The code and model are released at \url{https://github.com/Mark12Ding/Dispider}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning</title>
      <itunes:episode>344</itunes:episode>
      <podcast:episode>344</podcast:episode>
      <itunes:title>BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">304d33d4-3142-43af-a4cf-bb4f0a3102c4</guid>
      <link>https://share.transistor.fm/s/ca81e991</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Beichen Zhang, Yuhong Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Haodong Duan, Yuhang Cao, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03226v1">http://arxiv.org/abs/2501.03226v1</a></p>

            <p><strong>Abstract:</strong><br>
            Cutting-edge large language models (LLMs) demonstrate promising performance in solving complex math problems with a divide-and-conquer pipeline and the assistance of in-context learning (ICL) examples. However, their potential for improvement is limited by two critical problems within their ICL examples: granularity-mismatch and the ensuing negative-effect noise problem. Specifically, the LLMs are capable of the dividing process yet mostly failed by inaccurate reasoning within a few conquer steps, while the ICL examples retrieved in question-grained sometimes lack relevant steps for a specific challenging reasoning step. Further, this disconnect may hinder the correct reasoning due to its irrelevance. To this end, we focus on improving the reasoning quality within each step and present BoostStep. BoostStep aligns the granularity between the retrieving and reasoning on step grained, and provides highly related ICL examples for each reasoning step with a novel `first-try' strategy. BoostStep provides more relevant examples than the coarse question-grained strategy, enhancing the model reasoning quality within each step steadily. BoostStep is a general and robust reasoning-enhancing method that not only improves standalone reasoning performance but also integrates seamlessly with Monte Carlo Tree Search methods (MCTS) to refine both candidate generation and decision-making. Quantitatively, it improves GPT-4o and Qwen2.5-Math-72B by 3.6\% and 2.0\% respectively on various mathematical benchmarks, and 7.5\% gain combined with MCTS.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Beichen Zhang, Yuhong Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Haodong Duan, Yuhang Cao, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03226v1">http://arxiv.org/abs/2501.03226v1</a></p>

            <p><strong>Abstract:</strong><br>
            Cutting-edge large language models (LLMs) demonstrate promising performance in solving complex math problems with a divide-and-conquer pipeline and the assistance of in-context learning (ICL) examples. However, their potential for improvement is limited by two critical problems within their ICL examples: granularity-mismatch and the ensuing negative-effect noise problem. Specifically, the LLMs are capable of the dividing process yet mostly failed by inaccurate reasoning within a few conquer steps, while the ICL examples retrieved in question-grained sometimes lack relevant steps for a specific challenging reasoning step. Further, this disconnect may hinder the correct reasoning due to its irrelevance. To this end, we focus on improving the reasoning quality within each step and present BoostStep. BoostStep aligns the granularity between the retrieving and reasoning on step grained, and provides highly related ICL examples for each reasoning step with a novel `first-try' strategy. BoostStep provides more relevant examples than the coarse question-grained strategy, enhancing the model reasoning quality within each step steadily. BoostStep is a general and robust reasoning-enhancing method that not only improves standalone reasoning performance but also integrates seamlessly with Monte Carlo Tree Search methods (MCTS) to refine both candidate generation and decision-making. Quantitatively, it improves GPT-4o and Qwen2.5-Math-72B by 3.6\% and 2.0\% respectively on various mathematical benchmarks, and 7.5\% gain combined with MCTS.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Jan 2025 21:41:13 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ca81e991/1062036e.mp3" length="21601090" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1346</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Beichen Zhang, Yuhong Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Haodong Duan, Yuhang Cao, Dahua Lin, Jiaqi Wang</p>

            <p><strong>Title:</strong><br>
            BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03226v1">http://arxiv.org/abs/2501.03226v1</a></p>

            <p><strong>Abstract:</strong><br>
            Cutting-edge large language models (LLMs) demonstrate promising performance in solving complex math problems with a divide-and-conquer pipeline and the assistance of in-context learning (ICL) examples. However, their potential for improvement is limited by two critical problems within their ICL examples: granularity-mismatch and the ensuing negative-effect noise problem. Specifically, the LLMs are capable of the dividing process yet mostly failed by inaccurate reasoning within a few conquer steps, while the ICL examples retrieved in question-grained sometimes lack relevant steps for a specific challenging reasoning step. Further, this disconnect may hinder the correct reasoning due to its irrelevance. To this end, we focus on improving the reasoning quality within each step and present BoostStep. BoostStep aligns the granularity between the retrieving and reasoning on step grained, and provides highly related ICL examples for each reasoning step with a novel `first-try' strategy. BoostStep provides more relevant examples than the coarse question-grained strategy, enhancing the model reasoning quality within each step steadily. BoostStep is a general and robust reasoning-enhancing method that not only improves standalone reasoning performance but also integrates seamlessly with Monte Carlo Tree Search methods (MCTS) to refine both candidate generation and decision-making. Quantitatively, it improves GPT-4o and Qwen2.5-Math-72B by 3.6\% and 2.0\% respectively on various mathematical benchmarks, and 7.5\% gain combined with MCTS.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Personalized Graph-Based Retrieval for Large Language Models</title>
      <itunes:episode>343</itunes:episode>
      <podcast:episode>343</podcast:episode>
      <itunes:title>Personalized Graph-Based Retrieval for Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d8442430-1ec8-469e-915f-6c76400d964e</guid>
      <link>https://share.transistor.fm/s/e36d9447</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Steven Au, Cameron J. Dimacali, Ojasmitha Pedirappagari, Namyong Park, Franck Dernoncourt, Yu Wang, Nikos Kanakaris, Hanieh Deilamsalehy, Ryan A. Rossi, Nesreen K. Ahmed</p>

            <p><strong>Title:</strong><br>
            Personalized Graph-Based Retrieval for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02157v1">http://arxiv.org/abs/2501.02157v1</a></p>

            <p><strong>Abstract:</strong><br>
            As large language models (LLMs) evolve, their ability to deliver personalized and context-aware responses offers transformative potential for improving user experiences. Existing personalization approaches, however, often rely solely on user history to augment the prompt, limiting their effectiveness in generating tailored outputs, especially in cold-start scenarios with sparse data. To address these limitations, we propose Personalized Graph-based Retrieval-Augmented Generation (PGraphRAG), a framework that leverages user-centric knowledge graphs to enrich personalization. By directly integrating structured user knowledge into the retrieval process and augmenting prompts with user-relevant context, PGraphRAG enhances contextual understanding and output quality. We also introduce the Personalized Graph-based Benchmark for Text Generation, designed to evaluate personalized text generation tasks in real-world settings where user history is sparse or unavailable. Experimental results show that PGraphRAG significantly outperforms state-of-the-art personalization methods across diverse tasks, demonstrating the unique advantages of graph-based retrieval for personalization.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Steven Au, Cameron J. Dimacali, Ojasmitha Pedirappagari, Namyong Park, Franck Dernoncourt, Yu Wang, Nikos Kanakaris, Hanieh Deilamsalehy, Ryan A. Rossi, Nesreen K. Ahmed</p>

            <p><strong>Title:</strong><br>
            Personalized Graph-Based Retrieval for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02157v1">http://arxiv.org/abs/2501.02157v1</a></p>

            <p><strong>Abstract:</strong><br>
            As large language models (LLMs) evolve, their ability to deliver personalized and context-aware responses offers transformative potential for improving user experiences. Existing personalization approaches, however, often rely solely on user history to augment the prompt, limiting their effectiveness in generating tailored outputs, especially in cold-start scenarios with sparse data. To address these limitations, we propose Personalized Graph-based Retrieval-Augmented Generation (PGraphRAG), a framework that leverages user-centric knowledge graphs to enrich personalization. By directly integrating structured user knowledge into the retrieval process and augmenting prompts with user-relevant context, PGraphRAG enhances contextual understanding and output quality. We also introduce the Personalized Graph-based Benchmark for Text Generation, designed to evaluate personalized text generation tasks in real-world settings where user history is sparse or unavailable. Experimental results show that PGraphRAG significantly outperforms state-of-the-art personalization methods across diverse tasks, demonstrating the unique advantages of graph-based retrieval for personalization.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Jan 2025 21:40:49 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e36d9447/39b35be2.mp3" length="20470885" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1276</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Steven Au, Cameron J. Dimacali, Ojasmitha Pedirappagari, Namyong Park, Franck Dernoncourt, Yu Wang, Nikos Kanakaris, Hanieh Deilamsalehy, Ryan A. Rossi, Nesreen K. Ahmed</p>

            <p><strong>Title:</strong><br>
            Personalized Graph-Based Retrieval for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02157v1">http://arxiv.org/abs/2501.02157v1</a></p>

            <p><strong>Abstract:</strong><br>
            As large language models (LLMs) evolve, their ability to deliver personalized and context-aware responses offers transformative potential for improving user experiences. Existing personalization approaches, however, often rely solely on user history to augment the prompt, limiting their effectiveness in generating tailored outputs, especially in cold-start scenarios with sparse data. To address these limitations, we propose Personalized Graph-based Retrieval-Augmented Generation (PGraphRAG), a framework that leverages user-centric knowledge graphs to enrich personalization. By directly integrating structured user knowledge into the retrieval process and augmenting prompts with user-relevant context, PGraphRAG enhances contextual understanding and output quality. We also introduce the Personalized Graph-based Benchmark for Text Generation, designed to evaluate personalized text generation tasks in real-world settings where user history is sparse or unavailable. Experimental results show that PGraphRAG significantly outperforms state-of-the-art personalization methods across diverse tasks, demonstrating the unique advantages of graph-based retrieval for personalization.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring</title>
      <itunes:episode>342</itunes:episode>
      <podcast:episode>342</podcast:episode>
      <itunes:title>METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f8d409b2-eb04-47b9-b0e0-f433a21f948d</guid>
      <link>https://share.transistor.fm/s/53df4826</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | q-bio.GN, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ollie Liu, Sami Jaghouar, Johannes Hagemann, Shangshang Wang, Jason Wiemels, Jeff Kaufman, Willie Neiswanger</p>

            <p><strong>Title:</strong><br>
            METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02045v1">http://arxiv.org/abs/2501.02045v1</a></p>

            <p><strong>Abstract:</strong><br>
            We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer model, which we refer to as a metagenomic foundation model, on a novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs. This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic (next-generation) sequencing methods. Unlike genomic models that focus on individual genomes or curated sets of specific species, the aim of METAGENE-1 is to capture the full distribution of genomic information present within this wastewater, to aid in tasks relevant to pandemic monitoring and pathogen detection. We carry out byte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic sequences, and then pretrain our model. In this paper, we first detail the pretraining dataset, tokenization strategy, and model architecture, highlighting the considerations and design choices that enable the effective modeling of metagenomic data. We then show results of pretraining this model on our metagenomic dataset, providing details about our losses, system metrics, and training stability over the course of pretraining. Finally, we demonstrate the performance of METAGENE-1, which achieves state-of-the-art results on a set of genomic benchmarks and new evaluations focused on human-pathogen detection and genomic sequence embedding, showcasing its potential for public health applications in pandemic monitoring, biosurveillance, and early detection of emerging health threats.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | q-bio.GN, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ollie Liu, Sami Jaghouar, Johannes Hagemann, Shangshang Wang, Jason Wiemels, Jeff Kaufman, Willie Neiswanger</p>

            <p><strong>Title:</strong><br>
            METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02045v1">http://arxiv.org/abs/2501.02045v1</a></p>

            <p><strong>Abstract:</strong><br>
            We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer model, which we refer to as a metagenomic foundation model, on a novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs. This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic (next-generation) sequencing methods. Unlike genomic models that focus on individual genomes or curated sets of specific species, the aim of METAGENE-1 is to capture the full distribution of genomic information present within this wastewater, to aid in tasks relevant to pandemic monitoring and pathogen detection. We carry out byte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic sequences, and then pretrain our model. In this paper, we first detail the pretraining dataset, tokenization strategy, and model architecture, highlighting the considerations and design choices that enable the effective modeling of metagenomic data. We then show results of pretraining this model on our metagenomic dataset, providing details about our losses, system metrics, and training stability over the course of pretraining. Finally, we demonstrate the performance of METAGENE-1, which achieves state-of-the-art results on a set of genomic benchmarks and new evaluations focused on human-pathogen detection and genomic sequence embedding, showcasing its potential for public health applications in pandemic monitoring, biosurveillance, and early detection of emerging health threats.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Jan 2025 21:40:26 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/53df4826/020e2c33.mp3" length="20821139" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1298</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | q-bio.GN, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ollie Liu, Sami Jaghouar, Johannes Hagemann, Shangshang Wang, Jason Wiemels, Jeff Kaufman, Willie Neiswanger</p>

            <p><strong>Title:</strong><br>
            METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02045v1">http://arxiv.org/abs/2501.02045v1</a></p>

            <p><strong>Abstract:</strong><br>
            We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer model, which we refer to as a metagenomic foundation model, on a novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs. This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic (next-generation) sequencing methods. Unlike genomic models that focus on individual genomes or curated sets of specific species, the aim of METAGENE-1 is to capture the full distribution of genomic information present within this wastewater, to aid in tasks relevant to pandemic monitoring and pathogen detection. We carry out byte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic sequences, and then pretrain our model. In this paper, we first detail the pretraining dataset, tokenization strategy, and model architecture, highlighting the considerations and design choices that enable the effective modeling of metagenomic data. We then show results of pretraining this model on our metagenomic dataset, providing details about our losses, system metrics, and training stability over the course of pretraining. Finally, we demonstrate the performance of METAGENE-1, which achieves state-of-the-art results on a set of genomic benchmarks and new evaluations focused on human-pathogen detection and genomic sequence embedding, showcasing its potential for public health applications in pandemic monitoring, biosurveillance, and early detection of emerging health threats.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking</title>
      <itunes:episode>341</itunes:episode>
      <podcast:episode>341</podcast:episode>
      <itunes:title>GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8013f128-6adb-4a6c-b873-261a3e99e87c</guid>
      <link>https://share.transistor.fm/s/06fd17fb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yijin Li, Fu-Yun Wang, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02690v1">http://arxiv.org/abs/2501.02690v1</a></p>

            <p><strong>Abstract:</strong><br>
            4D video control is essential in video generation as it enables the use of sophisticated lens techniques, such as multi-camera shooting and dolly zoom, which are currently unsupported by existing methods. Training a video Diffusion Transformer (DiT) directly to control 4D content requires expensive multi-view videos. Inspired by Monocular Dynamic novel View Synthesis (MDVS) that optimizes a 4D representation and renders videos according to different 4D elements, such as camera pose and object motion editing, we bring pseudo 4D Gaussian fields to video generation. Specifically, we propose a novel framework that constructs a pseudo 4D Gaussian field with dense 3D point tracking and renders the Gaussian field for all video frames. Then we finetune a pretrained DiT to generate videos following the guidance of the rendered video, dubbed as GS-DiT. To boost the training of the GS-DiT, we also propose an efficient Dense 3D Point Tracking (D3D-PT) method for the pseudo 4D Gaussian field construction. Our D3D-PT outperforms SpatialTracker, the state-of-the-art sparse 3D point tracking method, in accuracy and accelerates the inference speed by two orders of magnitude. During the inference stage, GS-DiT can generate videos with the same dynamic content while adhering to different camera parameters, addressing a significant limitation of current video generation models. GS-DiT demonstrates strong generalization capabilities and extends the 4D controllability of Gaussian splatting to video generation beyond just camera poses. It supports advanced cinematic effects through the manipulation of the Gaussian field and camera intrinsics, making it a powerful tool for creative video production. Demos are available at https://wkbian.github.io/Projects/GS-DiT/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yijin Li, Fu-Yun Wang, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02690v1">http://arxiv.org/abs/2501.02690v1</a></p>

            <p><strong>Abstract:</strong><br>
            4D video control is essential in video generation as it enables the use of sophisticated lens techniques, such as multi-camera shooting and dolly zoom, which are currently unsupported by existing methods. Training a video Diffusion Transformer (DiT) directly to control 4D content requires expensive multi-view videos. Inspired by Monocular Dynamic novel View Synthesis (MDVS) that optimizes a 4D representation and renders videos according to different 4D elements, such as camera pose and object motion editing, we bring pseudo 4D Gaussian fields to video generation. Specifically, we propose a novel framework that constructs a pseudo 4D Gaussian field with dense 3D point tracking and renders the Gaussian field for all video frames. Then we finetune a pretrained DiT to generate videos following the guidance of the rendered video, dubbed as GS-DiT. To boost the training of the GS-DiT, we also propose an efficient Dense 3D Point Tracking (D3D-PT) method for the pseudo 4D Gaussian field construction. Our D3D-PT outperforms SpatialTracker, the state-of-the-art sparse 3D point tracking method, in accuracy and accelerates the inference speed by two orders of magnitude. During the inference stage, GS-DiT can generate videos with the same dynamic content while adhering to different camera parameters, addressing a significant limitation of current video generation models. GS-DiT demonstrates strong generalization capabilities and extends the 4D controllability of Gaussian splatting to video generation beyond just camera poses. It supports advanced cinematic effects through the manipulation of the Gaussian field and camera intrinsics, making it a powerful tool for creative video production. Demos are available at https://wkbian.github.io/Projects/GS-DiT/.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Jan 2025 21:40:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/06fd17fb/12fc1ceb.mp3" length="21574762" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1345</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yijin Li, Fu-Yun Wang, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.02690v1">http://arxiv.org/abs/2501.02690v1</a></p>

            <p><strong>Abstract:</strong><br>
            4D video control is essential in video generation as it enables the use of sophisticated lens techniques, such as multi-camera shooting and dolly zoom, which are currently unsupported by existing methods. Training a video Diffusion Transformer (DiT) directly to control 4D content requires expensive multi-view videos. Inspired by Monocular Dynamic novel View Synthesis (MDVS) that optimizes a 4D representation and renders videos according to different 4D elements, such as camera pose and object motion editing, we bring pseudo 4D Gaussian fields to video generation. Specifically, we propose a novel framework that constructs a pseudo 4D Gaussian field with dense 3D point tracking and renders the Gaussian field for all video frames. Then we finetune a pretrained DiT to generate videos following the guidance of the rendered video, dubbed as GS-DiT. To boost the training of the GS-DiT, we also propose an efficient Dense 3D Point Tracking (D3D-PT) method for the pseudo 4D Gaussian field construction. Our D3D-PT outperforms SpatialTracker, the state-of-the-art sparse 3D point tracking method, in accuracy and accelerates the inference speed by two orders of magnitude. During the inference stage, GS-DiT can generate videos with the same dynamic content while adhering to different camera parameters, addressing a significant limitation of current video generation models. GS-DiT demonstrates strong generalization capabilities and extends the 4D controllability of Gaussian splatting to video generation beyond just camera poses. It supports advanced cinematic effects through the manipulation of the Gaussian field and camera intrinsics, making it a powerful tool for creative video production. Demos are available at https://wkbian.github.io/Projects/GS-DiT/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation</title>
      <itunes:episode>340</itunes:episode>
      <podcast:episode>340</podcast:episode>
      <itunes:title>Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">45414009-157d-4f9d-83d5-e9f742030320</guid>
      <link>https://share.transistor.fm/s/9ce6e928</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Guy Yariv, Yuval Kirstain, Amit Zohar, Shelly Sheynin, Yaniv Taigman, Yossi Adi, Sagie Benaim, Adam Polyak</p>

            <p><strong>Title:</strong><br>
            Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03059v1">http://arxiv.org/abs/2501.03059v1</a></p>

            <p><strong>Abstract:</strong><br>
            We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description. While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios. To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. Our key innovation is the introduction of a mask-based motion trajectory as an intermediate representation, that captures both semantic object information and motion, enabling an expressive but compact representation of motion and semantics. To incorporate the learned representation in the second stage, we utilize object-level attention objectives. Specifically, we consider a spatial, per-object, masked-cross attention objective, integrating object-specific prompts into corresponding latent space regions and a masked spatio-temporal self-attention objective, ensuring frame-to-frame consistency for each object. We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art results in temporal coherence, motion realism, and text-prompt faithfulness. Additionally, we introduce \benchmark, a new challenging benchmark for single-object and multi-object I2V generation, and demonstrate our method's superiority on this benchmark. Project page is available at https://guyyariv.github.io/TTM/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Guy Yariv, Yuval Kirstain, Amit Zohar, Shelly Sheynin, Yaniv Taigman, Yossi Adi, Sagie Benaim, Adam Polyak</p>

            <p><strong>Title:</strong><br>
            Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03059v1">http://arxiv.org/abs/2501.03059v1</a></p>

            <p><strong>Abstract:</strong><br>
            We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description. While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios. To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. Our key innovation is the introduction of a mask-based motion trajectory as an intermediate representation, that captures both semantic object information and motion, enabling an expressive but compact representation of motion and semantics. To incorporate the learned representation in the second stage, we utilize object-level attention objectives. Specifically, we consider a spatial, per-object, masked-cross attention objective, integrating object-specific prompts into corresponding latent space regions and a masked spatio-temporal self-attention objective, ensuring frame-to-frame consistency for each object. We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art results in temporal coherence, motion realism, and text-prompt faithfulness. Additionally, we introduce \benchmark, a new challenging benchmark for single-object and multi-object I2V generation, and demonstrate our method's superiority on this benchmark. Project page is available at https://guyyariv.github.io/TTM/.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Jan 2025 21:39:40 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9ce6e928/8e884cfd.mp3" length="21412147" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1335</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Guy Yariv, Yuval Kirstain, Amit Zohar, Shelly Sheynin, Yaniv Taigman, Yossi Adi, Sagie Benaim, Adam Polyak</p>

            <p><strong>Title:</strong><br>
            Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03059v1">http://arxiv.org/abs/2501.03059v1</a></p>

            <p><strong>Abstract:</strong><br>
            We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description. While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios. To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. Our key innovation is the introduction of a mask-based motion trajectory as an intermediate representation, that captures both semantic object information and motion, enabling an expressive but compact representation of motion and semantics. To incorporate the learned representation in the second stage, we utilize object-level attention objectives. Specifically, we consider a spatial, per-object, masked-cross attention objective, integrating object-specific prompts into corresponding latent space regions and a masked spatio-temporal self-attention objective, ensuring frame-to-frame consistency for each object. We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art results in temporal coherence, motion realism, and text-prompt faithfulness. Additionally, we introduce \benchmark, a new challenging benchmark for single-object and multi-object I2V generation, and demonstrate our method's superiority on this benchmark. Project page is available at https://guyyariv.github.io/TTM/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TransPixar: Advancing Text-to-Video Generation with Transparency</title>
      <itunes:episode>339</itunes:episode>
      <podcast:episode>339</podcast:episode>
      <itunes:title>TransPixar: Advancing Text-to-Video Generation with Transparency</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b609de0a-fc87-4185-93cc-07e213d4dabc</guid>
      <link>https://share.transistor.fm/s/5b326add</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Luozhou Wang, Yijun Li, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yingcong Chen</p>

            <p><strong>Title:</strong><br>
            TransPixar: Advancing Text-to-Video Generation with Transparency</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03006v1">http://arxiv.org/abs/2501.03006v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes. We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixar preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data. Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Luozhou Wang, Yijun Li, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yingcong Chen</p>

            <p><strong>Title:</strong><br>
            TransPixar: Advancing Text-to-Video Generation with Transparency</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03006v1">http://arxiv.org/abs/2501.03006v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes. We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixar preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data. Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Jan 2025 21:39:17 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5b326add/ebfdb320.mp3" length="21893622" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1365</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Luozhou Wang, Yijun Li, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yingcong Chen</p>

            <p><strong>Title:</strong><br>
            TransPixar: Advancing Text-to-Video Generation with Transparency</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.03006v1">http://arxiv.org/abs/2501.03006v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes. We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixar preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data. Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AutoPresent: Designing Structured Visuals from Scratch</title>
      <itunes:episode>338</itunes:episode>
      <podcast:episode>338</podcast:episode>
      <itunes:title>AutoPresent: Designing Structured Visuals from Scratch</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">022f975a-38eb-4b87-b10d-3bf904b50cae</guid>
      <link>https://share.transistor.fm/s/b8cd94f6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, Trevor Darrell</p>

            <p><strong>Title:</strong><br>
            AutoPresent: Designing Structured Visuals from Scratch</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.00912v1">http://arxiv.org/abs/2501.00912v1</a></p>

            <p><strong>Abstract:</strong><br>
            Designing structured visuals such as presentation slides is essential for communicative needs, necessitating both content creation and visual planning skills. In this work, we tackle the challenge of automated slide generation, where models produce slide presentations from natural language (NL) instructions. We first introduce the SlidesBench benchmark, the first benchmark for slide generation with 7k training and 585 testing examples derived from 310 slide decks across 10 domains. SlidesBench supports evaluations that are (i)reference-based to measure similarity to a target slide, and (ii)reference-free to measure the design quality of generated slides alone. We benchmark end-to-end image generation and program generation methods with a variety of models, and find that programmatic methods produce higher-quality slides in user-interactable formats. Built on the success of program generation, we create AutoPresent, an 8B Llama-based model trained on 7k pairs of instructions paired with code for slide generation, and achieve results comparable to the closed-source model GPT-4o. We further explore iterative design refinement where the model is tasked to self-refine its own output, and we found that this process improves the slide's quality. We hope that our work will provide a basis for future work on generating structured visuals.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, Trevor Darrell</p>

            <p><strong>Title:</strong><br>
            AutoPresent: Designing Structured Visuals from Scratch</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.00912v1">http://arxiv.org/abs/2501.00912v1</a></p>

            <p><strong>Abstract:</strong><br>
            Designing structured visuals such as presentation slides is essential for communicative needs, necessitating both content creation and visual planning skills. In this work, we tackle the challenge of automated slide generation, where models produce slide presentations from natural language (NL) instructions. We first introduce the SlidesBench benchmark, the first benchmark for slide generation with 7k training and 585 testing examples derived from 310 slide decks across 10 domains. SlidesBench supports evaluations that are (i)reference-based to measure similarity to a target slide, and (ii)reference-free to measure the design quality of generated slides alone. We benchmark end-to-end image generation and program generation methods with a variety of models, and find that programmatic methods produce higher-quality slides in user-interactable formats. Built on the success of program generation, we create AutoPresent, an 8B Llama-based model trained on 7k pairs of instructions paired with code for slide generation, and achieve results comparable to the closed-source model GPT-4o. We further explore iterative design refinement where the model is tasked to self-refine its own output, and we found that this process improves the slide's quality. We hope that our work will provide a basis for future work on generating structured visuals.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 07 Jan 2025 21:38:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b8cd94f6/5fa6a087.mp3" length="18612214" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1160</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, Trevor Darrell</p>

            <p><strong>Title:</strong><br>
            AutoPresent: Designing Structured Visuals from Scratch</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.00912v1">http://arxiv.org/abs/2501.00912v1</a></p>

            <p><strong>Abstract:</strong><br>
            Designing structured visuals such as presentation slides is essential for communicative needs, necessitating both content creation and visual planning skills. In this work, we tackle the challenge of automated slide generation, where models produce slide presentations from natural language (NL) instructions. We first introduce the SlidesBench benchmark, the first benchmark for slide generation with 7k training and 585 testing examples derived from 310 slide decks across 10 domains. SlidesBench supports evaluations that are (i)reference-based to measure similarity to a target slide, and (ii)reference-free to measure the design quality of generated slides alone. We benchmark end-to-end image generation and program generation methods with a variety of models, and find that programmatic methods produce higher-quality slides in user-interactable formats. Built on the success of program generation, we create AutoPresent, an 8B Llama-based model trained on 7k pairs of instructions paired with code for slide generation, and achieve results comparable to the closed-source model GPT-4o. We further explore iterative design refinement where the model is tasked to self-refine its own output, and we found that this process improves the slide's quality. We hope that our work will provide a basis for future work on generating structured visuals.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation</title>
      <itunes:episode>337</itunes:episode>
      <podcast:episode>337</podcast:episode>
      <itunes:title>EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">51cb34cd-08bb-4194-8dd1-e1dd97530678</guid>
      <link>https://share.transistor.fm/s/5f0140db</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.RO, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Peng Gao, Hongsheng Li, Maoqing Yao, Guanghui Ren</p>

            <p><strong>Title:</strong><br>
            EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01895v1">http://arxiv.org/abs/2501.01895v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce EnerVerse, a comprehensive framework for embodied future space generation specifically designed for robotic manipulation tasks. EnerVerse seamlessly integrates convolutional and bidirectional attention mechanisms for inner-chunk space modeling, ensuring low-level consistency and continuity. Recognizing the inherent redundancy in video data, we propose a sparse memory context combined with a chunkwise unidirectional generative paradigm to enable the generation of infinitely long sequences. To further augment robotic capabilities, we introduce the Free Anchor View (FAV) space, which provides flexible perspectives to enhance observation and analysis. The FAV space mitigates motion modeling ambiguity, removes physical constraints in confined environments, and significantly improves the robot's generalization and adaptability across various tasks and settings. To address the prohibitive costs and labor intensity of acquiring multi-camera observations, we present a data engine pipeline that integrates a generative model with 4D Gaussian Splatting (4DGS). This pipeline leverages the generative model's robust generalization capabilities and the spatial constraints provided by 4DGS, enabling an iterative enhancement of data quality and diversity, thus creating a data flywheel effect that effectively narrows the sim-to-real gap. Finally, our experiments demonstrate that the embodied future space generation prior substantially enhances policy predictive capabilities, resulting in improved overall performance, particularly in long-range robotic manipulation tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.RO, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Peng Gao, Hongsheng Li, Maoqing Yao, Guanghui Ren</p>

            <p><strong>Title:</strong><br>
            EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01895v1">http://arxiv.org/abs/2501.01895v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce EnerVerse, a comprehensive framework for embodied future space generation specifically designed for robotic manipulation tasks. EnerVerse seamlessly integrates convolutional and bidirectional attention mechanisms for inner-chunk space modeling, ensuring low-level consistency and continuity. Recognizing the inherent redundancy in video data, we propose a sparse memory context combined with a chunkwise unidirectional generative paradigm to enable the generation of infinitely long sequences. To further augment robotic capabilities, we introduce the Free Anchor View (FAV) space, which provides flexible perspectives to enhance observation and analysis. The FAV space mitigates motion modeling ambiguity, removes physical constraints in confined environments, and significantly improves the robot's generalization and adaptability across various tasks and settings. To address the prohibitive costs and labor intensity of acquiring multi-camera observations, we present a data engine pipeline that integrates a generative model with 4D Gaussian Splatting (4DGS). This pipeline leverages the generative model's robust generalization capabilities and the spatial constraints provided by 4DGS, enabling an iterative enhancement of data quality and diversity, thus creating a data flywheel effect that effectively narrows the sim-to-real gap. Finally, our experiments demonstrate that the embodied future space generation prior substantially enhances policy predictive capabilities, resulting in improved overall performance, particularly in long-range robotic manipulation tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 06 Jan 2025 20:15:32 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5f0140db/08097e0a.mp3" length="23802866" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1484</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 41 | cs.RO, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Peng Gao, Hongsheng Li, Maoqing Yao, Guanghui Ren</p>

            <p><strong>Title:</strong><br>
            EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01895v1">http://arxiv.org/abs/2501.01895v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce EnerVerse, a comprehensive framework for embodied future space generation specifically designed for robotic manipulation tasks. EnerVerse seamlessly integrates convolutional and bidirectional attention mechanisms for inner-chunk space modeling, ensuring low-level consistency and continuity. Recognizing the inherent redundancy in video data, we propose a sparse memory context combined with a chunkwise unidirectional generative paradigm to enable the generation of infinitely long sequences. To further augment robotic capabilities, we introduce the Free Anchor View (FAV) space, which provides flexible perspectives to enhance observation and analysis. The FAV space mitigates motion modeling ambiguity, removes physical constraints in confined environments, and significantly improves the robot's generalization and adaptability across various tasks and settings. To address the prohibitive costs and labor intensity of acquiring multi-camera observations, we present a data engine pipeline that integrates a generative model with 4D Gaussian Splatting (4DGS). This pipeline leverages the generative model's robust generalization capabilities and the spatial constraints provided by 4DGS, enabling an iterative enhancement of data quality and diversity, thus creating a data flywheel effect that effectively narrows the sim-to-real gap. Finally, our experiments demonstrate that the embodied future space generation prior substantially enhances policy predictive capabilities, resulting in improved overall performance, particularly in long-range robotic manipulation tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction</title>
      <itunes:episode>336</itunes:episode>
      <podcast:episode>336</podcast:episode>
      <itunes:title>VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ce130a49-584c-46a8-931c-bca4cbb4962a</guid>
      <link>https://share.transistor.fm/s/6b682c52</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Yangze Li, Zuwei Long, Heting Gao, Ke Li, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He</p>

            <p><strong>Title:</strong><br>
            VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01957v1">http://arxiv.org/abs/2501.01957v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Yangze Li, Zuwei Long, Heting Gao, Ke Li, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He</p>

            <p><strong>Title:</strong><br>
            VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01957v1">http://arxiv.org/abs/2501.01957v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 06 Jan 2025 20:15:11 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6b682c52/0826333e.mp3" length="19843538" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1237</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CV, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Yangze Li, Zuwei Long, Heting Gao, Ke Li, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He</p>

            <p><strong>Title:</strong><br>
            VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01957v1">http://arxiv.org/abs/2501.01957v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation</title>
      <itunes:episode>335</itunes:episode>
      <podcast:episode>335</podcast:episode>
      <itunes:title>VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">41cb3a99-9bdf-43aa-bac0-e2c270ff58d5</guid>
      <link>https://share.transistor.fm/s/6e6950e9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, Jiayan Teng, Zhuoyi Yang, Wendi Zheng, Xiao Liu, Ming Ding, Xiaohan Zhang, Xiaotao Gu, Shiyu Huang, Minlie Huang, Jie Tang, Yuxiao Dong</p>

            <p><strong>Title:</strong><br>
            VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21059v1">http://arxiv.org/abs/2412.21059v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a general strategy to aligning visual generation models -- both image and video generation -- with human preference. To start with, we build VisionReward -- a fine-grained and multi-dimensional reward model. We decompose human preferences in images and videos into multiple dimensions, each represented by a series of judgment questions, linearly weighted and summed to an interpretable and accurate score. To address the challenges of video quality assessment, we systematically analyze various dynamic features of videos, which helps VisionReward surpass VideoScore by 17.2% and achieve top performance for video preference prediction. Based on VisionReward, we develop a multi-objective preference learning algorithm that effectively addresses the issue of confounding factors within preference data. Our approach significantly outperforms existing image and video scoring methods on both machine metrics and human evaluation. All code and datasets are provided at https://github.com/THUDM/VisionReward.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, Jiayan Teng, Zhuoyi Yang, Wendi Zheng, Xiao Liu, Ming Ding, Xiaohan Zhang, Xiaotao Gu, Shiyu Huang, Minlie Huang, Jie Tang, Yuxiao Dong</p>

            <p><strong>Title:</strong><br>
            VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21059v1">http://arxiv.org/abs/2412.21059v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a general strategy to aligning visual generation models -- both image and video generation -- with human preference. To start with, we build VisionReward -- a fine-grained and multi-dimensional reward model. We decompose human preferences in images and videos into multiple dimensions, each represented by a series of judgment questions, linearly weighted and summed to an interpretable and accurate score. To address the challenges of video quality assessment, we systematically analyze various dynamic features of videos, which helps VisionReward surpass VideoScore by 17.2% and achieve top performance for video preference prediction. Based on VisionReward, we develop a multi-objective preference learning algorithm that effectively addresses the issue of confounding factors within preference data. Our approach significantly outperforms existing image and video scoring methods on both machine metrics and human evaluation. All code and datasets are provided at https://github.com/THUDM/VisionReward.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 06 Jan 2025 20:14:49 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6e6950e9/7c17a96c.mp3" length="22171184" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1382</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, Jiayan Teng, Zhuoyi Yang, Wendi Zheng, Xiao Liu, Ming Ding, Xiaohan Zhang, Xiaotao Gu, Shiyu Huang, Minlie Huang, Jie Tang, Yuxiao Dong</p>

            <p><strong>Title:</strong><br>
            VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21059v1">http://arxiv.org/abs/2412.21059v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a general strategy to aligning visual generation models -- both image and video generation -- with human preference. To start with, we build VisionReward -- a fine-grained and multi-dimensional reward model. We decompose human preferences in images and videos into multiple dimensions, each represented by a series of judgment questions, linearly weighted and summed to an interpretable and accurate score. To address the challenges of video quality assessment, we systematically analyze various dynamic features of videos, which helps VisionReward surpass VideoScore by 17.2% and achieve top performance for video preference prediction. Based on VisionReward, we develop a multi-objective preference learning algorithm that effectively addresses the issue of confounding factors within preference data. Our approach significantly outperforms existing image and video scoring methods on both machine metrics and human evaluation. All code and datasets are provided at https://github.com/THUDM/VisionReward.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Virgo: A Preliminary Exploration on Reproducing o1-like MLLM</title>
      <itunes:episode>334</itunes:episode>
      <podcast:episode>334</podcast:episode>
      <itunes:title>Virgo: A Preliminary Exploration on Reproducing o1-like MLLM</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">379a6697-9281-435b-bb30-85af05e43ad2</guid>
      <link>https://share.transistor.fm/s/622add6e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            Virgo: A Preliminary Exploration on Reproducing o1-like MLLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01904v1">http://arxiv.org/abs/2501.01904v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data semantics across different modalities, it is intuitively more challenging to implement multimodal slow-thinking systems.   To address this issue, in this paper, we explore a straightforward approach by fine-tuning a capable MLLM with a small amount of textual long-form thought data, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning with long thought). We find that these long-form reasoning processes, expressed in natural language, can be effectively transferred to MLLMs. Moreover, it seems that such textual reasoning data can be even more effective than visual reasoning data in eliciting the slow-thinking capacities of MLLMs. While this work is preliminary, it demonstrates that slow-thinking capacities are fundamentally associated with the language model component, which can be transferred across modalities or domains. This finding can be leveraged to guide the development of more powerful slow-thinking reasoning systems. We release our resources at https://github.com/RUCAIBox/Virgo.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            Virgo: A Preliminary Exploration on Reproducing o1-like MLLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01904v1">http://arxiv.org/abs/2501.01904v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data semantics across different modalities, it is intuitively more challenging to implement multimodal slow-thinking systems.   To address this issue, in this paper, we explore a straightforward approach by fine-tuning a capable MLLM with a small amount of textual long-form thought data, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning with long thought). We find that these long-form reasoning processes, expressed in natural language, can be effectively transferred to MLLMs. Moreover, it seems that such textual reasoning data can be even more effective than visual reasoning data in eliciting the slow-thinking capacities of MLLMs. While this work is preliminary, it demonstrates that slow-thinking capacities are fundamentally associated with the language model component, which can be transferred across modalities or domains. This finding can be leveraged to guide the development of more powerful slow-thinking reasoning systems. We release our resources at https://github.com/RUCAIBox/Virgo.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 06 Jan 2025 20:14:28 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/622add6e/35735377.mp3" length="21780769" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1358</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            Virgo: A Preliminary Exploration on Reproducing o1-like MLLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01904v1">http://arxiv.org/abs/2501.01904v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data semantics across different modalities, it is intuitively more challenging to implement multimodal slow-thinking systems.   To address this issue, in this paper, we explore a straightforward approach by fine-tuning a capable MLLM with a small amount of textual long-form thought data, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning with long thought). We find that these long-form reasoning processes, expressed in natural language, can be effectively transferred to MLLMs. Moreover, it seems that such textual reasoning data can be even more effective than visual reasoning data in eliciting the slow-thinking capacities of MLLMs. While this work is preliminary, it demonstrates that slow-thinking capacities are fundamentally associated with the language model component, which can be transferred across modalities or domains. This finding can be leveraged to guide the development of more powerful slow-thinking reasoning systems. We release our resources at https://github.com/RUCAIBox/Virgo.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SDPO: Segment-Level Direct Preference Optimization for Social Agents</title>
      <itunes:episode>333</itunes:episode>
      <podcast:episode>333</podcast:episode>
      <itunes:title>SDPO: Segment-Level Direct Preference Optimization for Social Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d0a1c652-c9d7-4602-8b26-8a7081359833</guid>
      <link>https://share.transistor.fm/s/d059bfb2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Aobo Kong, Wentao Ma, Shiwan Zhao, Yongbin Li, Yuchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Yong Qin, Fei Huang</p>

            <p><strong>Title:</strong><br>
            SDPO: Segment-Level Direct Preference Optimization for Social Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01821v1">http://arxiv.org/abs/2501.01821v1</a></p>

            <p><strong>Abstract:</strong><br>
            Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex goal-oriented social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across a variety of agent tasks. Existing DPO-based approaches for multi-turn interactions are divided into turn-level and session-level methods. The turn-level method is overly fine-grained, focusing exclusively on individual turns, while session-level methods are too coarse-grained, often introducing training noise. To address these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which focuses on specific key segments within interactions to optimize multi-turn agent behavior while minimizing training noise. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO's potential to advance the social intelligence of LLM-based agents. We release our code and data at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Aobo Kong, Wentao Ma, Shiwan Zhao, Yongbin Li, Yuchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Yong Qin, Fei Huang</p>

            <p><strong>Title:</strong><br>
            SDPO: Segment-Level Direct Preference Optimization for Social Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01821v1">http://arxiv.org/abs/2501.01821v1</a></p>

            <p><strong>Abstract:</strong><br>
            Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex goal-oriented social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across a variety of agent tasks. Existing DPO-based approaches for multi-turn interactions are divided into turn-level and session-level methods. The turn-level method is overly fine-grained, focusing exclusively on individual turns, while session-level methods are too coarse-grained, often introducing training noise. To address these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which focuses on specific key segments within interactions to optimize multi-turn agent behavior while minimizing training noise. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO's potential to advance the social intelligence of LLM-based agents. We release our code and data at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 06 Jan 2025 20:14:07 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d059bfb2/079167ae.mp3" length="19008036" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1184</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Aobo Kong, Wentao Ma, Shiwan Zhao, Yongbin Li, Yuchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Yong Qin, Fei Huang</p>

            <p><strong>Title:</strong><br>
            SDPO: Segment-Level Direct Preference Optimization for Social Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01821v1">http://arxiv.org/abs/2501.01821v1</a></p>

            <p><strong>Abstract:</strong><br>
            Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex goal-oriented social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across a variety of agent tasks. Existing DPO-based approaches for multi-turn interactions are divided into turn-level and session-level methods. The turn-level method is overly fine-grained, focusing exclusively on individual turns, while session-level methods are too coarse-grained, often introducing training noise. To address these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which focuses on specific key segments within interactions to optimize multi-turn agent behavior while minimizing training noise. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO's potential to advance the social intelligence of LLM-based agents. We release our code and data at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Graph Generative Pre-trained Transformer</title>
      <itunes:episode>332</itunes:episode>
      <podcast:episode>332</podcast:episode>
      <itunes:title>Graph Generative Pre-trained Transformer</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">58a4a635-25f3-48ec-87a4-8c2f84062f10</guid>
      <link>https://share.transistor.fm/s/d68da6ce</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xiaohui Chen, Yinkai Wang, Jiaxing He, Yuanqi Du, Soha Hassoun, Xiaolin Xu, Li-Ping Liu</p>

            <p><strong>Title:</strong><br>
            Graph Generative Pre-trained Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01073v1">http://arxiv.org/abs/2501.01073v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graph generation is a critical task in numerous domains, including molecular design and social network analysis, due to its ability to model complex relationships and structured data. While most modern graph generative models utilize adjacency matrix representations, this work revisits an alternative approach that represents graphs as sequences of node set and edge set. We advocate for this approach due to its efficient encoding of graphs and propose a novel representation. Based on this representation, we introduce the Graph Generative Pre-trained Transformer (G2PT), an auto-regressive model that learns graph structures via next-token prediction. To further exploit G2PT's capabilities as a general-purpose foundation model, we explore fine-tuning strategies for two downstream applications: goal-oriented generation and graph property prediction. We conduct extensive experiments across multiple datasets. Results indicate that G2PT achieves superior generative performance on both generic graph and molecule datasets. Furthermore, G2PT exhibits strong adaptability and versatility in downstream tasks from molecular design to property prediction.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xiaohui Chen, Yinkai Wang, Jiaxing He, Yuanqi Du, Soha Hassoun, Xiaolin Xu, Li-Ping Liu</p>

            <p><strong>Title:</strong><br>
            Graph Generative Pre-trained Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01073v1">http://arxiv.org/abs/2501.01073v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graph generation is a critical task in numerous domains, including molecular design and social network analysis, due to its ability to model complex relationships and structured data. While most modern graph generative models utilize adjacency matrix representations, this work revisits an alternative approach that represents graphs as sequences of node set and edge set. We advocate for this approach due to its efficient encoding of graphs and propose a novel representation. Based on this representation, we introduce the Graph Generative Pre-trained Transformer (G2PT), an auto-regressive model that learns graph structures via next-token prediction. To further exploit G2PT's capabilities as a general-purpose foundation model, we explore fine-tuning strategies for two downstream applications: goal-oriented generation and graph property prediction. We conduct extensive experiments across multiple datasets. Results indicate that G2PT achieves superior generative performance on both generic graph and molecule datasets. Furthermore, G2PT exhibits strong adaptability and versatility in downstream tasks from molecular design to property prediction.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 06 Jan 2025 20:13:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d68da6ce/8b06018a.mp3" length="19643724" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1224</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xiaohui Chen, Yinkai Wang, Jiaxing He, Yuanqi Du, Soha Hassoun, Xiaolin Xu, Li-Ping Liu</p>

            <p><strong>Title:</strong><br>
            Graph Generative Pre-trained Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01073v1">http://arxiv.org/abs/2501.01073v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graph generation is a critical task in numerous domains, including molecular design and social network analysis, due to its ability to model complex relationships and structured data. While most modern graph generative models utilize adjacency matrix representations, this work revisits an alternative approach that represents graphs as sequences of node set and edge set. We advocate for this approach due to its efficient encoding of graphs and propose a novel representation. Based on this representation, we introduce the Graph Generative Pre-trained Transformer (G2PT), an auto-regressive model that learns graph structures via next-token prediction. To further exploit G2PT's capabilities as a general-purpose foundation model, we explore fine-tuning strategies for two downstream applications: goal-oriented generation and graph property prediction. We conduct extensive experiments across multiple datasets. Results indicate that G2PT achieves superior generative performance on both generic graph and molecule datasets. Furthermore, G2PT exhibits strong adaptability and versatility in downstream tasks from molecular design to property prediction.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models</title>
      <itunes:episode>331</itunes:episode>
      <podcast:episode>331</podcast:episode>
      <itunes:title>LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">52e73c61-19b3-4ac4-a5bc-d116837265d0</guid>
      <link>https://share.transistor.fm/s/6135a689</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CL, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Hieu Man, Nghia Trung Ngo, Viet Dac Lai, Ryan A. Rossi, Franck Dernoncourt, Thien Huu Nguyen</p>

            <p><strong>Title:</strong><br>
            LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.00874v1">http://arxiv.org/abs/2501.00874v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in large language models (LLMs) based embedding models have established new state-of-the-art benchmarks for text embedding tasks, particularly in dense vector-based retrieval. However, these models predominantly focus on English, leaving multilingual embedding capabilities largely unexplored. To address this limitation, we present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision. LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks. These components are seamlessly integrated through a minimal set of trainable parameters that act as a connector, effectively transferring the multilingual encoder's language understanding capabilities to the specialized embedding model. Additionally, to comprehensively evaluate multilingual embedding performance, we introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages. Extensive experimental results demonstrate that LUSIFER significantly enhances the multilingual performance across various embedding tasks, particularly for medium and low-resource languages, without requiring explicit multilingual training data.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CL, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Hieu Man, Nghia Trung Ngo, Viet Dac Lai, Ryan A. Rossi, Franck Dernoncourt, Thien Huu Nguyen</p>

            <p><strong>Title:</strong><br>
            LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.00874v1">http://arxiv.org/abs/2501.00874v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in large language models (LLMs) based embedding models have established new state-of-the-art benchmarks for text embedding tasks, particularly in dense vector-based retrieval. However, these models predominantly focus on English, leaving multilingual embedding capabilities largely unexplored. To address this limitation, we present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision. LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks. These components are seamlessly integrated through a minimal set of trainable parameters that act as a connector, effectively transferring the multilingual encoder's language understanding capabilities to the specialized embedding model. Additionally, to comprehensively evaluate multilingual embedding performance, we introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages. Extensive experimental results demonstrate that LUSIFER significantly enhances the multilingual performance across various embedding tasks, particularly for medium and low-resource languages, without requiring explicit multilingual training data.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 06 Jan 2025 20:13:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6135a689/1ba4c132.mp3" length="22357184" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1394</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CL, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Hieu Man, Nghia Trung Ngo, Viet Dac Lai, Ryan A. Rossi, Franck Dernoncourt, Thien Huu Nguyen</p>

            <p><strong>Title:</strong><br>
            LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.00874v1">http://arxiv.org/abs/2501.00874v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in large language models (LLMs) based embedding models have established new state-of-the-art benchmarks for text embedding tasks, particularly in dense vector-based retrieval. However, these models predominantly focus on English, leaving multilingual embedding capabilities largely unexplored. To address this limitation, we present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision. LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks. These components are seamlessly integrated through a minimal set of trainable parameters that act as a connector, effectively transferring the multilingual encoder's language understanding capabilities to the specialized embedding model. Additionally, to comprehensively evaluate multilingual embedding performance, we introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages. Extensive experimental results demonstrate that LUSIFER significantly enhances the multilingual performance across various embedding tasks, particularly for medium and low-resource languages, without requiring explicit multilingual training data.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery</title>
      <itunes:episode>330</itunes:episode>
      <podcast:episode>330</podcast:episode>
      <itunes:title>BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">12e7ff7d-aebc-404b-8c4a-7e69b158aa8f</guid>
      <link>https://share.transistor.fm/s/98445a58</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Kanishk Gandhi, Michael Y. Li, Lyle Goodyear, Louise Li, Aditi Bhaskar, Mohammed Zaman, Noah D. Goodman</p>

            <p><strong>Title:</strong><br>
            BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01540v1">http://arxiv.org/abs/2501.01540v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding the world and explaining it with scientific theories is a central aspiration of artificial intelligence research. Proposing theories, designing experiments to test them, and then revising them based on data are fundamental to scientific discovery. Despite the significant promise of LLM-based scientific agents, no benchmarks systematically test LLM's ability to propose scientific models, collect experimental data, and revise them in light of new data. We introduce BoxingGym, a benchmark with 10 environments for systematically evaluating both experimental design (e.g. collecting data to test a scientific theory) and model discovery (e.g. proposing and revising scientific theories). To enable tractable and quantitative evaluation, we implement each environment as a generative probabilistic model with which a scientific agent can run interactive experiments. These probabilistic models are drawn from various real-world scientific domains ranging from psychology to ecology. To quantitatively evaluate a scientific agent's ability to collect informative experimental data, we compute the expected information gain (EIG), an information-theoretic quantity which measures how much an experiment reduces uncertainty about the parameters of a generative model. A good scientific theory is a concise and predictive explanation. Therefore, to quantitatively evaluate model discovery, we ask a scientific agent to explain their model and then assess whether this explanation enables another scientific agent to make reliable predictions about this environment. In addition to this explanation-based evaluation, we compute standard model evaluation metrics such as prediction errors. We find that current LLMs, such as GPT-4o, struggle with both experimental design and model discovery. We find that augmenting the LLM-based agent with an explicit statistical model does not reliably improve these results.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Kanishk Gandhi, Michael Y. Li, Lyle Goodyear, Louise Li, Aditi Bhaskar, Mohammed Zaman, Noah D. Goodman</p>

            <p><strong>Title:</strong><br>
            BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01540v1">http://arxiv.org/abs/2501.01540v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding the world and explaining it with scientific theories is a central aspiration of artificial intelligence research. Proposing theories, designing experiments to test them, and then revising them based on data are fundamental to scientific discovery. Despite the significant promise of LLM-based scientific agents, no benchmarks systematically test LLM's ability to propose scientific models, collect experimental data, and revise them in light of new data. We introduce BoxingGym, a benchmark with 10 environments for systematically evaluating both experimental design (e.g. collecting data to test a scientific theory) and model discovery (e.g. proposing and revising scientific theories). To enable tractable and quantitative evaluation, we implement each environment as a generative probabilistic model with which a scientific agent can run interactive experiments. These probabilistic models are drawn from various real-world scientific domains ranging from psychology to ecology. To quantitatively evaluate a scientific agent's ability to collect informative experimental data, we compute the expected information gain (EIG), an information-theoretic quantity which measures how much an experiment reduces uncertainty about the parameters of a generative model. A good scientific theory is a concise and predictive explanation. Therefore, to quantitatively evaluate model discovery, we ask a scientific agent to explain their model and then assess whether this explanation enables another scientific agent to make reliable predictions about this environment. In addition to this explanation-based evaluation, we compute standard model evaluation metrics such as prediction errors. We find that current LLMs, such as GPT-4o, struggle with both experimental design and model discovery. We find that augmenting the LLM-based agent with an explicit statistical model does not reliably improve these results.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 06 Jan 2025 20:13:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/98445a58/a5db9ac5.mp3" length="24953522" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1556</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Kanishk Gandhi, Michael Y. Li, Lyle Goodyear, Louise Li, Aditi Bhaskar, Mohammed Zaman, Noah D. Goodman</p>

            <p><strong>Title:</strong><br>
            BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01540v1">http://arxiv.org/abs/2501.01540v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding the world and explaining it with scientific theories is a central aspiration of artificial intelligence research. Proposing theories, designing experiments to test them, and then revising them based on data are fundamental to scientific discovery. Despite the significant promise of LLM-based scientific agents, no benchmarks systematically test LLM's ability to propose scientific models, collect experimental data, and revise them in light of new data. We introduce BoxingGym, a benchmark with 10 environments for systematically evaluating both experimental design (e.g. collecting data to test a scientific theory) and model discovery (e.g. proposing and revising scientific theories). To enable tractable and quantitative evaluation, we implement each environment as a generative probabilistic model with which a scientific agent can run interactive experiments. These probabilistic models are drawn from various real-world scientific domains ranging from psychology to ecology. To quantitatively evaluate a scientific agent's ability to collect informative experimental data, we compute the expected information gain (EIG), an information-theoretic quantity which measures how much an experiment reduces uncertainty about the parameters of a generative model. A good scientific theory is a concise and predictive explanation. Therefore, to quantitatively evaluate model discovery, we ask a scientific agent to explain their model and then assess whether this explanation enables another scientific agent to make reliable predictions about this environment. In addition to this explanation-based evaluation, we compute standard model evaluation metrics such as prediction errors. We find that current LLMs, such as GPT-4o, struggle with both experimental design and model discovery. We find that augmenting the LLM-based agent with an explicit statistical model does not reliably improve these results.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining</title>
      <itunes:episode>329</itunes:episode>
      <podcast:episode>329</podcast:episode>
      <itunes:title>2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">882af077-7a37-4ec6-8cee-e821e0c160da</guid>
      <link>https://share.transistor.fm/s/e01876c2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wenqi Zhang, Hang Zhang, Xin Li, Jiashuo Sun, Yongliang Shen, Weiming Lu, Deli Zhao, Yueting Zhuang, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.00958v1">http://arxiv.org/abs/2501.00958v1</a></p>

            <p><strong>Abstract:</strong><br>
            Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality \textbf{multimodal textbook} corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving~\footnote{Our code are available at \url{https://github.com/DAMO-NLP-SG/multimodal_textbook}}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wenqi Zhang, Hang Zhang, Xin Li, Jiashuo Sun, Yongliang Shen, Weiming Lu, Deli Zhao, Yueting Zhuang, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.00958v1">http://arxiv.org/abs/2501.00958v1</a></p>

            <p><strong>Abstract:</strong><br>
            Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality \textbf{multimodal textbook} corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving~\footnote{Our code are available at \url{https://github.com/DAMO-NLP-SG/multimodal_textbook}}.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Jan 2025 20:25:13 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e01876c2/cc8e1895.mp3" length="22982415" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1433</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 45 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wenqi Zhang, Hang Zhang, Xin Li, Jiashuo Sun, Yongliang Shen, Weiming Lu, Deli Zhao, Yueting Zhuang, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.00958v1">http://arxiv.org/abs/2501.00958v1</a></p>

            <p><strong>Abstract:</strong><br>
            Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality \textbf{multimodal textbook} corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving~\footnote{Our code are available at \url{https://github.com/DAMO-NLP-SG/multimodal_textbook}}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings</title>
      <itunes:episode>328</itunes:episode>
      <podcast:episode>328</podcast:episode>
      <itunes:title>CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a22a5582-f276-4953-a88b-5167f5b3d216</guid>
      <link>https://share.transistor.fm/s/969e3163</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01257v1">http://arxiv.org/abs/2501.01257v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to develop more challenging and comprehensive benchmarks that effectively test their sophisticated competition-level coding abilities. Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments. To bridge this gap, we introduce CodeElo, a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time. CodeElo benchmark is mainly based on the official CodeForces platform and tries to align with the platform as much as possible. We compile the recent six months of contest problems on CodeForces with detailed information such as contest divisions, problem difficulty ratings, and problem algorithm tags. We introduce a unique judging method in which problems are submitted directly to the platform and develop a reliable Elo rating calculation system that aligns with the platform and is comparable with human participants but has lower variance. By testing on our CodeElo, we provide the Elo ratings of 30 existing popular open-source and 3 proprietary LLMs for the first time. The results show that o1-mini and QwQ-32B-Preview stand out significantly, achieving Elo ratings of 1578 and 1261, respectively, while other models struggle even with the easiest problems, placing in the lowest 20 percent among all human participants. Detailed analysis experiments are also conducted to provide insights into performance across algorithms and comparisons between using C++ and Python, which can suggest directions for future studies.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01257v1">http://arxiv.org/abs/2501.01257v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to develop more challenging and comprehensive benchmarks that effectively test their sophisticated competition-level coding abilities. Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments. To bridge this gap, we introduce CodeElo, a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time. CodeElo benchmark is mainly based on the official CodeForces platform and tries to align with the platform as much as possible. We compile the recent six months of contest problems on CodeForces with detailed information such as contest divisions, problem difficulty ratings, and problem algorithm tags. We introduce a unique judging method in which problems are submitted directly to the platform and develop a reliable Elo rating calculation system that aligns with the platform and is comparable with human participants but has lower variance. By testing on our CodeElo, we provide the Elo ratings of 30 existing popular open-source and 3 proprietary LLMs for the first time. The results show that o1-mini and QwQ-32B-Preview stand out significantly, achieving Elo ratings of 1578 and 1261, respectively, while other models struggle even with the easiest problems, placing in the lowest 20 percent among all human participants. Detailed analysis experiments are also conducted to provide insights into performance across algorithms and comparisons between using C++ and Python, which can suggest directions for future studies.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Jan 2025 20:24:50 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/969e3163/73bbbde2.mp3" length="22654341" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1412</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01257v1">http://arxiv.org/abs/2501.01257v1</a></p>

            <p><strong>Abstract:</strong><br>
            With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to develop more challenging and comprehensive benchmarks that effectively test their sophisticated competition-level coding abilities. Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments. To bridge this gap, we introduce CodeElo, a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time. CodeElo benchmark is mainly based on the official CodeForces platform and tries to align with the platform as much as possible. We compile the recent six months of contest problems on CodeForces with detailed information such as contest divisions, problem difficulty ratings, and problem algorithm tags. We introduce a unique judging method in which problems are submitted directly to the platform and develop a reliable Elo rating calculation system that aligns with the platform and is comparable with human participants but has lower variance. By testing on our CodeElo, we provide the Elo ratings of 30 existing popular open-source and 3 proprietary LLMs for the first time. The results show that o1-mini and QwQ-32B-Preview stand out significantly, achieving Elo ratings of 1578 and 1261, respectively, while other models struggle even with the easiest problems, placing in the lowest 20 percent among all human participants. Detailed analysis experiments are also conducted to provide insights into performance across algorithms and comparisons between using C++ and Python, which can suggest directions for future studies.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control</title>
      <itunes:episode>327</itunes:episode>
      <podcast:episode>327</podcast:episode>
      <itunes:title>VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a244aac9-09e2-4fe4-af95-3aac77107e65</guid>
      <link>https://share.transistor.fm/s/21216081</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01427v1">http://arxiv.org/abs/2501.01427v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite significant advancements in video generation, inserting a given object into videos remains a challenging task. The difficulty lies in preserving the appearance details of the reference object and accurately modeling coherent motions at the same time. In this paper, we propose VideoAnydoor, a zero-shot video object insertion framework with high-fidelity detail preservation and precise motion control. Starting from a text-to-video model, we utilize an ID extractor to inject the global identity and leverage a box sequence to control the overall motion. To preserve the detailed appearance and meanwhile support fine-grained motion control, we design a pixel warper. It takes the reference image with arbitrary key-points and the corresponding key-point trajectories as inputs. It warps the pixel details according to the trajectories and fuses the warped features with the diffusion U-Net, thus improving detail preservation and supporting users in manipulating the motion trajectories. In addition, we propose a training strategy involving both videos and static images with a reweight reconstruction loss to enhance insertion quality. VideoAnydoor demonstrates significant superiority over existing methods and naturally supports various downstream applications (e.g., talking head generation, video virtual try-on, multi-region editing) without task-specific fine-tuning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01427v1">http://arxiv.org/abs/2501.01427v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite significant advancements in video generation, inserting a given object into videos remains a challenging task. The difficulty lies in preserving the appearance details of the reference object and accurately modeling coherent motions at the same time. In this paper, we propose VideoAnydoor, a zero-shot video object insertion framework with high-fidelity detail preservation and precise motion control. Starting from a text-to-video model, we utilize an ID extractor to inject the global identity and leverage a box sequence to control the overall motion. To preserve the detailed appearance and meanwhile support fine-grained motion control, we design a pixel warper. It takes the reference image with arbitrary key-points and the corresponding key-point trajectories as inputs. It warps the pixel details according to the trajectories and fuses the warped features with the diffusion U-Net, thus improving detail preservation and supporting users in manipulating the motion trajectories. In addition, we propose a training strategy involving both videos and static images with a reweight reconstruction loss to enhance insertion quality. VideoAnydoor demonstrates significant superiority over existing methods and naturally supports various downstream applications (e.g., talking head generation, video virtual try-on, multi-region editing) without task-specific fine-tuning.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Jan 2025 20:24:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/21216081/f4f8d12d.mp3" length="18539514" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1155</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01427v1">http://arxiv.org/abs/2501.01427v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite significant advancements in video generation, inserting a given object into videos remains a challenging task. The difficulty lies in preserving the appearance details of the reference object and accurately modeling coherent motions at the same time. In this paper, we propose VideoAnydoor, a zero-shot video object insertion framework with high-fidelity detail preservation and precise motion control. Starting from a text-to-video model, we utilize an ID extractor to inject the global identity and leverage a box sequence to control the overall motion. To preserve the detailed appearance and meanwhile support fine-grained motion control, we design a pixel warper. It takes the reference image with arbitrary key-points and the corresponding key-point trajectories as inputs. It warps the pixel details according to the trajectories and fuses the warped features with the diffusion U-Net, thus improving detail preservation and supporting users in manipulating the motion trajectories. In addition, we propose a training strategy involving both videos and static images with a reweight reconstruction loss to enhance insertion quality. VideoAnydoor demonstrates significant superiority over existing methods and naturally supports various downstream applications (e.g., talking head generation, video virtual try-on, multi-region editing) without task-specific fine-tuning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models</title>
      <itunes:episode>326</itunes:episode>
      <podcast:episode>326</podcast:episode>
      <itunes:title>Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0e1a502a-8c5d-4ef5-bb14-21e1f0d11f3a</guid>
      <link>https://share.transistor.fm/s/3582f7fa</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jingfeng Yao, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01423v1">http://arxiv.org/abs/2501.01423v1</a></p>

            <p><strong>Abstract:</strong><br>
            Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substantially larger diffusion models and more training iterations to achieve comparable generation performance. Consequently, existing systems often settle for sub-optimal solutions, either producing visual artifacts due to information loss within tokenizers or failing to converge fully due to expensive computation costs. We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed VA-VAE (Vision foundation model Aligned Variational AutoEncoder) significantly expands the reconstruction-generation frontier of latent diffusion models, enabling faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces. To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT. The integrated system achieves state-of-the-art (SOTA) performance on ImageNet 256x256 generation with an FID score of 1.35 while demonstrating remarkable training efficiency by reaching an FID score of 2.11 in just 64 epochs--representing an over 21 times convergence speedup compared to the original DiT. Models and codes are available at: https://github.com/hustvl/LightningDiT.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jingfeng Yao, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01423v1">http://arxiv.org/abs/2501.01423v1</a></p>

            <p><strong>Abstract:</strong><br>
            Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substantially larger diffusion models and more training iterations to achieve comparable generation performance. Consequently, existing systems often settle for sub-optimal solutions, either producing visual artifacts due to information loss within tokenizers or failing to converge fully due to expensive computation costs. We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed VA-VAE (Vision foundation model Aligned Variational AutoEncoder) significantly expands the reconstruction-generation frontier of latent diffusion models, enabling faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces. To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT. The integrated system achieves state-of-the-art (SOTA) performance on ImageNet 256x256 generation with an FID score of 1.35 while demonstrating remarkable training efficiency by reaching an FID score of 2.11 in just 64 epochs--representing an over 21 times convergence speedup compared to the original DiT. Models and codes are available at: https://github.com/hustvl/LightningDiT.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Jan 2025 20:23:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3582f7fa/9620c8f3.mp3" length="23890652" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1489</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jingfeng Yao, Xinggang Wang</p>

            <p><strong>Title:</strong><br>
            Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01423v1">http://arxiv.org/abs/2501.01423v1</a></p>

            <p><strong>Abstract:</strong><br>
            Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substantially larger diffusion models and more training iterations to achieve comparable generation performance. Consequently, existing systems often settle for sub-optimal solutions, either producing visual artifacts due to information loss within tokenizers or failing to converge fully due to expensive computation costs. We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed VA-VAE (Vision foundation model Aligned Variational AutoEncoder) significantly expands the reconstruction-generation frontier of latent diffusion models, enabling faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces. To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT. The integrated system achieves state-of-the-art (SOTA) performance on ImageNet 256x256 generation with an FID score of 1.35 while demonstrating remarkable training efficiency by reaching an FID score of 2.11 in just 64 epochs--representing an over 21 times convergence speedup compared to the original DiT. Models and codes are available at: https://github.com/hustvl/LightningDiT.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ProgCo: Program Helps Self-Correction of Large Language Models</title>
      <itunes:episode>325</itunes:episode>
      <podcast:episode>325</podcast:episode>
      <itunes:title>ProgCo: Program Helps Self-Correction of Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6f719a03-5dc0-43de-8b8c-eb514e19a112</guid>
      <link>https://share.transistor.fm/s/48f3fb34</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiaoshuai Song, Yanan Wu, Weixun Wang, Jiaheng Liu, Wenbo Su, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            ProgCo: Program Helps Self-Correction of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01264v1">http://arxiv.org/abs/2501.01264v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-Correction aims to enable large language models (LLMs) to self-verify and self-refine their initial responses without external feedback. However, LLMs often fail to effectively self-verify and generate correct feedback, further misleading refinement and leading to the failure of self-correction, especially in complex reasoning tasks. In this paper, we propose Program-driven Self-Correction (ProgCo). First, program-driven verification (ProgVe) achieves complex verification logic and extensive validation through self-generated, self-executing verification pseudo-programs. Then, program-driven refinement (ProgRe) receives feedback from ProgVe, conducts dual reflection and refinement on both responses and verification programs to mitigate misleading of incorrect feedback in complex reasoning tasks. Experiments on three instruction-following and mathematical benchmarks indicate that ProgCo achieves effective self-correction, and can be further enhance performance when combined with real program tools.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiaoshuai Song, Yanan Wu, Weixun Wang, Jiaheng Liu, Wenbo Su, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            ProgCo: Program Helps Self-Correction of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01264v1">http://arxiv.org/abs/2501.01264v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-Correction aims to enable large language models (LLMs) to self-verify and self-refine their initial responses without external feedback. However, LLMs often fail to effectively self-verify and generate correct feedback, further misleading refinement and leading to the failure of self-correction, especially in complex reasoning tasks. In this paper, we propose Program-driven Self-Correction (ProgCo). First, program-driven verification (ProgVe) achieves complex verification logic and extensive validation through self-generated, self-executing verification pseudo-programs. Then, program-driven refinement (ProgRe) receives feedback from ProgVe, conducts dual reflection and refinement on both responses and verification programs to mitigate misleading of incorrect feedback in complex reasoning tasks. Experiments on three instruction-following and mathematical benchmarks indicate that ProgCo achieves effective self-correction, and can be further enhance performance when combined with real program tools.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Jan 2025 20:23:31 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/48f3fb34/55ae9e5f.mp3" length="19569767" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1219</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiaoshuai Song, Yanan Wu, Weixun Wang, Jiaheng Liu, Wenbo Su, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            ProgCo: Program Helps Self-Correction of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01264v1">http://arxiv.org/abs/2501.01264v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-Correction aims to enable large language models (LLMs) to self-verify and self-refine their initial responses without external feedback. However, LLMs often fail to effectively self-verify and generate correct feedback, further misleading refinement and leading to the failure of self-correction, especially in complex reasoning tasks. In this paper, we propose Program-driven Self-Correction (ProgCo). First, program-driven verification (ProgVe) achieves complex verification logic and extensive validation through self-generated, self-executing verification pseudo-programs. Then, program-driven refinement (ProgRe) receives feedback from ProgVe, conducts dual reflection and refinement on both responses and verification programs to mitigate misleading of incorrect feedback in complex reasoning tasks. Experiments on three instruction-following and mathematical benchmarks indicate that ProgCo achieves effective self-correction, and can be further enhance performance when combined with real program tools.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models</title>
      <itunes:episode>324</itunes:episode>
      <podcast:episode>324</podcast:episode>
      <itunes:title>MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">34480ad8-ea15-4a81-8932-db7acf0ec299</guid>
      <link>https://share.transistor.fm/s/294d08f2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, Md Rizwan Parvez</p>

            <p><strong>Title:</strong><br>
            MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.00316v1">http://arxiv.org/abs/2501.00316v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in foundation models have enhanced AI systems' capabilities in autonomous tool usage and reasoning. However, their ability in location or map-based reasoning - which improves daily life by optimizing navigation, facilitating resource discovery, and streamlining logistics - has not been systematically studied. To bridge this gap, we introduce MapEval, a benchmark designed to assess diverse and complex map-based user queries with geo-spatial reasoning. MapEval features three task types (textual, API-based, and visual) that require collecting world information via map tools, processing heterogeneous geo-spatial contexts (e.g., named entities, travel distances, user reviews or ratings, images), and compositional reasoning, which all state-of-the-art foundation models find challenging. Comprising 700 unique multiple-choice questions about locations across 180 cities and 54 countries, MapEval evaluates foundation models' ability to handle spatial relationships, map infographics, travel planning, and navigation challenges. Using MapEval, we conducted a comprehensive evaluation of 28 prominent foundation models. While no single model excelled across all tasks, Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro achieved competitive performance overall. However, substantial performance gaps emerged, particularly in MapEval, where agents with Claude-3.5-Sonnet outperformed GPT-4o and Gemini-1.5-Pro by 16% and 21%, respectively, and the gaps became even more amplified when compared to open-source LLMs. Our detailed analyses provide insights into the strengths and weaknesses of current models, though all models still fall short of human performance by more than 20% on average, struggling with complex map images and rigorous geo-spatial reasoning. This gap highlights MapEval's critical role in advancing general-purpose foundation models with stronger geo-spatial understanding.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, Md Rizwan Parvez</p>

            <p><strong>Title:</strong><br>
            MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.00316v1">http://arxiv.org/abs/2501.00316v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in foundation models have enhanced AI systems' capabilities in autonomous tool usage and reasoning. However, their ability in location or map-based reasoning - which improves daily life by optimizing navigation, facilitating resource discovery, and streamlining logistics - has not been systematically studied. To bridge this gap, we introduce MapEval, a benchmark designed to assess diverse and complex map-based user queries with geo-spatial reasoning. MapEval features three task types (textual, API-based, and visual) that require collecting world information via map tools, processing heterogeneous geo-spatial contexts (e.g., named entities, travel distances, user reviews or ratings, images), and compositional reasoning, which all state-of-the-art foundation models find challenging. Comprising 700 unique multiple-choice questions about locations across 180 cities and 54 countries, MapEval evaluates foundation models' ability to handle spatial relationships, map infographics, travel planning, and navigation challenges. Using MapEval, we conducted a comprehensive evaluation of 28 prominent foundation models. While no single model excelled across all tasks, Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro achieved competitive performance overall. However, substantial performance gaps emerged, particularly in MapEval, where agents with Claude-3.5-Sonnet outperformed GPT-4o and Gemini-1.5-Pro by 16% and 21%, respectively, and the gaps became even more amplified when compared to open-source LLMs. Our detailed analyses provide insights into the strengths and weaknesses of current models, though all models still fall short of human performance by more than 20% on average, struggling with complex map images and rigorous geo-spatial reasoning. This gap highlights MapEval's critical role in advancing general-purpose foundation models with stronger geo-spatial understanding.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Jan 2025 20:23:08 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/294d08f2/81997432.mp3" length="24570246" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1532</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, Md Rizwan Parvez</p>

            <p><strong>Title:</strong><br>
            MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.00316v1">http://arxiv.org/abs/2501.00316v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in foundation models have enhanced AI systems' capabilities in autonomous tool usage and reasoning. However, their ability in location or map-based reasoning - which improves daily life by optimizing navigation, facilitating resource discovery, and streamlining logistics - has not been systematically studied. To bridge this gap, we introduce MapEval, a benchmark designed to assess diverse and complex map-based user queries with geo-spatial reasoning. MapEval features three task types (textual, API-based, and visual) that require collecting world information via map tools, processing heterogeneous geo-spatial contexts (e.g., named entities, travel distances, user reviews or ratings, images), and compositional reasoning, which all state-of-the-art foundation models find challenging. Comprising 700 unique multiple-choice questions about locations across 180 cities and 54 countries, MapEval evaluates foundation models' ability to handle spatial relationships, map infographics, travel planning, and navigation challenges. Using MapEval, we conducted a comprehensive evaluation of 28 prominent foundation models. While no single model excelled across all tasks, Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro achieved competitive performance overall. However, substantial performance gaps emerged, particularly in MapEval, where agents with Claude-3.5-Sonnet outperformed GPT-4o and Gemini-1.5-Pro by 16% and 21%, respectively, and the gaps became even more amplified when compared to open-source LLMs. Our detailed analyses provide insights into the strengths and weaknesses of current models, though all models still fall short of human performance by more than 20% on average, struggling with complex map images and rigorous geo-spatial reasoning. This gap highlights MapEval's critical role in advancing general-purpose foundation models with stronger geo-spatial understanding.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A3: Android Agent Arena for Mobile GUI Agents</title>
      <itunes:episode>323</itunes:episode>
      <podcast:episode>323</podcast:episode>
      <itunes:title>A3: Android Agent Arena for Mobile GUI Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8c3cf5bd-6a3f-4db5-8601-218178cc5000</guid>
      <link>https://share.transistor.fm/s/5ef90cb1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuxiang Chai, Hanhao Li, Jiayu Zhang, Liang Liu, Guozhi Wang, Shuai Ren, Siyuan Huang, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            A3: Android Agent Arena for Mobile GUI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01149v1">http://arxiv.org/abs/2501.01149v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI agents have become increasingly prevalent in recent years, driven by significant advancements in the field of large language models (LLMs). Mobile GUI agents, a subset of AI agents, are designed to autonomously perform tasks on mobile devices. While numerous studies have introduced agents, datasets, and benchmarks to advance mobile GUI agent research, many existing datasets focus on static frame evaluations and fail to provide a comprehensive platform for assessing performance on real-world, in-the-wild tasks. To address this gap, we present Android Agent Arena (A3), a novel evaluation platform. Unlike existing in-the-wild systems, A3 offers: (1) meaningful and practical tasks, such as real-time online information retrieval and operational instructions; (2) a larger, more flexible action space, enabling compatibility with agents trained on any dataset; and (3) automated business-level LLM-based evaluation process. A3 includes 21 widely used general third-party apps and 201 tasks representative of common user scenarios, providing a robust foundation for evaluating mobile GUI agents in real-world situations and a new autonomous evaluation process for less human labor and coding expertise. The project is available at \url{https://yuxiangchai.github.io/Android-Agent-Arena/}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuxiang Chai, Hanhao Li, Jiayu Zhang, Liang Liu, Guozhi Wang, Shuai Ren, Siyuan Huang, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            A3: Android Agent Arena for Mobile GUI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01149v1">http://arxiv.org/abs/2501.01149v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI agents have become increasingly prevalent in recent years, driven by significant advancements in the field of large language models (LLMs). Mobile GUI agents, a subset of AI agents, are designed to autonomously perform tasks on mobile devices. While numerous studies have introduced agents, datasets, and benchmarks to advance mobile GUI agent research, many existing datasets focus on static frame evaluations and fail to provide a comprehensive platform for assessing performance on real-world, in-the-wild tasks. To address this gap, we present Android Agent Arena (A3), a novel evaluation platform. Unlike existing in-the-wild systems, A3 offers: (1) meaningful and practical tasks, such as real-time online information retrieval and operational instructions; (2) a larger, more flexible action space, enabling compatibility with agents trained on any dataset; and (3) automated business-level LLM-based evaluation process. A3 includes 21 widely used general third-party apps and 201 tasks representative of common user scenarios, providing a robust foundation for evaluating mobile GUI agents in real-world situations and a new autonomous evaluation process for less human labor and coding expertise. The project is available at \url{https://yuxiangchai.github.io/Android-Agent-Arena/}.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Jan 2025 20:22:45 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5ef90cb1/0a270e54.mp3" length="22693995" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1415</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuxiang Chai, Hanhao Li, Jiayu Zhang, Liang Liu, Guozhi Wang, Shuai Ren, Siyuan Huang, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            A3: Android Agent Arena for Mobile GUI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01149v1">http://arxiv.org/abs/2501.01149v1</a></p>

            <p><strong>Abstract:</strong><br>
            AI agents have become increasingly prevalent in recent years, driven by significant advancements in the field of large language models (LLMs). Mobile GUI agents, a subset of AI agents, are designed to autonomously perform tasks on mobile devices. While numerous studies have introduced agents, datasets, and benchmarks to advance mobile GUI agent research, many existing datasets focus on static frame evaluations and fail to provide a comprehensive platform for assessing performance on real-world, in-the-wild tasks. To address this gap, we present Android Agent Arena (A3), a novel evaluation platform. Unlike existing in-the-wild systems, A3 offers: (1) meaningful and practical tasks, such as real-time online information retrieval and operational instructions; (2) a larger, more flexible action space, enabling compatibility with agents trained on any dataset; and (3) automated business-level LLM-based evaluation process. A3 includes 21 widely used general third-party apps and 201 tasks representative of common user scenarios, providing a robust foundation for evaluating mobile GUI agents in real-world situations and a new autonomous evaluation process for less human labor and coding expertise. The project is available at \url{https://yuxiangchai.github.io/Android-Agent-Arena/}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MLLM-as-a-Judge for Image Safety without Human Labeling</title>
      <itunes:episode>322</itunes:episode>
      <podcast:episode>322</podcast:episode>
      <itunes:title>MLLM-as-a-Judge for Image Safety without Human Labeling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1593a110-9cf8-4c04-8b93-e741884df2d5</guid>
      <link>https://share.transistor.fm/s/5b49ef3e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV, cs.CL, cs.CY, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhenting Wang, Shuming Hu, Shiyu Zhao, Xiaowen Lin, Felix Juefei-Xu, Zhuowei Li, Ligong Han, Harihar Subramanyam, Li Chen, Jianfa Chen, Nan Jiang, Lingjuan Lyu, Shiqing Ma, Dimitris N. Metaxas, Ankit Jain</p>

            <p><strong>Title:</strong><br>
            MLLM-as-a-Judge for Image Safety without Human Labeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.00192v1">http://arxiv.org/abs/2501.00192v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image content safety has become a significant challenge with the rise of visual media on online platforms. Meanwhile, in the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content, such as images containing sexual or violent material. Thus, it becomes crucial to identify such unsafe images based on established safety rules. Pre-trained Multimodal Large Language Models (MLLMs) offer potential in this regard, given their strong pattern recognition abilities. Existing approaches typically fine-tune MLLMs with human-labeled datasets, which however brings a series of drawbacks. First, relying on human annotators to label data following intricate and detailed guidelines is both expensive and labor-intensive. Furthermore, users of safety judgment systems may need to frequently update safety rules, making fine-tuning on human-based annotation more challenging. This raises the research question: Can we detect unsafe images by querying MLLMs in a zero-shot setting using a predefined safety constitution (a set of safety rules)? Our research showed that simply querying pre-trained MLLMs does not yield satisfactory results. This lack of effectiveness stems from factors such as the subjectivity of safety rules, the complexity of lengthy constitutions, and the inherent biases in the models. To address these challenges, we propose a MLLM-based method includes objectifying safety rules, assessing the relevance between rules and images, making quick judgments based on debiased token probabilities with logically complete yet simplified precondition chains for safety rules, and conducting more in-depth reasoning with cascaded chain-of-thought processes if necessary. Experiment results demonstrate that our method is highly effective for zero-shot image safety judgment tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV, cs.CL, cs.CY, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhenting Wang, Shuming Hu, Shiyu Zhao, Xiaowen Lin, Felix Juefei-Xu, Zhuowei Li, Ligong Han, Harihar Subramanyam, Li Chen, Jianfa Chen, Nan Jiang, Lingjuan Lyu, Shiqing Ma, Dimitris N. Metaxas, Ankit Jain</p>

            <p><strong>Title:</strong><br>
            MLLM-as-a-Judge for Image Safety without Human Labeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.00192v1">http://arxiv.org/abs/2501.00192v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image content safety has become a significant challenge with the rise of visual media on online platforms. Meanwhile, in the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content, such as images containing sexual or violent material. Thus, it becomes crucial to identify such unsafe images based on established safety rules. Pre-trained Multimodal Large Language Models (MLLMs) offer potential in this regard, given their strong pattern recognition abilities. Existing approaches typically fine-tune MLLMs with human-labeled datasets, which however brings a series of drawbacks. First, relying on human annotators to label data following intricate and detailed guidelines is both expensive and labor-intensive. Furthermore, users of safety judgment systems may need to frequently update safety rules, making fine-tuning on human-based annotation more challenging. This raises the research question: Can we detect unsafe images by querying MLLMs in a zero-shot setting using a predefined safety constitution (a set of safety rules)? Our research showed that simply querying pre-trained MLLMs does not yield satisfactory results. This lack of effectiveness stems from factors such as the subjectivity of safety rules, the complexity of lengthy constitutions, and the inherent biases in the models. To address these challenges, we propose a MLLM-based method includes objectifying safety rules, assessing the relevance between rules and images, making quick judgments based on debiased token probabilities with logically complete yet simplified precondition chains for safety rules, and conducting more in-depth reasoning with cascaded chain-of-thought processes if necessary. Experiment results demonstrate that our method is highly effective for zero-shot image safety judgment tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Jan 2025 20:22:22 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5b49ef3e/dab85455.mp3" length="21497806" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1340</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV, cs.CL, cs.CY, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhenting Wang, Shuming Hu, Shiyu Zhao, Xiaowen Lin, Felix Juefei-Xu, Zhuowei Li, Ligong Han, Harihar Subramanyam, Li Chen, Jianfa Chen, Nan Jiang, Lingjuan Lyu, Shiqing Ma, Dimitris N. Metaxas, Ankit Jain</p>

            <p><strong>Title:</strong><br>
            MLLM-as-a-Judge for Image Safety without Human Labeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.00192v1">http://arxiv.org/abs/2501.00192v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image content safety has become a significant challenge with the rise of visual media on online platforms. Meanwhile, in the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content, such as images containing sexual or violent material. Thus, it becomes crucial to identify such unsafe images based on established safety rules. Pre-trained Multimodal Large Language Models (MLLMs) offer potential in this regard, given their strong pattern recognition abilities. Existing approaches typically fine-tune MLLMs with human-labeled datasets, which however brings a series of drawbacks. First, relying on human annotators to label data following intricate and detailed guidelines is both expensive and labor-intensive. Furthermore, users of safety judgment systems may need to frequently update safety rules, making fine-tuning on human-based annotation more challenging. This raises the research question: Can we detect unsafe images by querying MLLMs in a zero-shot setting using a predefined safety constitution (a set of safety rules)? Our research showed that simply querying pre-trained MLLMs does not yield satisfactory results. This lack of effectiveness stems from factors such as the subjectivity of safety rules, the complexity of lengthy constitutions, and the inherent biases in the models. To address these challenges, we propose a MLLM-based method includes objectifying safety rules, assessing the relevance between rules and images, making quick judgments based on debiased token probabilities with logically complete yet simplified precondition chains for safety rules, and conducting more in-depth reasoning with cascaded chain-of-thought processes if necessary. Experiment results demonstrate that our method is highly effective for zero-shot image safety judgment tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Dynamic Scaling of Unit Tests for Code Reward Modeling</title>
      <itunes:episode>321</itunes:episode>
      <podcast:episode>321</podcast:episode>
      <itunes:title>Dynamic Scaling of Unit Tests for Code Reward Modeling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">293e7704-7da5-4c25-9fc4-7eabe66711b1</guid>
      <link>https://share.transistor.fm/s/97f51e4c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Zeyao Ma, Xiaokang Zhang, Jing Zhang, Jifan Yu, Sijia Luo, Jie Tang</p>

            <p><strong>Title:</strong><br>
            Dynamic Scaling of Unit Tests for Code Reward Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01054v1">http://arxiv.org/abs/2501.01054v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current large language models (LLMs) often struggle to produce accurate responses on the first attempt for complex reasoning tasks like code generation. Prior research tackles this challenge by generating multiple candidate solutions and validating them with LLM-generated unit tests. The execution results of unit tests serve as reward signals to identify correct solutions. As LLMs always confidently make mistakes, these unit tests are not reliable, thereby diminishing the quality of reward signals. Motivated by the observation that scaling the number of solutions improves LLM performance, we explore the impact of scaling unit tests to enhance reward signal quality. Our pioneer experiment reveals a positive correlation between the number of unit tests and reward signal quality, with greater benefits observed in more challenging problems. Based on these insights, we propose CodeRM-8B, a lightweight yet effective unit test generator that enables efficient and high-quality unit test scaling. Additionally, we implement a dynamic scaling mechanism that adapts the number of unit tests based on problem difficulty, further improving efficiency. Experimental results show that our approach significantly improves performance across various models on three benchmarks (e.g., with gains of 18.43% for Llama3-8B and 3.42% for GPT-4o-mini on HumanEval Plus).</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Zeyao Ma, Xiaokang Zhang, Jing Zhang, Jifan Yu, Sijia Luo, Jie Tang</p>

            <p><strong>Title:</strong><br>
            Dynamic Scaling of Unit Tests for Code Reward Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01054v1">http://arxiv.org/abs/2501.01054v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current large language models (LLMs) often struggle to produce accurate responses on the first attempt for complex reasoning tasks like code generation. Prior research tackles this challenge by generating multiple candidate solutions and validating them with LLM-generated unit tests. The execution results of unit tests serve as reward signals to identify correct solutions. As LLMs always confidently make mistakes, these unit tests are not reliable, thereby diminishing the quality of reward signals. Motivated by the observation that scaling the number of solutions improves LLM performance, we explore the impact of scaling unit tests to enhance reward signal quality. Our pioneer experiment reveals a positive correlation between the number of unit tests and reward signal quality, with greater benefits observed in more challenging problems. Based on these insights, we propose CodeRM-8B, a lightweight yet effective unit test generator that enables efficient and high-quality unit test scaling. Additionally, we implement a dynamic scaling mechanism that adapts the number of unit tests based on problem difficulty, further improving efficiency. Experimental results show that our approach significantly improves performance across various models on three benchmarks (e.g., with gains of 18.43% for Llama3-8B and 3.42% for GPT-4o-mini on HumanEval Plus).</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 03 Jan 2025 20:22:00 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/97f51e4c/e7278e14.mp3" length="21049334" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1312</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Zeyao Ma, Xiaokang Zhang, Jing Zhang, Jifan Yu, Sijia Luo, Jie Tang</p>

            <p><strong>Title:</strong><br>
            Dynamic Scaling of Unit Tests for Code Reward Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2501.01054v1">http://arxiv.org/abs/2501.01054v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current large language models (LLMs) often struggle to produce accurate responses on the first attempt for complex reasoning tasks like code generation. Prior research tackles this challenge by generating multiple candidate solutions and validating them with LLM-generated unit tests. The execution results of unit tests serve as reward signals to identify correct solutions. As LLMs always confidently make mistakes, these unit tests are not reliable, thereby diminishing the quality of reward signals. Motivated by the observation that scaling the number of solutions improves LLM performance, we explore the impact of scaling unit tests to enhance reward signal quality. Our pioneer experiment reveals a positive correlation between the number of unit tests and reward signal quality, with greater benefits observed in more challenging problems. Based on these insights, we propose CodeRM-8B, a lightweight yet effective unit test generator that enables efficient and high-quality unit test scaling. Additionally, we implement a dynamic scaling mechanism that adapts the number of unit tests based on problem difficulty, further improving efficiency. Experimental results show that our approach significantly improves performance across various models on three benchmarks (e.g., with gains of 18.43% for Llama3-8B and 3.42% for GPT-4o-mini on HumanEval Plus).</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis</title>
      <itunes:episode>320</itunes:episode>
      <podcast:episode>320</podcast:episode>
      <itunes:title>OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b9f75efd-2359-41cd-8e73-c774860deea1</guid>
      <link>https://share.transistor.fm/s/83a864bb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, Zhiyong Wu</p>

            <p><strong>Title:</strong><br>
            OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.19723v1">http://arxiv.org/abs/2412.19723v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interface (GUI) agents powered by Vision-Language Models (VLMs) have demonstrated human-like computer control capability. Despite their utility in advancing digital automation, a critical bottleneck persists: collecting high-quality trajectory data for training. Common practices for collecting such data rely on human supervision or synthetic data generation through executing pre-defined tasks, which are either resource-intensive or unable to guarantee data quality. Moreover, these methods suffer from limited data diversity and significant gaps between synthetic data and real-world environments. To address these challenges, we propose OS-Genesis, a novel GUI data synthesis pipeline that reverses the conventional trajectory collection process. Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions, then retrospectively derive high-quality tasks to enable trajectory-level exploration. A trajectory reward model is then employed to ensure the quality of the generated trajectories. We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks. In-depth analysis further validates OS-Genesis's efficiency and its superior data quality and diversity compared to existing synthesis methods. Our codes, data, and checkpoints are available at \href{https://qiushisun.github.io/OS-Genesis-Home/}{OS-Genesis Homepage}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, Zhiyong Wu</p>

            <p><strong>Title:</strong><br>
            OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.19723v1">http://arxiv.org/abs/2412.19723v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interface (GUI) agents powered by Vision-Language Models (VLMs) have demonstrated human-like computer control capability. Despite their utility in advancing digital automation, a critical bottleneck persists: collecting high-quality trajectory data for training. Common practices for collecting such data rely on human supervision or synthetic data generation through executing pre-defined tasks, which are either resource-intensive or unable to guarantee data quality. Moreover, these methods suffer from limited data diversity and significant gaps between synthetic data and real-world environments. To address these challenges, we propose OS-Genesis, a novel GUI data synthesis pipeline that reverses the conventional trajectory collection process. Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions, then retrospectively derive high-quality tasks to enable trajectory-level exploration. A trajectory reward model is then employed to ensure the quality of the generated trajectories. We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks. In-depth analysis further validates OS-Genesis's efficiency and its superior data quality and diversity compared to existing synthesis methods. Our codes, data, and checkpoints are available at \href{https://qiushisun.github.io/OS-Genesis-Home/}{OS-Genesis Homepage}.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Jan 2025 20:07:11 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/83a864bb/124044a2.mp3" length="21786643" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1358</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 52 | cs.AI, cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, Zhiyong Wu</p>

            <p><strong>Title:</strong><br>
            OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.19723v1">http://arxiv.org/abs/2412.19723v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interface (GUI) agents powered by Vision-Language Models (VLMs) have demonstrated human-like computer control capability. Despite their utility in advancing digital automation, a critical bottleneck persists: collecting high-quality trajectory data for training. Common practices for collecting such data rely on human supervision or synthetic data generation through executing pre-defined tasks, which are either resource-intensive or unable to guarantee data quality. Moreover, these methods suffer from limited data diversity and significant gaps between synthetic data and real-world environments. To address these challenges, we propose OS-Genesis, a novel GUI data synthesis pipeline that reverses the conventional trajectory collection process. Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions, then retrospectively derive high-quality tasks to enable trajectory-level exploration. A trajectory reward model is then employed to ensure the quality of the generated trajectories. We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks. In-depth analysis further validates OS-Genesis's efficiency and its superior data quality and diversity compared to existing synthesis methods. Our codes, data, and checkpoints are available at \href{https://qiushisun.github.io/OS-Genesis-Home/}{OS-Genesis Homepage}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Xmodel-2 Technical Report</title>
      <itunes:episode>319</itunes:episode>
      <podcast:episode>319</podcast:episode>
      <itunes:title>Xmodel-2 Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bf7d1348-a05d-4d36-a158-23e552289406</guid>
      <link>https://share.transistor.fm/s/44395460</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Wang Qun, Liu Yang, Lin Qingquan, Qu Zhijiu, Jiang Ling</p>

            <p><strong>Title:</strong><br>
            Xmodel-2 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.19638v1">http://arxiv.org/abs/2412.19638v1</a></p>

            <p><strong>Abstract:</strong><br>
            Xmodel-2 is a 1.2-billion-parameter large language model designed specifically for reasoning tasks. Its architecture enables different model scales to share a unified set of hyperparameters, allowing for extensive experimentation on smaller models and seamless transfer of optimal configurations to larger models. To maximize training efficiency and stability, Xmodel-2 employs the WSD learning rate scheduler from MiniCPM. Pretrained on 1.5 trillion tokens from diverse sources, Xmodel-2 achieves state-of-the-art performance in complex reasoning and agent-based tasks, while maintaining low training costs. These results highlight the potential of efficient model design and training strategies in advancing reasoning capabilities. Model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/Xmodel-2</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Wang Qun, Liu Yang, Lin Qingquan, Qu Zhijiu, Jiang Ling</p>

            <p><strong>Title:</strong><br>
            Xmodel-2 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.19638v1">http://arxiv.org/abs/2412.19638v1</a></p>

            <p><strong>Abstract:</strong><br>
            Xmodel-2 is a 1.2-billion-parameter large language model designed specifically for reasoning tasks. Its architecture enables different model scales to share a unified set of hyperparameters, allowing for extensive experimentation on smaller models and seamless transfer of optimal configurations to larger models. To maximize training efficiency and stability, Xmodel-2 employs the WSD learning rate scheduler from MiniCPM. Pretrained on 1.5 trillion tokens from diverse sources, Xmodel-2 achieves state-of-the-art performance in complex reasoning and agent-based tasks, while maintaining low training costs. These results highlight the potential of efficient model design and training strategies in advancing reasoning capabilities. Model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/Xmodel-2</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Jan 2025 20:06:50 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/44395460/94acbbd2.mp3" length="16639418" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1036</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Wang Qun, Liu Yang, Lin Qingquan, Qu Zhijiu, Jiang Ling</p>

            <p><strong>Title:</strong><br>
            Xmodel-2 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.19638v1">http://arxiv.org/abs/2412.19638v1</a></p>

            <p><strong>Abstract:</strong><br>
            Xmodel-2 is a 1.2-billion-parameter large language model designed specifically for reasoning tasks. Its architecture enables different model scales to share a unified set of hyperparameters, allowing for extensive experimentation on smaller models and seamless transfer of optimal configurations to larger models. To maximize training efficiency and stability, Xmodel-2 employs the WSD learning rate scheduler from MiniCPM. Pretrained on 1.5 trillion tokens from diverse sources, Xmodel-2 achieves state-of-the-art performance in complex reasoning and agent-based tasks, while maintaining low training costs. These results highlight the potential of efficient model design and training strategies in advancing reasoning capabilities. Model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/Xmodel-2</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Are Vision-Language Models Truly Understanding Multi-vision Sensor?</title>
      <itunes:episode>318</itunes:episode>
      <podcast:episode>318</podcast:episode>
      <itunes:title>Are Vision-Language Models Truly Understanding Multi-vision Sensor?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a96f8957-e39e-4b42-86a1-0cdebe44fd0d</guid>
      <link>https://share.transistor.fm/s/26b3416e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sangyun Chung, Youngjoon Yu, Youngchae Chee, Se Yeon Kim, Byung-Kwan Lee, Yong Man Ro</p>

            <p><strong>Title:</strong><br>
            Are Vision-Language Models Truly Understanding Multi-vision Sensor?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20750v1">http://arxiv.org/abs/2412.20750v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale Vision-Language Models (VLMs) have advanced by aligning vision inputs with text, significantly improving performance in computer vision tasks. Moreover, for VLMs to be effectively utilized in real-world applications, an understanding of diverse multi-vision sensor data, such as thermal, depth, and X-ray information, is essential. However, we find that current VLMs process multi-vision sensor images without deep understanding of sensor information, disregarding each sensor's unique physical properties. This limitation restricts their capacity to interpret and respond to complex questions requiring multi-vision sensor reasoning. To address this, we propose a novel Multi-vision Sensor Perception and Reasoning (MS-PR) benchmark, assessing VLMs on their capacity for sensor-specific reasoning. Moreover, we introduce Diverse Negative Attributes (DNA) optimization to enable VLMs to perform deep reasoning on multi-vision sensor tasks, helping to bridge the core information gap between images and sensor data. Extensive experimental results validate that the proposed DNA method can significantly improve the multi-vision sensor reasoning for VLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sangyun Chung, Youngjoon Yu, Youngchae Chee, Se Yeon Kim, Byung-Kwan Lee, Yong Man Ro</p>

            <p><strong>Title:</strong><br>
            Are Vision-Language Models Truly Understanding Multi-vision Sensor?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20750v1">http://arxiv.org/abs/2412.20750v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale Vision-Language Models (VLMs) have advanced by aligning vision inputs with text, significantly improving performance in computer vision tasks. Moreover, for VLMs to be effectively utilized in real-world applications, an understanding of diverse multi-vision sensor data, such as thermal, depth, and X-ray information, is essential. However, we find that current VLMs process multi-vision sensor images without deep understanding of sensor information, disregarding each sensor's unique physical properties. This limitation restricts their capacity to interpret and respond to complex questions requiring multi-vision sensor reasoning. To address this, we propose a novel Multi-vision Sensor Perception and Reasoning (MS-PR) benchmark, assessing VLMs on their capacity for sensor-specific reasoning. Moreover, we introduce Diverse Negative Attributes (DNA) optimization to enable VLMs to perform deep reasoning on multi-vision sensor tasks, helping to bridge the core information gap between images and sensor data. Extensive experimental results validate that the proposed DNA method can significantly improve the multi-vision sensor reasoning for VLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Jan 2025 20:06:29 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/26b3416e/700418fa.mp3" length="23898575" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1490</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sangyun Chung, Youngjoon Yu, Youngchae Chee, Se Yeon Kim, Byung-Kwan Lee, Yong Man Ro</p>

            <p><strong>Title:</strong><br>
            Are Vision-Language Models Truly Understanding Multi-vision Sensor?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20750v1">http://arxiv.org/abs/2412.20750v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale Vision-Language Models (VLMs) have advanced by aligning vision inputs with text, significantly improving performance in computer vision tasks. Moreover, for VLMs to be effectively utilized in real-world applications, an understanding of diverse multi-vision sensor data, such as thermal, depth, and X-ray information, is essential. However, we find that current VLMs process multi-vision sensor images without deep understanding of sensor information, disregarding each sensor's unique physical properties. This limitation restricts their capacity to interpret and respond to complex questions requiring multi-vision sensor reasoning. To address this, we propose a novel Multi-vision Sensor Perception and Reasoning (MS-PR) benchmark, assessing VLMs on their capacity for sensor-specific reasoning. Moreover, we introduce Diverse Negative Attributes (DNA) optimization to enable VLMs to perform deep reasoning on multi-vision sensor tasks, helping to bridge the core information gap between images and sensor data. Extensive experimental results validate that the proposed DNA method can significantly improve the multi-vision sensor reasoning for VLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving</title>
      <itunes:episode>317</itunes:episode>
      <podcast:episode>317</podcast:episode>
      <itunes:title>HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c28d387f-db32-4587-a496-a4688a45fee8</guid>
      <link>https://share.transistor.fm/s/d60132d6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yang Li, Dong Du, Linfeng Song, Chen Li, Weikang Wang, Tao Yang, Haitao Mi</p>

            <p><strong>Title:</strong><br>
            HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20735v2">http://arxiv.org/abs/2412.20735v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce HunyuanProver, an language model finetuned from the Hunyuan 7B for interactive automatic theorem proving with LEAN4. To alleviate the data sparsity issue, we design a scalable framework to iterative synthesize data with low cost. Besides, guided tree search algorithms are designed to enable effective ``system 2 thinking`` of the prover. HunyuanProver achieves state-of-the-art (SOTA) performances on major benchmarks. Specifically, it achieves a pass of 68.4% on the miniF2F-test compared to 65.9%, the current SOTA results. It proves 4 IMO statements (imo_1960_p2, imo_1962_p2}, imo_1964_p2 and imo_1983_p6) in miniF2F-test. To benefit the community, we will open-source a dataset of 30k synthesized instances, where each instance contains the original question in natural language, the converted statement by autoformalization, and the proof by HunyuanProver.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yang Li, Dong Du, Linfeng Song, Chen Li, Weikang Wang, Tao Yang, Haitao Mi</p>

            <p><strong>Title:</strong><br>
            HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20735v2">http://arxiv.org/abs/2412.20735v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce HunyuanProver, an language model finetuned from the Hunyuan 7B for interactive automatic theorem proving with LEAN4. To alleviate the data sparsity issue, we design a scalable framework to iterative synthesize data with low cost. Besides, guided tree search algorithms are designed to enable effective ``system 2 thinking`` of the prover. HunyuanProver achieves state-of-the-art (SOTA) performances on major benchmarks. Specifically, it achieves a pass of 68.4% on the miniF2F-test compared to 65.9%, the current SOTA results. It proves 4 IMO statements (imo_1960_p2, imo_1962_p2}, imo_1964_p2 and imo_1983_p6) in miniF2F-test. To benefit the community, we will open-source a dataset of 30k synthesized instances, where each instance contains the original question in natural language, the converted statement by autoformalization, and the proof by HunyuanProver.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Jan 2025 20:06:08 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d60132d6/1fbdf311.mp3" length="20031653" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1248</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yang Li, Dong Du, Linfeng Song, Chen Li, Weikang Wang, Tao Yang, Haitao Mi</p>

            <p><strong>Title:</strong><br>
            HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20735v2">http://arxiv.org/abs/2412.20735v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce HunyuanProver, an language model finetuned from the Hunyuan 7B for interactive automatic theorem proving with LEAN4. To alleviate the data sparsity issue, we design a scalable framework to iterative synthesize data with low cost. Besides, guided tree search algorithms are designed to enable effective ``system 2 thinking`` of the prover. HunyuanProver achieves state-of-the-art (SOTA) performances on major benchmarks. Specifically, it achieves a pass of 68.4% on the miniF2F-test compared to 65.9%, the current SOTA results. It proves 4 IMO statements (imo_1960_p2, imo_1962_p2}, imo_1964_p2 and imo_1983_p6) in miniF2F-test. To benefit the community, we will open-source a dataset of 30k synthesized instances, where each instance contains the original question in natural language, the converted statement by autoformalization, and the proof by HunyuanProver.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control</title>
      <itunes:episode>316</itunes:episode>
      <podcast:episode>316</podcast:episode>
      <itunes:title>VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5671b693-595c-46a9-8fa1-7934c724dc49</guid>
      <link>https://share.transistor.fm/s/16583125</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shaojin Wu, Fei Ding, Mengqi Huang, Wei Liu, Qian He</p>

            <p><strong>Title:</strong><br>
            VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20800v1">http://arxiv.org/abs/2412.20800v1</a></p>

            <p><strong>Abstract:</strong><br>
            While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions including color, lighting, composition, etc. In this paper, we propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter, to upgrade the quality of generated images while maintaining generality across visual concepts by (1) disentangling the input text prompt into the content description and aesthetic description by the initialization of aesthetic embedding, and (2) integrating aesthetic conditions into the denoising process through value-mixed cross-attention, with the network connected by zero-initialized linear layers. Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method, all while preserving the image-text alignment. Through our meticulous design, VMix is flexible enough to be applied to community models for better visual performance without retraining. To validate the effectiveness of our method, we conducted extensive experiments, showing that VMix outperforms other state-of-the-art methods and is compatible with other community modules (e.g., LoRA, ControlNet, and IPAdapter) for image generation. The project page is https://vmix-diffusion.github.io/VMix/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shaojin Wu, Fei Ding, Mengqi Huang, Wei Liu, Qian He</p>

            <p><strong>Title:</strong><br>
            VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20800v1">http://arxiv.org/abs/2412.20800v1</a></p>

            <p><strong>Abstract:</strong><br>
            While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions including color, lighting, composition, etc. In this paper, we propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter, to upgrade the quality of generated images while maintaining generality across visual concepts by (1) disentangling the input text prompt into the content description and aesthetic description by the initialization of aesthetic embedding, and (2) integrating aesthetic conditions into the denoising process through value-mixed cross-attention, with the network connected by zero-initialized linear layers. Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method, all while preserving the image-text alignment. Through our meticulous design, VMix is flexible enough to be applied to community models for better visual performance without retraining. To validate the effectiveness of our method, we conducted extensive experiments, showing that VMix outperforms other state-of-the-art methods and is compatible with other community modules (e.g., LoRA, ControlNet, and IPAdapter) for image generation. The project page is https://vmix-diffusion.github.io/VMix/.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 02 Jan 2025 20:05:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/16583125/96b27fa8.mp3" length="21275477" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1326</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shaojin Wu, Fei Ding, Mengqi Huang, Wei Liu, Qian He</p>

            <p><strong>Title:</strong><br>
            VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20800v1">http://arxiv.org/abs/2412.20800v1</a></p>

            <p><strong>Abstract:</strong><br>
            While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions including color, lighting, composition, etc. In this paper, we propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter, to upgrade the quality of generated images while maintaining generality across visual concepts by (1) disentangling the input text prompt into the content description and aesthetic description by the initialization of aesthetic embedding, and (2) integrating aesthetic conditions into the denoising process through value-mixed cross-attention, with the network connected by zero-initialized linear layers. Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method, all while preserving the image-text alignment. Through our meticulous design, VMix is flexible enough to be applied to community models for better visual performance without retraining. To validate the effectiveness of our method, we conducted extensive experiments, showing that VMix outperforms other state-of-the-art methods and is compatible with other community modules (e.g., LoRA, ControlNet, and IPAdapter) for image generation. The project page is https://vmix-diffusion.github.io/VMix/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs</title>
      <itunes:episode>315</itunes:episode>
      <podcast:episode>315</podcast:episode>
      <itunes:title>Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3581bbd4-a903-4370-a53d-0ee91134fa1b</guid>
      <link>https://share.transistor.fm/s/aad91b54</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu</p>

            <p><strong>Title:</strong><br>
            Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21187v1">http://arxiv.org/abs/2412.21187v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable performance of models like the OpenAI o1 can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought (CoT) processes, exploring multiple strategies to enhance problem-solving capabilities. However, a critical question remains: How to intelligently and efficiently scale computational resources during testing. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where excessive computational resources are allocated for simple problems with minimal benefit. We introduce novel efficiency metrics from both outcome and process perspectives to evaluate the rational use of computational resources by o1-like models. Using a self-training paradigm, we propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy. Experimental results show that our approach successfully reduces computational overhead while preserving model performance across a range of testsets with varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu</p>

            <p><strong>Title:</strong><br>
            Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21187v1">http://arxiv.org/abs/2412.21187v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable performance of models like the OpenAI o1 can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought (CoT) processes, exploring multiple strategies to enhance problem-solving capabilities. However, a critical question remains: How to intelligently and efficiently scale computational resources during testing. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where excessive computational resources are allocated for simple problems with minimal benefit. We introduce novel efficiency metrics from both outcome and process perspectives to evaluate the rational use of computational resources by o1-like models. Using a self-training paradigm, we propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy. Experimental results show that our approach successfully reduces computational overhead while preserving model performance across a range of testsets with varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Jan 2025 19:04:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/aad91b54/247c5112.mp3" length="19378348" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1207</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu</p>

            <p><strong>Title:</strong><br>
            Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21187v1">http://arxiv.org/abs/2412.21187v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable performance of models like the OpenAI o1 can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought (CoT) processes, exploring multiple strategies to enhance problem-solving capabilities. However, a critical question remains: How to intelligently and efficiently scale computational resources during testing. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where excessive computational resources are allocated for simple problems with minimal benefit. We introduce novel efficiency metrics from both outcome and process perspectives to evaluate the rational use of computational resources by o1-like models. Using a self-training paradigm, we propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy. Experimental results show that our approach successfully reduces computational overhead while preserving model performance across a range of testsets with varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System</title>
      <itunes:episode>314</itunes:episode>
      <podcast:episode>314</podcast:episode>
      <itunes:title>OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">25783f79-0e4f-40a1-9855-77f6d6c29bb1</guid>
      <link>https://share.transistor.fm/s/8b161ece</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CL, cs.AI, cs.DB, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yujie Luo, Xiangyuan Ru, Kangwei Liu, Lin Yuan, Mengshu Sun, Ningyu Zhang, Lei Liang, Zhiqiang Zhang, Jun Zhou, Lanning Wei, Da Zheng, Haofen Wang, Huajun Chen</p>

            <p><strong>Title:</strong><br>
            OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20005v1">http://arxiv.org/abs/2412.20005v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce OneKE, a dockerized schema-guided knowledge extraction system, which can extract knowledge from the Web and raw PDF Books, and support various domains (science, news, etc.). Specifically, we design OneKE with multiple agents and a configure knowledge base. Different agents perform their respective roles, enabling support for various extraction scenarios. The configure knowledge base facilitates schema configuration, error case debugging and correction, further improving the performance. Empirical evaluations on benchmark datasets demonstrate OneKE's efficacy, while case studies further elucidate its adaptability to diverse tasks across multiple domains, highlighting its potential for broad applications. We have open-sourced the Code at https://github.com/zjunlp/OneKE and released a Video at http://oneke.openkg.cn/demo.mp4.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CL, cs.AI, cs.DB, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yujie Luo, Xiangyuan Ru, Kangwei Liu, Lin Yuan, Mengshu Sun, Ningyu Zhang, Lei Liang, Zhiqiang Zhang, Jun Zhou, Lanning Wei, Da Zheng, Haofen Wang, Huajun Chen</p>

            <p><strong>Title:</strong><br>
            OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20005v1">http://arxiv.org/abs/2412.20005v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce OneKE, a dockerized schema-guided knowledge extraction system, which can extract knowledge from the Web and raw PDF Books, and support various domains (science, news, etc.). Specifically, we design OneKE with multiple agents and a configure knowledge base. Different agents perform their respective roles, enabling support for various extraction scenarios. The configure knowledge base facilitates schema configuration, error case debugging and correction, further improving the performance. Empirical evaluations on benchmark datasets demonstrate OneKE's efficacy, while case studies further elucidate its adaptability to diverse tasks across multiple domains, highlighting its potential for broad applications. We have open-sourced the Code at https://github.com/zjunlp/OneKE and released a Video at http://oneke.openkg.cn/demo.mp4.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 01 Jan 2025 19:04:33 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8b161ece/96af6c61.mp3" length="18185083" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1133</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CL, cs.AI, cs.DB, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yujie Luo, Xiangyuan Ru, Kangwei Liu, Lin Yuan, Mengshu Sun, Ningyu Zhang, Lei Liang, Zhiqiang Zhang, Jun Zhou, Lanning Wei, Da Zheng, Haofen Wang, Huajun Chen</p>

            <p><strong>Title:</strong><br>
            OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20005v1">http://arxiv.org/abs/2412.20005v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce OneKE, a dockerized schema-guided knowledge extraction system, which can extract knowledge from the Web and raw PDF Books, and support various domains (science, news, etc.). Specifically, we design OneKE with multiple agents and a configure knowledge base. Different agents perform their respective roles, enabling support for various extraction scenarios. The configure knowledge base facilitates schema configuration, error case debugging and correction, further improving the performance. Empirical evaluations on benchmark datasets demonstrate OneKE's efficacy, while case studies further elucidate its adaptability to diverse tasks across multiple domains, highlighting its potential for broad applications. We have open-sourced the Code at https://github.com/zjunlp/OneKE and released a Video at http://oneke.openkg.cn/demo.mp4.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization</title>
      <itunes:episode>313</itunes:episode>
      <podcast:episode>313</podcast:episode>
      <itunes:title>Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">03da2083-7842-4b43-83a5-8a0a60e24543</guid>
      <link>https://share.transistor.fm/s/80fdeabe</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang Shen, Xiu-Shen Wei, Yifan Sun, Yuxin Song, Tao Yuan, Jian Jin, Heyang Xu, Yazhou Yao, Errui Ding</p>

            <p><strong>Title:</strong><br>
            Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18525v2">http://arxiv.org/abs/2412.18525v2</a></p>

            <p><strong>Abstract:</strong><br>
            Computer Vision (CV) has yet to fully achieve the zero-shot task generalization observed in Natural Language Processing (NLP), despite following many of the milestones established in NLP, such as large transformer models, extensive pre-training, and the auto-regression paradigm, among others. In this paper, we explore the idea that CV adopts discrete and terminological task definitions (\eg, ``image segmentation''), which may be a key barrier to zero-shot task generalization. Our hypothesis is that without truly understanding previously-seen tasks--due to these terminological definitions--deep models struggle to generalize to novel tasks. To verify this, we introduce Explanatory Instructions, which provide an intuitive way to define CV task objectives through detailed linguistic transformations from input images to outputs. We create a large-scale dataset comprising 12 million ``image input $\to$ explanatory instruction $\to$ output'' triplets, and train an auto-regressive-based vision-language model (AR-based VLM) that takes both images and explanatory instructions as input. By learning to follow these instructions, the AR-based VLM achieves instruction-level zero-shot capabilities for previously-seen tasks and demonstrates strong zero-shot generalization for unseen CV tasks. Code and dataset will be openly available on our GitHub repository.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang Shen, Xiu-Shen Wei, Yifan Sun, Yuxin Song, Tao Yuan, Jian Jin, Heyang Xu, Yazhou Yao, Errui Ding</p>

            <p><strong>Title:</strong><br>
            Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18525v2">http://arxiv.org/abs/2412.18525v2</a></p>

            <p><strong>Abstract:</strong><br>
            Computer Vision (CV) has yet to fully achieve the zero-shot task generalization observed in Natural Language Processing (NLP), despite following many of the milestones established in NLP, such as large transformer models, extensive pre-training, and the auto-regression paradigm, among others. In this paper, we explore the idea that CV adopts discrete and terminological task definitions (\eg, ``image segmentation''), which may be a key barrier to zero-shot task generalization. Our hypothesis is that without truly understanding previously-seen tasks--due to these terminological definitions--deep models struggle to generalize to novel tasks. To verify this, we introduce Explanatory Instructions, which provide an intuitive way to define CV task objectives through detailed linguistic transformations from input images to outputs. We create a large-scale dataset comprising 12 million ``image input $\to$ explanatory instruction $\to$ output'' triplets, and train an auto-regressive-based vision-language model (AR-based VLM) that takes both images and explanatory instructions as input. By learning to follow these instructions, the AR-based VLM achieves instruction-level zero-shot capabilities for previously-seen tasks and demonstrates strong zero-shot generalization for unseen CV tasks. Code and dataset will be openly available on our GitHub repository.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Dec 2024 20:27:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/80fdeabe/c16c388b.mp3" length="24118870" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1504</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang Shen, Xiu-Shen Wei, Yifan Sun, Yuxin Song, Tao Yuan, Jian Jin, Heyang Xu, Yazhou Yao, Errui Ding</p>

            <p><strong>Title:</strong><br>
            Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18525v2">http://arxiv.org/abs/2412.18525v2</a></p>

            <p><strong>Abstract:</strong><br>
            Computer Vision (CV) has yet to fully achieve the zero-shot task generalization observed in Natural Language Processing (NLP), despite following many of the milestones established in NLP, such as large transformer models, extensive pre-training, and the auto-regression paradigm, among others. In this paper, we explore the idea that CV adopts discrete and terminological task definitions (\eg, ``image segmentation''), which may be a key barrier to zero-shot task generalization. Our hypothesis is that without truly understanding previously-seen tasks--due to these terminological definitions--deep models struggle to generalize to novel tasks. To verify this, we introduce Explanatory Instructions, which provide an intuitive way to define CV task objectives through detailed linguistic transformations from input images to outputs. We create a large-scale dataset comprising 12 million ``image input $\to$ explanatory instruction $\to$ output'' triplets, and train an auto-regressive-based vision-language model (AR-based VLM) that takes both images and explanatory instructions as input. By learning to follow these instructions, the AR-based VLM achieves instruction-level zero-shot capabilities for previously-seen tasks and demonstrates strong zero-shot generalization for unseen CV tasks. Code and dataset will be openly available on our GitHub repository.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>On the Compositional Generalization of Multimodal LLMs for Medical Imaging</title>
      <itunes:episode>312</itunes:episode>
      <podcast:episode>312</podcast:episode>
      <itunes:title>On the Compositional Generalization of Multimodal LLMs for Medical Imaging</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">27a71be5-1174-4f03-95b7-2835e7f44d64</guid>
      <link>https://share.transistor.fm/s/a34d608f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhenyang Cai, Junying Chen, Rongsheng Wang, Weihong Wang, Yonglin Deng, Dingjie Song, Yize Chen, Zixu Zhang, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            On the Compositional Generalization of Multimodal LLMs for Medical Imaging</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20070v1">http://arxiv.org/abs/2412.20070v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) hold significant potential in the medical field, but their capabilities are often limited by insufficient data in certain medical domains, highlighting the need for understanding what kinds of images can be used by MLLMs for generalization. Current research suggests that multi-task training outperforms single-task as different tasks can benefit each other, but they often overlook the internal relationships within these tasks, providing limited guidance on selecting datasets to enhance specific tasks. To analyze this phenomenon, we attempted to employ compositional generalization (CG)-the ability of models to understand novel combinations by recombining learned elements-as a guiding framework. Since medical images can be precisely defined by Modality, Anatomical area, and Task, naturally providing an environment for exploring CG. Therefore, we assembled 106 medical datasets to create Med-MAT for comprehensive experiments. The experiments confirmed that MLLMs can use CG to understand unseen medical images and identified CG as one of the main drivers of the generalization observed in multi-task training. Additionally, further studies demonstrated that CG effectively supports datasets with limited data and delivers consistent performance across different backbones, highlighting its versatility and broad applicability. Med-MAT is publicly available at https://github.com/FreedomIntelligence/Med-MAT.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhenyang Cai, Junying Chen, Rongsheng Wang, Weihong Wang, Yonglin Deng, Dingjie Song, Yize Chen, Zixu Zhang, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            On the Compositional Generalization of Multimodal LLMs for Medical Imaging</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20070v1">http://arxiv.org/abs/2412.20070v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) hold significant potential in the medical field, but their capabilities are often limited by insufficient data in certain medical domains, highlighting the need for understanding what kinds of images can be used by MLLMs for generalization. Current research suggests that multi-task training outperforms single-task as different tasks can benefit each other, but they often overlook the internal relationships within these tasks, providing limited guidance on selecting datasets to enhance specific tasks. To analyze this phenomenon, we attempted to employ compositional generalization (CG)-the ability of models to understand novel combinations by recombining learned elements-as a guiding framework. Since medical images can be precisely defined by Modality, Anatomical area, and Task, naturally providing an environment for exploring CG. Therefore, we assembled 106 medical datasets to create Med-MAT for comprehensive experiments. The experiments confirmed that MLLMs can use CG to understand unseen medical images and identified CG as one of the main drivers of the generalization observed in multi-task training. Additionally, further studies demonstrated that CG effectively supports datasets with limited data and delivers consistent performance across different backbones, highlighting its versatility and broad applicability. Med-MAT is publicly available at https://github.com/FreedomIntelligence/Med-MAT.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Dec 2024 20:26:41 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a34d608f/437d56a8.mp3" length="21900319" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1365</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhenyang Cai, Junying Chen, Rongsheng Wang, Weihong Wang, Yonglin Deng, Dingjie Song, Yize Chen, Zixu Zhang, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            On the Compositional Generalization of Multimodal LLMs for Medical Imaging</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20070v1">http://arxiv.org/abs/2412.20070v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) hold significant potential in the medical field, but their capabilities are often limited by insufficient data in certain medical domains, highlighting the need for understanding what kinds of images can be used by MLLMs for generalization. Current research suggests that multi-task training outperforms single-task as different tasks can benefit each other, but they often overlook the internal relationships within these tasks, providing limited guidance on selecting datasets to enhance specific tasks. To analyze this phenomenon, we attempted to employ compositional generalization (CG)-the ability of models to understand novel combinations by recombining learned elements-as a guiding framework. Since medical images can be precisely defined by Modality, Anatomical area, and Task, naturally providing an environment for exploring CG. Therefore, we assembled 106 medical datasets to create Med-MAT for comprehensive experiments. The experiments confirmed that MLLMs can use CG to understand unseen medical images and identified CG as one of the main drivers of the generalization observed in multi-task training. Additionally, further studies demonstrated that CG effectively supports datasets with limited data and delivers consistent performance across different backbones, highlighting its versatility and broad applicability. Med-MAT is publicly available at https://github.com/FreedomIntelligence/Med-MAT.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Bringing Objects to Life: 4D generation from 3D objects</title>
      <itunes:episode>311</itunes:episode>
      <podcast:episode>311</podcast:episode>
      <itunes:title>Bringing Objects to Life: 4D generation from 3D objects</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0af64e29-f207-4e3b-9c68-ce445e3c0f14</guid>
      <link>https://share.transistor.fm/s/662818be</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ohad Rahamim, Ori Malca, Dvir Samuel, Gal Chechik</p>

            <p><strong>Title:</strong><br>
            Bringing Objects to Life: 4D generation from 3D objects</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20422v1">http://arxiv.org/abs/2412.20422v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in generative modeling now enable the creation of 4D content (moving 3D objects) controlled with text prompts. 4D generation has large potential in applications like virtual worlds, media, and gaming, but existing methods provide limited control over the appearance and geometry of generated content. In this work, we introduce a method for animating user-provided 3D objects by conditioning on textual prompts to guide 4D generation, enabling custom animations while maintaining the identity of the original object. We first convert a 3D mesh into a ``static" 4D Neural Radiance Field (NeRF) that preserves the visual attributes of the input object. Then, we animate the object using an Image-to-Video diffusion model driven by text. To improve motion realism, we introduce an incremental viewpoint selection protocol for sampling perspectives to promote lifelike movement and a masked Score Distillation Sampling (SDS) loss, which leverages attention maps to focus optimization on relevant regions. We evaluate our model in terms of temporal coherence, prompt adherence, and visual fidelity and find that our method outperforms baselines that are based on other approaches, achieving up to threefold improvements in identity preservation measured using LPIPS scores, and effectively balancing visual quality with dynamic content.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ohad Rahamim, Ori Malca, Dvir Samuel, Gal Chechik</p>

            <p><strong>Title:</strong><br>
            Bringing Objects to Life: 4D generation from 3D objects</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20422v1">http://arxiv.org/abs/2412.20422v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in generative modeling now enable the creation of 4D content (moving 3D objects) controlled with text prompts. 4D generation has large potential in applications like virtual worlds, media, and gaming, but existing methods provide limited control over the appearance and geometry of generated content. In this work, we introduce a method for animating user-provided 3D objects by conditioning on textual prompts to guide 4D generation, enabling custom animations while maintaining the identity of the original object. We first convert a 3D mesh into a ``static" 4D Neural Radiance Field (NeRF) that preserves the visual attributes of the input object. Then, we animate the object using an Image-to-Video diffusion model driven by text. To improve motion realism, we introduce an incremental viewpoint selection protocol for sampling perspectives to promote lifelike movement and a masked Score Distillation Sampling (SDS) loss, which leverages attention maps to focus optimization on relevant regions. We evaluate our model in terms of temporal coherence, prompt adherence, and visual fidelity and find that our method outperforms baselines that are based on other approaches, achieving up to threefold improvements in identity preservation measured using LPIPS scores, and effectively balancing visual quality with dynamic content.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Dec 2024 20:26:18 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/662818be/a890c718.mp3" length="20983716" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1308</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ohad Rahamim, Ori Malca, Dvir Samuel, Gal Chechik</p>

            <p><strong>Title:</strong><br>
            Bringing Objects to Life: 4D generation from 3D objects</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20422v1">http://arxiv.org/abs/2412.20422v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in generative modeling now enable the creation of 4D content (moving 3D objects) controlled with text prompts. 4D generation has large potential in applications like virtual worlds, media, and gaming, but existing methods provide limited control over the appearance and geometry of generated content. In this work, we introduce a method for animating user-provided 3D objects by conditioning on textual prompts to guide 4D generation, enabling custom animations while maintaining the identity of the original object. We first convert a 3D mesh into a ``static" 4D Neural Radiance Field (NeRF) that preserves the visual attributes of the input object. Then, we animate the object using an Image-to-Video diffusion model driven by text. To improve motion realism, we introduce an incremental viewpoint selection protocol for sampling perspectives to promote lifelike movement and a masked Score Distillation Sampling (SDS) loss, which leverages attention maps to focus optimization on relevant regions. We evaluate our model in terms of temporal coherence, prompt adherence, and visual fidelity and find that our method outperforms baselines that are based on other approaches, achieving up to threefold improvements in identity preservation measured using LPIPS scores, and effectively balancing visual quality with dynamic content.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Efficiently Serving LLM Reasoning Programs with Certaindex</title>
      <itunes:episode>310</itunes:episode>
      <podcast:episode>310</podcast:episode>
      <itunes:title>Efficiently Serving LLM Reasoning Programs with Certaindex</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bdf27ce4-a2c0-4c70-ab63-40b88c3e9e49</guid>
      <link>https://share.transistor.fm/s/bbbbc6df</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Aurick Qiao, Hao Zhang</p>

            <p><strong>Title:</strong><br>
            Efficiently Serving LLM Reasoning Programs with Certaindex</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20993v1">http://arxiv.org/abs/2412.20993v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid evolution of large language models (LLMs) has unlocked their capabilities in advanced reasoning tasks like mathematical problem-solving, code generation, and legal analysis. Central to this progress are inference-time reasoning algorithms, which refine outputs by exploring multiple solution paths, at the cost of increasing compute demands and response latencies. Existing serving systems fail to adapt to the scaling behaviors of these algorithms or the varying difficulty of queries, leading to inefficient resource use and unmet latency targets.   We present Dynasor, a system that optimizes inference-time compute for LLM reasoning queries. Unlike traditional engines, Dynasor tracks and schedules requests within reasoning queries and uses Certaindex, a proxy that measures statistical reasoning progress based on model certainty, to guide compute allocation dynamically. Dynasor co-adapts scheduling with reasoning progress: it allocates more compute to hard queries, reduces compute for simpler ones, and terminates unpromising queries early, balancing accuracy, latency, and cost. On diverse datasets and algorithms, Dynasor reduces compute by up to 50% in batch processing and sustaining 3.3x higher query rates or 4.7x tighter latency SLOs in online serving.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Aurick Qiao, Hao Zhang</p>

            <p><strong>Title:</strong><br>
            Efficiently Serving LLM Reasoning Programs with Certaindex</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20993v1">http://arxiv.org/abs/2412.20993v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid evolution of large language models (LLMs) has unlocked their capabilities in advanced reasoning tasks like mathematical problem-solving, code generation, and legal analysis. Central to this progress are inference-time reasoning algorithms, which refine outputs by exploring multiple solution paths, at the cost of increasing compute demands and response latencies. Existing serving systems fail to adapt to the scaling behaviors of these algorithms or the varying difficulty of queries, leading to inefficient resource use and unmet latency targets.   We present Dynasor, a system that optimizes inference-time compute for LLM reasoning queries. Unlike traditional engines, Dynasor tracks and schedules requests within reasoning queries and uses Certaindex, a proxy that measures statistical reasoning progress based on model certainty, to guide compute allocation dynamically. Dynasor co-adapts scheduling with reasoning progress: it allocates more compute to hard queries, reduces compute for simpler ones, and terminates unpromising queries early, balancing accuracy, latency, and cost. On diverse datasets and algorithms, Dynasor reduces compute by up to 50% in batch processing and sustaining 3.3x higher query rates or 4.7x tighter latency SLOs in online serving.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Dec 2024 20:25:55 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bbbbc6df/919d639f.mp3" length="19570599" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1219</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Aurick Qiao, Hao Zhang</p>

            <p><strong>Title:</strong><br>
            Efficiently Serving LLM Reasoning Programs with Certaindex</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20993v1">http://arxiv.org/abs/2412.20993v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid evolution of large language models (LLMs) has unlocked their capabilities in advanced reasoning tasks like mathematical problem-solving, code generation, and legal analysis. Central to this progress are inference-time reasoning algorithms, which refine outputs by exploring multiple solution paths, at the cost of increasing compute demands and response latencies. Existing serving systems fail to adapt to the scaling behaviors of these algorithms or the varying difficulty of queries, leading to inefficient resource use and unmet latency targets.   We present Dynasor, a system that optimizes inference-time compute for LLM reasoning queries. Unlike traditional engines, Dynasor tracks and schedules requests within reasoning queries and uses Certaindex, a proxy that measures statistical reasoning progress based on model certainty, to guide compute allocation dynamically. Dynasor co-adapts scheduling with reasoning progress: it allocates more compute to hard queries, reduces compute for simpler ones, and terminates unpromising queries early, balancing accuracy, latency, and cost. On diverse datasets and algorithms, Dynasor reduces compute by up to 50% in batch processing and sustaining 3.3x higher query rates or 4.7x tighter latency SLOs in online serving.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization</title>
      <itunes:episode>309</itunes:episode>
      <podcast:episode>309</podcast:episode>
      <itunes:title>TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">91cf9b8d-9872-4b55-8a73-689d6e6927b8</guid>
      <link>https://share.transistor.fm/s/06c3632c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.SD, cs.AI, cs.CL, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Rafael Valle, Bryan Catanzaro, Soujanya Poria</p>

            <p><strong>Title:</strong><br>
            TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21037v1">http://arxiv.org/abs/2412.21037v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We demonstrate that the audio preference dataset generated using CRPO outperforms existing alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both objective and subjective benchmarks. We open source all code and models to support further research in TTA generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.SD, cs.AI, cs.CL, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Rafael Valle, Bryan Catanzaro, Soujanya Poria</p>

            <p><strong>Title:</strong><br>
            TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21037v1">http://arxiv.org/abs/2412.21037v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We demonstrate that the audio preference dataset generated using CRPO outperforms existing alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both objective and subjective benchmarks. We open source all code and models to support further research in TTA generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Dec 2024 20:25:32 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/06c3632c/579d2951.mp3" length="20460494" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1275</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.SD, cs.AI, cs.CL, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Rafael Valle, Bryan Catanzaro, Soujanya Poria</p>

            <p><strong>Title:</strong><br>
            TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21037v1">http://arxiv.org/abs/2412.21037v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in just 3.7 seconds on a single A40 GPU. A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We demonstrate that the audio preference dataset generated using CRPO outperforms existing alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both objective and subjective benchmarks. We open source all code and models to support further research in TTA generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Edicho: Consistent Image Editing in the Wild</title>
      <itunes:episode>308</itunes:episode>
      <podcast:episode>308</podcast:episode>
      <itunes:title>Edicho: Consistent Image Editing in the Wild</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7512dfd6-9f08-41ea-a282-f58c68944f49</guid>
      <link>https://share.transistor.fm/s/fe68b342</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qingyan Bai, Hao Ouyang, Yinghao Xu, Qiuyu Wang, Ceyuan Yang, Ka Leong Cheng, Yujun Shen, Qifeng Chen</p>

            <p><strong>Title:</strong><br>
            Edicho: Consistent Image Editing in the Wild</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21079v1">http://arxiv.org/abs/2412.21079v1</a></p>

            <p><strong>Abstract:</strong><br>
            As a verified need, consistent editing across in-the-wild images remains a technical challenge arising from various unmanageable factors, like object poses, lighting conditions, and photography environments. Edicho steps in with a training-free solution based on diffusion models, featuring a fundamental design principle of using explicit image correspondence to direct editing. Specifically, the key components include an attention manipulation module and a carefully refined classifier-free guidance (CFG) denoising strategy, both of which take into account the pre-estimated correspondence. Such an inference-time algorithm enjoys a plug-and-play nature and is compatible to most diffusion-based editing methods, such as ControlNet and BrushNet. Extensive results demonstrate the efficacy of Edicho in consistent cross-image editing under diverse settings. We will release the code to facilitate future studies.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qingyan Bai, Hao Ouyang, Yinghao Xu, Qiuyu Wang, Ceyuan Yang, Ka Leong Cheng, Yujun Shen, Qifeng Chen</p>

            <p><strong>Title:</strong><br>
            Edicho: Consistent Image Editing in the Wild</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21079v1">http://arxiv.org/abs/2412.21079v1</a></p>

            <p><strong>Abstract:</strong><br>
            As a verified need, consistent editing across in-the-wild images remains a technical challenge arising from various unmanageable factors, like object poses, lighting conditions, and photography environments. Edicho steps in with a training-free solution based on diffusion models, featuring a fundamental design principle of using explicit image correspondence to direct editing. Specifically, the key components include an attention manipulation module and a carefully refined classifier-free guidance (CFG) denoising strategy, both of which take into account the pre-estimated correspondence. Such an inference-time algorithm enjoys a plug-and-play nature and is compatible to most diffusion-based editing methods, such as ControlNet and BrushNet. Extensive results demonstrate the efficacy of Edicho in consistent cross-image editing under diverse settings. We will release the code to facilitate future studies.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Dec 2024 20:25:09 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fe68b342/745d86c8.mp3" length="21933308" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1367</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qingyan Bai, Hao Ouyang, Yinghao Xu, Qiuyu Wang, Ceyuan Yang, Ka Leong Cheng, Yujun Shen, Qifeng Chen</p>

            <p><strong>Title:</strong><br>
            Edicho: Consistent Image Editing in the Wild</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21079v1">http://arxiv.org/abs/2412.21079v1</a></p>

            <p><strong>Abstract:</strong><br>
            As a verified need, consistent editing across in-the-wild images remains a technical challenge arising from various unmanageable factors, like object poses, lighting conditions, and photography environments. Edicho steps in with a training-free solution based on diffusion models, featuring a fundamental design principle of using explicit image correspondence to direct editing. Specifically, the key components include an attention manipulation module and a carefully refined classifier-free guidance (CFG) denoising strategy, both of which take into account the pre-estimated correspondence. Such an inference-time algorithm enjoys a plug-and-play nature and is compatible to most diffusion-based editing methods, such as ControlNet and BrushNet. Extensive results demonstrate the efficacy of Edicho in consistent cross-image editing under diverse settings. We will release the code to facilitate future studies.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Facilitating large language model Russian adaptation with Learned Embedding Propagation</title>
      <itunes:episode>307</itunes:episode>
      <podcast:episode>307</podcast:episode>
      <itunes:title>Facilitating large language model Russian adaptation with Learned Embedding Propagation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5f57beeb-096b-4178-8a69-1b29e82f2bd4</guid>
      <link>https://share.transistor.fm/s/65c30522</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mikhail Tikhomirov, Daniil Chernyshev</p>

            <p><strong>Title:</strong><br>
            Facilitating large language model Russian adaptation with Learned Embedding Propagation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21140v1">http://arxiv.org/abs/2412.21140v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rapid advancements of large language model (LLM) technologies led to the introduction of powerful open-source instruction-tuned LLMs that have the same text generation quality as the state-of-the-art counterparts such as GPT-4. While the emergence of such models accelerates the adoption of LLM technologies in sensitive-information environments the authors of such models don not disclose the training data necessary for replication of the results thus making the achievements model-exclusive. Since those open-source models are also multilingual this in turn reduces the benefits of training a language specific LLMs as improved inference computation efficiency becomes the only guaranteed advantage of such costly procedure. More cost-efficient options such as vocabulary extension and subsequent continued pre-training are also inhibited by the lack of access to high-quality instruction-tuning data since it is the major factor behind the resulting LLM task-solving capabilities. To address the limitations and cut the costs of the language adaptation pipeline we propose Learned Embedding Propagation (LEP). Unlike existing approaches our method has lower training data size requirements due to minimal impact on existing LLM knowledge which we reinforce using novel ad-hoc embedding propagation procedure that allows to skip the instruction-tuning step and instead implant the new language knowledge directly into any existing instruct-tuned variant. We evaluated four Russian vocabulary adaptations for LLaMa-3-8B and Mistral-7B, showing that LEP is competitive with traditional instruction-tuning methods, achieving performance comparable to OpenChat 3.5 and LLaMa-3-8B-Instruct, with further improvements via self-calibration and continued tuning enhancing task-solving capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mikhail Tikhomirov, Daniil Chernyshev</p>

            <p><strong>Title:</strong><br>
            Facilitating large language model Russian adaptation with Learned Embedding Propagation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21140v1">http://arxiv.org/abs/2412.21140v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rapid advancements of large language model (LLM) technologies led to the introduction of powerful open-source instruction-tuned LLMs that have the same text generation quality as the state-of-the-art counterparts such as GPT-4. While the emergence of such models accelerates the adoption of LLM technologies in sensitive-information environments the authors of such models don not disclose the training data necessary for replication of the results thus making the achievements model-exclusive. Since those open-source models are also multilingual this in turn reduces the benefits of training a language specific LLMs as improved inference computation efficiency becomes the only guaranteed advantage of such costly procedure. More cost-efficient options such as vocabulary extension and subsequent continued pre-training are also inhibited by the lack of access to high-quality instruction-tuning data since it is the major factor behind the resulting LLM task-solving capabilities. To address the limitations and cut the costs of the language adaptation pipeline we propose Learned Embedding Propagation (LEP). Unlike existing approaches our method has lower training data size requirements due to minimal impact on existing LLM knowledge which we reinforce using novel ad-hoc embedding propagation procedure that allows to skip the instruction-tuning step and instead implant the new language knowledge directly into any existing instruct-tuned variant. We evaluated four Russian vocabulary adaptations for LLaMa-3-8B and Mistral-7B, showing that LEP is competitive with traditional instruction-tuning methods, achieving performance comparable to OpenChat 3.5 and LLaMa-3-8B-Instruct, with further improvements via self-calibration and continued tuning enhancing task-solving capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Dec 2024 20:24:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/65c30522/d595cbb8.mp3" length="21376630" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1332</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Mikhail Tikhomirov, Daniil Chernyshev</p>

            <p><strong>Title:</strong><br>
            Facilitating large language model Russian adaptation with Learned Embedding Propagation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21140v1">http://arxiv.org/abs/2412.21140v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rapid advancements of large language model (LLM) technologies led to the introduction of powerful open-source instruction-tuned LLMs that have the same text generation quality as the state-of-the-art counterparts such as GPT-4. While the emergence of such models accelerates the adoption of LLM technologies in sensitive-information environments the authors of such models don not disclose the training data necessary for replication of the results thus making the achievements model-exclusive. Since those open-source models are also multilingual this in turn reduces the benefits of training a language specific LLMs as improved inference computation efficiency becomes the only guaranteed advantage of such costly procedure. More cost-efficient options such as vocabulary extension and subsequent continued pre-training are also inhibited by the lack of access to high-quality instruction-tuning data since it is the major factor behind the resulting LLM task-solving capabilities. To address the limitations and cut the costs of the language adaptation pipeline we propose Learned Embedding Propagation (LEP). Unlike existing approaches our method has lower training data size requirements due to minimal impact on existing LLM knowledge which we reinforce using novel ad-hoc embedding propagation procedure that allows to skip the instruction-tuning step and instead implant the new language knowledge directly into any existing instruct-tuned variant. We evaluated four Russian vocabulary adaptations for LLaMa-3-8B and Mistral-7B, showing that LEP is competitive with traditional instruction-tuning methods, achieving performance comparable to OpenChat 3.5 and LLaMa-3-8B-Instruct, with further improvements via self-calibration and continued tuning enhancing task-solving capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Training Software Engineering Agents and Verifiers with SWE-Gym</title>
      <itunes:episode>306</itunes:episode>
      <podcast:episode>306</podcast:episode>
      <itunes:title>Training Software Engineering Agents and Verifiers with SWE-Gym</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5ce2427e-1593-4c0f-bce7-017a0f0996f6</guid>
      <link>https://share.transistor.fm/s/69b74b38</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.SE, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang</p>

            <p><strong>Title:</strong><br>
            Training Software Engineering Agents and Verifiers with SWE-Gym</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21139v1">http://arxiv.org/abs/2412.21139v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present SWE-Gym, the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances, each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train language model based SWE agents , achieving up to 19% absolute gains in resolve rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new state-of-the-art for open-weight SWE agents. To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.SE, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang</p>

            <p><strong>Title:</strong><br>
            Training Software Engineering Agents and Verifiers with SWE-Gym</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21139v1">http://arxiv.org/abs/2412.21139v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present SWE-Gym, the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances, each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train language model based SWE agents , achieving up to 19% absolute gains in resolve rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new state-of-the-art for open-weight SWE agents. To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Dec 2024 20:24:22 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/69b74b38/a1c27701.mp3" length="25887221" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1614</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.SE, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang</p>

            <p><strong>Title:</strong><br>
            Training Software Engineering Agents and Verifiers with SWE-Gym</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21139v1">http://arxiv.org/abs/2412.21139v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present SWE-Gym, the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances, each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train language model based SWE agents , achieving up to 19% absolute gains in resolve rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new state-of-the-art for open-weight SWE agents. To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation</title>
      <itunes:episode>305</itunes:episode>
      <podcast:episode>305</podcast:episode>
      <itunes:title>HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">964e32f9-53a5-4d36-9317-0907b7c46703</guid>
      <link>https://share.transistor.fm/s/1169e922</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.SE, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhaojian Yu, Yilun Zhao, Arman Cohan, Xiao-Ping Zhang</p>

            <p><strong>Title:</strong><br>
            HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21199v1">http://arxiv.org/abs/2412.21199v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one. This work features three key contributions. First, we propose a general recipe for generating more challenging versions of existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on self-invoking code generation. Second, from the analysis of experimental results over twenty LLMs on our benchmarks, we have two important observations: (i) Most LLMs excel in traditional code generation benchmarks like HumanEval and MBPP, but their performance declines on self-invoking tasks. For example, o1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro. (ii) On self-invoking code generation task, the instruction-tuned models demonstrate only marginal improvements compared to the base models. Third, we disclose the types of failure modes that exist in our evaluation results. All these results underscore the need for further advancements in self-invoking code generation tasks and provide a new direction for future research on enhancing LLMs' code reasoning capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.SE, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhaojian Yu, Yilun Zhao, Arman Cohan, Xiao-Ping Zhang</p>

            <p><strong>Title:</strong><br>
            HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21199v1">http://arxiv.org/abs/2412.21199v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one. This work features three key contributions. First, we propose a general recipe for generating more challenging versions of existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on self-invoking code generation. Second, from the analysis of experimental results over twenty LLMs on our benchmarks, we have two important observations: (i) Most LLMs excel in traditional code generation benchmarks like HumanEval and MBPP, but their performance declines on self-invoking tasks. For example, o1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro. (ii) On self-invoking code generation task, the instruction-tuned models demonstrate only marginal improvements compared to the base models. Third, we disclose the types of failure modes that exist in our evaluation results. All these results underscore the need for further advancements in self-invoking code generation tasks and provide a new direction for future research on enhancing LLMs' code reasoning capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Dec 2024 20:23:59 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1169e922/181f5b3e.mp3" length="20119414" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1254</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.SE, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhaojian Yu, Yilun Zhao, Arman Cohan, Xiao-Ping Zhang</p>

            <p><strong>Title:</strong><br>
            HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.21199v1">http://arxiv.org/abs/2412.21199v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one. This work features three key contributions. First, we propose a general recipe for generating more challenging versions of existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on self-invoking code generation. Second, from the analysis of experimental results over twenty LLMs on our benchmarks, we have two important observations: (i) Most LLMs excel in traditional code generation benchmarks like HumanEval and MBPP, but their performance declines on self-invoking tasks. For example, o1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro. (ii) On self-invoking code generation task, the instruction-tuned models demonstrate only marginal improvements compared to the base models. Third, we disclose the types of failure modes that exist in our evaluation results. All these results underscore the need for further advancements in self-invoking code generation tasks and provide a new direction for future research on enhancing LLMs' code reasoning capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Slow Perception: Let's Perceive Geometric Figures Step-by-step</title>
      <itunes:episode>304</itunes:episode>
      <podcast:episode>304</podcast:episode>
      <itunes:title>Slow Perception: Let's Perceive Geometric Figures Step-by-step</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">05889bc8-39cc-4d0c-87e2-b55da888da34</guid>
      <link>https://share.transistor.fm/s/3635c49a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haoran Wei, Youyang Yin, Yumeng Li, Jia Wang, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang</p>

            <p><strong>Title:</strong><br>
            Slow Perception: Let's Perceive Geometric Figures Step-by-step</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20631v1">http://arxiv.org/abs/2412.20631v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, "visual o1" began to enter people's vision, with expectations that this slow-thinking design can solve visual reasoning tasks, especially geometric math problems. However, the reality is that current LVLMs (Large Vision Language Models) can hardly even accurately copy a geometric figure, let alone truly understand the complex inherent logic and spatial relationships within geometric shapes. We believe accurate copying (strong perception) is the first step to visual o1. Accordingly, we introduce the concept of "slow perception" (SP), which guides the model to gradually perceive basic point-line combinations, as our humans, reconstruct complex geometric structures progressively. There are two-fold stages in SP: a) perception decomposition. Perception is not instantaneous. In this stage, complex geometric figures are broken down into basic simple units to unify geometry representation. b) perception flow, which acknowledges that accurately tracing a line is not an easy task. This stage aims to avoid "long visual jumps" in regressing line segments by using a proposed "perceptual ruler" to trace each line stroke-by-stroke. Surprisingly, such a human-like perception manner enjoys an inference time scaling law -- the slower, the better. Researchers strive to speed up the model's perception in the past, but we slow it down again, allowing the model to read the image step-by-step and carefully.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haoran Wei, Youyang Yin, Yumeng Li, Jia Wang, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang</p>

            <p><strong>Title:</strong><br>
            Slow Perception: Let's Perceive Geometric Figures Step-by-step</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20631v1">http://arxiv.org/abs/2412.20631v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, "visual o1" began to enter people's vision, with expectations that this slow-thinking design can solve visual reasoning tasks, especially geometric math problems. However, the reality is that current LVLMs (Large Vision Language Models) can hardly even accurately copy a geometric figure, let alone truly understand the complex inherent logic and spatial relationships within geometric shapes. We believe accurate copying (strong perception) is the first step to visual o1. Accordingly, we introduce the concept of "slow perception" (SP), which guides the model to gradually perceive basic point-line combinations, as our humans, reconstruct complex geometric structures progressively. There are two-fold stages in SP: a) perception decomposition. Perception is not instantaneous. In this stage, complex geometric figures are broken down into basic simple units to unify geometry representation. b) perception flow, which acknowledges that accurately tracing a line is not an easy task. This stage aims to avoid "long visual jumps" in regressing line segments by using a proposed "perceptual ruler" to trace each line stroke-by-stroke. Surprisingly, such a human-like perception manner enjoys an inference time scaling law -- the slower, the better. Researchers strive to speed up the model's perception in the past, but we slow it down again, allowing the model to read the image step-by-step and carefully.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 31 Dec 2024 20:23:36 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3635c49a/10c79b44.mp3" length="22439057" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1399</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haoran Wei, Youyang Yin, Yumeng Li, Jia Wang, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang</p>

            <p><strong>Title:</strong><br>
            Slow Perception: Let's Perceive Geometric Figures Step-by-step</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.20631v1">http://arxiv.org/abs/2412.20631v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, "visual o1" began to enter people's vision, with expectations that this slow-thinking design can solve visual reasoning tasks, especially geometric math problems. However, the reality is that current LVLMs (Large Vision Language Models) can hardly even accurately copy a geometric figure, let alone truly understand the complex inherent logic and spatial relationships within geometric shapes. We believe accurate copying (strong perception) is the first step to visual o1. Accordingly, we introduce the concept of "slow perception" (SP), which guides the model to gradually perceive basic point-line combinations, as our humans, reconstruct complex geometric structures progressively. There are two-fold stages in SP: a) perception decomposition. Perception is not instantaneous. In this stage, complex geometric figures are broken down into basic simple units to unify geometry representation. b) perception flow, which acknowledges that accurately tracing a line is not an easy task. This stage aims to avoid "long visual jumps" in regressing line segments by using a proposed "perceptual ruler" to trace each line stroke-by-stroke. Surprisingly, such a human-like perception manner enjoys an inference time scaling law -- the slower, the better. Researchers strive to speed up the model's perception in the past, but we slow it down again, allowing the model to read the image step-by-step and carefully.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs</title>
      <itunes:episode>303</itunes:episode>
      <podcast:episode>303</podcast:episode>
      <itunes:title>HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a865c6b4-9e26-485d-a6b6-aab6a0ffd928</guid>
      <link>https://share.transistor.fm/s/967220e7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18925v1">http://arxiv.org/abs/2412.18925v1</a></p>

            <p><strong>Abstract:</strong><br>
            The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though distinct from mathematics, also demands robust reasoning to provide reliable answers, given the high standards of healthcare. However, verifying medical reasoning is challenging, unlike those in mathematics. To address this, we propose verifiable medical problems with a medical verifier to check the correctness of model outputs. This verifiable nature enables advancements in medical reasoning through a two-stage approach: (1) using the verifier to guide the search for a complex reasoning trajectory for fine-tuning LLMs, (2) applying reinforcement learning (RL) with verifier-based rewards to enhance complex reasoning further. Finally, we introduce HuatuoGPT-o1, a medical LLM capable of complex reasoning, which outperforms general and medical-specific baselines using only 40K verifiable problems. Experiments show complex reasoning improves medical problem-solving and benefits more from RL. We hope our approach inspires advancements in reasoning across medical and other specialized domains.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18925v1">http://arxiv.org/abs/2412.18925v1</a></p>

            <p><strong>Abstract:</strong><br>
            The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though distinct from mathematics, also demands robust reasoning to provide reliable answers, given the high standards of healthcare. However, verifying medical reasoning is challenging, unlike those in mathematics. To address this, we propose verifiable medical problems with a medical verifier to check the correctness of model outputs. This verifiable nature enables advancements in medical reasoning through a two-stage approach: (1) using the verifier to guide the search for a complex reasoning trajectory for fine-tuning LLMs, (2) applying reinforcement learning (RL) with verifier-based rewards to enhance complex reasoning further. Finally, we introduce HuatuoGPT-o1, a medical LLM capable of complex reasoning, which outperforms general and medical-specific baselines using only 40K verifiable problems. Experiments show complex reasoning improves medical problem-solving and benefits more from RL. We hope our approach inspires advancements in reasoning across medical and other specialized domains.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 30 Dec 2024 20:35:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/967220e7/5c72ce64.mp3" length="22440724" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1399</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 53 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18925v1">http://arxiv.org/abs/2412.18925v1</a></p>

            <p><strong>Abstract:</strong><br>
            The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though distinct from mathematics, also demands robust reasoning to provide reliable answers, given the high standards of healthcare. However, verifying medical reasoning is challenging, unlike those in mathematics. To address this, we propose verifiable medical problems with a medical verifier to check the correctness of model outputs. This verifiable nature enables advancements in medical reasoning through a two-stage approach: (1) using the verifier to guide the search for a complex reasoning trajectory for fine-tuning LLMs, (2) applying reinforcement learning (RL) with verifier-based rewards to enhance complex reasoning further. Finally, we introduce HuatuoGPT-o1, a medical LLM capable of complex reasoning, which outperforms general and medical-specific baselines using only 40K verifiable problems. Experiments show complex reasoning improves medical problem-solving and benefits more from RL. We hope our approach inspires advancements in reasoning across medical and other specialized domains.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>1.58-bit FLUX</title>
      <itunes:episode>302</itunes:episode>
      <podcast:episode>302</podcast:episode>
      <itunes:title>1.58-bit FLUX</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4e64583b-3ade-4085-aad9-6525485a1517</guid>
      <link>https://share.transistor.fm/s/4a895514</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chenglin Yang, Celong Liu, Xueqing Deng, Dongwon Kim, Xing Mei, Xiaohui Shen, Liang-Chieh Chen</p>

            <p><strong>Title:</strong><br>
            1.58-bit FLUX</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18653v1">http://arxiv.org/abs/2412.18653v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present 1.58-bit FLUX, the first successful approach to quantizing the state-of-the-art text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (i.e., values in {-1, 0, +1}) while maintaining comparable performance for generating 1024 x 1024 images. Notably, our quantization method operates without access to image data, relying solely on self-supervision from the FLUX.1-dev model. Additionally, we develop a custom kernel optimized for 1.58-bit operations, achieving a 7.7x reduction in model storage, a 5.1x reduction in inference memory, and improved inference latency. Extensive evaluations on the GenEval and T2I Compbench benchmarks demonstrate the effectiveness of 1.58-bit FLUX in maintaining generation quality while significantly enhancing computational efficiency.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chenglin Yang, Celong Liu, Xueqing Deng, Dongwon Kim, Xing Mei, Xiaohui Shen, Liang-Chieh Chen</p>

            <p><strong>Title:</strong><br>
            1.58-bit FLUX</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18653v1">http://arxiv.org/abs/2412.18653v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present 1.58-bit FLUX, the first successful approach to quantizing the state-of-the-art text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (i.e., values in {-1, 0, +1}) while maintaining comparable performance for generating 1024 x 1024 images. Notably, our quantization method operates without access to image data, relying solely on self-supervision from the FLUX.1-dev model. Additionally, we develop a custom kernel optimized for 1.58-bit operations, achieving a 7.7x reduction in model storage, a 5.1x reduction in inference memory, and improved inference latency. Extensive evaluations on the GenEval and T2I Compbench benchmarks demonstrate the effectiveness of 1.58-bit FLUX in maintaining generation quality while significantly enhancing computational efficiency.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 30 Dec 2024 20:35:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4a895514/87d153bf.mp3" length="22117179" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1379</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chenglin Yang, Celong Liu, Xueqing Deng, Dongwon Kim, Xing Mei, Xiaohui Shen, Liang-Chieh Chen</p>

            <p><strong>Title:</strong><br>
            1.58-bit FLUX</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18653v1">http://arxiv.org/abs/2412.18653v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present 1.58-bit FLUX, the first successful approach to quantizing the state-of-the-art text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (i.e., values in {-1, 0, +1}) while maintaining comparable performance for generating 1024 x 1024 images. Notably, our quantization method operates without access to image data, relying solely on self-supervision from the FLUX.1-dev model. Additionally, we develop a custom kernel optimized for 1.58-bit operations, achieving a 7.7x reduction in model storage, a 5.1x reduction in inference memory, and improved inference latency. Extensive evaluations on the GenEval and T2I Compbench benchmarks demonstrate the effectiveness of 1.58-bit FLUX in maintaining generation quality while significantly enhancing computational efficiency.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey</title>
      <itunes:episode>301</itunes:episode>
      <podcast:episode>301</podcast:episode>
      <itunes:title>Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c4bf4647-45f9-478a-8b26-824f3c2dc0ac</guid>
      <link>https://share.transistor.fm/s/b1bd16e3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CL, cs.AI, cs.CV, cs.LG, cs.MM, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Liang Chen, Zekun Wang, Shuhuai Ren, Lei Li, Haozhe Zhao, Yunshui Li, Zefan Cai, Hongcheng Guo, Lei Zhang, Yizhe Xiong, Yichi Zhang, Ruoyu Wu, Qingxiu Dong, Ge Zhang, Jian Yang, Lingwei Meng, Shujie Hu, Yulong Chen, Junyang Lin, Shuai Bai, Andreas Vlachos, Xu Tan, Minjia Zhang, Wen Xiao, Aaron Yee, Tianyu Liu, Baobao Chang</p>

            <p><strong>Title:</strong><br>
            Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18619v2">http://arxiv.org/abs/2412.18619v2</a></p>

            <p><strong>Abstract:</strong><br>
            Building on the foundations of language modeling in natural language processing, Next Token Prediction (NTP) has evolved into a versatile training objective for machine learning tasks across various modalities, achieving considerable success. As Large Language Models (LLMs) have advanced to unify understanding and generation tasks within the textual modality, recent research has shown that tasks from different modalities can also be effectively encapsulated within the NTP framework, transforming the multimodal information into tokens and predict the next one given the context. This survey introduces a comprehensive taxonomy that unifies both understanding and generation within multimodal learning through the lens of NTP. The proposed taxonomy covers five key aspects: Multimodal tokenization, MMNTP model architectures, unified task representation, datasets \&amp; evaluation, and open challenges. This new taxonomy aims to aid researchers in their exploration of multimodal intelligence. An associated GitHub repository collecting the latest papers and repos is available at https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CL, cs.AI, cs.CV, cs.LG, cs.MM, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Liang Chen, Zekun Wang, Shuhuai Ren, Lei Li, Haozhe Zhao, Yunshui Li, Zefan Cai, Hongcheng Guo, Lei Zhang, Yizhe Xiong, Yichi Zhang, Ruoyu Wu, Qingxiu Dong, Ge Zhang, Jian Yang, Lingwei Meng, Shujie Hu, Yulong Chen, Junyang Lin, Shuai Bai, Andreas Vlachos, Xu Tan, Minjia Zhang, Wen Xiao, Aaron Yee, Tianyu Liu, Baobao Chang</p>

            <p><strong>Title:</strong><br>
            Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18619v2">http://arxiv.org/abs/2412.18619v2</a></p>

            <p><strong>Abstract:</strong><br>
            Building on the foundations of language modeling in natural language processing, Next Token Prediction (NTP) has evolved into a versatile training objective for machine learning tasks across various modalities, achieving considerable success. As Large Language Models (LLMs) have advanced to unify understanding and generation tasks within the textual modality, recent research has shown that tasks from different modalities can also be effectively encapsulated within the NTP framework, transforming the multimodal information into tokens and predict the next one given the context. This survey introduces a comprehensive taxonomy that unifies both understanding and generation within multimodal learning through the lens of NTP. The proposed taxonomy covers five key aspects: Multimodal tokenization, MMNTP model architectures, unified task representation, datasets \&amp; evaluation, and open challenges. This new taxonomy aims to aid researchers in their exploration of multimodal intelligence. An associated GitHub repository collecting the latest papers and repos is available at https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 30 Dec 2024 20:34:40 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b1bd16e3/a1558796.mp3" length="16866422" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1050</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CL, cs.AI, cs.CV, cs.LG, cs.MM, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Liang Chen, Zekun Wang, Shuhuai Ren, Lei Li, Haozhe Zhao, Yunshui Li, Zefan Cai, Hongcheng Guo, Lei Zhang, Yizhe Xiong, Yichi Zhang, Ruoyu Wu, Qingxiu Dong, Ge Zhang, Jian Yang, Lingwei Meng, Shujie Hu, Yulong Chen, Junyang Lin, Shuai Bai, Andreas Vlachos, Xu Tan, Minjia Zhang, Wen Xiao, Aaron Yee, Tianyu Liu, Baobao Chang</p>

            <p><strong>Title:</strong><br>
            Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18619v2">http://arxiv.org/abs/2412.18619v2</a></p>

            <p><strong>Abstract:</strong><br>
            Building on the foundations of language modeling in natural language processing, Next Token Prediction (NTP) has evolved into a versatile training objective for machine learning tasks across various modalities, achieving considerable success. As Large Language Models (LLMs) have advanced to unify understanding and generation tasks within the textual modality, recent research has shown that tasks from different modalities can also be effectively encapsulated within the NTP framework, transforming the multimodal information into tokens and predict the next one given the context. This survey introduces a comprehensive taxonomy that unifies both understanding and generation within multimodal learning through the lens of NTP. The proposed taxonomy covers five key aspects: Multimodal tokenization, MMNTP model architectures, unified task representation, datasets \&amp; evaluation, and open challenges. This new taxonomy aims to aid researchers in their exploration of multimodal intelligence. An associated GitHub repository collecting the latest papers and repos is available at https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models</title>
      <itunes:episode>300</itunes:episode>
      <podcast:episode>300</podcast:episode>
      <itunes:title>Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3e8089b9-3e56-45ff-9f65-7fcf5e47d62d</guid>
      <link>https://share.transistor.fm/s/cb43d8ec</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zehan Wang, Ziang Zhang, Tianyu Pang, Chao Du, Hengshuang Zhao, Zhou Zhao</p>

            <p><strong>Title:</strong><br>
            Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18605v1">http://arxiv.org/abs/2412.18605v1</a></p>

            <p><strong>Abstract:</strong><br>
            Orientation is a key attribute of objects, crucial for understanding their spatial pose and arrangement in images. However, practical solutions for accurate orientation estimation from a single image remain underexplored. In this work, we introduce Orient Anything, the first expert and foundational model designed to estimate object orientation in a single- and free-view image. Due to the scarcity of labeled data, we propose extracting knowledge from the 3D world. By developing a pipeline to annotate the front face of 3D objects and render images from random views, we collect 2M images with precise orientation annotations. To fully leverage the dataset, we design a robust training objective that models the 3D orientation as probability distributions of three angles and predicts the object orientation by fitting these distributions. Besides, we employ several strategies to improve synthetic-to-real transfer. Our model achieves state-of-the-art orientation estimation accuracy in both rendered and real images and exhibits impressive zero-shot ability in various scenarios. More importantly, our model enhances many applications, such as comprehension and generation of complex spatial concepts and 3D object pose adjustment.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zehan Wang, Ziang Zhang, Tianyu Pang, Chao Du, Hengshuang Zhao, Zhou Zhao</p>

            <p><strong>Title:</strong><br>
            Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18605v1">http://arxiv.org/abs/2412.18605v1</a></p>

            <p><strong>Abstract:</strong><br>
            Orientation is a key attribute of objects, crucial for understanding their spatial pose and arrangement in images. However, practical solutions for accurate orientation estimation from a single image remain underexplored. In this work, we introduce Orient Anything, the first expert and foundational model designed to estimate object orientation in a single- and free-view image. Due to the scarcity of labeled data, we propose extracting knowledge from the 3D world. By developing a pipeline to annotate the front face of 3D objects and render images from random views, we collect 2M images with precise orientation annotations. To fully leverage the dataset, we design a robust training objective that models the 3D orientation as probability distributions of three angles and predicts the object orientation by fitting these distributions. Besides, we employ several strategies to improve synthetic-to-real transfer. Our model achieves state-of-the-art orientation estimation accuracy in both rendered and real images and exhibits impressive zero-shot ability in various scenarios. More importantly, our model enhances many applications, such as comprehension and generation of complex spatial concepts and 3D object pose adjustment.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 30 Dec 2024 20:34:17 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cb43d8ec/9c04f6a4.mp3" length="22388509" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1396</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zehan Wang, Ziang Zhang, Tianyu Pang, Chao Du, Hengshuang Zhao, Zhou Zhao</p>

            <p><strong>Title:</strong><br>
            Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18605v1">http://arxiv.org/abs/2412.18605v1</a></p>

            <p><strong>Abstract:</strong><br>
            Orientation is a key attribute of objects, crucial for understanding their spatial pose and arrangement in images. However, practical solutions for accurate orientation estimation from a single image remain underexplored. In this work, we introduce Orient Anything, the first expert and foundational model designed to estimate object orientation in a single- and free-view image. Due to the scarcity of labeled data, we propose extracting knowledge from the 3D world. By developing a pipeline to annotate the front face of 3D objects and render images from random views, we collect 2M images with precise orientation annotations. To fully leverage the dataset, we design a robust training objective that models the 3D orientation as probability distributions of three angles and predicts the object orientation by fitting these distributions. Besides, we employ several strategies to improve synthetic-to-real transfer. Our model achieves state-of-the-art orientation estimation accuracy in both rendered and real images and exhibits impressive zero-shot ability in various scenarios. More importantly, our model enhances many applications, such as comprehension and generation of complex spatial concepts and 3D object pose adjustment.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment</title>
      <itunes:episode>299</itunes:episode>
      <podcast:episode>299</podcast:episode>
      <itunes:title>Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e3958781-79b8-439b-95a4-44e52c5ff6dd</guid>
      <link>https://share.transistor.fm/s/e3575308</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, Limin Wang, Yi Wang</p>

            <p><strong>Title:</strong><br>
            Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.19326v1">http://arxiv.org/abs/2412.19326v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals though they give comprehensive perception and reasoning in a spectrum of vision applications. Recent studies either develop tool-using or unify specific visual tasks into the autoregressive framework, often at the expense of overall multimodal performance. To address this issue and enhance MLLMs with visual tasks in a scalable fashion, we propose Task Preference Optimization (TPO), a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. TPO introduces learnable task tokens that establish connections between multiple task-specific heads and the MLLM. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Through multi-task co-training within TPO, we observe synergistic benefits that elevate individual task performance beyond what is achievable through single-task training methodologies. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models. Additionally, MLLM-TPO demonstrates robust zero-shot capabilities across various tasks, performing comparably to state-of-the-art supervised models. The code will be released at https://github.com/OpenGVLab/TPO</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, Limin Wang, Yi Wang</p>

            <p><strong>Title:</strong><br>
            Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.19326v1">http://arxiv.org/abs/2412.19326v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals though they give comprehensive perception and reasoning in a spectrum of vision applications. Recent studies either develop tool-using or unify specific visual tasks into the autoregressive framework, often at the expense of overall multimodal performance. To address this issue and enhance MLLMs with visual tasks in a scalable fashion, we propose Task Preference Optimization (TPO), a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. TPO introduces learnable task tokens that establish connections between multiple task-specific heads and the MLLM. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Through multi-task co-training within TPO, we observe synergistic benefits that elevate individual task performance beyond what is achievable through single-task training methodologies. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models. Additionally, MLLM-TPO demonstrates robust zero-shot capabilities across various tasks, performing comparably to state-of-the-art supervised models. The code will be released at https://github.com/OpenGVLab/TPO</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 30 Dec 2024 20:33:51 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e3575308/b68f72da.mp3" length="24101735" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1503</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, Limin Wang, Yi Wang</p>

            <p><strong>Title:</strong><br>
            Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.19326v1">http://arxiv.org/abs/2412.19326v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals though they give comprehensive perception and reasoning in a spectrum of vision applications. Recent studies either develop tool-using or unify specific visual tasks into the autoregressive framework, often at the expense of overall multimodal performance. To address this issue and enhance MLLMs with visual tasks in a scalable fashion, we propose Task Preference Optimization (TPO), a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. TPO introduces learnable task tokens that establish connections between multiple task-specific heads and the MLLM. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Through multi-task co-training within TPO, we observe synergistic benefits that elevate individual task performance beyond what is achievable through single-task training methodologies. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models. Additionally, MLLM-TPO demonstrates robust zero-shot capabilities across various tasks, performing comparably to state-of-the-art supervised models. The code will be released at https://github.com/OpenGVLab/TPO</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>From Elements to Design: A Layered Approach for Automatic Graphic Design Composition</title>
      <itunes:episode>298</itunes:episode>
      <podcast:episode>298</podcast:episode>
      <itunes:title>From Elements to Design: A Layered Approach for Automatic Graphic Design Composition</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ae58bf30-5593-4707-ae42-f1d8bb237301</guid>
      <link>https://share.transistor.fm/s/51a5ab4f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiawei Lin, Shizhao Sun, Danqing Huang, Ting Liu, Ji Li, Jiang Bian</p>

            <p><strong>Title:</strong><br>
            From Elements to Design: A Layered Approach for Automatic Graphic Design Composition</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.19712v1">http://arxiv.org/abs/2412.19712v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we investigate automatic design composition from multimodal graphic elements. Although recent studies have developed various generative models for graphic design, they usually face the following limitations: they only focus on certain subtasks and are far from achieving the design composition task; they do not consider the hierarchical information of graphic designs during the generation process. To tackle these issues, we introduce the layered design principle into Large Multimodal Models (LMMs) and propose a novel approach, called LaDeCo, to accomplish this challenging task. Specifically, LaDeCo first performs layer planning for a given element set, dividing the input elements into different semantic layers according to their contents. Based on the planning results, it subsequently predicts element attributes that control the design composition in a layer-wise manner, and includes the rendered image of previously generated layers into the context. With this insightful design, LaDeCo decomposes the difficult task into smaller manageable steps, making the generation process smoother and clearer. The experimental results demonstrate the effectiveness of LaDeCo in design composition. Furthermore, we show that LaDeCo enables some interesting applications in graphic design, such as resolution adjustment, element filling, design variation, etc. In addition, it even outperforms the specialized models in some design subtasks without any task-specific training.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiawei Lin, Shizhao Sun, Danqing Huang, Ting Liu, Ji Li, Jiang Bian</p>

            <p><strong>Title:</strong><br>
            From Elements to Design: A Layered Approach for Automatic Graphic Design Composition</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.19712v1">http://arxiv.org/abs/2412.19712v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we investigate automatic design composition from multimodal graphic elements. Although recent studies have developed various generative models for graphic design, they usually face the following limitations: they only focus on certain subtasks and are far from achieving the design composition task; they do not consider the hierarchical information of graphic designs during the generation process. To tackle these issues, we introduce the layered design principle into Large Multimodal Models (LMMs) and propose a novel approach, called LaDeCo, to accomplish this challenging task. Specifically, LaDeCo first performs layer planning for a given element set, dividing the input elements into different semantic layers according to their contents. Based on the planning results, it subsequently predicts element attributes that control the design composition in a layer-wise manner, and includes the rendered image of previously generated layers into the context. With this insightful design, LaDeCo decomposes the difficult task into smaller manageable steps, making the generation process smoother and clearer. The experimental results demonstrate the effectiveness of LaDeCo in design composition. Furthermore, we show that LaDeCo enables some interesting applications in graphic design, such as resolution adjustment, element filling, design variation, etc. In addition, it even outperforms the specialized models in some design subtasks without any task-specific training.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 30 Dec 2024 20:33:28 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/51a5ab4f/d15437bc.mp3" length="21787062" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1358</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jiawei Lin, Shizhao Sun, Danqing Huang, Ting Liu, Ji Li, Jiang Bian</p>

            <p><strong>Title:</strong><br>
            From Elements to Design: A Layered Approach for Automatic Graphic Design Composition</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.19712v1">http://arxiv.org/abs/2412.19712v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we investigate automatic design composition from multimodal graphic elements. Although recent studies have developed various generative models for graphic design, they usually face the following limitations: they only focus on certain subtasks and are far from achieving the design composition task; they do not consider the hierarchical information of graphic designs during the generation process. To tackle these issues, we introduce the layered design principle into Large Multimodal Models (LMMs) and propose a novel approach, called LaDeCo, to accomplish this challenging task. Specifically, LaDeCo first performs layer planning for a given element set, dividing the input elements into different semantic layers according to their contents. Based on the planning results, it subsequently predicts element attributes that control the design composition in a layer-wise manner, and includes the rendered image of previously generated layers into the context. With this insightful design, LaDeCo decomposes the difficult task into smaller manageable steps, making the generation process smoother and clearer. The experimental results demonstrate the effectiveness of LaDeCo in design composition. Furthermore, we show that LaDeCo enables some interesting applications in graphic design, such as resolution adjustment, element filling, design variation, etc. In addition, it even outperforms the specialized models in some design subtasks without any task-specific training.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models</title>
      <itunes:episode>297</itunes:episode>
      <podcast:episode>297</podcast:episode>
      <itunes:title>VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5e580199-1ba9-4045-aa03-96e1ce2ab651</guid>
      <link>https://share.transistor.fm/s/2ca08056</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tao Wu, Yong Zhang, Xiaodong Cun, Zhongang Qi, Junfu Pu, Huanzhang Dou, Guangcong Zheng, Ying Shan, Xi Li</p>

            <p><strong>Title:</strong><br>
            VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.19645v2">http://arxiv.org/abs/2412.19645v2</a></p>

            <p><strong>Abstract:</strong><br>
            Zero-shot customized video generation has gained significant attention due to its substantial application potential. Existing methods rely on additional models to extract and inject reference subject features, assuming that the Video Diffusion Model (VDM) alone is insufficient for zero-shot customized video generation. However, these methods often struggle to maintain consistent subject appearance due to suboptimal feature extraction and injection techniques. In this paper, we reveal that VDM inherently possesses the force to extract and inject subject features. Departing from previous heuristic approaches, we introduce a novel framework that leverages VDM's inherent force to enable high-quality zero-shot customized video generation. Specifically, for feature extraction, we directly input reference images into VDM and use its intrinsic feature extraction process, which not only provides fine-grained features but also significantly aligns with VDM's pre-trained knowledge. For feature injection, we devise an innovative bidirectional interaction between subject features and generated content through spatial self-attention within VDM, ensuring that VDM has better subject fidelity while maintaining the diversity of the generated video. Experiments on both customized human and object video generation validate the effectiveness of our framework.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tao Wu, Yong Zhang, Xiaodong Cun, Zhongang Qi, Junfu Pu, Huanzhang Dou, Guangcong Zheng, Ying Shan, Xi Li</p>

            <p><strong>Title:</strong><br>
            VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.19645v2">http://arxiv.org/abs/2412.19645v2</a></p>

            <p><strong>Abstract:</strong><br>
            Zero-shot customized video generation has gained significant attention due to its substantial application potential. Existing methods rely on additional models to extract and inject reference subject features, assuming that the Video Diffusion Model (VDM) alone is insufficient for zero-shot customized video generation. However, these methods often struggle to maintain consistent subject appearance due to suboptimal feature extraction and injection techniques. In this paper, we reveal that VDM inherently possesses the force to extract and inject subject features. Departing from previous heuristic approaches, we introduce a novel framework that leverages VDM's inherent force to enable high-quality zero-shot customized video generation. Specifically, for feature extraction, we directly input reference images into VDM and use its intrinsic feature extraction process, which not only provides fine-grained features but also significantly aligns with VDM's pre-trained knowledge. For feature injection, we devise an innovative bidirectional interaction between subject features and generated content through spatial self-attention within VDM, ensuring that VDM has better subject fidelity while maintaining the diversity of the generated video. Experiments on both customized human and object video generation validate the effectiveness of our framework.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 30 Dec 2024 20:33:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2ca08056/ecf18e3f.mp3" length="22981605" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1433</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tao Wu, Yong Zhang, Xiaodong Cun, Zhongang Qi, Junfu Pu, Huanzhang Dou, Guangcong Zheng, Ying Shan, Xi Li</p>

            <p><strong>Title:</strong><br>
            VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.19645v2">http://arxiv.org/abs/2412.19645v2</a></p>

            <p><strong>Abstract:</strong><br>
            Zero-shot customized video generation has gained significant attention due to its substantial application potential. Existing methods rely on additional models to extract and inject reference subject features, assuming that the Video Diffusion Model (VDM) alone is insufficient for zero-shot customized video generation. However, these methods often struggle to maintain consistent subject appearance due to suboptimal feature extraction and injection techniques. In this paper, we reveal that VDM inherently possesses the force to extract and inject subject features. Departing from previous heuristic approaches, we introduce a novel framework that leverages VDM's inherent force to enable high-quality zero-shot customized video generation. Specifically, for feature extraction, we directly input reference images into VDM and use its intrinsic feature extraction process, which not only provides fine-grained features but also significantly aligns with VDM's pre-trained knowledge. For feature injection, we devise an innovative bidirectional interaction between subject features and generated content through spatial self-attention within VDM, ensuring that VDM has better subject fidelity while maintaining the diversity of the generated video. Experiments on both customized human and object video generation validate the effectiveness of our framework.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Superposition of Diffusion Models Using the Itô Density Estimator</title>
      <itunes:episode>296</itunes:episode>
      <podcast:episode>296</podcast:episode>
      <itunes:title>The Superposition of Diffusion Models Using the Itô Density Estimator</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">589f7f66-126e-4f5d-ae91-a6a64cdc880f</guid>
      <link>https://share.transistor.fm/s/39b8b672</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Marta Skreta, Lazar Atanackovic, Avishek Joey Bose, Alexander Tong, Kirill Neklyudov</p>

            <p><strong>Title:</strong><br>
            The Superposition of Diffusion Models Using the Itô Density Estimator</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17762v1">http://arxiv.org/abs/2412.17762v1</a></p>

            <p><strong>Abstract:</strong><br>
            The Cambrian explosion of easily accessible pre-trained diffusion models suggests a demand for methods that combine multiple different pre-trained diffusion models without incurring the significant computational burden of re-training a larger combined model. In this paper, we cast the problem of combining multiple pre-trained diffusion models at the generation stage under a novel proposed framework termed superposition. Theoretically, we derive superposition from rigorous first principles stemming from the celebrated continuity equation and design two novel algorithms tailor-made for combining diffusion models in SuperDiff. SuperDiff leverages a new scalable It\^o density estimator for the log likelihood of the diffusion SDE which incurs no additional overhead compared to the well-known Hutchinson's estimator needed for divergence calculations. We demonstrate that SuperDiff is scalable to large pre-trained diffusion models as superposition is performed solely through composition during inference, and also enjoys painless implementation as it combines different pre-trained vector fields through an automated re-weighting scheme. Notably, we show that SuperDiff is efficient during inference time, and mimics traditional composition operators such as the logical OR and the logical AND. We empirically demonstrate the utility of using SuperDiff for generating more diverse images on CIFAR-10, more faithful prompt conditioned image editing using Stable Diffusion, and improved unconditional de novo structure design of proteins. https://github.com/necludov/super-diffusion</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Marta Skreta, Lazar Atanackovic, Avishek Joey Bose, Alexander Tong, Kirill Neklyudov</p>

            <p><strong>Title:</strong><br>
            The Superposition of Diffusion Models Using the Itô Density Estimator</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17762v1">http://arxiv.org/abs/2412.17762v1</a></p>

            <p><strong>Abstract:</strong><br>
            The Cambrian explosion of easily accessible pre-trained diffusion models suggests a demand for methods that combine multiple different pre-trained diffusion models without incurring the significant computational burden of re-training a larger combined model. In this paper, we cast the problem of combining multiple pre-trained diffusion models at the generation stage under a novel proposed framework termed superposition. Theoretically, we derive superposition from rigorous first principles stemming from the celebrated continuity equation and design two novel algorithms tailor-made for combining diffusion models in SuperDiff. SuperDiff leverages a new scalable It\^o density estimator for the log likelihood of the diffusion SDE which incurs no additional overhead compared to the well-known Hutchinson's estimator needed for divergence calculations. We demonstrate that SuperDiff is scalable to large pre-trained diffusion models as superposition is performed solely through composition during inference, and also enjoys painless implementation as it combines different pre-trained vector fields through an automated re-weighting scheme. Notably, we show that SuperDiff is efficient during inference time, and mimics traditional composition operators such as the logical OR and the logical AND. We empirically demonstrate the utility of using SuperDiff for generating more diverse images on CIFAR-10, more faithful prompt conditioned image editing using Stable Diffusion, and improved unconditional de novo structure design of proteins. https://github.com/necludov/super-diffusion</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 30 Dec 2024 20:32:41 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/39b8b672/788e2194.mp3" length="22484203" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1402</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Marta Skreta, Lazar Atanackovic, Avishek Joey Bose, Alexander Tong, Kirill Neklyudov</p>

            <p><strong>Title:</strong><br>
            The Superposition of Diffusion Models Using the Itô Density Estimator</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17762v1">http://arxiv.org/abs/2412.17762v1</a></p>

            <p><strong>Abstract:</strong><br>
            The Cambrian explosion of easily accessible pre-trained diffusion models suggests a demand for methods that combine multiple different pre-trained diffusion models without incurring the significant computational burden of re-training a larger combined model. In this paper, we cast the problem of combining multiple pre-trained diffusion models at the generation stage under a novel proposed framework termed superposition. Theoretically, we derive superposition from rigorous first principles stemming from the celebrated continuity equation and design two novel algorithms tailor-made for combining diffusion models in SuperDiff. SuperDiff leverages a new scalable It\^o density estimator for the log likelihood of the diffusion SDE which incurs no additional overhead compared to the well-known Hutchinson's estimator needed for divergence calculations. We demonstrate that SuperDiff is scalable to large pre-trained diffusion models as superposition is performed solely through composition during inference, and also enjoys painless implementation as it combines different pre-trained vector fields through an automated re-weighting scheme. Notably, we show that SuperDiff is efficient during inference time, and mimics traditional composition operators such as the logical OR and the logical AND. We empirically demonstrate the utility of using SuperDiff for generating more diverse images on CIFAR-10, more faithful prompt conditioned image editing using Stable Diffusion, and improved unconditional de novo structure design of proteins. https://github.com/necludov/super-diffusion</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging</title>
      <itunes:episode>295</itunes:episode>
      <podcast:episode>295</podcast:episode>
      <itunes:title>Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6a001f08-6010-4e03-aa86-b3ccc67fc109</guid>
      <link>https://share.transistor.fm/s/0061f57c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hua Farn, Hsuan Su, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, Hung-yi Lee</p>

            <p><strong>Title:</strong><br>
            Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.19512v1">http://arxiv.org/abs/2412.19512v1</a></p>

            <p><strong>Abstract:</strong><br>
            Fine-tuning large language models (LLMs) for downstream tasks is a widely adopted approach, but it often leads to safety degradation in safety-aligned LLMs. Currently, many solutions address this issue by incorporating additional safety data, which can be impractical in many cases. In this paper, we address the question: How can we improve downstream task performance while preserving safety in LLMs without relying on additional safety data? We propose a simple and effective method that maintains the inherent safety of LLMs while enhancing their downstream task performance: merging the weights of pre- and post-fine-tuned safety-aligned models. Experimental results across various downstream tasks, models, and merging methods demonstrate that this approach effectively mitigates safety degradation while improving downstream task performance, offering a practical solution for adapting safety-aligned LLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hua Farn, Hsuan Su, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, Hung-yi Lee</p>

            <p><strong>Title:</strong><br>
            Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.19512v1">http://arxiv.org/abs/2412.19512v1</a></p>

            <p><strong>Abstract:</strong><br>
            Fine-tuning large language models (LLMs) for downstream tasks is a widely adopted approach, but it often leads to safety degradation in safety-aligned LLMs. Currently, many solutions address this issue by incorporating additional safety data, which can be impractical in many cases. In this paper, we address the question: How can we improve downstream task performance while preserving safety in LLMs without relying on additional safety data? We propose a simple and effective method that maintains the inherent safety of LLMs while enhancing their downstream task performance: merging the weights of pre- and post-fine-tuned safety-aligned models. Experimental results across various downstream tasks, models, and merging methods demonstrate that this approach effectively mitigates safety degradation while improving downstream task performance, offering a practical solution for adapting safety-aligned LLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 30 Dec 2024 20:32:18 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0061f57c/25cb12a8.mp3" length="18346406" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1143</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Hua Farn, Hsuan Su, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, Hung-yi Lee</p>

            <p><strong>Title:</strong><br>
            Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.19512v1">http://arxiv.org/abs/2412.19512v1</a></p>

            <p><strong>Abstract:</strong><br>
            Fine-tuning large language models (LLMs) for downstream tasks is a widely adopted approach, but it often leads to safety degradation in safety-aligned LLMs. Currently, many solutions address this issue by incorporating additional safety data, which can be impractical in many cases. In this paper, we address the question: How can we improve downstream task performance while preserving safety in LLMs without relying on additional safety data? We propose a simple and effective method that maintains the inherent safety of LLMs while enhancing their downstream task performance: merging the weights of pre- and post-fine-tuned safety-aligned models. Experimental results across various downstream tasks, models, and merging methods demonstrate that this approach effectively mitigates safety degradation while improving downstream task performance, offering a practical solution for adapting safety-aligned LLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era</title>
      <itunes:episode>294</itunes:episode>
      <podcast:episode>294</podcast:episode>
      <itunes:title>CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">38549353-057f-40c4-ba7f-a3e2cef176f9</guid>
      <link>https://share.transistor.fm/s/1885530a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CL, cs.AI, cs.DB</p>

            <p><strong>Authors:</strong><br>
            Yanlin Feng, Simone Papicchio, Sajjadur Rahman</p>

            <p><strong>Title:</strong><br>
            CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18702v1">http://arxiv.org/abs/2412.18702v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval from graph data is crucial for augmenting large language models (LLM) with both open-domain knowledge and private enterprise data, and it is also a key component in the recent GraphRAG system (edge et al., 2024). Despite decades of research on knowledge graphs and knowledge base question answering, leading LLM frameworks (e.g. Langchain and LlamaIndex) have only minimal support for retrieval from modern encyclopedic knowledge graphs like Wikidata. In this paper, we analyze the root cause and suggest that modern RDF knowledge graphs (e.g. Wikidata, Freebase) are less efficient for LLMs due to overly large schemas that far exceed the typical LLM context window, use of resource identifiers, overlapping relation types and lack of normalization. As a solution, we propose property graph views on top of the underlying RDF graph that can be efficiently queried by LLMs using Cypher. We instantiated this idea on Wikidata and introduced CypherBench, the first benchmark with 11 large-scale, multi-domain property graphs with 7.8 million entities and over 10,000 questions. To achieve this, we tackled several key challenges, including developing an RDF-to-property graph conversion engine, creating a systematic pipeline for text-to-Cypher task generation, and designing new evaluation metrics.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CL, cs.AI, cs.DB</p>

            <p><strong>Authors:</strong><br>
            Yanlin Feng, Simone Papicchio, Sajjadur Rahman</p>

            <p><strong>Title:</strong><br>
            CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18702v1">http://arxiv.org/abs/2412.18702v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval from graph data is crucial for augmenting large language models (LLM) with both open-domain knowledge and private enterprise data, and it is also a key component in the recent GraphRAG system (edge et al., 2024). Despite decades of research on knowledge graphs and knowledge base question answering, leading LLM frameworks (e.g. Langchain and LlamaIndex) have only minimal support for retrieval from modern encyclopedic knowledge graphs like Wikidata. In this paper, we analyze the root cause and suggest that modern RDF knowledge graphs (e.g. Wikidata, Freebase) are less efficient for LLMs due to overly large schemas that far exceed the typical LLM context window, use of resource identifiers, overlapping relation types and lack of normalization. As a solution, we propose property graph views on top of the underlying RDF graph that can be efficiently queried by LLMs using Cypher. We instantiated this idea on Wikidata and introduced CypherBench, the first benchmark with 11 large-scale, multi-domain property graphs with 7.8 million entities and over 10,000 questions. To achieve this, we tackled several key challenges, including developing an RDF-to-property graph conversion engine, creating a systematic pipeline for text-to-Cypher task generation, and designing new evaluation metrics.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 30 Dec 2024 20:31:55 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1885530a/612361ee.mp3" length="24065367" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1500</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CL, cs.AI, cs.DB</p>

            <p><strong>Authors:</strong><br>
            Yanlin Feng, Simone Papicchio, Sajjadur Rahman</p>

            <p><strong>Title:</strong><br>
            CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18702v1">http://arxiv.org/abs/2412.18702v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval from graph data is crucial for augmenting large language models (LLM) with both open-domain knowledge and private enterprise data, and it is also a key component in the recent GraphRAG system (edge et al., 2024). Despite decades of research on knowledge graphs and knowledge base question answering, leading LLM frameworks (e.g. Langchain and LlamaIndex) have only minimal support for retrieval from modern encyclopedic knowledge graphs like Wikidata. In this paper, we analyze the root cause and suggest that modern RDF knowledge graphs (e.g. Wikidata, Freebase) are less efficient for LLMs due to overly large schemas that far exceed the typical LLM context window, use of resource identifiers, overlapping relation types and lack of normalization. As a solution, we propose property graph views on top of the underlying RDF graph that can be efficiently queried by LLMs using Cypher. We instantiated this idea on Wikidata and introduced CypherBench, the first benchmark with 11 large-scale, multi-domain property graphs with 7.8 million entities and over 10,000 questions. To achieve this, we tackled several key challenges, including developing an RDF-to-property graph conversion engine, creating a systematic pipeline for text-to-Cypher task generation, and designing new evaluation metrics.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>YuLan-Mini: An Open Data-efficient Language Model</title>
      <itunes:episode>293</itunes:episode>
      <podcast:episode>293</podcast:episode>
      <itunes:title>YuLan-Mini: An Open Data-efficient Language Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d4c5332d-2141-4c49-a171-23a15c951224</guid>
      <link>https://share.transistor.fm/s/fa81c05a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiwen Hu, Huatong Song, Jia Deng, Jiapeng Wang, Jie Chen, Kun Zhou, Yutao Zhu, Jinhao Jiang, Zican Dong, Wayne Xin Zhao, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            YuLan-Mini: An Open Data-efficient Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17743v2">http://arxiv.org/abs/2412.17743v2</a></p>

            <p><strong>Abstract:</strong><br>
            Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of the data composition for each training phase. Project details can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiwen Hu, Huatong Song, Jia Deng, Jiapeng Wang, Jie Chen, Kun Zhou, Yutao Zhu, Jinhao Jiang, Zican Dong, Wayne Xin Zhao, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            YuLan-Mini: An Open Data-efficient Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17743v2">http://arxiv.org/abs/2412.17743v2</a></p>

            <p><strong>Abstract:</strong><br>
            Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of the data composition for each training phase. Project details can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 27 Dec 2024 19:11:20 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fa81c05a/880ccb02.mp3" length="18916006" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1179</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiwen Hu, Huatong Song, Jia Deng, Jiapeng Wang, Jie Chen, Kun Zhou, Yutao Zhu, Jinhao Jiang, Zican Dong, Wayne Xin Zhao, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            YuLan-Mini: An Open Data-efficient Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17743v2">http://arxiv.org/abs/2412.17743v2</a></p>

            <p><strong>Abstract:</strong><br>
            Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of the data composition for each training phase. Project details can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression</title>
      <itunes:episode>292</itunes:episode>
      <podcast:episode>292</podcast:episode>
      <itunes:title>A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">70c92304-4aa5-49cc-b1d8-1fbb39bfc4d3</guid>
      <link>https://share.transistor.fm/s/562e0f65</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Xinting Huang, Dong Yu, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17483v1">http://arxiv.org/abs/2412.17483v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we provide a thorough investigation of gist-based context compression methods to improve long-context processing in large language models. We focus on two key questions: (1) How well can these methods replace full attention models? and (2) What potential failure patterns arise due to compression? Through extensive experiments, we show that while gist-based compression can achieve near-lossless performance on tasks like retrieval-augmented generation and long-document QA, it faces challenges in tasks like synthetic recall. Furthermore, we identify three key failure patterns: lost by the boundary, lost if surprise, and lost along the way. To mitigate these issues, we propose two effective strategies: fine-grained autoencoding, which enhances the reconstruction of original token information, and segment-wise token importance estimation, which adjusts optimization based on token dependencies. Our work provides valuable insights into the understanding of gist token-based context compression and offers practical strategies for improving compression capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Xinting Huang, Dong Yu, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17483v1">http://arxiv.org/abs/2412.17483v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we provide a thorough investigation of gist-based context compression methods to improve long-context processing in large language models. We focus on two key questions: (1) How well can these methods replace full attention models? and (2) What potential failure patterns arise due to compression? Through extensive experiments, we show that while gist-based compression can achieve near-lossless performance on tasks like retrieval-augmented generation and long-document QA, it faces challenges in tasks like synthetic recall. Furthermore, we identify three key failure patterns: lost by the boundary, lost if surprise, and lost along the way. To mitigate these issues, we propose two effective strategies: fine-grained autoencoding, which enhances the reconstruction of original token information, and segment-wise token importance estimation, which adjusts optimization based on token dependencies. Our work provides valuable insights into the understanding of gist token-based context compression and offers practical strategies for improving compression capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 27 Dec 2024 19:10:58 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/562e0f65/4adcf82e.mp3" length="20978280" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1307</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Xinting Huang, Dong Yu, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17483v1">http://arxiv.org/abs/2412.17483v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we provide a thorough investigation of gist-based context compression methods to improve long-context processing in large language models. We focus on two key questions: (1) How well can these methods replace full attention models? and (2) What potential failure patterns arise due to compression? Through extensive experiments, we show that while gist-based compression can achieve near-lossless performance on tasks like retrieval-augmented generation and long-document QA, it faces challenges in tasks like synthetic recall. Furthermore, we identify three key failure patterns: lost by the boundary, lost if surprise, and lost along the way. To mitigate these issues, we propose two effective strategies: fine-grained autoencoding, which enhances the reconstruction of original token information, and segment-wise token importance estimation, which adjusts optimization based on token dependencies. Our work provides valuable insights into the understanding of gist token-based context compression and offers practical strategies for improving compression capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MMFactory: A Universal Solution Search Engine for Vision-Language Tasks</title>
      <itunes:episode>291</itunes:episode>
      <podcast:episode>291</podcast:episode>
      <itunes:title>MMFactory: A Universal Solution Search Engine for Vision-Language Tasks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f32a4d0b-7ad8-4d8d-8391-062e66468818</guid>
      <link>https://share.transistor.fm/s/d3dc537e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wan-Cyuan Fan, Tanzila Rahman, Leonid Sigal</p>

            <p><strong>Title:</strong><br>
            MMFactory: A Universal Solution Search Engine for Vision-Language Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18072v1">http://arxiv.org/abs/2412.18072v1</a></p>

            <p><strong>Abstract:</strong><br>
            With advances in foundational and vision-language models, and effective fine-tuning techniques, a large number of both general and special-purpose models have been developed for a variety of visual tasks. Despite the flexibility and accessibility of these models, no single model is able to handle all tasks and/or applications that may be envisioned by potential users. Recent approaches, such as visual programming and multimodal LLMs with integrated tools aim to tackle complex visual tasks, by way of program synthesis. However, such approaches overlook user constraints (e.g., performance / computational needs), produce test-time sample-specific solutions that are difficult to deploy, and, sometimes, require low-level instructions that maybe beyond the abilities of a naive user. To address these limitations, we introduce MMFactory, a universal framework that includes model and metrics routing components, acting like a solution search engine across various available models. Based on a task description and few sample input-output pairs and (optionally) resource and/or performance constraints, MMFactory can suggest a diverse pool of programmatic solutions by instantiating and combining visio-lingual tools from its model repository. In addition to synthesizing these solutions, MMFactory also proposes metrics and benchmarks performance / resource characteristics, allowing users to pick a solution that meets their unique design constraints. From the technical perspective, we also introduced a committee-based solution proposer that leverages multi-agent LLM conversation to generate executable, diverse, universal, and robust solutions for the user. Experimental results show that MMFactory outperforms existing methods by delivering state-of-the-art solutions tailored to user problem specifications. Project page is available at https://davidhalladay.github.io/mmfactory_demo.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wan-Cyuan Fan, Tanzila Rahman, Leonid Sigal</p>

            <p><strong>Title:</strong><br>
            MMFactory: A Universal Solution Search Engine for Vision-Language Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18072v1">http://arxiv.org/abs/2412.18072v1</a></p>

            <p><strong>Abstract:</strong><br>
            With advances in foundational and vision-language models, and effective fine-tuning techniques, a large number of both general and special-purpose models have been developed for a variety of visual tasks. Despite the flexibility and accessibility of these models, no single model is able to handle all tasks and/or applications that may be envisioned by potential users. Recent approaches, such as visual programming and multimodal LLMs with integrated tools aim to tackle complex visual tasks, by way of program synthesis. However, such approaches overlook user constraints (e.g., performance / computational needs), produce test-time sample-specific solutions that are difficult to deploy, and, sometimes, require low-level instructions that maybe beyond the abilities of a naive user. To address these limitations, we introduce MMFactory, a universal framework that includes model and metrics routing components, acting like a solution search engine across various available models. Based on a task description and few sample input-output pairs and (optionally) resource and/or performance constraints, MMFactory can suggest a diverse pool of programmatic solutions by instantiating and combining visio-lingual tools from its model repository. In addition to synthesizing these solutions, MMFactory also proposes metrics and benchmarks performance / resource characteristics, allowing users to pick a solution that meets their unique design constraints. From the technical perspective, we also introduced a committee-based solution proposer that leverages multi-agent LLM conversation to generate executable, diverse, universal, and robust solutions for the user. Experimental results show that MMFactory outperforms existing methods by delivering state-of-the-art solutions tailored to user problem specifications. Project page is available at https://davidhalladay.github.io/mmfactory_demo.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 27 Dec 2024 19:10:37 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d3dc537e/3a87ab91.mp3" length="20384736" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1270</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wan-Cyuan Fan, Tanzila Rahman, Leonid Sigal</p>

            <p><strong>Title:</strong><br>
            MMFactory: A Universal Solution Search Engine for Vision-Language Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18072v1">http://arxiv.org/abs/2412.18072v1</a></p>

            <p><strong>Abstract:</strong><br>
            With advances in foundational and vision-language models, and effective fine-tuning techniques, a large number of both general and special-purpose models have been developed for a variety of visual tasks. Despite the flexibility and accessibility of these models, no single model is able to handle all tasks and/or applications that may be envisioned by potential users. Recent approaches, such as visual programming and multimodal LLMs with integrated tools aim to tackle complex visual tasks, by way of program synthesis. However, such approaches overlook user constraints (e.g., performance / computational needs), produce test-time sample-specific solutions that are difficult to deploy, and, sometimes, require low-level instructions that maybe beyond the abilities of a naive user. To address these limitations, we introduce MMFactory, a universal framework that includes model and metrics routing components, acting like a solution search engine across various available models. Based on a task description and few sample input-output pairs and (optionally) resource and/or performance constraints, MMFactory can suggest a diverse pool of programmatic solutions by instantiating and combining visio-lingual tools from its model repository. In addition to synthesizing these solutions, MMFactory also proposes metrics and benchmarks performance / resource characteristics, allowing users to pick a solution that meets their unique design constraints. From the technical perspective, we also introduced a committee-based solution proposer that leverages multi-agent LLM conversation to generate executable, diverse, universal, and robust solutions for the user. Experimental results show that MMFactory outperforms existing methods by delivering state-of-the-art solutions tailored to user problem specifications. Project page is available at https://davidhalladay.github.io/mmfactory_demo.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation</title>
      <itunes:episode>290</itunes:episode>
      <podcast:episode>290</podcast:episode>
      <itunes:title>Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">78df760b-2086-498a-8414-a6942b198de8</guid>
      <link>https://share.transistor.fm/s/69f2136f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.IR, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yucong Luo, Qitao Qin, Hao Zhang, Mingyue Cheng, Ruiran Yan, Kefan Wang, Jie Ouyang</p>

            <p><strong>Title:</strong><br>
            Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18176v1">http://arxiv.org/abs/2412.18176v1</a></p>

            <p><strong>Abstract:</strong><br>
            Sequential recommendation (SR) systems have evolved significantly over the past decade, transitioning from traditional collaborative filtering to deep learning approaches and, more recently, to large language models (LLMs). While the adoption of LLMs has driven substantial advancements, these models inherently lack collaborative filtering information, relying primarily on textual content data neglecting other modalities and thus failing to achieve optimal recommendation performance. To address this limitation, we propose Molar, a Multimodal large language sequential recommendation framework that integrates multiple content modalities with ID information to capture collaborative signals effectively. Molar employs an MLLM to generate unified item representations from both textual and non-textual data, facilitating comprehensive multimodal modeling and enriching item embeddings. Additionally, it incorporates collaborative filtering signals through a post-alignment mechanism, which aligns user representations from content-based and ID-based models, ensuring precise personalization and robust performance. By seamlessly combining multimodal content with collaborative filtering insights, Molar captures both user interests and contextual semantics, leading to superior recommendation accuracy. Extensive experiments validate that Molar significantly outperforms traditional and LLM-based baselines, highlighting its strength in utilizing multimodal data and collaborative signals for sequential recommendation tasks. The source code is available at https://anonymous.4open.science/r/Molar-8B06/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.IR, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yucong Luo, Qitao Qin, Hao Zhang, Mingyue Cheng, Ruiran Yan, Kefan Wang, Jie Ouyang</p>

            <p><strong>Title:</strong><br>
            Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18176v1">http://arxiv.org/abs/2412.18176v1</a></p>

            <p><strong>Abstract:</strong><br>
            Sequential recommendation (SR) systems have evolved significantly over the past decade, transitioning from traditional collaborative filtering to deep learning approaches and, more recently, to large language models (LLMs). While the adoption of LLMs has driven substantial advancements, these models inherently lack collaborative filtering information, relying primarily on textual content data neglecting other modalities and thus failing to achieve optimal recommendation performance. To address this limitation, we propose Molar, a Multimodal large language sequential recommendation framework that integrates multiple content modalities with ID information to capture collaborative signals effectively. Molar employs an MLLM to generate unified item representations from both textual and non-textual data, facilitating comprehensive multimodal modeling and enriching item embeddings. Additionally, it incorporates collaborative filtering signals through a post-alignment mechanism, which aligns user representations from content-based and ID-based models, ensuring precise personalization and robust performance. By seamlessly combining multimodal content with collaborative filtering insights, Molar captures both user interests and contextual semantics, leading to superior recommendation accuracy. Extensive experiments validate that Molar significantly outperforms traditional and LLM-based baselines, highlighting its strength in utilizing multimodal data and collaborative signals for sequential recommendation tasks. The source code is available at https://anonymous.4open.science/r/Molar-8B06/.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 27 Dec 2024 19:10:16 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/69f2136f/3a654673.mp3" length="21461846" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1338</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.IR, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yucong Luo, Qitao Qin, Hao Zhang, Mingyue Cheng, Ruiran Yan, Kefan Wang, Jie Ouyang</p>

            <p><strong>Title:</strong><br>
            Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18176v1">http://arxiv.org/abs/2412.18176v1</a></p>

            <p><strong>Abstract:</strong><br>
            Sequential recommendation (SR) systems have evolved significantly over the past decade, transitioning from traditional collaborative filtering to deep learning approaches and, more recently, to large language models (LLMs). While the adoption of LLMs has driven substantial advancements, these models inherently lack collaborative filtering information, relying primarily on textual content data neglecting other modalities and thus failing to achieve optimal recommendation performance. To address this limitation, we propose Molar, a Multimodal large language sequential recommendation framework that integrates multiple content modalities with ID information to capture collaborative signals effectively. Molar employs an MLLM to generate unified item representations from both textual and non-textual data, facilitating comprehensive multimodal modeling and enriching item embeddings. Additionally, it incorporates collaborative filtering signals through a post-alignment mechanism, which aligns user representations from content-based and ID-based models, ensuring precise personalization and robust performance. By seamlessly combining multimodal content with collaborative filtering insights, Molar captures both user interests and contextual semantics, leading to superior recommendation accuracy. Extensive experiments validate that Molar significantly outperforms traditional and LLM-based baselines, highlighting its strength in utilizing multimodal data and collaborative signals for sequential recommendation tasks. The source code is available at https://anonymous.4open.science/r/Molar-8B06/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DepthLab: From Partial to Complete</title>
      <itunes:episode>289</itunes:episode>
      <podcast:episode>289</podcast:episode>
      <itunes:title>DepthLab: From Partial to Complete</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d91404f2-e372-4351-b5ad-801e09133df8</guid>
      <link>https://share.transistor.fm/s/8b6fbb43</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhiheng Liu, Ka Leong Cheng, Qiuyu Wang, Shuzhe Wang, Hao Ouyang, Bin Tan, Kai Zhu, Yujun Shen, Qifeng Chen, Ping Luo</p>

            <p><strong>Title:</strong><br>
            DepthLab: From Partial to Complete</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18153v1">http://arxiv.org/abs/2412.18153v1</a></p>

            <p><strong>Abstract:</strong><br>
            Missing values remain a common challenge for depth data across its wide range of applications, stemming from various causes like incomplete data acquisition and perspective alteration. This work bridges this gap with DepthLab, a foundation depth inpainting model powered by image diffusion priors. Our model features two notable strengths: (1) it demonstrates resilience to depth-deficient regions, providing reliable completion for both continuous areas and isolated points, and (2) it faithfully preserves scale consistency with the conditioned known depth when filling in missing values. Drawing on these advantages, our approach proves its worth in various downstream tasks, including 3D scene inpainting, text-to-3D scene generation, sparse-view reconstruction with DUST3R, and LiDAR depth completion, exceeding current solutions in both numerical performance and visual quality. Our project page with source code is available at https://johanan528.github.io/depthlab_web/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhiheng Liu, Ka Leong Cheng, Qiuyu Wang, Shuzhe Wang, Hao Ouyang, Bin Tan, Kai Zhu, Yujun Shen, Qifeng Chen, Ping Luo</p>

            <p><strong>Title:</strong><br>
            DepthLab: From Partial to Complete</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18153v1">http://arxiv.org/abs/2412.18153v1</a></p>

            <p><strong>Abstract:</strong><br>
            Missing values remain a common challenge for depth data across its wide range of applications, stemming from various causes like incomplete data acquisition and perspective alteration. This work bridges this gap with DepthLab, a foundation depth inpainting model powered by image diffusion priors. Our model features two notable strengths: (1) it demonstrates resilience to depth-deficient regions, providing reliable completion for both continuous areas and isolated points, and (2) it faithfully preserves scale consistency with the conditioned known depth when filling in missing values. Drawing on these advantages, our approach proves its worth in various downstream tasks, including 3D scene inpainting, text-to-3D scene generation, sparse-view reconstruction with DUST3R, and LiDAR depth completion, exceeding current solutions in both numerical performance and visual quality. Our project page with source code is available at https://johanan528.github.io/depthlab_web/.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 25 Dec 2024 20:26:57 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8b6fbb43/5a2391a9.mp3" length="21296686" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1327</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhiheng Liu, Ka Leong Cheng, Qiuyu Wang, Shuzhe Wang, Hao Ouyang, Bin Tan, Kai Zhu, Yujun Shen, Qifeng Chen, Ping Luo</p>

            <p><strong>Title:</strong><br>
            DepthLab: From Partial to Complete</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18153v1">http://arxiv.org/abs/2412.18153v1</a></p>

            <p><strong>Abstract:</strong><br>
            Missing values remain a common challenge for depth data across its wide range of applications, stemming from various causes like incomplete data acquisition and perspective alteration. This work bridges this gap with DepthLab, a foundation depth inpainting model powered by image diffusion priors. Our model features two notable strengths: (1) it demonstrates resilience to depth-deficient regions, providing reliable completion for both continuous areas and isolated points, and (2) it faithfully preserves scale consistency with the conditioned known depth when filling in missing values. Drawing on these advantages, our approach proves its worth in various downstream tasks, including 3D scene inpainting, text-to-3D scene generation, sparse-view reconstruction with DUST3R, and LiDAR depth completion, exceeding current solutions in both numerical performance and visual quality. Our project page with source code is available at https://johanan528.github.io/depthlab_web/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization</title>
      <itunes:episode>288</itunes:episode>
      <podcast:episode>288</podcast:episode>
      <itunes:title>Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a25a8d2e-5055-4eaa-be85-826ca9aa30ef</guid>
      <link>https://share.transistor.fm/s/a5d86b42</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Ning Ding, Youbang Sun, Biqing Qi, Yuchen Fan, Xue Kai Zhu, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17739v1">http://arxiv.org/abs/2412.17739v1</a></p>

            <p><strong>Abstract:</strong><br>
            Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While existing works mainly address RoPE's limitations within attention mechanism, this paper provides an analysis across nearly all parts of LMs, uncovering their adverse effects on length generalization for RoPE-based attention. Using Discrete Signal Processing theory, we show that RoPE enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose Fourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs Fourier Series and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales show that, within varying context windows, FoPE can maintain a more stable perplexity and a more consistent accuracy in a needle-in-haystack task compared to RoPE and ALiBi. Several analyses and ablations bring further support to our method and theoretical modeling.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Ning Ding, Youbang Sun, Biqing Qi, Yuchen Fan, Xue Kai Zhu, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17739v1">http://arxiv.org/abs/2412.17739v1</a></p>

            <p><strong>Abstract:</strong><br>
            Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While existing works mainly address RoPE's limitations within attention mechanism, this paper provides an analysis across nearly all parts of LMs, uncovering their adverse effects on length generalization for RoPE-based attention. Using Discrete Signal Processing theory, we show that RoPE enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose Fourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs Fourier Series and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales show that, within varying context windows, FoPE can maintain a more stable perplexity and a more consistent accuracy in a needle-in-haystack task compared to RoPE and ALiBi. Several analyses and ablations bring further support to our method and theoretical modeling.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 25 Dec 2024 20:26:36 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a5d86b42/97556b5d.mp3" length="20764266" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1294</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Ning Ding, Youbang Sun, Biqing Qi, Yuchen Fan, Xue Kai Zhu, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17739v1">http://arxiv.org/abs/2412.17739v1</a></p>

            <p><strong>Abstract:</strong><br>
            Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While existing works mainly address RoPE's limitations within attention mechanism, this paper provides an analysis across nearly all parts of LMs, uncovering their adverse effects on length generalization for RoPE-based attention. Using Discrete Signal Processing theory, we show that RoPE enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose Fourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs Fourier Series and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales show that, within varying context windows, FoPE can maintain a more stable perplexity and a more consistent accuracy in a needle-in-haystack task compared to RoPE and ALiBi. Several analyses and ablations bring further support to our method and theoretical modeling.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation</title>
      <itunes:episode>287</itunes:episode>
      <podcast:episode>287</podcast:episode>
      <itunes:title>DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bdb4646e-72ac-47fe-819c-aca55b2a548d</guid>
      <link>https://share.transistor.fm/s/7d10110b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.AI, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong Zhang, Ying Shan, Xiangyu Yue</p>

            <p><strong>Title:</strong><br>
            DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18597v1">http://arxiv.org/abs/2412.18597v1</a></p>

            <p><strong>Abstract:</strong><br>
            Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.AI, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong Zhang, Ying Shan, Xiangyu Yue</p>

            <p><strong>Title:</strong><br>
            DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18597v1">http://arxiv.org/abs/2412.18597v1</a></p>

            <p><strong>Abstract:</strong><br>
            Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 25 Dec 2024 20:26:14 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7d10110b/d2f6cf89.mp3" length="21390819" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1333</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV, cs.AI, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong Zhang, Ying Shan, Xiangyu Yue</p>

            <p><strong>Title:</strong><br>
            DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18597v1">http://arxiv.org/abs/2412.18597v1</a></p>

            <p><strong>Abstract:</strong><br>
            Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>In Case You Missed It: ARC 'Challenge' Is Not That Challenging</title>
      <itunes:episode>286</itunes:episode>
      <podcast:episode>286</podcast:episode>
      <itunes:title>In Case You Missed It: ARC 'Challenge' Is Not That Challenging</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3af7d7b9-195d-4eb6-8ec1-c6f660862422</guid>
      <link>https://share.transistor.fm/s/7ad98e62</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Łukasz Borchmann</p>

            <p><strong>Title:</strong><br>
            In Case You Missed It: ARC 'Challenge' Is Not That Challenging</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17758v1">http://arxiv.org/abs/2412.17758v1</a></p>

            <p><strong>Abstract:</strong><br>
            ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged. We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Łukasz Borchmann</p>

            <p><strong>Title:</strong><br>
            In Case You Missed It: ARC 'Challenge' Is Not That Challenging</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17758v1">http://arxiv.org/abs/2412.17758v1</a></p>

            <p><strong>Abstract:</strong><br>
            ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged. We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 25 Dec 2024 20:25:52 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7ad98e62/dd761299.mp3" length="23396959" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1459</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Łukasz Borchmann</p>

            <p><strong>Title:</strong><br>
            In Case You Missed It: ARC 'Challenge' Is Not That Challenging</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17758v1">http://arxiv.org/abs/2412.17758v1</a></p>

            <p><strong>Abstract:</strong><br>
            ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged. We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing</title>
      <itunes:episode>285</itunes:episode>
      <podcast:episode>285</podcast:episode>
      <itunes:title>ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e8c1fd1f-7746-46aa-bf08-a6d32f83f6d7</guid>
      <link>https://share.transistor.fm/s/95559b76</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ziteng Wang, Jianfei Chen, Jun Zhu</p>

            <p><strong>Title:</strong><br>
            ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14711v1">http://arxiv.org/abs/2412.14711v1</a></p>

            <p><strong>Abstract:</strong><br>
            Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget. However, vanilla TopK routers are trained in a discontinuous, non-differentiable way, limiting their performance and scalability. To address this issue, we propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing, utilizing ReLU as the router instead. We further propose methods to regulate the router's sparsity while balancing the load among experts. ReMoE's continuous nature enables efficient dynamic allocation of computation across tokens and layers, while also exhibiting domain specialization. Our experiments demonstrate that ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity. Furthermore, ReMoE exhibits superior scalability with respect to the number of experts, surpassing traditional MoE architectures. The implementation based on Megatron-LM is available at https://github.com/thu-ml/ReMoE.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ziteng Wang, Jianfei Chen, Jun Zhu</p>

            <p><strong>Title:</strong><br>
            ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14711v1">http://arxiv.org/abs/2412.14711v1</a></p>

            <p><strong>Abstract:</strong><br>
            Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget. However, vanilla TopK routers are trained in a discontinuous, non-differentiable way, limiting their performance and scalability. To address this issue, we propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing, utilizing ReLU as the router instead. We further propose methods to regulate the router's sparsity while balancing the load among experts. ReMoE's continuous nature enables efficient dynamic allocation of computation across tokens and layers, while also exhibiting domain specialization. Our experiments demonstrate that ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity. Furthermore, ReMoE exhibits superior scalability with respect to the number of experts, surpassing traditional MoE architectures. The implementation based on Megatron-LM is available at https://github.com/thu-ml/ReMoE.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 25 Dec 2024 20:25:31 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/95559b76/02cacb95.mp3" length="20156524" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1256</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ziteng Wang, Jianfei Chen, Jun Zhu</p>

            <p><strong>Title:</strong><br>
            ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14711v1">http://arxiv.org/abs/2412.14711v1</a></p>

            <p><strong>Abstract:</strong><br>
            Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget. However, vanilla TopK routers are trained in a discontinuous, non-differentiable way, limiting their performance and scalability. To address this issue, we propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing, utilizing ReLU as the router instead. We further propose methods to regulate the router's sparsity while balancing the load among experts. ReMoE's continuous nature enables efficient dynamic allocation of computation across tokens and layers, while also exhibiting domain specialization. Our experiments demonstrate that ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity. Furthermore, ReMoE exhibits superior scalability with respect to the number of experts, surpassing traditional MoE architectures. The implementation based on Megatron-LM is available at https://github.com/thu-ml/ReMoE.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval</title>
      <itunes:episode>284</itunes:episode>
      <podcast:episode>284</podcast:episode>
      <itunes:title>SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3ba8d441-e5b5-4e32-b773-c6aa58739f99</guid>
      <link>https://share.transistor.fm/s/ee65a56d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Aakash Mahalingam, Vinesh Kumar Gande, Aman Chadha, Vinija Jain, Divya Chaudhary</p>

            <p><strong>Title:</strong><br>
            SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15443v1">http://arxiv.org/abs/2412.15443v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-Augmented Generation (RAG) systems have become pivotal in leveraging vast corpora to generate informed and contextually relevant responses, notably reducing hallucinations in Large Language Models. Despite significant advancements, these systems struggle to efficiently process and retrieve information from large datasets while maintaining a comprehensive understanding of the context. This paper introduces SKETCH, a novel methodology that enhances the RAG retrieval process by integrating semantic text retrieval with knowledge graphs, thereby merging structured and unstructured data for a more holistic comprehension. SKETCH, demonstrates substantial improvements in retrieval performance and maintains superior context integrity compared to traditional methods. Evaluated across four diverse datasets: QuALITY, QASPER, NarrativeQA, and Italian Cuisine-SKETCH consistently outperforms baseline approaches on key RAGAS metrics such as answer_relevancy, faithfulness, context_precision and context_recall. Notably, on the Italian Cuisine dataset, SKETCH achieved an answer relevancy of 0.94 and a context precision of 0.99, representing the highest performance across all evaluated metrics. These results highlight SKETCH's capability in delivering more accurate and contextually relevant responses, setting new benchmarks for future retrieval systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Aakash Mahalingam, Vinesh Kumar Gande, Aman Chadha, Vinija Jain, Divya Chaudhary</p>

            <p><strong>Title:</strong><br>
            SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15443v1">http://arxiv.org/abs/2412.15443v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-Augmented Generation (RAG) systems have become pivotal in leveraging vast corpora to generate informed and contextually relevant responses, notably reducing hallucinations in Large Language Models. Despite significant advancements, these systems struggle to efficiently process and retrieve information from large datasets while maintaining a comprehensive understanding of the context. This paper introduces SKETCH, a novel methodology that enhances the RAG retrieval process by integrating semantic text retrieval with knowledge graphs, thereby merging structured and unstructured data for a more holistic comprehension. SKETCH, demonstrates substantial improvements in retrieval performance and maintains superior context integrity compared to traditional methods. Evaluated across four diverse datasets: QuALITY, QASPER, NarrativeQA, and Italian Cuisine-SKETCH consistently outperforms baseline approaches on key RAGAS metrics such as answer_relevancy, faithfulness, context_precision and context_recall. Notably, on the Italian Cuisine dataset, SKETCH achieved an answer relevancy of 0.94 and a context precision of 0.99, representing the highest performance across all evaluated metrics. These results highlight SKETCH's capability in delivering more accurate and contextually relevant responses, setting new benchmarks for future retrieval systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 25 Dec 2024 20:25:09 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ee65a56d/79b8bf89.mp3" length="21456392" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1337</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Aakash Mahalingam, Vinesh Kumar Gande, Aman Chadha, Vinija Jain, Divya Chaudhary</p>

            <p><strong>Title:</strong><br>
            SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15443v1">http://arxiv.org/abs/2412.15443v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-Augmented Generation (RAG) systems have become pivotal in leveraging vast corpora to generate informed and contextually relevant responses, notably reducing hallucinations in Large Language Models. Despite significant advancements, these systems struggle to efficiently process and retrieve information from large datasets while maintaining a comprehensive understanding of the context. This paper introduces SKETCH, a novel methodology that enhances the RAG retrieval process by integrating semantic text retrieval with knowledge graphs, thereby merging structured and unstructured data for a more holistic comprehension. SKETCH, demonstrates substantial improvements in retrieval performance and maintains superior context integrity compared to traditional methods. Evaluated across four diverse datasets: QuALITY, QASPER, NarrativeQA, and Italian Cuisine-SKETCH consistently outperforms baseline approaches on key RAGAS metrics such as answer_relevancy, faithfulness, context_precision and context_recall. Notably, on the Italian Cuisine dataset, SKETCH achieved an answer relevancy of 0.94 and a context precision of 0.99, representing the highest performance across all evaluated metrics. These results highlight SKETCH's capability in delivering more accurate and contextually relevant responses, setting new benchmarks for future retrieval systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models</title>
      <itunes:episode>283</itunes:episode>
      <podcast:episode>283</podcast:episode>
      <itunes:title>PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c596592e-0257-4ef5-8647-3607f5dd3e83</guid>
      <link>https://share.transistor.fm/s/d20db665</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Minghao Chen, Roman Shapovalov, Iro Laina, Tom Monnier, Jianyuan Wang, David Novotny, Andrea Vedaldi</p>

            <p><strong>Title:</strong><br>
            PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18608v1">http://arxiv.org/abs/2412.18608v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text- or image-to-3D generators and 3D scanners can now produce 3D assets with high-quality shapes and textures. These assets typically consist of a single, fused representation, like an implicit neural field, a Gaussian mixture, or a mesh, without any useful structure. However, most applications and creative workflows require assets to be made of several meaningful parts that can be manipulated independently. To address this gap, we introduce PartGen, a novel approach that generates 3D objects composed of meaningful parts starting from text, an image, or an unstructured 3D object. First, given multiple views of a 3D object, generated or rendered, a multi-view diffusion model extracts a set of plausible and view-consistent part segmentations, dividing the object into parts. Then, a second multi-view diffusion model takes each part separately, fills in the occlusions, and uses those completed views for 3D reconstruction by feeding them to a 3D reconstruction network. This completion process considers the context of the entire object to ensure that the parts integrate cohesively. The generative completion model can make up for the information missing due to occlusions; in extreme cases, it can hallucinate entirely invisible parts based on the input 3D asset. We evaluate our method on generated and real 3D assets and show that it outperforms segmentation and part-extraction baselines by a large margin. We also showcase downstream applications such as 3D part editing.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Minghao Chen, Roman Shapovalov, Iro Laina, Tom Monnier, Jianyuan Wang, David Novotny, Andrea Vedaldi</p>

            <p><strong>Title:</strong><br>
            PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18608v1">http://arxiv.org/abs/2412.18608v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text- or image-to-3D generators and 3D scanners can now produce 3D assets with high-quality shapes and textures. These assets typically consist of a single, fused representation, like an implicit neural field, a Gaussian mixture, or a mesh, without any useful structure. However, most applications and creative workflows require assets to be made of several meaningful parts that can be manipulated independently. To address this gap, we introduce PartGen, a novel approach that generates 3D objects composed of meaningful parts starting from text, an image, or an unstructured 3D object. First, given multiple views of a 3D object, generated or rendered, a multi-view diffusion model extracts a set of plausible and view-consistent part segmentations, dividing the object into parts. Then, a second multi-view diffusion model takes each part separately, fills in the occlusions, and uses those completed views for 3D reconstruction by feeding them to a 3D reconstruction network. This completion process considers the context of the entire object to ensure that the parts integrate cohesively. The generative completion model can make up for the information missing due to occlusions; in extreme cases, it can hallucinate entirely invisible parts based on the input 3D asset. We evaluate our method on generated and real 3D assets and show that it outperforms segmentation and part-extraction baselines by a large margin. We also showcase downstream applications such as 3D part editing.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 25 Dec 2024 20:24:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d20db665/ecbcc153.mp3" length="25396499" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1584</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Minghao Chen, Roman Shapovalov, Iro Laina, Tom Monnier, Jianyuan Wang, David Novotny, Andrea Vedaldi</p>

            <p><strong>Title:</strong><br>
            PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.18608v1">http://arxiv.org/abs/2412.18608v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text- or image-to-3D generators and 3D scanners can now produce 3D assets with high-quality shapes and textures. These assets typically consist of a single, fused representation, like an implicit neural field, a Gaussian mixture, or a mesh, without any useful structure. However, most applications and creative workflows require assets to be made of several meaningful parts that can be manipulated independently. To address this gap, we introduce PartGen, a novel approach that generates 3D objects composed of meaningful parts starting from text, an image, or an unstructured 3D object. First, given multiple views of a 3D object, generated or rendered, a multi-view diffusion model extracts a set of plausible and view-consistent part segmentations, dividing the object into parts. Then, a second multi-view diffusion model takes each part separately, fills in the occlusions, and uses those completed views for 3D reconstruction by feeding them to a 3D reconstruction network. This completion process considers the context of the entire object to ensure that the parts integrate cohesively. The generative completion model can make up for the information missing due to occlusions; in extreme cases, it can hallucinate entirely invisible parts based on the input 3D asset. We evaluate our method on generated and real 3D assets and show that it outperforms segmentation and part-extraction baselines by a large margin. We also showcase downstream applications such as 3D part editing.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MotiF: Making Text Count in Image Animation with Motion Focal Loss</title>
      <itunes:episode>282</itunes:episode>
      <podcast:episode>282</podcast:episode>
      <itunes:title>MotiF: Making Text Count in Image Animation with Motion Focal Loss</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9a262434-49b5-4078-9b15-249aeefd42f0</guid>
      <link>https://share.transistor.fm/s/f0de620f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, Xi Yin</p>

            <p><strong>Title:</strong><br>
            MotiF: Making Text Count in Image Animation with Motion Focal Loss</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.16153v1">http://arxiv.org/abs/2412.16153v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-Image-to-Video (TI2V) generation aims to generate a video from an image following a text description, which is also referred to as text-guided image animation. Most existing methods struggle to generate videos that align well with the text prompts, particularly when motion is specified. To overcome this limitation, we introduce MotiF, a simple yet effective approach that directs the model's learning to the regions with more motion, thereby improving the text alignment and motion generation. We use optical flow to generate a motion heatmap and weight the loss according to the intensity of the motion. This modified objective leads to noticeable improvements and complements existing methods that utilize motion priors as model inputs. Additionally, due to the lack of a diverse benchmark for evaluating TI2V generation, we propose TI2V Bench, a dataset consists of 320 image-text pairs for robust evaluation. We present a human evaluation protocol that asks the annotators to select an overall preference between two videos followed by their justifications. Through a comprehensive evaluation on TI2V Bench, MotiF outperforms nine open-sourced models, achieving an average preference of 72%. The TI2V Bench is released in https://wang-sj16.github.io/motif/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, Xi Yin</p>

            <p><strong>Title:</strong><br>
            MotiF: Making Text Count in Image Animation with Motion Focal Loss</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.16153v1">http://arxiv.org/abs/2412.16153v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-Image-to-Video (TI2V) generation aims to generate a video from an image following a text description, which is also referred to as text-guided image animation. Most existing methods struggle to generate videos that align well with the text prompts, particularly when motion is specified. To overcome this limitation, we introduce MotiF, a simple yet effective approach that directs the model's learning to the regions with more motion, thereby improving the text alignment and motion generation. We use optical flow to generate a motion heatmap and weight the loss according to the intensity of the motion. This modified objective leads to noticeable improvements and complements existing methods that utilize motion priors as model inputs. Additionally, due to the lack of a diverse benchmark for evaluating TI2V generation, we propose TI2V Bench, a dataset consists of 320 image-text pairs for robust evaluation. We present a human evaluation protocol that asks the annotators to select an overall preference between two videos followed by their justifications. Through a comprehensive evaluation on TI2V Bench, MotiF outperforms nine open-sourced models, achieving an average preference of 72%. The TI2V Bench is released in https://wang-sj16.github.io/motif/.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 25 Dec 2024 20:24:26 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f0de620f/ccda3e57.mp3" length="21730978" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1355</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, Xi Yin</p>

            <p><strong>Title:</strong><br>
            MotiF: Making Text Count in Image Animation with Motion Focal Loss</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.16153v1">http://arxiv.org/abs/2412.16153v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-Image-to-Video (TI2V) generation aims to generate a video from an image following a text description, which is also referred to as text-guided image animation. Most existing methods struggle to generate videos that align well with the text prompts, particularly when motion is specified. To overcome this limitation, we introduce MotiF, a simple yet effective approach that directs the model's learning to the regions with more motion, thereby improving the text alignment and motion generation. We use optical flow to generate a motion heatmap and weight the loss according to the intensity of the motion. This modified objective leads to noticeable improvements and complements existing methods that utilize motion priors as model inputs. Additionally, due to the lack of a diverse benchmark for evaluating TI2V generation, we propose TI2V Bench, a dataset consists of 320 image-text pairs for robust evaluation. We present a human evaluation protocol that asks the annotators to select an overall preference between two videos followed by their justifications. Through a comprehensive evaluation on TI2V Bench, MotiF outperforms nine open-sourced models, achieving an average preference of 72%. The TI2V Bench is released in https://wang-sj16.github.io/motif/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Bridging the Data Provenance Gap Across Text, Speech and Video</title>
      <itunes:episode>281</itunes:episode>
      <podcast:episode>281</podcast:episode>
      <itunes:title>Bridging the Data Provenance Gap Across Text, Speech and Video</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">38166471-6e9c-4652-af3f-05d9bdbfde8c</guid>
      <link>https://share.transistor.fm/s/906b869f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.AI, cs.CL, cs.CY, cs.LG, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska, William Brannon, Robert Mahari, Manan Dey, Mohammed Hamdy, Nayan Saxena, Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh Chien, Naana Obeng-Marnu, Da Yin, Kun Qian, Yizhi Li, Minnie Liang, An Dinh, Shrestha Mohanty, Deividas Mataciunas, Tobin South, Jianguo Zhang, Ariel N. Lee, Campbell S. Lund, Christopher Klamm, Damien Sileo, Diganta Misra, Enrico Shippole, Kevin Klyman, Lester JV Miranda, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Vipul Gupta, Vivek Sharma, Xuhui Zhou, Caiming Xiong, Luis Villa, Stella Biderman, Alex Pentland, Sara Hooker, Jad Kabbara</p>

            <p><strong>Title:</strong><br>
            Bridging the Data Provenance Gap Across Text, Speech and Video</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17847v1">http://arxiv.org/abs/2412.17847v1</a></p>

            <p><strong>Abstract:</strong><br>
            Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities--popular text, speech, and video datasets--from their detailed sourcing trends and use restrictions to their geographical and linguistic representation. Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries. We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets, eclipsing all other sources since 2019. Secondly, tracing the chain of dataset derivations we find that while less than 33% of datasets are restrictively licensed, over 80% of the source content in widely-used text, speech, and video datasets, carry non-commercial restrictions. Finally, counter to the rising number of languages and geographies represented in public AI training datasets, our audit demonstrates measures of relative geographical and multilingual representation have failed to significantly improve their coverage since 2013. We believe the breadth of our audit enables us to empirically examine trends in data sourcing, restrictions, and Western-centricity at an ecosystem-level, and that visibility into these questions are essential to progress in responsible AI. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire multimodal audit, allowing practitioners to trace data provenance across text, speech, and video.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.AI, cs.CL, cs.CY, cs.LG, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska, William Brannon, Robert Mahari, Manan Dey, Mohammed Hamdy, Nayan Saxena, Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh Chien, Naana Obeng-Marnu, Da Yin, Kun Qian, Yizhi Li, Minnie Liang, An Dinh, Shrestha Mohanty, Deividas Mataciunas, Tobin South, Jianguo Zhang, Ariel N. Lee, Campbell S. Lund, Christopher Klamm, Damien Sileo, Diganta Misra, Enrico Shippole, Kevin Klyman, Lester JV Miranda, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Vipul Gupta, Vivek Sharma, Xuhui Zhou, Caiming Xiong, Luis Villa, Stella Biderman, Alex Pentland, Sara Hooker, Jad Kabbara</p>

            <p><strong>Title:</strong><br>
            Bridging the Data Provenance Gap Across Text, Speech and Video</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17847v1">http://arxiv.org/abs/2412.17847v1</a></p>

            <p><strong>Abstract:</strong><br>
            Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities--popular text, speech, and video datasets--from their detailed sourcing trends and use restrictions to their geographical and linguistic representation. Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries. We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets, eclipsing all other sources since 2019. Secondly, tracing the chain of dataset derivations we find that while less than 33% of datasets are restrictively licensed, over 80% of the source content in widely-used text, speech, and video datasets, carry non-commercial restrictions. Finally, counter to the rising number of languages and geographies represented in public AI training datasets, our audit demonstrates measures of relative geographical and multilingual representation have failed to significantly improve their coverage since 2013. We believe the breadth of our audit enables us to empirically examine trends in data sourcing, restrictions, and Western-centricity at an ecosystem-level, and that visibility into these questions are essential to progress in responsible AI. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire multimodal audit, allowing practitioners to trace data provenance across text, speech, and video.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 25 Dec 2024 20:24:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/906b869f/c4db3e9d.mp3" length="24518344" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1529</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.AI, cs.CL, cs.CY, cs.LG, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska, William Brannon, Robert Mahari, Manan Dey, Mohammed Hamdy, Nayan Saxena, Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh Chien, Naana Obeng-Marnu, Da Yin, Kun Qian, Yizhi Li, Minnie Liang, An Dinh, Shrestha Mohanty, Deividas Mataciunas, Tobin South, Jianguo Zhang, Ariel N. Lee, Campbell S. Lund, Christopher Klamm, Damien Sileo, Diganta Misra, Enrico Shippole, Kevin Klyman, Lester JV Miranda, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Vipul Gupta, Vivek Sharma, Xuhui Zhou, Caiming Xiong, Luis Villa, Stella Biderman, Alex Pentland, Sara Hooker, Jad Kabbara</p>

            <p><strong>Title:</strong><br>
            Bridging the Data Provenance Gap Across Text, Speech and Video</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17847v1">http://arxiv.org/abs/2412.17847v1</a></p>

            <p><strong>Abstract:</strong><br>
            Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities--popular text, speech, and video datasets--from their detailed sourcing trends and use restrictions to their geographical and linguistic representation. Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries. We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets, eclipsing all other sources since 2019. Secondly, tracing the chain of dataset derivations we find that while less than 33% of datasets are restrictively licensed, over 80% of the source content in widely-used text, speech, and video datasets, carry non-commercial restrictions. Finally, counter to the rising number of languages and geographies represented in public AI training datasets, our audit demonstrates measures of relative geographical and multilingual representation have failed to significantly improve their coverage since 2013. We believe the breadth of our audit enables us to empirically examine trends in data sourcing, restrictions, and Western-centricity at an ecosystem-level, and that visibility into these questions are essential to progress in responsible AI. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire multimodal audit, allowing practitioners to trace data provenance across text, speech, and video.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response</title>
      <itunes:episode>280</itunes:episode>
      <podcast:episode>280</podcast:episode>
      <itunes:title>RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">03da0c3f-79ab-4668-94ab-70b63e574bc1</guid>
      <link>https://share.transistor.fm/s/5ee76dfe</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junyu Luo, Xiao Luo, Kaize Ding, Jingyang Yuan, Zhiping Xiao, Ming Zhang</p>

            <p><strong>Title:</strong><br>
            RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14922v1">http://arxiv.org/abs/2412.14922v1</a></p>

            <p><strong>Abstract:</strong><br>
            Supervised fine-tuning (SFT) plays a crucial role in adapting large language models (LLMs) to specific domains or tasks. However, as demonstrated by empirical experiments, the collected data inevitably contains noise in practical applications, which poses significant challenges to model performance on downstream tasks. Therefore, there is an urgent need for a noise-robust SFT framework to enhance model capabilities in downstream tasks. To address this challenge, we introduce a robust SFT framework (RobustFT) that performs noise detection and relabeling on downstream task data. For noise identification, our approach employs a multi-expert collaborative system with inference-enhanced models to achieve superior noise detection. In the denoising phase, we utilize a context-enhanced strategy, which incorporates the most relevant and confident knowledge followed by careful assessment to generate reliable annotations. Additionally, we introduce an effective data selection mechanism based on response entropy, ensuring only high-quality samples are retained for fine-tuning. Extensive experiments conducted on multiple LLMs across five datasets demonstrate RobustFT's exceptional performance in noisy scenarios.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junyu Luo, Xiao Luo, Kaize Ding, Jingyang Yuan, Zhiping Xiao, Ming Zhang</p>

            <p><strong>Title:</strong><br>
            RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14922v1">http://arxiv.org/abs/2412.14922v1</a></p>

            <p><strong>Abstract:</strong><br>
            Supervised fine-tuning (SFT) plays a crucial role in adapting large language models (LLMs) to specific domains or tasks. However, as demonstrated by empirical experiments, the collected data inevitably contains noise in practical applications, which poses significant challenges to model performance on downstream tasks. Therefore, there is an urgent need for a noise-robust SFT framework to enhance model capabilities in downstream tasks. To address this challenge, we introduce a robust SFT framework (RobustFT) that performs noise detection and relabeling on downstream task data. For noise identification, our approach employs a multi-expert collaborative system with inference-enhanced models to achieve superior noise detection. In the denoising phase, we utilize a context-enhanced strategy, which incorporates the most relevant and confident knowledge followed by careful assessment to generate reliable annotations. Additionally, we introduce an effective data selection mechanism based on response entropy, ensuring only high-quality samples are retained for fine-tuning. Extensive experiments conducted on multiple LLMs across five datasets demonstrate RobustFT's exceptional performance in noisy scenarios.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Dec 2024 20:44:51 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5ee76dfe/102d6deb.mp3" length="20839073" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1299</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 64 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Junyu Luo, Xiao Luo, Kaize Ding, Jingyang Yuan, Zhiping Xiao, Ming Zhang</p>

            <p><strong>Title:</strong><br>
            RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14922v1">http://arxiv.org/abs/2412.14922v1</a></p>

            <p><strong>Abstract:</strong><br>
            Supervised fine-tuning (SFT) plays a crucial role in adapting large language models (LLMs) to specific domains or tasks. However, as demonstrated by empirical experiments, the collected data inevitably contains noise in practical applications, which poses significant challenges to model performance on downstream tasks. Therefore, there is an urgent need for a noise-robust SFT framework to enhance model capabilities in downstream tasks. To address this challenge, we introduce a robust SFT framework (RobustFT) that performs noise detection and relabeling on downstream task data. For noise identification, our approach employs a multi-expert collaborative system with inference-enhanced models to achieve superior noise detection. In the denoising phase, we utilize a context-enhanced strategy, which incorporates the most relevant and confident knowledge followed by careful assessment to generate reliable annotations. Additionally, we introduce an effective data selection mechanism based on response entropy, ensuring only high-quality samples are retained for fine-tuning. Extensive experiments conducted on multiple LLMs across five datasets demonstrate RobustFT's exceptional performance in noisy scenarios.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners</title>
      <itunes:episode>279</itunes:episode>
      <podcast:episode>279</podcast:episode>
      <itunes:title>B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9156c73f-c38d-422e-84d8-833ef4ad9e74</guid>
      <link>https://share.transistor.fm/s/3c45aca1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Weihao Zeng, Yuzhen Huang, Lulu Zhao, Yijun Wang, Zifei Shan, Junxian He</p>

            <p><strong>Title:</strong><br>
            B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17256v1">http://arxiv.org/abs/2412.17256v1</a></p>

            <p><strong>Abstract:</strong><br>
            In the absence of extensive human-annotated data for complex reasoning tasks, self-improvement -- where models are trained on their own outputs -- has emerged as a primary method for enhancing performance. However, the critical factors underlying the mechanism of these iterative self-improving methods remain poorly understood, such as under what conditions self-improvement is effective, and what are the bottlenecks in the current iterations. In this work, we identify and propose methods to monitor two pivotal factors in this iterative process: (1) the model's ability to generate sufficiently diverse responses (exploration); and (2) the effectiveness of external rewards in distinguishing high-quality candidates from lower-quality ones (exploitation). Using mathematical reasoning as a case study, we begin with a quantitative analysis to track the dynamics of exploration and exploitation, discovering that a model's exploratory capabilities rapidly deteriorate over iterations, and the effectiveness of exploiting external rewards diminishes as well. Motivated by these findings, we introduce B-STaR, a Self-Taught Reasoning framework that autonomously adjusts configurations across iterations to Balance exploration and exploitation, thereby optimizing the self-improving effectiveness based on the current policy model and available rewards. Our experiments on mathematical reasoning, coding, and commonsense reasoning demonstrate that B-STaR not only enhances the model's exploratory capabilities throughout training but also achieves a more effective balance between exploration and exploitation, leading to superior performance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Weihao Zeng, Yuzhen Huang, Lulu Zhao, Yijun Wang, Zifei Shan, Junxian He</p>

            <p><strong>Title:</strong><br>
            B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17256v1">http://arxiv.org/abs/2412.17256v1</a></p>

            <p><strong>Abstract:</strong><br>
            In the absence of extensive human-annotated data for complex reasoning tasks, self-improvement -- where models are trained on their own outputs -- has emerged as a primary method for enhancing performance. However, the critical factors underlying the mechanism of these iterative self-improving methods remain poorly understood, such as under what conditions self-improvement is effective, and what are the bottlenecks in the current iterations. In this work, we identify and propose methods to monitor two pivotal factors in this iterative process: (1) the model's ability to generate sufficiently diverse responses (exploration); and (2) the effectiveness of external rewards in distinguishing high-quality candidates from lower-quality ones (exploitation). Using mathematical reasoning as a case study, we begin with a quantitative analysis to track the dynamics of exploration and exploitation, discovering that a model's exploratory capabilities rapidly deteriorate over iterations, and the effectiveness of exploiting external rewards diminishes as well. Motivated by these findings, we introduce B-STaR, a Self-Taught Reasoning framework that autonomously adjusts configurations across iterations to Balance exploration and exploitation, thereby optimizing the self-improving effectiveness based on the current policy model and available rewards. Our experiments on mathematical reasoning, coding, and commonsense reasoning demonstrate that B-STaR not only enhances the model's exploratory capabilities throughout training but also achieves a more effective balance between exploration and exploitation, leading to superior performance.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Dec 2024 20:44:30 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3c45aca1/b79e58f2.mp3" length="19919981" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1241</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Weihao Zeng, Yuzhen Huang, Lulu Zhao, Yijun Wang, Zifei Shan, Junxian He</p>

            <p><strong>Title:</strong><br>
            B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17256v1">http://arxiv.org/abs/2412.17256v1</a></p>

            <p><strong>Abstract:</strong><br>
            In the absence of extensive human-annotated data for complex reasoning tasks, self-improvement -- where models are trained on their own outputs -- has emerged as a primary method for enhancing performance. However, the critical factors underlying the mechanism of these iterative self-improving methods remain poorly understood, such as under what conditions self-improvement is effective, and what are the bottlenecks in the current iterations. In this work, we identify and propose methods to monitor two pivotal factors in this iterative process: (1) the model's ability to generate sufficiently diverse responses (exploration); and (2) the effectiveness of external rewards in distinguishing high-quality candidates from lower-quality ones (exploitation). Using mathematical reasoning as a case study, we begin with a quantitative analysis to track the dynamics of exploration and exploitation, discovering that a model's exploratory capabilities rapidly deteriorate over iterations, and the effectiveness of exploiting external rewards diminishes as well. Motivated by these findings, we introduce B-STaR, a Self-Taught Reasoning framework that autonomously adjusts configurations across iterations to Balance exploration and exploitation, thereby optimizing the self-improving effectiveness based on the current policy model and available rewards. Our experiments on mathematical reasoning, coding, and commonsense reasoning demonstrate that B-STaR not only enhances the model's exploratory capabilities throughout training but also achieves a more effective balance between exploration and exploitation, leading to superior performance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching</title>
      <itunes:episode>278</itunes:episode>
      <podcast:episode>278</podcast:episode>
      <itunes:title>Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">53a391e9-21b1-483b-b676-90d16bf1a5a6</guid>
      <link>https://share.transistor.fm/s/6e94c509</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin</p>

            <p><strong>Title:</strong><br>
            Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17153v2">http://arxiv.org/abs/2412.17153v2</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process. We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or two steps? If successful, this would significantly advance the development and deployment of AR models. We notice that existing works that try to speed up AR generation by generating multiple tokens at once fundamentally cannot capture the output distribution due to the conditional dependencies between tokens, limiting their effectiveness for few-step generation. To address this, we propose Distilled Decoding (DD), which uses flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model. We then train a network to distill this mapping, enabling few-step generation. DD doesn't need the training data of the original AR model, making it more practical. We evaluate DD on state-of-the-art image AR models and present promising results on ImageNet-256. For VAR, which requires 10-step generation, DD enables one-step generation (6.3$\times$ speed-up), with an acceptable increase in FID from 4.19 to 9.96. For LlamaGen, DD reduces generation from 256 steps to 1, achieving an 217.8$\times$ speed-up with a comparable FID increase from 4.11 to 11.35. In both cases, baseline methods completely fail with FID&gt;100. DD also excels on text-to-image generation, reducing the generation from 256 steps to 2 for LlamaGen with minimal FID increase from 25.70 to 28.95. As the first work to demonstrate the possibility of one-step generation for image AR models, DD challenges the prevailing notion that AR models are inherently slow, and opens up new opportunities for efficient AR generation. The project website is at https://imagination-research.github.io/distilled-decoding.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin</p>

            <p><strong>Title:</strong><br>
            Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17153v2">http://arxiv.org/abs/2412.17153v2</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process. We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or two steps? If successful, this would significantly advance the development and deployment of AR models. We notice that existing works that try to speed up AR generation by generating multiple tokens at once fundamentally cannot capture the output distribution due to the conditional dependencies between tokens, limiting their effectiveness for few-step generation. To address this, we propose Distilled Decoding (DD), which uses flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model. We then train a network to distill this mapping, enabling few-step generation. DD doesn't need the training data of the original AR model, making it more practical. We evaluate DD on state-of-the-art image AR models and present promising results on ImageNet-256. For VAR, which requires 10-step generation, DD enables one-step generation (6.3$\times$ speed-up), with an acceptable increase in FID from 4.19 to 9.96. For LlamaGen, DD reduces generation from 256 steps to 1, achieving an 217.8$\times$ speed-up with a comparable FID increase from 4.11 to 11.35. In both cases, baseline methods completely fail with FID&gt;100. DD also excels on text-to-image generation, reducing the generation from 256 steps to 2 for LlamaGen with minimal FID increase from 25.70 to 28.95. As the first work to demonstrate the possibility of one-step generation for image AR models, DD challenges the prevailing notion that AR models are inherently slow, and opens up new opportunities for efficient AR generation. The project website is at https://imagination-research.github.io/distilled-decoding.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Dec 2024 20:44:09 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6e94c509/cb6d8f38.mp3" length="23095639" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1440</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 26 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin</p>

            <p><strong>Title:</strong><br>
            Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17153v2">http://arxiv.org/abs/2412.17153v2</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process. We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or two steps? If successful, this would significantly advance the development and deployment of AR models. We notice that existing works that try to speed up AR generation by generating multiple tokens at once fundamentally cannot capture the output distribution due to the conditional dependencies between tokens, limiting their effectiveness for few-step generation. To address this, we propose Distilled Decoding (DD), which uses flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model. We then train a network to distill this mapping, enabling few-step generation. DD doesn't need the training data of the original AR model, making it more practical. We evaluate DD on state-of-the-art image AR models and present promising results on ImageNet-256. For VAR, which requires 10-step generation, DD enables one-step generation (6.3$\times$ speed-up), with an acceptable increase in FID from 4.19 to 9.96. For LlamaGen, DD reduces generation from 256 steps to 1, achieving an 217.8$\times$ speed-up with a comparable FID increase from 4.11 to 11.35. In both cases, baseline methods completely fail with FID&gt;100. DD also excels on text-to-image generation, reducing the generation from 256 steps to 2 for LlamaGen with minimal FID increase from 25.70 to 28.95. As the first work to demonstrate the possibility of one-step generation for image AR models, DD challenges the prevailing notion that AR models are inherently slow, and opens up new opportunities for efficient AR generation. The project website is at https://imagination-research.github.io/distilled-decoding.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Diving into Self-Evolving Training for Multimodal Reasoning</title>
      <itunes:episode>277</itunes:episode>
      <podcast:episode>277</podcast:episode>
      <itunes:title>Diving into Self-Evolving Training for Multimodal Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">950a7dd2-3bd7-4d2a-be85-dfd8bb8d0cd2</guid>
      <link>https://share.transistor.fm/s/9eecd8e5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL, cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He</p>

            <p><strong>Title:</strong><br>
            Diving into Self-Evolving Training for Multimodal Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17451v1">http://arxiv.org/abs/2412.17451v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning ability is essential for Large Multimodal Models (LMMs). In the absence of multimodal chain-of-thought annotated data, self-evolving training, where the model learns from its own outputs, has emerged as an effective and scalable approach for enhancing reasoning abilities. Despite its growing usage, a comprehensive understanding of self-evolving training, particularly in the context of multimodal reasoning, remains limited. In this paper, we delve into the intricacies of self-evolving training for multimodal reasoning, pinpointing three key factors: Training Method, Reward Model, and Prompt Variation. We systematically examine each factor and explore how various configurations affect the training's effectiveness. Our analysis leads to a set of best practices for each factor, aimed at optimizing multimodal reasoning. Furthermore, we explore the Self-Evolution Dynamics during training and the impact of automatic balancing mechanisms in boosting performance. After all the investigations, we present a final recipe for self-evolving training in multimodal reasoning, encapsulating these design choices into a framework we call MSTaR (Multimodal Self-evolving Training for Reasoning), which is universally effective for models with different sizes on various benchmarks, e.g., surpassing the pre-evolved model significantly on 5 multimodal reasoning benchmarks without using additional human annotations, as demonstrated on MiniCPM-V-2.5 (8B), Phi-3.5-Vision (4B) and InternVL2 (2B). We believe this study fills a significant gap in the understanding of self-evolving training for multimodal reasoning and offers a robust framework for future research. Our policy and reward models, as well as the collected data, is released to facilitate further investigation in multimodal reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL, cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He</p>

            <p><strong>Title:</strong><br>
            Diving into Self-Evolving Training for Multimodal Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17451v1">http://arxiv.org/abs/2412.17451v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning ability is essential for Large Multimodal Models (LMMs). In the absence of multimodal chain-of-thought annotated data, self-evolving training, where the model learns from its own outputs, has emerged as an effective and scalable approach for enhancing reasoning abilities. Despite its growing usage, a comprehensive understanding of self-evolving training, particularly in the context of multimodal reasoning, remains limited. In this paper, we delve into the intricacies of self-evolving training for multimodal reasoning, pinpointing three key factors: Training Method, Reward Model, and Prompt Variation. We systematically examine each factor and explore how various configurations affect the training's effectiveness. Our analysis leads to a set of best practices for each factor, aimed at optimizing multimodal reasoning. Furthermore, we explore the Self-Evolution Dynamics during training and the impact of automatic balancing mechanisms in boosting performance. After all the investigations, we present a final recipe for self-evolving training in multimodal reasoning, encapsulating these design choices into a framework we call MSTaR (Multimodal Self-evolving Training for Reasoning), which is universally effective for models with different sizes on various benchmarks, e.g., surpassing the pre-evolved model significantly on 5 multimodal reasoning benchmarks without using additional human annotations, as demonstrated on MiniCPM-V-2.5 (8B), Phi-3.5-Vision (4B) and InternVL2 (2B). We believe this study fills a significant gap in the understanding of self-evolving training for multimodal reasoning and offers a robust framework for future research. Our policy and reward models, as well as the collected data, is released to facilitate further investigation in multimodal reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Dec 2024 20:43:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9eecd8e5/a80f81b1.mp3" length="20348780" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1268</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL, cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He</p>

            <p><strong>Title:</strong><br>
            Diving into Self-Evolving Training for Multimodal Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17451v1">http://arxiv.org/abs/2412.17451v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reasoning ability is essential for Large Multimodal Models (LMMs). In the absence of multimodal chain-of-thought annotated data, self-evolving training, where the model learns from its own outputs, has emerged as an effective and scalable approach for enhancing reasoning abilities. Despite its growing usage, a comprehensive understanding of self-evolving training, particularly in the context of multimodal reasoning, remains limited. In this paper, we delve into the intricacies of self-evolving training for multimodal reasoning, pinpointing three key factors: Training Method, Reward Model, and Prompt Variation. We systematically examine each factor and explore how various configurations affect the training's effectiveness. Our analysis leads to a set of best practices for each factor, aimed at optimizing multimodal reasoning. Furthermore, we explore the Self-Evolution Dynamics during training and the impact of automatic balancing mechanisms in boosting performance. After all the investigations, we present a final recipe for self-evolving training in multimodal reasoning, encapsulating these design choices into a framework we call MSTaR (Multimodal Self-evolving Training for Reasoning), which is universally effective for models with different sizes on various benchmarks, e.g., surpassing the pre-evolved model significantly on 5 multimodal reasoning benchmarks without using additional human annotations, as demonstrated on MiniCPM-V-2.5 (8B), Phi-3.5-Vision (4B) and InternVL2 (2B). We believe this study fills a significant gap in the understanding of self-evolving training for multimodal reasoning and offers a robust framework for future research. Our policy and reward models, as well as the collected data, is released to facilitate further investigation in multimodal reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Deliberation in Latent Space via Differentiable Cache Augmentation</title>
      <itunes:episode>276</itunes:episode>
      <podcast:episode>276</podcast:episode>
      <itunes:title>Deliberation in Latent Space via Differentiable Cache Augmentation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">df36ba78-57de-4c65-9b58-1af241dac99d</guid>
      <link>https://share.transistor.fm/s/1ec6c6a2</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam</p>

            <p><strong>Title:</strong><br>
            Deliberation in Latent Space via Differentiable Cache Augmentation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17747v1">http://arxiv.org/abs/2412.17747v1</a></p>

            <p><strong>Abstract:</strong><br>
            Techniques enabling large language models (LLMs) to "think more" by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. However, the standard approaches generate sequences of discrete tokens immediately before responding, and so they can incur significant latency costs and be challenging to optimize. In this work, we demonstrate that a frozen LLM can be augmented with an offline coprocessor that operates on the model's key-value (kv) cache. This coprocessor augments the cache with a set of latent embeddings designed to improve the fidelity of subsequent decoding. We train this coprocessor using the language modeling loss from the decoder on standard pretraining data, while keeping the decoder itself frozen. This approach enables the model to learn, in an end-to-end differentiable fashion, how to distill additional computation into its kv-cache. Because the decoder remains unchanged, the coprocessor can operate offline and asynchronously, and the language model can function normally if the coprocessor is unavailable or if a given cache is deemed not to require extra computation. We show experimentally that when a cache is augmented, the decoder achieves lower perplexity on numerous subsequent tokens. Furthermore, even without any task-specific training, our experiments demonstrate that cache augmentation consistently reduces perplexity and improves performance across a range of reasoning-intensive tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam</p>

            <p><strong>Title:</strong><br>
            Deliberation in Latent Space via Differentiable Cache Augmentation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17747v1">http://arxiv.org/abs/2412.17747v1</a></p>

            <p><strong>Abstract:</strong><br>
            Techniques enabling large language models (LLMs) to "think more" by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. However, the standard approaches generate sequences of discrete tokens immediately before responding, and so they can incur significant latency costs and be challenging to optimize. In this work, we demonstrate that a frozen LLM can be augmented with an offline coprocessor that operates on the model's key-value (kv) cache. This coprocessor augments the cache with a set of latent embeddings designed to improve the fidelity of subsequent decoding. We train this coprocessor using the language modeling loss from the decoder on standard pretraining data, while keeping the decoder itself frozen. This approach enables the model to learn, in an end-to-end differentiable fashion, how to distill additional computation into its kv-cache. Because the decoder remains unchanged, the coprocessor can operate offline and asynchronously, and the language model can function normally if the coprocessor is unavailable or if a given cache is deemed not to require extra computation. We show experimentally that when a cache is augmented, the decoder achieves lower perplexity on numerous subsequent tokens. Furthermore, even without any task-specific training, our experiments demonstrate that cache augmentation consistently reduces perplexity and improves performance across a range of reasoning-intensive tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Dec 2024 20:43:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1ec6c6a2/256108e4.mp3" length="21630250" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1348</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam</p>

            <p><strong>Title:</strong><br>
            Deliberation in Latent Space via Differentiable Cache Augmentation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17747v1">http://arxiv.org/abs/2412.17747v1</a></p>

            <p><strong>Abstract:</strong><br>
            Techniques enabling large language models (LLMs) to "think more" by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. However, the standard approaches generate sequences of discrete tokens immediately before responding, and so they can incur significant latency costs and be challenging to optimize. In this work, we demonstrate that a frozen LLM can be augmented with an offline coprocessor that operates on the model's key-value (kv) cache. This coprocessor augments the cache with a set of latent embeddings designed to improve the fidelity of subsequent decoding. We train this coprocessor using the language modeling loss from the decoder on standard pretraining data, while keeping the decoder itself frozen. This approach enables the model to learn, in an end-to-end differentiable fashion, how to distill additional computation into its kv-cache. Because the decoder remains unchanged, the coprocessor can operate offline and asynchronously, and the language model can function normally if the coprocessor is unavailable or if a given cache is deemed not to require extra computation. We show experimentally that when a cache is augmented, the decoder achieves lower perplexity on numerous subsequent tokens. Furthermore, even without any task-specific training, our experiments demonstrate that cache augmentation consistently reduces perplexity and improves performance across a range of reasoning-intensive tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Large Motion Video Autoencoding with Cross-modal Video VAE</title>
      <itunes:episode>275</itunes:episode>
      <podcast:episode>275</podcast:episode>
      <itunes:title>Large Motion Video Autoencoding with Cross-modal Video VAE</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cb1eb917-e040-4a36-9011-3b40ff514e15</guid>
      <link>https://share.transistor.fm/s/b5cec89d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yazhou Xing, Yang Fei, Yingqing He, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen</p>

            <p><strong>Title:</strong><br>
            Large Motion Video Autoencoding with Cross-modal Video VAE</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17805v1">http://arxiv.org/abs/2412.17805v1</a></p>

            <p><strong>Abstract:</strong><br>
            Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compression rates due to a lack of temporal compression. Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance. In this paper, we present a novel and powerful video autoencoder capable of high-fidelity video encoding. First, we observe that entangling spatial and temporal compression by merely extending the image VAE to a 3D VAE can introduce motion blur and detail distortion artifacts. Thus, we propose temporal-aware spatial compression to better encode and decode the spatial information. Additionally, we integrate a lightweight motion compression model for further temporal compression. Second, we propose to leverage the textual information inherent in text-to-video datasets and incorporate text guidance into our model. This significantly enhances reconstruction quality, particularly in terms of detail preservation and temporal stability. Third, we further improve the versatility of our model through joint training on both images and videos, which not only enhances reconstruction quality but also enables the model to perform both image and video autoencoding. Extensive evaluations against strong recent baselines demonstrate the superior performance of our method. The project website can be found at~\href{https://yzxing87.github.io/vae/}{https://yzxing87.github.io/vae/}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yazhou Xing, Yang Fei, Yingqing He, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen</p>

            <p><strong>Title:</strong><br>
            Large Motion Video Autoencoding with Cross-modal Video VAE</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17805v1">http://arxiv.org/abs/2412.17805v1</a></p>

            <p><strong>Abstract:</strong><br>
            Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compression rates due to a lack of temporal compression. Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance. In this paper, we present a novel and powerful video autoencoder capable of high-fidelity video encoding. First, we observe that entangling spatial and temporal compression by merely extending the image VAE to a 3D VAE can introduce motion blur and detail distortion artifacts. Thus, we propose temporal-aware spatial compression to better encode and decode the spatial information. Additionally, we integrate a lightweight motion compression model for further temporal compression. Second, we propose to leverage the textual information inherent in text-to-video datasets and incorporate text guidance into our model. This significantly enhances reconstruction quality, particularly in terms of detail preservation and temporal stability. Third, we further improve the versatility of our model through joint training on both images and videos, which not only enhances reconstruction quality but also enables the model to perform both image and video autoencoding. Extensive evaluations against strong recent baselines demonstrate the superior performance of our method. The project website can be found at~\href{https://yzxing87.github.io/vae/}{https://yzxing87.github.io/vae/}.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Dec 2024 20:43:06 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b5cec89d/d6524f44.mp3" length="24179793" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1508</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 15 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yazhou Xing, Yang Fei, Yingqing He, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen</p>

            <p><strong>Title:</strong><br>
            Large Motion Video Autoencoding with Cross-modal Video VAE</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.17805v1">http://arxiv.org/abs/2412.17805v1</a></p>

            <p><strong>Abstract:</strong><br>
            Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compression rates due to a lack of temporal compression. Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance. In this paper, we present a novel and powerful video autoencoder capable of high-fidelity video encoding. First, we observe that entangling spatial and temporal compression by merely extending the image VAE to a 3D VAE can introduce motion blur and detail distortion artifacts. Thus, we propose temporal-aware spatial compression to better encode and decode the spatial information. Additionally, we integrate a lightweight motion compression model for further temporal compression. Second, we propose to leverage the textual information inherent in text-to-video datasets and incorporate text guidance into our model. This significantly enhances reconstruction quality, particularly in terms of detail preservation and temporal stability. Third, we further improve the versatility of our model through joint training on both images and videos, which not only enhances reconstruction quality but also enables the model to perform both image and video autoencoding. Extensive evaluations against strong recent baselines demonstrate the superior performance of our method. The project website can be found at~\href{https://yzxing87.github.io/vae/}{https://yzxing87.github.io/vae/}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OpenAI o1 System Card</title>
      <itunes:episode>274</itunes:episode>
      <podcast:episode>274</podcast:episode>
      <itunes:title>OpenAI o1 System Card</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8a93be7f-51d2-4114-b4bf-58203876afe0</guid>
      <link>https://share.transistor.fm/s/ed54079e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey Mishchenko, Andy Applebaum, Angela Jiang, Ashvin Nair, Barret Zoph, Behrooz Ghorbani, Ben Rossen, Benjamin Sokolowsky, Boaz Barak, Bob McGrew, Borys Minaiev, Botao Hao, Bowen Baker, Brandon Houghton, Brandon McKinzie, Brydon Eastman, Camillo Lugaresi, Cary Bassin, Cary Hudson, Chak Ming Li, Charles de Bourcy, Chelsea Voss, Chen Shen, Chong Zhang, Chris Koch, Chris Orsinger, Christopher Hesse, Claudia Fischer, Clive Chan, Dan Roberts, Daniel Kappler, Daniel Levy, Daniel Selsam, David Dohan, David Farhi, David Mely, David Robinson, Dimitris Tsipras, Doug Li, Dragos Oprica, Eben Freeman, Eddie Zhang, Edmund Wong, Elizabeth Proehl, Enoch Cheung, Eric Mitchell, Eric Wallace, Erik Ritter, Evan Mays, Fan Wang, Felipe Petroski Such, Filippo Raso, Florencia Leoni, Foivos Tsimpourlas, Francis Song, Fred von Lohmann, Freddie Sulit, Geoff Salmon, Giambattista Parascandolo, Gildas Chabot, Grace Zhao, Greg Brockman, Guillaume Leclerc, Hadi Salman, Haiming Bao, Hao Sheng, Hart Andrin, Hessam Bagherinezhad, Hongyu Ren, Hunter Lightman, Hyung Won Chung, Ian Kivlichan, Ian O'Connell, Ian Osband, Ignasi Clavera Gilaberte, Ilge Akkaya, Ilya Kostrikov, Ilya Sutskever, Irina Kofman, Jakub Pachocki, James Lennon, Jason Wei, Jean Harb, Jerry Twore, Jiacheng Feng, Jiahui Yu, Jiayi Weng, Jie Tang, Jieqi Yu, Joaquin Quiñonero Candela, Joe Palermo, Joel Parish, Johannes Heidecke, John Hallman, John Rizzo, Jonathan Gordon, Jonathan Uesato, Jonathan Uesato, Jonathan Ward, Joost Huizinga, Julie Wang, Kai Chen, Kai Xiao, Karan Singhal, Karina Nguyen, Karl Cobbe, Katy Shi, Kayla Wood, Kendra Rimbach, Keren Gu-Lemberg, Keren GuLemberg, Kevin Liu, Kevin Lu, Kevin Stone, Kevin Yu, Lama Ahmad, Lauren Yang, Leo Liu, Leon Maksin, Leyton Ho, Liam Fedus, Lilian Weng, Linden Li, Lindsay McCallum, Lindsey Held, Lorenz Kuhn, Lukas Kondraciuk, Lukasz Kaiser, Luke Metz, Madelaine Boyd, Maja Trebacz, Manas Joglekar, Mark Chen, Marko Tintor, Mason Meyer, Matt Jones, Matt Kaufer, Max Schwarzer, Meghan Shah, Mehmet Yatbaz, Melody Guan, Mengyuan Xu, Mengyuan Yan, Mia Glaese, Mianna Chen, Mianna Chen, Michael Lampe, Michael Malek, Michele Wang, Michelle Fradin, Mike McClay, Mikhail Pavlov, Miles Wang, Mingxuan Wang, Mira Murati, Mo Bavarian, Mostafa Rohaninejad, Nat McAleese, Neil Chowdhury, Neil Chowdhury, Nick Ryder, Nikolas Tezak, Noam Brown, Ofir Nachum, Oleg Boiko, Oleg Murk, Olivia Watkins, Patrick Chao, Paul Ashbourne, Pavel Izmailov, Peter Zhokhov, Rachel Dias, Rahul Arora, Randall Lin, Rapha Gontijo Lopes, Raz Gaon, Reah Miyara, Reimar Leike, Renny Hwang, Rhythm Garg, Robin Brown, Roshan James, Rui Shu, Ryan Cheu, Ryan Greene, Saachi Jain, Sam Altman, Sam Toizer, Sam Toyer, Samuel Miserendino, Sandhini Agarwal, Santiago Hernandez, Sasha Baker, Scott McKinney, Scottie Yan, Shengjia Zhao, Shengli Hu, Shibani Santurkar, Shraman Ray Chaudhuri, Shuyuan Zhang, Siyuan Fu, Spencer Papay, Steph Lin, Suchir Balaji, Suvansh Sanjeev, Szymon Sidor, Tal Broda, Aidan Clark, Tao Wang, Taylor Gordon, Ted Sanders, Tejal Patwardhan, Thibault Sottiaux, Thomas Degry, Thomas Dimson, Tianhao Zheng, Timur Garipov, Tom Stasi, Trapit Bansal, Trevor Creech, Troy Peterson, Tyna Eloundou, Valerie Qi, Vineet Kosaraju, Vinnie Monaco, Vitchyr Pong, Vlad Fomenko, Weiyi Zheng, Wenda Zhou, Wes McCabe, Wojciech Zaremba, Yann Dubois, Yinghai Lu, Yining Chen, Young Cha, Yu Bai, Yuchen He, Yuchen Zhang, Yunyun Wang, Zheng Shao, Zhuohan Li</p>

            <p><strong>Title:</strong><br>
            OpenAI o1 System Card</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.16720v1">http://arxiv.org/abs/2412.16720v1</a></p>

            <p><strong>Abstract:</strong><br>
            The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey Mishchenko, Andy Applebaum, Angela Jiang, Ashvin Nair, Barret Zoph, Behrooz Ghorbani, Ben Rossen, Benjamin Sokolowsky, Boaz Barak, Bob McGrew, Borys Minaiev, Botao Hao, Bowen Baker, Brandon Houghton, Brandon McKinzie, Brydon Eastman, Camillo Lugaresi, Cary Bassin, Cary Hudson, Chak Ming Li, Charles de Bourcy, Chelsea Voss, Chen Shen, Chong Zhang, Chris Koch, Chris Orsinger, Christopher Hesse, Claudia Fischer, Clive Chan, Dan Roberts, Daniel Kappler, Daniel Levy, Daniel Selsam, David Dohan, David Farhi, David Mely, David Robinson, Dimitris Tsipras, Doug Li, Dragos Oprica, Eben Freeman, Eddie Zhang, Edmund Wong, Elizabeth Proehl, Enoch Cheung, Eric Mitchell, Eric Wallace, Erik Ritter, Evan Mays, Fan Wang, Felipe Petroski Such, Filippo Raso, Florencia Leoni, Foivos Tsimpourlas, Francis Song, Fred von Lohmann, Freddie Sulit, Geoff Salmon, Giambattista Parascandolo, Gildas Chabot, Grace Zhao, Greg Brockman, Guillaume Leclerc, Hadi Salman, Haiming Bao, Hao Sheng, Hart Andrin, Hessam Bagherinezhad, Hongyu Ren, Hunter Lightman, Hyung Won Chung, Ian Kivlichan, Ian O'Connell, Ian Osband, Ignasi Clavera Gilaberte, Ilge Akkaya, Ilya Kostrikov, Ilya Sutskever, Irina Kofman, Jakub Pachocki, James Lennon, Jason Wei, Jean Harb, Jerry Twore, Jiacheng Feng, Jiahui Yu, Jiayi Weng, Jie Tang, Jieqi Yu, Joaquin Quiñonero Candela, Joe Palermo, Joel Parish, Johannes Heidecke, John Hallman, John Rizzo, Jonathan Gordon, Jonathan Uesato, Jonathan Uesato, Jonathan Ward, Joost Huizinga, Julie Wang, Kai Chen, Kai Xiao, Karan Singhal, Karina Nguyen, Karl Cobbe, Katy Shi, Kayla Wood, Kendra Rimbach, Keren Gu-Lemberg, Keren GuLemberg, Kevin Liu, Kevin Lu, Kevin Stone, Kevin Yu, Lama Ahmad, Lauren Yang, Leo Liu, Leon Maksin, Leyton Ho, Liam Fedus, Lilian Weng, Linden Li, Lindsay McCallum, Lindsey Held, Lorenz Kuhn, Lukas Kondraciuk, Lukasz Kaiser, Luke Metz, Madelaine Boyd, Maja Trebacz, Manas Joglekar, Mark Chen, Marko Tintor, Mason Meyer, Matt Jones, Matt Kaufer, Max Schwarzer, Meghan Shah, Mehmet Yatbaz, Melody Guan, Mengyuan Xu, Mengyuan Yan, Mia Glaese, Mianna Chen, Mianna Chen, Michael Lampe, Michael Malek, Michele Wang, Michelle Fradin, Mike McClay, Mikhail Pavlov, Miles Wang, Mingxuan Wang, Mira Murati, Mo Bavarian, Mostafa Rohaninejad, Nat McAleese, Neil Chowdhury, Neil Chowdhury, Nick Ryder, Nikolas Tezak, Noam Brown, Ofir Nachum, Oleg Boiko, Oleg Murk, Olivia Watkins, Patrick Chao, Paul Ashbourne, Pavel Izmailov, Peter Zhokhov, Rachel Dias, Rahul Arora, Randall Lin, Rapha Gontijo Lopes, Raz Gaon, Reah Miyara, Reimar Leike, Renny Hwang, Rhythm Garg, Robin Brown, Roshan James, Rui Shu, Ryan Cheu, Ryan Greene, Saachi Jain, Sam Altman, Sam Toizer, Sam Toyer, Samuel Miserendino, Sandhini Agarwal, Santiago Hernandez, Sasha Baker, Scott McKinney, Scottie Yan, Shengjia Zhao, Shengli Hu, Shibani Santurkar, Shraman Ray Chaudhuri, Shuyuan Zhang, Siyuan Fu, Spencer Papay, Steph Lin, Suchir Balaji, Suvansh Sanjeev, Szymon Sidor, Tal Broda, Aidan Clark, Tao Wang, Taylor Gordon, Ted Sanders, Tejal Patwardhan, Thibault Sottiaux, Thomas Degry, Thomas Dimson, Tianhao Zheng, Timur Garipov, Tom Stasi, Trapit Bansal, Trevor Creech, Troy Peterson, Tyna Eloundou, Valerie Qi, Vineet Kosaraju, Vinnie Monaco, Vitchyr Pong, Vlad Fomenko, Weiyi Zheng, Wenda Zhou, Wes McCabe, Wojciech Zaremba, Yann Dubois, Yinghai Lu, Yining Chen, Young Cha, Yu Bai, Yuchen He, Yuchen Zhang, Yunyun Wang, Zheng Shao, Zhuohan Li</p>

            <p><strong>Title:</strong><br>
            OpenAI o1 System Card</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.16720v1">http://arxiv.org/abs/2412.16720v1</a></p>

            <p><strong>Abstract:</strong><br>
            The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Dec 2024 20:42:45 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ed54079e/c5df1ce0.mp3" length="24070668" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1501</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey Mishchenko, Andy Applebaum, Angela Jiang, Ashvin Nair, Barret Zoph, Behrooz Ghorbani, Ben Rossen, Benjamin Sokolowsky, Boaz Barak, Bob McGrew, Borys Minaiev, Botao Hao, Bowen Baker, Brandon Houghton, Brandon McKinzie, Brydon Eastman, Camillo Lugaresi, Cary Bassin, Cary Hudson, Chak Ming Li, Charles de Bourcy, Chelsea Voss, Chen Shen, Chong Zhang, Chris Koch, Chris Orsinger, Christopher Hesse, Claudia Fischer, Clive Chan, Dan Roberts, Daniel Kappler, Daniel Levy, Daniel Selsam, David Dohan, David Farhi, David Mely, David Robinson, Dimitris Tsipras, Doug Li, Dragos Oprica, Eben Freeman, Eddie Zhang, Edmund Wong, Elizabeth Proehl, Enoch Cheung, Eric Mitchell, Eric Wallace, Erik Ritter, Evan Mays, Fan Wang, Felipe Petroski Such, Filippo Raso, Florencia Leoni, Foivos Tsimpourlas, Francis Song, Fred von Lohmann, Freddie Sulit, Geoff Salmon, Giambattista Parascandolo, Gildas Chabot, Grace Zhao, Greg Brockman, Guillaume Leclerc, Hadi Salman, Haiming Bao, Hao Sheng, Hart Andrin, Hessam Bagherinezhad, Hongyu Ren, Hunter Lightman, Hyung Won Chung, Ian Kivlichan, Ian O'Connell, Ian Osband, Ignasi Clavera Gilaberte, Ilge Akkaya, Ilya Kostrikov, Ilya Sutskever, Irina Kofman, Jakub Pachocki, James Lennon, Jason Wei, Jean Harb, Jerry Twore, Jiacheng Feng, Jiahui Yu, Jiayi Weng, Jie Tang, Jieqi Yu, Joaquin Quiñonero Candela, Joe Palermo, Joel Parish, Johannes Heidecke, John Hallman, John Rizzo, Jonathan Gordon, Jonathan Uesato, Jonathan Uesato, Jonathan Ward, Joost Huizinga, Julie Wang, Kai Chen, Kai Xiao, Karan Singhal, Karina Nguyen, Karl Cobbe, Katy Shi, Kayla Wood, Kendra Rimbach, Keren Gu-Lemberg, Keren GuLemberg, Kevin Liu, Kevin Lu, Kevin Stone, Kevin Yu, Lama Ahmad, Lauren Yang, Leo Liu, Leon Maksin, Leyton Ho, Liam Fedus, Lilian Weng, Linden Li, Lindsay McCallum, Lindsey Held, Lorenz Kuhn, Lukas Kondraciuk, Lukasz Kaiser, Luke Metz, Madelaine Boyd, Maja Trebacz, Manas Joglekar, Mark Chen, Marko Tintor, Mason Meyer, Matt Jones, Matt Kaufer, Max Schwarzer, Meghan Shah, Mehmet Yatbaz, Melody Guan, Mengyuan Xu, Mengyuan Yan, Mia Glaese, Mianna Chen, Mianna Chen, Michael Lampe, Michael Malek, Michele Wang, Michelle Fradin, Mike McClay, Mikhail Pavlov, Miles Wang, Mingxuan Wang, Mira Murati, Mo Bavarian, Mostafa Rohaninejad, Nat McAleese, Neil Chowdhury, Neil Chowdhury, Nick Ryder, Nikolas Tezak, Noam Brown, Ofir Nachum, Oleg Boiko, Oleg Murk, Olivia Watkins, Patrick Chao, Paul Ashbourne, Pavel Izmailov, Peter Zhokhov, Rachel Dias, Rahul Arora, Randall Lin, Rapha Gontijo Lopes, Raz Gaon, Reah Miyara, Reimar Leike, Renny Hwang, Rhythm Garg, Robin Brown, Roshan James, Rui Shu, Ryan Cheu, Ryan Greene, Saachi Jain, Sam Altman, Sam Toizer, Sam Toyer, Samuel Miserendino, Sandhini Agarwal, Santiago Hernandez, Sasha Baker, Scott McKinney, Scottie Yan, Shengjia Zhao, Shengli Hu, Shibani Santurkar, Shraman Ray Chaudhuri, Shuyuan Zhang, Siyuan Fu, Spencer Papay, Steph Lin, Suchir Balaji, Suvansh Sanjeev, Szymon Sidor, Tal Broda, Aidan Clark, Tao Wang, Taylor Gordon, Ted Sanders, Tejal Patwardhan, Thibault Sottiaux, Thomas Degry, Thomas Dimson, Tianhao Zheng, Timur Garipov, Tom Stasi, Trapit Bansal, Trevor Creech, Troy Peterson, Tyna Eloundou, Valerie Qi, Vineet Kosaraju, Vinnie Monaco, Vitchyr Pong, Vlad Fomenko, Weiyi Zheng, Wenda Zhou, Wes McCabe, Wojciech Zaremba, Yann Dubois, Yinghai Lu, Yining Chen, Young Cha, Yu Bai, Yuchen He, Yuchen Zhang, Yunyun Wang, Zheng Shao, Zhuohan Li</p>

            <p><strong>Title:</strong><br>
            OpenAI o1 System Card</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.16720v1">http://arxiv.org/abs/2412.16720v1</a></p>

            <p><strong>Abstract:</strong><br>
            The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Revisiting In-Context Learning with Long Context Language Models</title>
      <itunes:episode>273</itunes:episode>
      <podcast:episode>273</podcast:episode>
      <itunes:title>Revisiting In-Context Learning with Long Context Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">86c6c4bb-de02-44cc-89dd-45662fe09483</guid>
      <link>https://share.transistor.fm/s/cbbea86c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jinheon Baek, Sun Jae Lee, Prakhar Gupta, Geunseob, Oh, Siddharth Dalmia, Prateek Kolhar</p>

            <p><strong>Title:</strong><br>
            Revisiting In-Context Learning with Long Context Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.16926v1">http://arxiv.org/abs/2412.16926v1</a></p>

            <p><strong>Abstract:</strong><br>
            In-Context Learning (ICL) is a technique by which language models make predictions based on examples provided in their input context. Previously, their context window size imposed a limit on the number of examples that can be shown, making example selection techniques crucial for identifying the maximally effective set of examples. However, the recent advent of Long Context Language Models (LCLMs) has significantly increased the number of examples that can be included in context, raising an important question of whether ICL performance in a many-shot regime is still sensitive to the method of sample selection. To answer this, we revisit these approaches in the context of LCLMs through extensive experiments on 18 datasets spanning 4 tasks. Surprisingly, we observe that sophisticated example selection techniques do not yield significant improvements over a simple random sample selection method. Instead, we find that the advent of LCLMs has fundamentally shifted the challenge of ICL from that of selecting the most effective examples to that of collecting sufficient examples to fill the context window. Specifically, in certain datasets, including all available examples does not fully utilize the context window; however, by augmenting the examples in context with a simple data augmentation approach, we substantially improve ICL performance by 5%.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jinheon Baek, Sun Jae Lee, Prakhar Gupta, Geunseob, Oh, Siddharth Dalmia, Prateek Kolhar</p>

            <p><strong>Title:</strong><br>
            Revisiting In-Context Learning with Long Context Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.16926v1">http://arxiv.org/abs/2412.16926v1</a></p>

            <p><strong>Abstract:</strong><br>
            In-Context Learning (ICL) is a technique by which language models make predictions based on examples provided in their input context. Previously, their context window size imposed a limit on the number of examples that can be shown, making example selection techniques crucial for identifying the maximally effective set of examples. However, the recent advent of Long Context Language Models (LCLMs) has significantly increased the number of examples that can be included in context, raising an important question of whether ICL performance in a many-shot regime is still sensitive to the method of sample selection. To answer this, we revisit these approaches in the context of LCLMs through extensive experiments on 18 datasets spanning 4 tasks. Surprisingly, we observe that sophisticated example selection techniques do not yield significant improvements over a simple random sample selection method. Instead, we find that the advent of LCLMs has fundamentally shifted the challenge of ICL from that of selecting the most effective examples to that of collecting sufficient examples to fill the context window. Specifically, in certain datasets, including all available examples does not fully utilize the context window; however, by augmenting the examples in context with a simple data augmentation approach, we substantially improve ICL performance by 5%.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Dec 2024 20:42:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cbbea86c/6f1810bf.mp3" length="22786741" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1420</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jinheon Baek, Sun Jae Lee, Prakhar Gupta, Geunseob, Oh, Siddharth Dalmia, Prateek Kolhar</p>

            <p><strong>Title:</strong><br>
            Revisiting In-Context Learning with Long Context Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.16926v1">http://arxiv.org/abs/2412.16926v1</a></p>

            <p><strong>Abstract:</strong><br>
            In-Context Learning (ICL) is a technique by which language models make predictions based on examples provided in their input context. Previously, their context window size imposed a limit on the number of examples that can be shown, making example selection techniques crucial for identifying the maximally effective set of examples. However, the recent advent of Long Context Language Models (LCLMs) has significantly increased the number of examples that can be included in context, raising an important question of whether ICL performance in a many-shot regime is still sensitive to the method of sample selection. To answer this, we revisit these approaches in the context of LCLMs through extensive experiments on 18 datasets spanning 4 tasks. Surprisingly, we observe that sophisticated example selection techniques do not yield significant improvements over a simple random sample selection method. Instead, we find that the advent of LCLMs has fundamentally shifted the challenge of ICL from that of selecting the most effective examples to that of collecting sufficient examples to fill the context window. Specifically, in certain datasets, including all available examples does not fully utilize the context window; however, by augmenting the examples in context with a simple data augmentation approach, we substantially improve ICL performance by 5%.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Outcome-Refining Process Supervision for Code Generation</title>
      <itunes:episode>272</itunes:episode>
      <podcast:episode>272</podcast:episode>
      <itunes:title>Outcome-Refining Process Supervision for Code Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a9bb8aad-5b41-4c61-aa81-2bcf32dfec79</guid>
      <link>https://share.transistor.fm/s/a9202673</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CL, cs.AI, cs.LG, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Zhuohao Yu, Weizheng Gu, Yidong Wang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang</p>

            <p><strong>Title:</strong><br>
            Outcome-Refining Process Supervision for Code Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15118v1">http://arxiv.org/abs/2412.15118v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models have demonstrated remarkable capabilities in code generation, yet they often struggle with complex programming tasks that require deep algorithmic reasoning. While process supervision through learned reward models shows promise in guiding reasoning steps, it requires expensive training data and suffers from unreliable evaluation. We propose Outcome-Refining Process Supervision, a novel paradigm that treats outcome refinement itself as the process to be supervised. Our framework leverages concrete execution signals to ground the supervision of reasoning steps, while using tree-structured exploration to maintain multiple solution trajectories simultaneously. Experiments demonstrate that our approach enables even smaller models to achieve high success accuracy and performance metrics on competitive programming tasks, creates more reliable verification than traditional reward models without requiring training PRMs. Our approach achieves significant improvements across 5 models and 3 datasets: an average of 26.9% increase in correctness and 42.2% in efficiency. The results suggest that providing structured reasoning space with concrete verification signals is crucial for solving complex programming tasks. We open-source all our code and data at: https://github.com/zhuohaoyu/ORPS</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CL, cs.AI, cs.LG, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Zhuohao Yu, Weizheng Gu, Yidong Wang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang</p>

            <p><strong>Title:</strong><br>
            Outcome-Refining Process Supervision for Code Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15118v1">http://arxiv.org/abs/2412.15118v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models have demonstrated remarkable capabilities in code generation, yet they often struggle with complex programming tasks that require deep algorithmic reasoning. While process supervision through learned reward models shows promise in guiding reasoning steps, it requires expensive training data and suffers from unreliable evaluation. We propose Outcome-Refining Process Supervision, a novel paradigm that treats outcome refinement itself as the process to be supervised. Our framework leverages concrete execution signals to ground the supervision of reasoning steps, while using tree-structured exploration to maintain multiple solution trajectories simultaneously. Experiments demonstrate that our approach enables even smaller models to achieve high success accuracy and performance metrics on competitive programming tasks, creates more reliable verification than traditional reward models without requiring training PRMs. Our approach achieves significant improvements across 5 models and 3 datasets: an average of 26.9% increase in correctness and 42.2% in efficiency. The results suggest that providing structured reasoning space with concrete verification signals is crucial for solving complex programming tasks. We open-source all our code and data at: https://github.com/zhuohaoyu/ORPS</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Dec 2024 20:42:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a9202673/c6ed29aa.mp3" length="20416486" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1272</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CL, cs.AI, cs.LG, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Zhuohao Yu, Weizheng Gu, Yidong Wang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang</p>

            <p><strong>Title:</strong><br>
            Outcome-Refining Process Supervision for Code Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15118v1">http://arxiv.org/abs/2412.15118v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models have demonstrated remarkable capabilities in code generation, yet they often struggle with complex programming tasks that require deep algorithmic reasoning. While process supervision through learned reward models shows promise in guiding reasoning steps, it requires expensive training data and suffers from unreliable evaluation. We propose Outcome-Refining Process Supervision, a novel paradigm that treats outcome refinement itself as the process to be supervised. Our framework leverages concrete execution signals to ground the supervision of reasoning steps, while using tree-structured exploration to maintain multiple solution trajectories simultaneously. Experiments demonstrate that our approach enables even smaller models to achieve high success accuracy and performance metrics on competitive programming tasks, creates more reliable verification than traditional reward models without requiring training PRMs. Our approach achieves significant improvements across 5 models and 3 datasets: an average of 26.9% increase in correctness and 42.2% in efficiency. The results suggest that providing structured reasoning space with concrete verification signals is crucial for solving complex programming tasks. We open-source all our code and data at: https://github.com/zhuohaoyu/ORPS</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LearnLM: Improving Gemini for Learning</title>
      <itunes:episode>271</itunes:episode>
      <podcast:episode>271</podcast:episode>
      <itunes:title>LearnLM: Improving Gemini for Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">611db8d9-bfeb-40ae-8cae-b076e32ed301</guid>
      <link>https://share.transistor.fm/s/66016cbf</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CY, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            LearnLM Team, Abhinit Modi, Aditya Srikanth Veerubhotla, Aliya Rysbek, Andrea Huber, Brett Wiltshire, Brian Veprek, Daniel Gillick, Daniel Kasenberg, Derek Ahmed, Irina Jurenka, James Cohan, Jennifer She, Julia Wilkowski, Kaiz Alarakyia, Kevin McKee, Lisa Wang, Markus Kunesch, Mike Schaekermann, Miruna Pîslar, Nikhil Joshi, Parsa Mahmoudieh, Paul Jhun, Sara Wiltberger, Shakir Mohamed, Shashank Agarwal, Shubham Milind Phal, Sun Jae Lee, Theofilos Strinopoulos, Wei-Jen Ko, Amy Wang, Ankit Anand, Avishkar Bhoopchand, Dan Wild, Divya Pandya, Filip Bar, Garth Graham, Holger Winnemoeller, Mahvish Nagda, Prateek Kolhar, Renee Schneider, Shaojian Zhu, Stephanie Chan, Steve Yadlowsky, Viknesh Sounderajah, Yannis Assael</p>

            <p><strong>Title:</strong><br>
            LearnLM: Improving Gemini for Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.16429v1">http://arxiv.org/abs/2412.16429v1</a></p>

            <p><strong>Abstract:</strong><br>
            Today's generative AI systems are tuned to present information by default rather than engage users in service of learning as a human tutor would. To address the wide range of potential education use cases for these systems, we reframe the challenge of injecting pedagogical behavior as one of \textit{pedagogical instruction following}, where training and evaluation examples include system-level instructions describing the specific pedagogy attributes present or desired in subsequent model turns. This framing avoids committing our models to any particular definition of pedagogy, and instead allows teachers or developers to specify desired model behavior. It also clears a path to improving Gemini models for learning -- by enabling the addition of our pedagogical data to post-training mixtures -- alongside their rapidly expanding set of capabilities. Both represent important changes from our initial tech report. We show how training with pedagogical instruction following produces a LearnLM model (available on Google AI Studio) that is preferred substantially by expert raters across a diverse set of learning scenarios, with average preference strengths of 31\% over GPT-4o, 11\% over Claude 3.5, and 13\% over the Gemini 1.5 Pro model LearnLM was based on.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CY, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            LearnLM Team, Abhinit Modi, Aditya Srikanth Veerubhotla, Aliya Rysbek, Andrea Huber, Brett Wiltshire, Brian Veprek, Daniel Gillick, Daniel Kasenberg, Derek Ahmed, Irina Jurenka, James Cohan, Jennifer She, Julia Wilkowski, Kaiz Alarakyia, Kevin McKee, Lisa Wang, Markus Kunesch, Mike Schaekermann, Miruna Pîslar, Nikhil Joshi, Parsa Mahmoudieh, Paul Jhun, Sara Wiltberger, Shakir Mohamed, Shashank Agarwal, Shubham Milind Phal, Sun Jae Lee, Theofilos Strinopoulos, Wei-Jen Ko, Amy Wang, Ankit Anand, Avishkar Bhoopchand, Dan Wild, Divya Pandya, Filip Bar, Garth Graham, Holger Winnemoeller, Mahvish Nagda, Prateek Kolhar, Renee Schneider, Shaojian Zhu, Stephanie Chan, Steve Yadlowsky, Viknesh Sounderajah, Yannis Assael</p>

            <p><strong>Title:</strong><br>
            LearnLM: Improving Gemini for Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.16429v1">http://arxiv.org/abs/2412.16429v1</a></p>

            <p><strong>Abstract:</strong><br>
            Today's generative AI systems are tuned to present information by default rather than engage users in service of learning as a human tutor would. To address the wide range of potential education use cases for these systems, we reframe the challenge of injecting pedagogical behavior as one of \textit{pedagogical instruction following}, where training and evaluation examples include system-level instructions describing the specific pedagogy attributes present or desired in subsequent model turns. This framing avoids committing our models to any particular definition of pedagogy, and instead allows teachers or developers to specify desired model behavior. It also clears a path to improving Gemini models for learning -- by enabling the addition of our pedagogical data to post-training mixtures -- alongside their rapidly expanding set of capabilities. Both represent important changes from our initial tech report. We show how training with pedagogical instruction following produces a LearnLM model (available on Google AI Studio) that is preferred substantially by expert raters across a diverse set of learning scenarios, with average preference strengths of 31\% over GPT-4o, 11\% over Claude 3.5, and 13\% over the Gemini 1.5 Pro model LearnLM was based on.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 24 Dec 2024 20:41:42 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/66016cbf/54f4cf86.mp3" length="26273330" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1638</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CY, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            LearnLM Team, Abhinit Modi, Aditya Srikanth Veerubhotla, Aliya Rysbek, Andrea Huber, Brett Wiltshire, Brian Veprek, Daniel Gillick, Daniel Kasenberg, Derek Ahmed, Irina Jurenka, James Cohan, Jennifer She, Julia Wilkowski, Kaiz Alarakyia, Kevin McKee, Lisa Wang, Markus Kunesch, Mike Schaekermann, Miruna Pîslar, Nikhil Joshi, Parsa Mahmoudieh, Paul Jhun, Sara Wiltberger, Shakir Mohamed, Shashank Agarwal, Shubham Milind Phal, Sun Jae Lee, Theofilos Strinopoulos, Wei-Jen Ko, Amy Wang, Ankit Anand, Avishkar Bhoopchand, Dan Wild, Divya Pandya, Filip Bar, Garth Graham, Holger Winnemoeller, Mahvish Nagda, Prateek Kolhar, Renee Schneider, Shaojian Zhu, Stephanie Chan, Steve Yadlowsky, Viknesh Sounderajah, Yannis Assael</p>

            <p><strong>Title:</strong><br>
            LearnLM: Improving Gemini for Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.16429v1">http://arxiv.org/abs/2412.16429v1</a></p>

            <p><strong>Abstract:</strong><br>
            Today's generative AI systems are tuned to present information by default rather than engage users in service of learning as a human tutor would. To address the wide range of potential education use cases for these systems, we reframe the challenge of injecting pedagogical behavior as one of \textit{pedagogical instruction following}, where training and evaluation examples include system-level instructions describing the specific pedagogy attributes present or desired in subsequent model turns. This framing avoids committing our models to any particular definition of pedagogy, and instead allows teachers or developers to specify desired model behavior. It also clears a path to improving Gemini models for learning -- by enabling the addition of our pedagogical data to post-training mixtures -- alongside their rapidly expanding set of capabilities. Both represent important changes from our initial tech report. We show how training with pedagogical instruction following produces a LearnLM model (available on Google AI Studio) that is preferred substantially by expert raters across a diverse set of learning scenarios, with average preference strengths of 31\% over GPT-4o, 11\% over Claude 3.5, and 13\% over the Gemini 1.5 Pro model LearnLM was based on.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Parallelized Autoregressive Visual Generation</title>
      <itunes:episode>270</itunes:episode>
      <podcast:episode>270</podcast:episode>
      <itunes:title>Parallelized Autoregressive Visual Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8fed7a86-cead-4cbb-8039-b5fcdf2a0f1f</guid>
      <link>https://share.transistor.fm/s/97395f9a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            Parallelized Autoregressive Visual Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15119v1">http://arxiv.org/abs/2412.15119v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive models have emerged as a powerful approach for visual generation but suffer from slow inference speed due to their sequential token-by-token prediction process. In this paper, we propose a simple yet effective approach for parallelized autoregressive visual generation that improves generation efficiency while preserving the advantages of autoregressive modeling. Our key insight is that parallel generation depends on visual token dependencies-tokens with weak dependencies can be generated in parallel, while strongly dependent adjacent tokens are difficult to generate together, as their independent sampling may lead to inconsistencies. Based on this observation, we develop a parallel generation strategy that generates distant tokens with weak dependencies in parallel while maintaining sequential generation for strongly dependent local tokens. Our approach can be seamlessly integrated into standard autoregressive models without modifying the architecture or tokenizer. Experiments on ImageNet and UCF-101 demonstrate that our method achieves a 3.6x speedup with comparable quality and up to 9.5x speedup with minimal quality degradation across both image and video generation tasks. We hope this work will inspire future research in efficient visual generation and unified autoregressive modeling. Project page: https://epiphqny.github.io/PAR-project.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            Parallelized Autoregressive Visual Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15119v1">http://arxiv.org/abs/2412.15119v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive models have emerged as a powerful approach for visual generation but suffer from slow inference speed due to their sequential token-by-token prediction process. In this paper, we propose a simple yet effective approach for parallelized autoregressive visual generation that improves generation efficiency while preserving the advantages of autoregressive modeling. Our key insight is that parallel generation depends on visual token dependencies-tokens with weak dependencies can be generated in parallel, while strongly dependent adjacent tokens are difficult to generate together, as their independent sampling may lead to inconsistencies. Based on this observation, we develop a parallel generation strategy that generates distant tokens with weak dependencies in parallel while maintaining sequential generation for strongly dependent local tokens. Our approach can be seamlessly integrated into standard autoregressive models without modifying the architecture or tokenizer. Experiments on ImageNet and UCF-101 demonstrate that our method achieves a 3.6x speedup with comparable quality and up to 9.5x speedup with minimal quality degradation across both image and video generation tasks. We hope this work will inspire future research in efficient visual generation and unified autoregressive modeling. Project page: https://epiphqny.github.io/PAR-project.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 23 Dec 2024 20:38:51 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/97395f9a/312d010c.mp3" length="21698356" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1352</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 34 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            Parallelized Autoregressive Visual Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15119v1">http://arxiv.org/abs/2412.15119v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive models have emerged as a powerful approach for visual generation but suffer from slow inference speed due to their sequential token-by-token prediction process. In this paper, we propose a simple yet effective approach for parallelized autoregressive visual generation that improves generation efficiency while preserving the advantages of autoregressive modeling. Our key insight is that parallel generation depends on visual token dependencies-tokens with weak dependencies can be generated in parallel, while strongly dependent adjacent tokens are difficult to generate together, as their independent sampling may lead to inconsistencies. Based on this observation, we develop a parallel generation strategy that generates distant tokens with weak dependencies in parallel while maintaining sequential generation for strongly dependent local tokens. Our approach can be seamlessly integrated into standard autoregressive models without modifying the architecture or tokenizer. Experiments on ImageNet and UCF-101 demonstrate that our method achieves a 3.6x speedup with comparable quality and up to 9.5x speedup with minimal quality degradation across both image and video generation tasks. We hope this work will inspire future research in efficient visual generation and unified autoregressive modeling. Project page: https://epiphqny.github.io/PAR-project.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Offline Reinforcement Learning for LLM Multi-Step Reasoning</title>
      <itunes:episode>269</itunes:episode>
      <podcast:episode>269</podcast:episode>
      <itunes:title>Offline Reinforcement Learning for LLM Multi-Step Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">11392876-e5b4-4bd4-907a-675b9f850cb3</guid>
      <link>https://share.transistor.fm/s/2184a5db</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Huaijie Wang, Shibo Hao, Hanze Dong, Shenao Zhang, Yilin Bao, Ziran Yang, Yi Wu</p>

            <p><strong>Title:</strong><br>
            Offline Reinforcement Learning for LLM Multi-Step Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.16145v1">http://arxiv.org/abs/2412.16145v1</a></p>

            <p><strong>Abstract:</strong><br>
            Improving the multi-step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks. While Direct Preference Optimization (DPO) has shown promise in aligning LLMs with human preferences, it is less suitable for multi-step reasoning tasks because (1) DPO relies on paired preference data, which is not readily available for multi-step reasoning tasks, and (2) it treats all tokens uniformly, making it ineffective for credit assignment in multi-step reasoning tasks, which often come with sparse reward. In this work, we propose OREO (Offline Reasoning Optimization), an offline RL method for enhancing LLM multi-step reasoning. Building on insights from previous works of maximum entropy reinforcement learning, it jointly learns a policy model and value function by optimizing the soft Bellman Equation. We show in principle that it reduces the need to collect pairwise data and enables better credit assignment. Empirically, OREO surpasses existing offline learning methods on multi-step reasoning benchmarks, including mathematical reasoning tasks (GSM8K, MATH) and embodied agent control (ALFWorld). The approach can be extended to a multi-iteration framework when additional resources are available. Furthermore, the learned value function can be leveraged to guide the tree search for free, which can further boost performance during test time.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Huaijie Wang, Shibo Hao, Hanze Dong, Shenao Zhang, Yilin Bao, Ziran Yang, Yi Wu</p>

            <p><strong>Title:</strong><br>
            Offline Reinforcement Learning for LLM Multi-Step Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.16145v1">http://arxiv.org/abs/2412.16145v1</a></p>

            <p><strong>Abstract:</strong><br>
            Improving the multi-step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks. While Direct Preference Optimization (DPO) has shown promise in aligning LLMs with human preferences, it is less suitable for multi-step reasoning tasks because (1) DPO relies on paired preference data, which is not readily available for multi-step reasoning tasks, and (2) it treats all tokens uniformly, making it ineffective for credit assignment in multi-step reasoning tasks, which often come with sparse reward. In this work, we propose OREO (Offline Reasoning Optimization), an offline RL method for enhancing LLM multi-step reasoning. Building on insights from previous works of maximum entropy reinforcement learning, it jointly learns a policy model and value function by optimizing the soft Bellman Equation. We show in principle that it reduces the need to collect pairwise data and enables better credit assignment. Empirically, OREO surpasses existing offline learning methods on multi-step reasoning benchmarks, including mathematical reasoning tasks (GSM8K, MATH) and embodied agent control (ALFWorld). The approach can be extended to a multi-iteration framework when additional resources are available. Furthermore, the learned value function can be leveraged to guide the tree search for free, which can further boost performance during test time.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 23 Dec 2024 20:38:30 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2184a5db/422adb0c.mp3" length="20196643" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1259</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Huaijie Wang, Shibo Hao, Hanze Dong, Shenao Zhang, Yilin Bao, Ziran Yang, Yi Wu</p>

            <p><strong>Title:</strong><br>
            Offline Reinforcement Learning for LLM Multi-Step Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.16145v1">http://arxiv.org/abs/2412.16145v1</a></p>

            <p><strong>Abstract:</strong><br>
            Improving the multi-step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks. While Direct Preference Optimization (DPO) has shown promise in aligning LLMs with human preferences, it is less suitable for multi-step reasoning tasks because (1) DPO relies on paired preference data, which is not readily available for multi-step reasoning tasks, and (2) it treats all tokens uniformly, making it ineffective for credit assignment in multi-step reasoning tasks, which often come with sparse reward. In this work, we propose OREO (Offline Reasoning Optimization), an offline RL method for enhancing LLM multi-step reasoning. Building on insights from previous works of maximum entropy reinforcement learning, it jointly learns a policy model and value function by optimizing the soft Bellman Equation. We show in principle that it reduces the need to collect pairwise data and enables better credit assignment. Empirically, OREO surpasses existing offline learning methods on multi-step reasoning benchmarks, including mathematical reasoning tasks (GSM8K, MATH) and embodied agent control (ALFWorld). The approach can be extended to a multi-iteration framework when additional resources are available. Furthermore, the learned value function can be leveraged to guide the tree search for free, which can further boost performance during test time.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation</title>
      <itunes:episode>268</itunes:episode>
      <podcast:episode>268</podcast:episode>
      <itunes:title>SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2a6722f6-ad29-4393-ac64-859f7a74832d</guid>
      <link>https://share.transistor.fm/s/0f1c8635</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, Deyu Zhou</p>

            <p><strong>Title:</strong><br>
            SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13649v1">http://arxiv.org/abs/2412.13649v1</a></p>

            <p><strong>Abstract:</strong><br>
            Key-Value (KV) cache has become a bottleneck of LLMs for long-context generation. Despite the numerous efforts in this area, the optimization for the decoding phase is generally ignored. However, we believe such optimization is crucial, especially for long-output generation tasks based on the following two observations: (i) Excessive compression during the prefill phase, which requires specific full context impairs the comprehension of the reasoning task; (ii) Deviation of heavy hitters occurs in the reasoning tasks with long outputs. Therefore, SCOPE, a simple yet efficient framework that separately performs KV cache optimization during the prefill and decoding phases, is introduced. Specifically, the KV cache during the prefill phase is preserved to maintain the essential information, while a novel strategy based on sliding is proposed to select essential heavy hitters for the decoding phase. Memory usage and memory transfer are further optimized using adaptive and discontinuous strategies. Extensive experiments on LongGenBench show the effectiveness and generalization of SCOPE and its compatibility as a plug-in to other prefill-only KV compression methods.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, Deyu Zhou</p>

            <p><strong>Title:</strong><br>
            SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13649v1">http://arxiv.org/abs/2412.13649v1</a></p>

            <p><strong>Abstract:</strong><br>
            Key-Value (KV) cache has become a bottleneck of LLMs for long-context generation. Despite the numerous efforts in this area, the optimization for the decoding phase is generally ignored. However, we believe such optimization is crucial, especially for long-output generation tasks based on the following two observations: (i) Excessive compression during the prefill phase, which requires specific full context impairs the comprehension of the reasoning task; (ii) Deviation of heavy hitters occurs in the reasoning tasks with long outputs. Therefore, SCOPE, a simple yet efficient framework that separately performs KV cache optimization during the prefill and decoding phases, is introduced. Specifically, the KV cache during the prefill phase is preserved to maintain the essential information, while a novel strategy based on sliding is proposed to select essential heavy hitters for the decoding phase. Memory usage and memory transfer are further optimized using adaptive and discontinuous strategies. Extensive experiments on LongGenBench show the effectiveness and generalization of SCOPE and its compatibility as a plug-in to other prefill-only KV compression methods.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 23 Dec 2024 20:38:09 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0f1c8635/9497cecc.mp3" length="21170919" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1320</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, Deyu Zhou</p>

            <p><strong>Title:</strong><br>
            SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13649v1">http://arxiv.org/abs/2412.13649v1</a></p>

            <p><strong>Abstract:</strong><br>
            Key-Value (KV) cache has become a bottleneck of LLMs for long-context generation. Despite the numerous efforts in this area, the optimization for the decoding phase is generally ignored. However, we believe such optimization is crucial, especially for long-output generation tasks based on the following two observations: (i) Excessive compression during the prefill phase, which requires specific full context impairs the comprehension of the reasoning task; (ii) Deviation of heavy hitters occurs in the reasoning tasks with long outputs. Therefore, SCOPE, a simple yet efficient framework that separately performs KV cache optimization during the prefill and decoding phases, is introduced. Specifically, the KV cache during the prefill phase is preserved to maintain the essential information, while a novel strategy based on sliding is proposed to select essential heavy hitters for the decoding phase. Memory usage and memory transfer are further optimized using adaptive and discontinuous strategies. Extensive experiments on LongGenBench show the effectiveness and generalization of SCOPE and its compatibility as a plug-in to other prefill-only KV compression methods.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up</title>
      <itunes:episode>267</itunes:episode>
      <podcast:episode>267</podcast:episode>
      <itunes:title>CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">bc1c1a86-3a57-4490-b6ae-75a6f18c4dcb</guid>
      <link>https://share.transistor.fm/s/1ad89220</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Songhua Liu, Zhenxiong Tan, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.16112v1">http://arxiv.org/abs/2412.16112v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when generating high-resolution images. To address this issue, we aim at a linear attention mechanism in this paper that reduces the complexity of pre-trained DiTs to linear. We begin our exploration with a comprehensive summary of existing efficient attention mechanisms and identify four key factors crucial for successful linearization of pre-trained DiTs: locality, formulation consistency, high-rank attention maps, and feature integrity. Based on these insights, we introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token, and thus achieves linear complexity. Our experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model. Simultaneously, it reduces attention computations by 99.5% and accelerates generation by 6.3 times for generating 8K-resolution images. Furthermore, we investigate favorable properties in the distilled attention layers, such as zero-shot generalization cross various models and plugins, and improved support for multi-GPU parallel inference. Models and codes are available here: https://github.com/Huage001/CLEAR.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Songhua Liu, Zhenxiong Tan, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.16112v1">http://arxiv.org/abs/2412.16112v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when generating high-resolution images. To address this issue, we aim at a linear attention mechanism in this paper that reduces the complexity of pre-trained DiTs to linear. We begin our exploration with a comprehensive summary of existing efficient attention mechanisms and identify four key factors crucial for successful linearization of pre-trained DiTs: locality, formulation consistency, high-rank attention maps, and feature integrity. Based on these insights, we introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token, and thus achieves linear complexity. Our experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model. Simultaneously, it reduces attention computations by 99.5% and accelerates generation by 6.3 times for generating 8K-resolution images. Furthermore, we investigate favorable properties in the distilled attention layers, such as zero-shot generalization cross various models and plugins, and improved support for multi-GPU parallel inference. Models and codes are available here: https://github.com/Huage001/CLEAR.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 23 Dec 2024 20:37:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1ad89220/c73f91c5.mp3" length="24433091" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1523</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Songhua Liu, Zhenxiong Tan, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.16112v1">http://arxiv.org/abs/2412.16112v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when generating high-resolution images. To address this issue, we aim at a linear attention mechanism in this paper that reduces the complexity of pre-trained DiTs to linear. We begin our exploration with a comprehensive summary of existing efficient attention mechanisms and identify four key factors crucial for successful linearization of pre-trained DiTs: locality, formulation consistency, high-rank attention maps, and feature integrity. Based on these insights, we introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token, and thus achieves linear complexity. Our experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model. Simultaneously, it reduces attention computations by 99.5% and accelerates generation by 6.3 times for generating 8K-resolution images. Furthermore, we investigate favorable properties in the distilled attention layers, such as zero-shot generalization cross various models and plugins, and improved support for multi-GPU parallel inference. Models and codes are available here: https://github.com/Huage001/CLEAR.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis</title>
      <itunes:episode>266</itunes:episode>
      <podcast:episode>266</podcast:episode>
      <itunes:title>Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e8ab0b0e-f7cd-42a0-9c52-0226c45278a3</guid>
      <link>https://share.transistor.fm/s/18870e78</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV, cs.LG, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji</p>

            <p><strong>Title:</strong><br>
            Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15322v1">http://arxiv.org/abs/2412.15322v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV, cs.LG, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji</p>

            <p><strong>Title:</strong><br>
            Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15322v1">http://arxiv.org/abs/2412.15322v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 23 Dec 2024 20:37:26 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/18870e78/e597fe8a.mp3" length="22504628" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1403</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV, cs.LG, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji</p>

            <p><strong>Title:</strong><br>
            Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15322v1">http://arxiv.org/abs/2412.15322v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: https://hkchengrex.github.io/MMAudio</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage</title>
      <itunes:episode>265</itunes:episode>
      <podcast:episode>265</podcast:episode>
      <itunes:title>Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8781cb96-a52d-4b4b-a23c-3f0159d4a606</guid>
      <link>https://share.transistor.fm/s/df9b6dbb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Saehyung Lee, Seunghyun Yoon, Trung Bui, Jing Shi, Sungroh Yoon</p>

            <p><strong>Title:</strong><br>
            Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15484v1">http://arxiv.org/abs/2412.15484v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. Our analysis reveals that existing hallucination detection methods struggle with detailed captions. We attribute this to the increasing reliance of MLLMs on their generated text, rather than the input image, as the sequence length grows. To address this issue, we propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions. Additionally, we introduce an evaluation framework and a benchmark dataset to facilitate the systematic analysis of detailed captions. Our experiments demonstrate that our proposed evaluation method better aligns with human judgments of factuality than existing metrics and that existing approaches to improve the MLLM factuality may fall short in hyper-detailed image captioning tasks. In contrast, our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V. Finally, we highlight a limitation of VQA-centric benchmarking by demonstrating that an MLLM's performance on VQA benchmarks may not correlate with its ability to generate detailed image captions.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Saehyung Lee, Seunghyun Yoon, Trung Bui, Jing Shi, Sungroh Yoon</p>

            <p><strong>Title:</strong><br>
            Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15484v1">http://arxiv.org/abs/2412.15484v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. Our analysis reveals that existing hallucination detection methods struggle with detailed captions. We attribute this to the increasing reliance of MLLMs on their generated text, rather than the input image, as the sequence length grows. To address this issue, we propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions. Additionally, we introduce an evaluation framework and a benchmark dataset to facilitate the systematic analysis of detailed captions. Our experiments demonstrate that our proposed evaluation method better aligns with human judgments of factuality than existing metrics and that existing approaches to improve the MLLM factuality may fall short in hyper-detailed image captioning tasks. In contrast, our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V. Finally, we highlight a limitation of VQA-centric benchmarking by demonstrating that an MLLM's performance on VQA benchmarks may not correlate with its ability to generate detailed image captions.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 23 Dec 2024 20:37:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/df9b6dbb/3a57241e.mp3" length="26993560" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1683</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Saehyung Lee, Seunghyun Yoon, Trung Bui, Jing Shi, Sungroh Yoon</p>

            <p><strong>Title:</strong><br>
            Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15484v1">http://arxiv.org/abs/2412.15484v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. Our analysis reveals that existing hallucination detection methods struggle with detailed captions. We attribute this to the increasing reliance of MLLMs on their generated text, rather than the input image, as the sequence length grows. To address this issue, we propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions. Additionally, we introduce an evaluation framework and a benchmark dataset to facilitate the systematic analysis of detailed captions. Our experiments demonstrate that our proposed evaluation method better aligns with human judgments of factuality than existing metrics and that existing approaches to improve the MLLM factuality may fall short in hyper-detailed image captioning tasks. In contrast, our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V. Finally, we highlight a limitation of VQA-centric benchmarking by demonstrating that an MLLM's performance on VQA benchmarks may not correlate with its ability to generate detailed image captions.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Sequence Matters: Harnessing Video Models in 3D Super-Resolution</title>
      <itunes:episode>264</itunes:episode>
      <podcast:episode>264</podcast:episode>
      <itunes:title>Sequence Matters: Harnessing Video Models in 3D Super-Resolution</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">254be7d8-dad4-4527-9719-af2d003c9cc1</guid>
      <link>https://share.transistor.fm/s/9fdbfd82</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CV, 68U10, 68T10, I.4.5; I.2.10</p>

            <p><strong>Authors:</strong><br>
            Hyun-kyu Ko, Dongheok Park, Youngin Park, Byeonghyeon Lee, Juhee Han, Eunbyung Park</p>

            <p><strong>Title:</strong><br>
            Sequence Matters: Harnessing Video Models in 3D Super-Resolution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11525v3">http://arxiv.org/abs/2412.11525v3</a></p>

            <p><strong>Abstract:</strong><br>
            3D super-resolution aims to reconstruct high-fidelity 3D models from low-resolution (LR) multi-view images. Early studies primarily focused on single-image super-resolution (SISR) models to upsample LR images into high-resolution images. However, these methods often lack view consistency because they operate independently on each image. Although various post-processing techniques have been extensively explored to mitigate these inconsistencies, they have yet to fully resolve the issues. In this paper, we perform a comprehensive study of 3D super-resolution by leveraging video super-resolution (VSR) models. By utilizing VSR models, we ensure a higher degree of spatial consistency and can reference surrounding spatial information, leading to more accurate and detailed reconstructions. Our findings reveal that VSR models can perform remarkably well even on sequences that lack precise spatial alignment. Given this observation, we propose a simple yet practical approach to align LR images without involving fine-tuning or generating 'smooth' trajectory from the trained 3D models over LR images. The experimental results show that the surprisingly simple algorithms can achieve the state-of-the-art results of 3D super-resolution tasks on standard benchmark datasets, such as the NeRF-synthetic and MipNeRF-360 datasets. Project page: https://ko-lani.github.io/Sequence-Matters</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CV, 68U10, 68T10, I.4.5; I.2.10</p>

            <p><strong>Authors:</strong><br>
            Hyun-kyu Ko, Dongheok Park, Youngin Park, Byeonghyeon Lee, Juhee Han, Eunbyung Park</p>

            <p><strong>Title:</strong><br>
            Sequence Matters: Harnessing Video Models in 3D Super-Resolution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11525v3">http://arxiv.org/abs/2412.11525v3</a></p>

            <p><strong>Abstract:</strong><br>
            3D super-resolution aims to reconstruct high-fidelity 3D models from low-resolution (LR) multi-view images. Early studies primarily focused on single-image super-resolution (SISR) models to upsample LR images into high-resolution images. However, these methods often lack view consistency because they operate independently on each image. Although various post-processing techniques have been extensively explored to mitigate these inconsistencies, they have yet to fully resolve the issues. In this paper, we perform a comprehensive study of 3D super-resolution by leveraging video super-resolution (VSR) models. By utilizing VSR models, we ensure a higher degree of spatial consistency and can reference surrounding spatial information, leading to more accurate and detailed reconstructions. Our findings reveal that VSR models can perform remarkably well even on sequences that lack precise spatial alignment. Given this observation, we propose a simple yet practical approach to align LR images without involving fine-tuning or generating 'smooth' trajectory from the trained 3D models over LR images. The experimental results show that the surprisingly simple algorithms can achieve the state-of-the-art results of 3D super-resolution tasks on standard benchmark datasets, such as the NeRF-synthetic and MipNeRF-360 datasets. Project page: https://ko-lani.github.io/Sequence-Matters</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 23 Dec 2024 20:36:44 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9fdbfd82/a5869ede.mp3" length="21339348" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1330</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 6 | cs.CV, 68U10, 68T10, I.4.5; I.2.10</p>

            <p><strong>Authors:</strong><br>
            Hyun-kyu Ko, Dongheok Park, Youngin Park, Byeonghyeon Lee, Juhee Han, Eunbyung Park</p>

            <p><strong>Title:</strong><br>
            Sequence Matters: Harnessing Video Models in 3D Super-Resolution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11525v3">http://arxiv.org/abs/2412.11525v3</a></p>

            <p><strong>Abstract:</strong><br>
            3D super-resolution aims to reconstruct high-fidelity 3D models from low-resolution (LR) multi-view images. Early studies primarily focused on single-image super-resolution (SISR) models to upsample LR images into high-resolution images. However, these methods often lack view consistency because they operate independently on each image. Although various post-processing techniques have been extensively explored to mitigate these inconsistencies, they have yet to fully resolve the issues. In this paper, we perform a comprehensive study of 3D super-resolution by leveraging video super-resolution (VSR) models. By utilizing VSR models, we ensure a higher degree of spatial consistency and can reference surrounding spatial information, leading to more accurate and detailed reconstructions. Our findings reveal that VSR models can perform remarkably well even on sequences that lack precise spatial alignment. Given this observation, we propose a simple yet practical approach to align LR images without involving fine-tuning or generating 'smooth' trajectory from the trained 3D models over LR images. The experimental results show that the surprisingly simple algorithms can achieve the state-of-the-art results of 3D super-resolution tasks on standard benchmark datasets, such as the NeRF-synthetic and MipNeRF-360 datasets. Project page: https://ko-lani.github.io/Sequence-Matters</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TRecViT: A Recurrent Video Transformer</title>
      <itunes:episode>263</itunes:episode>
      <podcast:episode>263</podcast:episode>
      <itunes:title>TRecViT: A Recurrent Video Transformer</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">627b1809-db81-4d63-ba00-3bc4973e9bb8</guid>
      <link>https://share.transistor.fm/s/d47f6297</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Viorica Pătrăucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, Razvan Pascanu</p>

            <p><strong>Title:</strong><br>
            TRecViT: A Recurrent Video Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14294v1">http://arxiv.org/abs/2412.14294v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having $3\times$ less parameters, $12\times$ smaller memory footprint, and $5\times$ lower FLOPs count. Code and checkpoints will be made available online at https://github.com/google-deepmind/trecvit.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Viorica Pătrăucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, Razvan Pascanu</p>

            <p><strong>Title:</strong><br>
            TRecViT: A Recurrent Video Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14294v1">http://arxiv.org/abs/2412.14294v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having $3\times$ less parameters, $12\times$ smaller memory footprint, and $5\times$ lower FLOPs count. Code and checkpoints will be made available online at https://github.com/google-deepmind/trecvit.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 23 Dec 2024 20:36:22 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d47f6297/d50f0eb5.mp3" length="24127528" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1504</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Viorica Pătrăucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, Razvan Pascanu</p>

            <p><strong>Title:</strong><br>
            TRecViT: A Recurrent Video Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14294v1">http://arxiv.org/abs/2412.14294v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having $3\times$ less parameters, $12\times$ smaller memory footprint, and $5\times$ lower FLOPs count. Code and checkpoints will be made available online at https://github.com/google-deepmind/trecvit.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design</title>
      <itunes:episode>262</itunes:episode>
      <podcast:episode>262</podcast:episode>
      <itunes:title>MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fb81010b-2fec-4644-984d-c38d4c8fa9dc</guid>
      <link>https://share.transistor.fm/s/b837d9ee</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhen Zheng, Xiaonan Song, Chuanjie Liu</p>

            <p><strong>Title:</strong><br>
            MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14590v1">http://arxiv.org/abs/2412.14590v1</a></p>

            <p><strong>Abstract:</strong><br>
            Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or system inefficiency. In this paper, we make a comprehensive analysis of the general quantization principles on their effect to the triangle of accuracy, memory consumption and system efficiency. We propose MixLLM that explores the new optimization space of mixed-precision quantization between output features based on the insight that different output features matter differently in the model. MixLLM identifies the output features with high salience in the global view rather than within each single layer, effectively assigning the larger bit-width to output features that need it most to achieve good accuracy with low memory consumption. We present the sweet spot of quantization configuration of algorithm-system co-design that leads to high accuracy and system efficiency. To address the system challenge, we design the two-step dequantization to make use of the int8 Tensor Core easily and fast data type conversion to reduce dequantization overhead significantly, and present the software pipeline to overlap the memory access, dequantization and the MatMul to the best. Extensive experiments show that with only 10% more bits, the PPL increasement can be reduced from about 0.5 in SOTA to within 0.2 for Llama 3.1 70B, while on average MMLU-Pro improves by 0.93 over the SOTA of three popular models. In addition to its superior accuracy, MixLLM also achieves state-of-the-art system efficiency.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhen Zheng, Xiaonan Song, Chuanjie Liu</p>

            <p><strong>Title:</strong><br>
            MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14590v1">http://arxiv.org/abs/2412.14590v1</a></p>

            <p><strong>Abstract:</strong><br>
            Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or system inefficiency. In this paper, we make a comprehensive analysis of the general quantization principles on their effect to the triangle of accuracy, memory consumption and system efficiency. We propose MixLLM that explores the new optimization space of mixed-precision quantization between output features based on the insight that different output features matter differently in the model. MixLLM identifies the output features with high salience in the global view rather than within each single layer, effectively assigning the larger bit-width to output features that need it most to achieve good accuracy with low memory consumption. We present the sweet spot of quantization configuration of algorithm-system co-design that leads to high accuracy and system efficiency. To address the system challenge, we design the two-step dequantization to make use of the int8 Tensor Core easily and fast data type conversion to reduce dequantization overhead significantly, and present the software pipeline to overlap the memory access, dequantization and the MatMul to the best. Extensive experiments show that with only 10% more bits, the PPL increasement can be reduced from about 0.5 in SOTA to within 0.2 for Llama 3.1 70B, while on average MMLU-Pro improves by 0.93 over the SOTA of three popular models. In addition to its superior accuracy, MixLLM also achieves state-of-the-art system efficiency.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 23 Dec 2024 20:36:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b837d9ee/8c92b0e2.mp3" length="22150654" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1381</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhen Zheng, Xiaonan Song, Chuanjie Liu</p>

            <p><strong>Title:</strong><br>
            MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14590v1">http://arxiv.org/abs/2412.14590v1</a></p>

            <p><strong>Abstract:</strong><br>
            Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or system inefficiency. In this paper, we make a comprehensive analysis of the general quantization principles on their effect to the triangle of accuracy, memory consumption and system efficiency. We propose MixLLM that explores the new optimization space of mixed-precision quantization between output features based on the insight that different output features matter differently in the model. MixLLM identifies the output features with high salience in the global view rather than within each single layer, effectively assigning the larger bit-width to output features that need it most to achieve good accuracy with low memory consumption. We present the sweet spot of quantization configuration of algorithm-system co-design that leads to high accuracy and system efficiency. To address the system challenge, we design the two-step dequantization to make use of the int8 Tensor Core easily and fast data type conversion to reduce dequantization overhead significantly, and present the software pipeline to overlap the memory access, dequantization and the MatMul to the best. Extensive experiments show that with only 10% more bits, the PPL increasement can be reduced from about 0.5 in SOTA to within 0.2 for Llama 3.1 70B, while on average MMLU-Pro improves by 0.93 over the SOTA of three popular models. In addition to its superior accuracy, MixLLM also achieves state-of-the-art system efficiency.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Multi-LLM Text Summarization</title>
      <itunes:episode>261</itunes:episode>
      <podcast:episode>261</podcast:episode>
      <itunes:title>Multi-LLM Text Summarization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ed0d96a2-278a-445e-be66-5ddec2b09892</guid>
      <link>https://share.transistor.fm/s/48794cfc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiangnan Fang, Cheng-Tse Liu, Jieun Kim, Yash Bhedaru, Ethan Liu, Nikhil Singh, Nedim Lipka, Puneet Mathur, Nesreen K. Ahmed, Franck Dernoncourt, Ryan A. Rossi, Hanieh Deilamsalehy</p>

            <p><strong>Title:</strong><br>
            Multi-LLM Text Summarization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15487v1">http://arxiv.org/abs/2412.15487v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we propose a Multi-LLM summarization framework, and investigate two different multi-LLM strategies including centralized and decentralized. Our multi-LLM summarization framework has two fundamentally important steps at each round of conversation: generation and evaluation. These steps are different depending on whether our multi-LLM decentralized summarization is used or centralized. In both our multi-LLM decentralized and centralized strategies, we have k different LLMs that generate diverse summaries of the text. However, during evaluation, our multi-LLM centralized summarization approach leverages a single LLM to evaluate the summaries and select the best one whereas k LLMs are used for decentralized multi-LLM summarization. Overall, we find that our multi-LLM summarization approaches significantly outperform the baselines that leverage only a single LLM by up to 3x. These results indicate the effectiveness of multi-LLM approaches for summarization.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiangnan Fang, Cheng-Tse Liu, Jieun Kim, Yash Bhedaru, Ethan Liu, Nikhil Singh, Nedim Lipka, Puneet Mathur, Nesreen K. Ahmed, Franck Dernoncourt, Ryan A. Rossi, Hanieh Deilamsalehy</p>

            <p><strong>Title:</strong><br>
            Multi-LLM Text Summarization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15487v1">http://arxiv.org/abs/2412.15487v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we propose a Multi-LLM summarization framework, and investigate two different multi-LLM strategies including centralized and decentralized. Our multi-LLM summarization framework has two fundamentally important steps at each round of conversation: generation and evaluation. These steps are different depending on whether our multi-LLM decentralized summarization is used or centralized. In both our multi-LLM decentralized and centralized strategies, we have k different LLMs that generate diverse summaries of the text. However, during evaluation, our multi-LLM centralized summarization approach leverages a single LLM to evaluate the summaries and select the best one whereas k LLMs are used for decentralized multi-LLM summarization. Overall, we find that our multi-LLM summarization approaches significantly outperform the baselines that leverage only a single LLM by up to 3x. These results indicate the effectiveness of multi-LLM approaches for summarization.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 23 Dec 2024 20:35:40 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/48794cfc/c56517ee.mp3" length="22439381" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1399</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 3 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiangnan Fang, Cheng-Tse Liu, Jieun Kim, Yash Bhedaru, Ethan Liu, Nikhil Singh, Nedim Lipka, Puneet Mathur, Nesreen K. Ahmed, Franck Dernoncourt, Ryan A. Rossi, Hanieh Deilamsalehy</p>

            <p><strong>Title:</strong><br>
            Multi-LLM Text Summarization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15487v1">http://arxiv.org/abs/2412.15487v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we propose a Multi-LLM summarization framework, and investigate two different multi-LLM strategies including centralized and decentralized. Our multi-LLM summarization framework has two fundamentally important steps at each round of conversation: generation and evaluation. These steps are different depending on whether our multi-LLM decentralized summarization is used or centralized. In both our multi-LLM decentralized and centralized strategies, we have k different LLMs that generate diverse summaries of the text. However, during evaluation, our multi-LLM centralized summarization approach leverages a single LLM to evaluate the summaries and select the best one whereas k LLMs are used for decentralized multi-LLM summarization. Overall, we find that our multi-LLM summarization approaches significantly outperform the baselines that leverage only a single LLM by up to 3x. These results indicate the effectiveness of multi-LLM approaches for summarization.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Qwen2.5 Technical Report</title>
      <itunes:episode>260</itunes:episode>
      <podcast:episode>260</podcast:episode>
      <itunes:title>Qwen2.5 Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f481e26a-94a8-492e-8608-d9e6e8ae9326</guid>
      <link>https://share.transistor.fm/s/326bbf6f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 236 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan Qiu</p>

            <p><strong>Title:</strong><br>
            Qwen2.5 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15115v1">http://arxiv.org/abs/2412.15115v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning. Post-training techniques enhance human preference, and notably improve long text generation, structural data analysis, and instruction following. To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich sizes. Open-weight offerings include base and instruction-tuned models, with quantized versions available. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio. Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and multimodal models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 236 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan Qiu</p>

            <p><strong>Title:</strong><br>
            Qwen2.5 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15115v1">http://arxiv.org/abs/2412.15115v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning. Post-training techniques enhance human preference, and notably improve long text generation, structural data analysis, and instruction following. To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich sizes. Open-weight offerings include base and instruction-tuned models, with quantized versions available. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio. Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and multimodal models.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Dec 2024 20:31:14 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/326bbf6f/6bf9bee5.mp3" length="24547563" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1531</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 236 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan Qiu</p>

            <p><strong>Title:</strong><br>
            Qwen2.5 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15115v1">http://arxiv.org/abs/2412.15115v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning. Post-training techniques enhance human preference, and notably improve long text generation, structural data analysis, and instruction following. To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich sizes. Open-weight offerings include base and instruction-tuned models, with quantized versions available. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio. Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and multimodal models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval</title>
      <itunes:episode>259</itunes:episode>
      <podcast:episode>259</podcast:episode>
      <itunes:title>MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">03fd588c-30aa-49eb-83cf-a2d273066bf5</guid>
      <link>https://share.transistor.fm/s/62d96e52</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, Yongping Xiong</p>

            <p><strong>Title:</strong><br>
            MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14475v1">http://arxiv.org/abs/2412.14475v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synthesis method that leverages vision language models (VLMs) and open-domain images, together with a massive synthetic dataset generated from this method. Our empirical analysis shows that MegaPairs generates high-quality data, enabling the multimodal retriever to significantly outperform the baseline model trained on 70$\times$ more data from existing datasets. Moreover, since MegaPairs solely relies on general image corpora and open-source VLMs, it can be easily scaled up, enabling continuous improvements in retrieval performance. In this stage, we produced more than 26 million training instances and trained several models of varying sizes using this data. These new models achieve state-of-the-art zero-shot performance across 4 popular composed image retrieval (CIR) benchmarks and the highest overall performance on the 36 datasets provided by MMEB. They also demonstrate notable performance improvements with additional downstream fine-tuning. Our produced dataset, well-trained models, and data synthesis pipeline will be made publicly available to facilitate the future development of this field.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, Yongping Xiong</p>

            <p><strong>Title:</strong><br>
            MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14475v1">http://arxiv.org/abs/2412.14475v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synthesis method that leverages vision language models (VLMs) and open-domain images, together with a massive synthetic dataset generated from this method. Our empirical analysis shows that MegaPairs generates high-quality data, enabling the multimodal retriever to significantly outperform the baseline model trained on 70$\times$ more data from existing datasets. Moreover, since MegaPairs solely relies on general image corpora and open-source VLMs, it can be easily scaled up, enabling continuous improvements in retrieval performance. In this stage, we produced more than 26 million training instances and trained several models of varying sizes using this data. These new models achieve state-of-the-art zero-shot performance across 4 popular composed image retrieval (CIR) benchmarks and the highest overall performance on the 36 datasets provided by MMEB. They also demonstrate notable performance improvements with additional downstream fine-tuning. Our produced dataset, well-trained models, and data synthesis pipeline will be made publicly available to facilitate the future development of this field.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Dec 2024 20:30:40 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/62d96e52/8c7f8530.mp3" length="22171509" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1382</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 44 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, Yongping Xiong</p>

            <p><strong>Title:</strong><br>
            MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14475v1">http://arxiv.org/abs/2412.14475v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synthesis method that leverages vision language models (VLMs) and open-domain images, together with a massive synthetic dataset generated from this method. Our empirical analysis shows that MegaPairs generates high-quality data, enabling the multimodal retriever to significantly outperform the baseline model trained on 70$\times$ more data from existing datasets. Moreover, since MegaPairs solely relies on general image corpora and open-source VLMs, it can be easily scaled up, enabling continuous improvements in retrieval performance. In this stage, we produced more than 26 million training instances and trained several models of varying sizes using this data. These new models achieve state-of-the-art zero-shot performance across 4 popular composed image retrieval (CIR) benchmarks and the highest overall performance on the 36 datasets provided by MMEB. They also demonstrate notable performance improvements with additional downstream fine-tuning. Our produced dataset, well-trained models, and data synthesis pipeline will be made publicly available to facilitate the future development of this field.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks</title>
      <itunes:episode>258</itunes:episode>
      <podcast:episode>258</podcast:episode>
      <itunes:title>LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a50b7c59-b9f4-4f5b-ba9a-775f374276ab</guid>
      <link>https://share.transistor.fm/s/e9c66199</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15204v1">http://arxiv.org/abs/2412.15204v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2. The project is available at https://longbench2.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15204v1">http://arxiv.org/abs/2412.15204v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2. The project is available at https://longbench2.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Dec 2024 20:30:17 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e9c66199/908b45f4.mp3" length="22313222" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1391</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li</p>

            <p><strong>Title:</strong><br>
            LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15204v1">http://arxiv.org/abs/2412.15204v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2. The project is available at https://longbench2.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>How to Synthesize Text Data without Model Collapse?</title>
      <itunes:episode>257</itunes:episode>
      <podcast:episode>257</podcast:episode>
      <itunes:title>How to Synthesize Text Data without Model Collapse?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">27a0ec4a-5e0f-43d4-bf3f-b5b41b5d7490</guid>
      <link>https://share.transistor.fm/s/2ed45d79</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin, Zilong Zheng, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            How to Synthesize Text Data without Model Collapse?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14689v1">http://arxiv.org/abs/2412.14689v1</a></p>

            <p><strong>Abstract:</strong><br>
            Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-$\{n\}$ models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin, Zilong Zheng, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            How to Synthesize Text Data without Model Collapse?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14689v1">http://arxiv.org/abs/2412.14689v1</a></p>

            <p><strong>Abstract:</strong><br>
            Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-$\{n\}$ models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Dec 2024 20:29:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2ed45d79/5134fcdd.mp3" length="23419936" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1460</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin, Zilong Zheng, Bowen Zhou</p>

            <p><strong>Title:</strong><br>
            How to Synthesize Text Data without Model Collapse?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14689v1">http://arxiv.org/abs/2412.14689v1</a></p>

            <p><strong>Abstract:</strong><br>
            Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-$\{n\}$ models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Flowing from Words to Pixels: A Framework for Cross-Modality Evolution</title>
      <itunes:episode>256</itunes:episode>
      <podcast:episode>256</podcast:episode>
      <itunes:title>Flowing from Words to Pixels: A Framework for Cross-Modality Evolution</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4b748a62-bcaf-43e7-a1fc-ac3b29a970bb</guid>
      <link>https://share.transistor.fm/s/f2520766</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qihao Liu, Xi Yin, Alan Yuille, Andrew Brown, Mannat Singh</p>

            <p><strong>Title:</strong><br>
            Flowing from Words to Pixels: A Framework for Cross-Modality Evolution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15213v1">http://arxiv.org/abs/2412.15213v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qihao Liu, Xi Yin, Alan Yuille, Andrew Brown, Mannat Singh</p>

            <p><strong>Title:</strong><br>
            Flowing from Words to Pixels: A Framework for Cross-Modality Evolution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15213v1">http://arxiv.org/abs/2412.15213v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Dec 2024 20:29:30 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f2520766/626ccd70.mp3" length="19213614" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1197</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qihao Liu, Xi Yin, Alan Yuille, Andrew Brown, Mannat Singh</p>

            <p><strong>Title:</strong><br>
            Flowing from Words to Pixels: A Framework for Cross-Modality Evolution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15213v1">http://arxiv.org/abs/2412.15213v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion</title>
      <itunes:episode>255</itunes:episode>
      <podcast:episode>255</podcast:episode>
      <itunes:title>Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8111760f-97ce-4abd-9524-243dfcfe193b</guid>
      <link>https://share.transistor.fm/s/56ed9141</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jixuan He, Wanhua Li, Ye Liu, Junsik Kim, Donglai Wei, Hanspeter Pfister</p>

            <p><strong>Title:</strong><br>
            Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14462v1">http://arxiv.org/abs/2412.14462v1</a></p>

            <p><strong>Abstract:</strong><br>
            As a common image editing operation, image composition involves integrating foreground objects into background scenes. In this paper, we expand the application of the concept of Affordance from human-centered image composition tasks to a more general object-scene composition framework, addressing the complex interplay between foreground objects and background scenes. Following the principle of Affordance, we define the affordance-aware object insertion task, which aims to seamlessly insert any object into any scene with various position prompts. To address the limited data issue and incorporate this task, we constructed the SAM-FB dataset, which contains over 3 million examples across more than 3,000 object categories. Furthermore, we propose the Mask-Aware Dual Diffusion (MADD) model, which utilizes a dual-stream architecture to simultaneously denoise the RGB image and the insertion mask. By explicitly modeling the insertion mask in the diffusion process, MADD effectively facilitates the notion of affordance. Extensive experimental results show that our method outperforms the state-of-the-art methods and exhibits strong generalization performance on in-the-wild images. Please refer to our code on https://github.com/KaKituken/affordance-aware-any.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jixuan He, Wanhua Li, Ye Liu, Junsik Kim, Donglai Wei, Hanspeter Pfister</p>

            <p><strong>Title:</strong><br>
            Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14462v1">http://arxiv.org/abs/2412.14462v1</a></p>

            <p><strong>Abstract:</strong><br>
            As a common image editing operation, image composition involves integrating foreground objects into background scenes. In this paper, we expand the application of the concept of Affordance from human-centered image composition tasks to a more general object-scene composition framework, addressing the complex interplay between foreground objects and background scenes. Following the principle of Affordance, we define the affordance-aware object insertion task, which aims to seamlessly insert any object into any scene with various position prompts. To address the limited data issue and incorporate this task, we constructed the SAM-FB dataset, which contains over 3 million examples across more than 3,000 object categories. Furthermore, we propose the Mask-Aware Dual Diffusion (MADD) model, which utilizes a dual-stream architecture to simultaneously denoise the RGB image and the insertion mask. By explicitly modeling the insertion mask in the diffusion process, MADD effectively facilitates the notion of affordance. Extensive experimental results show that our method outperforms the state-of-the-art methods and exhibits strong generalization performance on in-the-wild images. Please refer to our code on https://github.com/KaKituken/affordance-aware-any.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Dec 2024 20:29:07 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/56ed9141/7dbbcb1b.mp3" length="19957992" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1244</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jixuan He, Wanhua Li, Ye Liu, Junsik Kim, Donglai Wei, Hanspeter Pfister</p>

            <p><strong>Title:</strong><br>
            Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14462v1">http://arxiv.org/abs/2412.14462v1</a></p>

            <p><strong>Abstract:</strong><br>
            As a common image editing operation, image composition involves integrating foreground objects into background scenes. In this paper, we expand the application of the concept of Affordance from human-centered image composition tasks to a more general object-scene composition framework, addressing the complex interplay between foreground objects and background scenes. Following the principle of Affordance, we define the affordance-aware object insertion task, which aims to seamlessly insert any object into any scene with various position prompts. To address the limited data issue and incorporate this task, we constructed the SAM-FB dataset, which contains over 3 million examples across more than 3,000 object categories. Furthermore, we propose the Mask-Aware Dual Diffusion (MADD) model, which utilizes a dual-stream architecture to simultaneously denoise the RGB image and the insertion mask. By explicitly modeling the insertion mask in the diffusion process, MADD effectively facilitates the notion of affordance. Extensive experimental results show that our method outperforms the state-of-the-art methods and exhibits strong generalization performance on in-the-wild images. Please refer to our code on https://github.com/KaKituken/affordance-aware-any.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis</title>
      <itunes:episode>254</itunes:episode>
      <podcast:episode>254</podcast:episode>
      <itunes:title>LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">455550e1-70d9-4fa7-9cd4-ca7bc61f6252</guid>
      <link>https://share.transistor.fm/s/7b7d89d8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, Limin Wang</p>

            <p><strong>Title:</strong><br>
            LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15214v1">http://arxiv.org/abs/2412.15214v1</a></p>

            <p><strong>Abstract:</strong><br>
            The intuitive nature of drag-based interaction has led to its growing adoption for controlling object trajectories in image-to-video synthesis. Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plane movements. In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory. That way, our new interaction paradigm not only inherits the convenience from 2D dragging, but facilitates trajectory control in the 3D space, broadening the scope of creativity. We propose a pioneering method for 3D trajectory control in image-to-video synthesis by abstracting object masks into a few cluster points. These points, accompanied by the depth information and the instance information, are finally fed into a video diffusion model as the control signal. Extensive experiments validate the effectiveness of our approach, dubbed LeviTor, in precisely manipulating the object movements when producing photo-realistic videos from static images. Project page: https://ppetrichor.github.io/levitor.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, Limin Wang</p>

            <p><strong>Title:</strong><br>
            LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15214v1">http://arxiv.org/abs/2412.15214v1</a></p>

            <p><strong>Abstract:</strong><br>
            The intuitive nature of drag-based interaction has led to its growing adoption for controlling object trajectories in image-to-video synthesis. Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plane movements. In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory. That way, our new interaction paradigm not only inherits the convenience from 2D dragging, but facilitates trajectory control in the 3D space, broadening the scope of creativity. We propose a pioneering method for 3D trajectory control in image-to-video synthesis by abstracting object masks into a few cluster points. These points, accompanied by the depth information and the instance information, are finally fed into a video diffusion model as the control signal. Extensive experiments validate the effectiveness of our approach, dubbed LeviTor, in precisely manipulating the object movements when producing photo-realistic videos from static images. Project page: https://ppetrichor.github.io/levitor.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Dec 2024 20:28:44 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7b7d89d8/c7d1c314.mp3" length="20342090" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1268</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, Limin Wang</p>

            <p><strong>Title:</strong><br>
            LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15214v1">http://arxiv.org/abs/2412.15214v1</a></p>

            <p><strong>Abstract:</strong><br>
            The intuitive nature of drag-based interaction has led to its growing adoption for controlling object trajectories in image-to-video synthesis. Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plane movements. In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory. That way, our new interaction paradigm not only inherits the convenience from 2D dragging, but facilitates trajectory control in the 3D space, broadening the scope of creativity. We propose a pioneering method for 3D trajectory control in image-to-video synthesis by abstracting object masks into a few cluster points. These points, accompanied by the depth information and the instance information, are finally fed into a video diffusion model as the control signal. Extensive experiments validate the effectiveness of our approach, dubbed LeviTor, in precisely manipulating the object movements when producing photo-realistic videos from static images. Project page: https://ppetrichor.github.io/levitor.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation</title>
      <itunes:episode>253</itunes:episode>
      <podcast:episode>253</podcast:episode>
      <itunes:title>DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">513ecd70-4726-45cc-8cc8-483ca4cf0b42</guid>
      <link>https://share.transistor.fm/s/2be8170c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Wang Zhao, Yan-Pei Cao, Jiale Xu, Yuejiang Dong, Ying Shan</p>

            <p><strong>Title:</strong><br>
            DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15200v1">http://arxiv.org/abs/2412.15200v1</a></p>

            <p><strong>Abstract:</strong><br>
            Procedural Content Generation (PCG) is powerful in creating high-quality 3D contents, yet controlling it to produce desired shapes is difficult and often requires extensive parameter tuning. Inverse Procedural Content Generation aims to automatically find the best parameters under the input condition. However, existing sampling-based and neural network-based methods still suffer from numerous sample iterations or limited controllability. In this work, we present DI-PCG, a novel and efficient method for Inverse PCG from general image conditions. At its core is a lightweight diffusion transformer model, where PCG parameters are directly treated as the denoising target and the observed images as conditions to control parameter generation. DI-PCG is efficient and effective. With only 7.6M network parameters and 30 GPU hours to train, it demonstrates superior performance in recovering parameters accurately, and generalizing well to in-the-wild images. Quantitative and qualitative experiment results validate the effectiveness of DI-PCG in inverse PCG and image-to-3D generation tasks. DI-PCG offers a promising approach for efficient inverse PCG and represents a valuable exploration step towards a 3D generation path that models how to construct a 3D asset using parametric models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Wang Zhao, Yan-Pei Cao, Jiale Xu, Yuejiang Dong, Ying Shan</p>

            <p><strong>Title:</strong><br>
            DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15200v1">http://arxiv.org/abs/2412.15200v1</a></p>

            <p><strong>Abstract:</strong><br>
            Procedural Content Generation (PCG) is powerful in creating high-quality 3D contents, yet controlling it to produce desired shapes is difficult and often requires extensive parameter tuning. Inverse Procedural Content Generation aims to automatically find the best parameters under the input condition. However, existing sampling-based and neural network-based methods still suffer from numerous sample iterations or limited controllability. In this work, we present DI-PCG, a novel and efficient method for Inverse PCG from general image conditions. At its core is a lightweight diffusion transformer model, where PCG parameters are directly treated as the denoising target and the observed images as conditions to control parameter generation. DI-PCG is efficient and effective. With only 7.6M network parameters and 30 GPU hours to train, it demonstrates superior performance in recovering parameters accurately, and generalizing well to in-the-wild images. Quantitative and qualitative experiment results validate the effectiveness of DI-PCG in inverse PCG and image-to-3D generation tasks. DI-PCG offers a promising approach for efficient inverse PCG and represents a valuable exploration step towards a 3D generation path that models how to construct a 3D asset using parametric models.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Dec 2024 20:28:21 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2be8170c/04eee5f1.mp3" length="22269767" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1388</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Wang Zhao, Yan-Pei Cao, Jiale Xu, Yuejiang Dong, Ying Shan</p>

            <p><strong>Title:</strong><br>
            DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15200v1">http://arxiv.org/abs/2412.15200v1</a></p>

            <p><strong>Abstract:</strong><br>
            Procedural Content Generation (PCG) is powerful in creating high-quality 3D contents, yet controlling it to produce desired shapes is difficult and often requires extensive parameter tuning. Inverse Procedural Content Generation aims to automatically find the best parameters under the input condition. However, existing sampling-based and neural network-based methods still suffer from numerous sample iterations or limited controllability. In this work, we present DI-PCG, a novel and efficient method for Inverse PCG from general image conditions. At its core is a lightweight diffusion transformer model, where PCG parameters are directly treated as the denoising target and the observed images as conditions to control parameter generation. DI-PCG is efficient and effective. With only 7.6M network parameters and 30 GPU hours to train, it demonstrates superior performance in recovering parameters accurately, and generalizing well to in-the-wild images. Quantitative and qualitative experiment results validate the effectiveness of DI-PCG in inverse PCG and image-to-3D generation tasks. DI-PCG offers a promising approach for efficient inverse PCG and represents a valuable exploration step towards a 3D generation path that models how to construct a 3D asset using parametric models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling</title>
      <itunes:episode>252</itunes:episode>
      <podcast:episode>252</podcast:episode>
      <itunes:title>AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">26c416e0-f5eb-4bda-abe6-be825a87a87c</guid>
      <link>https://share.transistor.fm/s/3bfbada9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping</p>

            <p><strong>Title:</strong><br>
            AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15084v1">http://arxiv.org/abs/2412.15084v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce AceMath, a suite of frontier math models that excel in solving complex math problems, along with highly effective reward models capable of evaluating generated solutions and reliably identifying the correct ones. To develop the instruction-tuned math models, we propose a supervised fine-tuning (SFT) process that first achieves competitive performance across general domains, followed by targeted fine-tuning for the math domain using a carefully curated set of prompts and synthetically generated responses. The resulting model, AceMath-72B-Instruct greatly outperforms Qwen2.5-Math-72B-Instruct, GPT-4o and Claude-3.5 Sonnet. To develop math-specialized reward model, we first construct AceMath-RewardBench, a comprehensive and robust benchmark for evaluating math reward models across diverse problems and difficulty levels. After that, we present a systematic approach to build our math reward models. The resulting model, AceMath-72B-RM, consistently outperforms state-of-the-art reward models. Furthermore, when combining AceMath-72B-Instruct with AceMath-72B-RM, we achieve the highest average rm@8 score across the math reasoning benchmarks. We will release model weights, training data, and evaluation benchmarks at: https://research.nvidia.com/labs/adlr/acemath</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping</p>

            <p><strong>Title:</strong><br>
            AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15084v1">http://arxiv.org/abs/2412.15084v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce AceMath, a suite of frontier math models that excel in solving complex math problems, along with highly effective reward models capable of evaluating generated solutions and reliably identifying the correct ones. To develop the instruction-tuned math models, we propose a supervised fine-tuning (SFT) process that first achieves competitive performance across general domains, followed by targeted fine-tuning for the math domain using a carefully curated set of prompts and synthetically generated responses. The resulting model, AceMath-72B-Instruct greatly outperforms Qwen2.5-Math-72B-Instruct, GPT-4o and Claude-3.5 Sonnet. To develop math-specialized reward model, we first construct AceMath-RewardBench, a comprehensive and robust benchmark for evaluating math reward models across diverse problems and difficulty levels. After that, we present a systematic approach to build our math reward models. The resulting model, AceMath-72B-RM, consistently outperforms state-of-the-art reward models. Furthermore, when combining AceMath-72B-Instruct with AceMath-72B-RM, we achieve the highest average rm@8 score across the math reasoning benchmarks. We will release model weights, training data, and evaluation benchmarks at: https://research.nvidia.com/labs/adlr/acemath</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 20 Dec 2024 20:27:56 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3bfbada9/db1e11ae.mp3" length="23244423" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1449</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping</p>

            <p><strong>Title:</strong><br>
            AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.15084v1">http://arxiv.org/abs/2412.15084v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce AceMath, a suite of frontier math models that excel in solving complex math problems, along with highly effective reward models capable of evaluating generated solutions and reliably identifying the correct ones. To develop the instruction-tuned math models, we propose a supervised fine-tuning (SFT) process that first achieves competitive performance across general domains, followed by targeted fine-tuning for the math domain using a carefully curated set of prompts and synthetically generated responses. The resulting model, AceMath-72B-Instruct greatly outperforms Qwen2.5-Math-72B-Instruct, GPT-4o and Claude-3.5 Sonnet. To develop math-specialized reward model, we first construct AceMath-RewardBench, a comprehensive and robust benchmark for evaluating math reward models across diverse problems and difficulty levels. After that, we present a systematic approach to build our math reward models. The resulting model, AceMath-72B-RM, consistently outperforms state-of-the-art reward models. Furthermore, when combining AceMath-72B-Instruct with AceMath-72B-RM, we achieve the highest average rm@8 score across the math reasoning benchmarks. We will release model weights, training data, and evaluation benchmarks at: https://research.nvidia.com/labs/adlr/acemath</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>No More Adam: Learning Rate Scaling at Initialization is All You Need</title>
      <itunes:episode>251</itunes:episode>
      <podcast:episode>251</podcast:episode>
      <itunes:title>No More Adam: Learning Rate Scaling at Initialization is All You Need</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4080ce9b-ed5d-4803-bb85-d55a12f1e8dc</guid>
      <link>https://share.transistor.fm/s/3c91dcff</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 177 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Minghao Xu, Lichuan Xiang, Xu Cai, Hongkai Wen</p>

            <p><strong>Title:</strong><br>
            No More Adam: Learning Rate Scaling at Initialization is All You Need</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11768v2">http://arxiv.org/abs/2412.11768v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we question the necessity of adaptive gradient methods for training deep neural networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with momentum (SGDM). SGD-SaI performs learning rate Scaling at Initialization (SaI) to distinct parameter groups, guided by their respective gradient signal-to-noise ratios (g-SNR). By adjusting learning rates without relying on adaptive second-order momentum, SGD-SaI helps prevent training imbalances from the very first iteration and cuts the optimizer's memory usage by half compared to AdamW. Despite its simplicity and efficiency, SGD-SaI consistently matches or outperforms AdamW in training a variety of Transformer-based tasks, effectively overcoming a long-standing challenge of using SGD for training Transformers. SGD-SaI excels in ImageNet-1K classification with Vision Transformers(ViT) and GPT-2 pretraining for large language models (LLMs, transformer decoder-only), demonstrating robustness to hyperparameter variations and practicality for diverse applications. We further tested its robustness on tasks like LoRA fine-tuning for LLMs and diffusion models, where it consistently outperforms state-of-the-art optimizers. From a memory efficiency perspective, SGD-SaI achieves substantial memory savings for optimizer states, reducing memory usage by 5.93 GB for GPT-2 (1.5B parameters) and 25.15 GB for Llama2-7B compared to AdamW in full-precision training settings.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 177 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Minghao Xu, Lichuan Xiang, Xu Cai, Hongkai Wen</p>

            <p><strong>Title:</strong><br>
            No More Adam: Learning Rate Scaling at Initialization is All You Need</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11768v2">http://arxiv.org/abs/2412.11768v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we question the necessity of adaptive gradient methods for training deep neural networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with momentum (SGDM). SGD-SaI performs learning rate Scaling at Initialization (SaI) to distinct parameter groups, guided by their respective gradient signal-to-noise ratios (g-SNR). By adjusting learning rates without relying on adaptive second-order momentum, SGD-SaI helps prevent training imbalances from the very first iteration and cuts the optimizer's memory usage by half compared to AdamW. Despite its simplicity and efficiency, SGD-SaI consistently matches or outperforms AdamW in training a variety of Transformer-based tasks, effectively overcoming a long-standing challenge of using SGD for training Transformers. SGD-SaI excels in ImageNet-1K classification with Vision Transformers(ViT) and GPT-2 pretraining for large language models (LLMs, transformer decoder-only), demonstrating robustness to hyperparameter variations and practicality for diverse applications. We further tested its robustness on tasks like LoRA fine-tuning for LLMs and diffusion models, where it consistently outperforms state-of-the-art optimizers. From a memory efficiency perspective, SGD-SaI achieves substantial memory savings for optimizer states, reducing memory usage by 5.93 GB for GPT-2 (1.5B parameters) and 25.15 GB for Llama2-7B compared to AdamW in full-precision training settings.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 19 Dec 2024 20:42:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3c91dcff/ab303d00.mp3" length="21155869" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1319</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 177 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Minghao Xu, Lichuan Xiang, Xu Cai, Hongkai Wen</p>

            <p><strong>Title:</strong><br>
            No More Adam: Learning Rate Scaling at Initialization is All You Need</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11768v2">http://arxiv.org/abs/2412.11768v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this work, we question the necessity of adaptive gradient methods for training deep neural networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with momentum (SGDM). SGD-SaI performs learning rate Scaling at Initialization (SaI) to distinct parameter groups, guided by their respective gradient signal-to-noise ratios (g-SNR). By adjusting learning rates without relying on adaptive second-order momentum, SGD-SaI helps prevent training imbalances from the very first iteration and cuts the optimizer's memory usage by half compared to AdamW. Despite its simplicity and efficiency, SGD-SaI consistently matches or outperforms AdamW in training a variety of Transformer-based tasks, effectively overcoming a long-standing challenge of using SGD for training Transformers. SGD-SaI excels in ImageNet-1K classification with Vision Transformers(ViT) and GPT-2 pretraining for large language models (LLMs, transformer decoder-only), demonstrating robustness to hyperparameter variations and practicality for diverse applications. We further tested its robustness on tasks like LoRA fine-tuning for LLMs and diffusion models, where it consistently outperforms state-of-the-art optimizers. From a memory efficiency perspective, SGD-SaI achieves substantial memory savings for optimizer states, reducing memory usage by 5.93 GB for GPT-2 (1.5B parameters) and 25.15 GB for Llama2-7B compared to AdamW in full-precision training settings.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference</title>
      <itunes:episode>250</itunes:episode>
      <podcast:episode>250</podcast:episode>
      <itunes:title>Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4b26c033-868d-4a16-b82c-3796d0ec49eb</guid>
      <link>https://share.transistor.fm/s/03a6df6d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli</p>

            <p><strong>Title:</strong><br>
            Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13663v2">http://arxiv.org/abs/2412.13663v2</a></p>

            <p><strong>Abstract:</strong><br>
            Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli</p>

            <p><strong>Title:</strong><br>
            Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13663v2">http://arxiv.org/abs/2412.13663v2</a></p>

            <p><strong>Abstract:</strong><br>
            Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 19 Dec 2024 20:41:39 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/03a6df6d/62575f8f.mp3" length="21110793" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1316</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli</p>

            <p><strong>Title:</strong><br>
            Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13663v2">http://arxiv.org/abs/2412.13663v2</a></p>

            <p><strong>Abstract:</strong><br>
            Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks</title>
      <itunes:episode>249</itunes:episode>
      <podcast:episode>249</podcast:episode>
      <itunes:title>TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7397a0a7-a684-43eb-a861-2879dc6e0f0d</guid>
      <link>https://share.transistor.fm/s/18fc7aa5</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig</p>

            <p><strong>Title:</strong><br>
            TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14161v1">http://arxiv.org/abs/2412.14161v1</a></p>

            <p><strong>Abstract:</strong><br>
            We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents -- in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig</p>

            <p><strong>Title:</strong><br>
            TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14161v1">http://arxiv.org/abs/2412.14161v1</a></p>

            <p><strong>Abstract:</strong><br>
            We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents -- in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 19 Dec 2024 20:41:18 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/18fc7aa5/c54cac03.mp3" length="23817020" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1485</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig</p>

            <p><strong>Title:</strong><br>
            TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14161v1">http://arxiv.org/abs/2412.14161v1</a></p>

            <p><strong>Abstract:</strong><br>
            We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents -- in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AniDoc: Animation Creation Made Easier</title>
      <itunes:episode>248</itunes:episode>
      <podcast:episode>248</podcast:episode>
      <itunes:title>AniDoc: Animation Creation Made Easier</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">64f046c3-cfc4-49dc-9dde-fc146f21d4a2</guid>
      <link>https://share.transistor.fm/s/01a3a1da</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yihao Meng, Hao Ouyang, Hanlin Wang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Zhiheng Liu, Yujun Shen, Huamin Qu</p>

            <p><strong>Title:</strong><br>
            AniDoc: Animation Creation Made Easier</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14173v1">http://arxiv.org/abs/2412.14173v1</a></p>

            <p><strong>Abstract:</strong><br>
            The production of 2D animation follows an industry-standard workflow, encompassing four essential stages: character design, keyframe animation, in-betweening, and coloring. Our research focuses on reducing the labor costs in the above process by harnessing the potential of increasingly powerful generative AI. Using video diffusion models as the foundation, AniDoc emerges as a video line art colorization tool, which automatically converts sketch sequences into colored animations following the reference character specification. Our model exploits correspondence matching as an explicit guidance, yielding strong robustness to the variations (e.g., posture) between the reference character and each line art frame. In addition, our model could even automate the in-betweening process, such that users can easily create a temporally consistent animation by simply providing a character image as well as the start and end sketches. Our code is available at: https://yihao-meng.github.io/AniDoc_demo.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yihao Meng, Hao Ouyang, Hanlin Wang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Zhiheng Liu, Yujun Shen, Huamin Qu</p>

            <p><strong>Title:</strong><br>
            AniDoc: Animation Creation Made Easier</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14173v1">http://arxiv.org/abs/2412.14173v1</a></p>

            <p><strong>Abstract:</strong><br>
            The production of 2D animation follows an industry-standard workflow, encompassing four essential stages: character design, keyframe animation, in-betweening, and coloring. Our research focuses on reducing the labor costs in the above process by harnessing the potential of increasingly powerful generative AI. Using video diffusion models as the foundation, AniDoc emerges as a video line art colorization tool, which automatically converts sketch sequences into colored animations following the reference character specification. Our model exploits correspondence matching as an explicit guidance, yielding strong robustness to the variations (e.g., posture) between the reference character and each line art frame. In addition, our model could even automate the in-betweening process, such that users can easily create a temporally consistent animation by simply providing a character image as well as the start and end sketches. Our code is available at: https://yihao-meng.github.io/AniDoc_demo.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 19 Dec 2024 20:40:57 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/01a3a1da/e78753b7.mp3" length="21497729" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1340</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yihao Meng, Hao Ouyang, Hanlin Wang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Zhiheng Liu, Yujun Shen, Huamin Qu</p>

            <p><strong>Title:</strong><br>
            AniDoc: Animation Creation Made Easier</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14173v1">http://arxiv.org/abs/2412.14173v1</a></p>

            <p><strong>Abstract:</strong><br>
            The production of 2D animation follows an industry-standard workflow, encompassing four essential stages: character design, keyframe animation, in-betweening, and coloring. Our research focuses on reducing the labor costs in the above process by harnessing the potential of increasingly powerful generative AI. Using video diffusion models as the foundation, AniDoc emerges as a video line art colorization tool, which automatically converts sketch sequences into colored animations following the reference character specification. Our model exploits correspondence matching as an explicit guidance, yielding strong robustness to the variations (e.g., posture) between the reference character and each line art frame. In addition, our model could even automate the in-betweening process, such that users can easily create a temporally consistent animation by simply providing a character image as well as the start and end sketches. Our code is available at: https://yihao-meng.github.io/AniDoc_demo.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FashionComposer: Compositional Fashion Image Generation</title>
      <itunes:episode>247</itunes:episode>
      <podcast:episode>247</podcast:episode>
      <itunes:title>FashionComposer: Compositional Fashion Image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8f97a693-0d25-42f4-b127-058c434fdbdc</guid>
      <link>https://share.transistor.fm/s/a00cc292</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sihui Ji, Yiyang Wang, Xi Chen, Xiaogang Xu, Hao Luo, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            FashionComposer: Compositional Fashion Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14168v2">http://arxiv.org/abs/2412.14168v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present FashionComposer for compositional fashion image generation. Unlike previous methods, FashionComposer is highly flexible. It takes multi-modal input (i.e., text prompt, parametric human model, garment image, and face image) and supports personalizing the appearance, pose, and figure of the human and assigning multiple garments in one pass. To achieve this, we first develop a universal framework capable of handling diverse input modalities. We construct scaled training data to enhance the model's robust compositional capabilities. To accommodate multiple reference images (garments and faces) seamlessly, we organize these references in a single image as an "asset library" and employ a reference UNet to extract appearance features. To inject the appearance features into the correct pixels in the generated result, we propose subject-binding attention. It binds the appearance features from different "assets" with the corresponding text features. In this way, the model could understand each asset according to their semantics, supporting arbitrary numbers and types of reference images. As a comprehensive solution, FashionComposer also supports many other applications like human album generation, diverse virtual try-on tasks, etc.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sihui Ji, Yiyang Wang, Xi Chen, Xiaogang Xu, Hao Luo, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            FashionComposer: Compositional Fashion Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14168v2">http://arxiv.org/abs/2412.14168v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present FashionComposer for compositional fashion image generation. Unlike previous methods, FashionComposer is highly flexible. It takes multi-modal input (i.e., text prompt, parametric human model, garment image, and face image) and supports personalizing the appearance, pose, and figure of the human and assigning multiple garments in one pass. To achieve this, we first develop a universal framework capable of handling diverse input modalities. We construct scaled training data to enhance the model's robust compositional capabilities. To accommodate multiple reference images (garments and faces) seamlessly, we organize these references in a single image as an "asset library" and employ a reference UNet to extract appearance features. To inject the appearance features into the correct pixels in the generated result, we propose subject-binding attention. It binds the appearance features from different "assets" with the corresponding text features. In this way, the model could understand each asset according to their semantics, supporting arbitrary numbers and types of reference images. As a comprehensive solution, FashionComposer also supports many other applications like human album generation, diverse virtual try-on tasks, etc.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 19 Dec 2024 20:40:25 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a00cc292/d5c10500.mp3" length="19052266" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1187</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sihui Ji, Yiyang Wang, Xi Chen, Xiaogang Xu, Hao Luo, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            FashionComposer: Compositional Fashion Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14168v2">http://arxiv.org/abs/2412.14168v2</a></p>

            <p><strong>Abstract:</strong><br>
            We present FashionComposer for compositional fashion image generation. Unlike previous methods, FashionComposer is highly flexible. It takes multi-modal input (i.e., text prompt, parametric human model, garment image, and face image) and supports personalizing the appearance, pose, and figure of the human and assigning multiple garments in one pass. To achieve this, we first develop a universal framework capable of handling diverse input modalities. We construct scaled training data to enhance the model's robust compositional capabilities. To accommodate multiple reference images (garments and faces) seamlessly, we organize these references in a single image as an "asset library" and employ a reference UNet to extract appearance features. To inject the appearance features into the correct pixels in the generated result, we propose subject-binding attention. It binds the appearance features from different "assets" with the corresponding text features. In this way, the model could understand each asset according to their semantics, supporting arbitrary numbers and types of reference images. As a comprehensive solution, FashionComposer also supports many other applications like human album generation, diverse virtual try-on tasks, etc.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GUI Agents: A Survey</title>
      <itunes:episode>246</itunes:episode>
      <podcast:episode>246</podcast:episode>
      <itunes:title>GUI Agents: A Survey</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">89652a7b-c955-4237-a4dd-2aca7a12d1d3</guid>
      <link>https://share.transistor.fm/s/16abc05e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.AI, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt</p>

            <p><strong>Title:</strong><br>
            GUI Agents: A Survey</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13501v1">http://arxiv.org/abs/2412.13501v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems or software applications via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.AI, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt</p>

            <p><strong>Title:</strong><br>
            GUI Agents: A Survey</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13501v1">http://arxiv.org/abs/2412.13501v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems or software applications via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 19 Dec 2024 20:40:04 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/16abc05e/7b6a7999.mp3" length="20233802" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1261</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.AI, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt</p>

            <p><strong>Title:</strong><br>
            GUI Agents: A Survey</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13501v1">http://arxiv.org/abs/2412.13501v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems or software applications via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning</title>
      <itunes:episode>245</itunes:episode>
      <podcast:episode>245</podcast:episode>
      <itunes:title>Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a6aeaec0-06b9-4589-b668-265dbb6f6071</guid>
      <link>https://share.transistor.fm/s/b963cf43</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Moritz Reuss, Jyothish Pari, Pulkit Agrawal, Rudolf Lioutikov</p>

            <p><strong>Title:</strong><br>
            Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.12953v1">http://arxiv.org/abs/2412.12953v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior. As models are becoming larger to capture more complex capabilities, their computational demands increase, as shown by recent scaling laws. Therefore, continuing with the current architectures will present a computational roadblock. To address this gap, we propose Mixture-of-Denoising Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current state-of-the-art Transformer-based Diffusion Policies while enabling parameter-efficient scaling through sparse experts and noise-conditioned routing, reducing both active parameters by 40% and inference costs by 90% via expert caching. Our architecture combines this efficient scaling with noise-conditioned self-attention mechanism, enabling more effective denoising across different noise levels. MoDE achieves state-of-the-art performance on 134 tasks in four established imitation learning benchmarks (CALVIN and LIBERO). Notably, by pretraining MoDE on diverse robotics data, we achieve 4.01 on CALVIN ABC and 0.95 on LIBERO-90. It surpasses both CNN-based and Transformer Diffusion Policies by an average of 57% across 4 benchmarks, while using 90% fewer FLOPs and fewer active parameters compared to default Diffusion Transformer architectures. Furthermore, we conduct comprehensive ablations on MoDE's components, providing insights for designing efficient and scalable Transformer architectures for Diffusion Policies. Code and demonstrations are available at https://mbreuss.github.io/MoDE_Diffusion_Policy/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Moritz Reuss, Jyothish Pari, Pulkit Agrawal, Rudolf Lioutikov</p>

            <p><strong>Title:</strong><br>
            Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.12953v1">http://arxiv.org/abs/2412.12953v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior. As models are becoming larger to capture more complex capabilities, their computational demands increase, as shown by recent scaling laws. Therefore, continuing with the current architectures will present a computational roadblock. To address this gap, we propose Mixture-of-Denoising Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current state-of-the-art Transformer-based Diffusion Policies while enabling parameter-efficient scaling through sparse experts and noise-conditioned routing, reducing both active parameters by 40% and inference costs by 90% via expert caching. Our architecture combines this efficient scaling with noise-conditioned self-attention mechanism, enabling more effective denoising across different noise levels. MoDE achieves state-of-the-art performance on 134 tasks in four established imitation learning benchmarks (CALVIN and LIBERO). Notably, by pretraining MoDE on diverse robotics data, we achieve 4.01 on CALVIN ABC and 0.95 on LIBERO-90. It surpasses both CNN-based and Transformer Diffusion Policies by an average of 57% across 4 benchmarks, while using 90% fewer FLOPs and fewer active parameters compared to default Diffusion Transformer architectures. Furthermore, we conduct comprehensive ablations on MoDE's components, providing insights for designing efficient and scalable Transformer architectures for Diffusion Policies. Code and demonstrations are available at https://mbreuss.github.io/MoDE_Diffusion_Policy/.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 19 Dec 2024 20:39:43 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b963cf43/d6022142.mp3" length="21854724" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1362</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Moritz Reuss, Jyothish Pari, Pulkit Agrawal, Rudolf Lioutikov</p>

            <p><strong>Title:</strong><br>
            Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.12953v1">http://arxiv.org/abs/2412.12953v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior. As models are becoming larger to capture more complex capabilities, their computational demands increase, as shown by recent scaling laws. Therefore, continuing with the current architectures will present a computational roadblock. To address this gap, we propose Mixture-of-Denoising Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current state-of-the-art Transformer-based Diffusion Policies while enabling parameter-efficient scaling through sparse experts and noise-conditioned routing, reducing both active parameters by 40% and inference costs by 90% via expert caching. Our architecture combines this efficient scaling with noise-conditioned self-attention mechanism, enabling more effective denoising across different noise levels. MoDE achieves state-of-the-art performance on 134 tasks in four established imitation learning benchmarks (CALVIN and LIBERO). Notably, by pretraining MoDE on diverse robotics data, we achieve 4.01 on CALVIN ABC and 0.95 on LIBERO-90. It surpasses both CNN-based and Transformer Diffusion Policies by an average of 57% across 4 benchmarks, while using 90% fewer FLOPs and fewer active parameters compared to default Diffusion Transformer architectures. Furthermore, we conduct comprehensive ablations on MoDE's components, providing insights for designing efficient and scalable Transformer architectures for Diffusion Policies. Code and demonstrations are available at https://mbreuss.github.io/MoDE_Diffusion_Policy/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation</title>
      <itunes:episode>244</itunes:episode>
      <podcast:episode>244</podcast:episode>
      <itunes:title>Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">aed92f8a-46c5-441b-b1a7-3d35bf8f044e</guid>
      <link>https://share.transistor.fm/s/143f362d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, Bingyi Kang</p>

            <p><strong>Title:</strong><br>
            Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14015v1">http://arxiv.org/abs/2412.14015v1</a></p>

            <p><strong>Abstract:</strong><br>
            Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, Bingyi Kang</p>

            <p><strong>Title:</strong><br>
            Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14015v1">http://arxiv.org/abs/2412.14015v1</a></p>

            <p><strong>Abstract:</strong><br>
            Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 19 Dec 2024 20:39:21 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/143f362d/a440c1e2.mp3" length="19915372" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1241</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, Bingyi Kang</p>

            <p><strong>Title:</strong><br>
            Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14015v1">http://arxiv.org/abs/2412.14015v1</a></p>

            <p><strong>Abstract:</strong><br>
            Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces</title>
      <itunes:episode>243</itunes:episode>
      <podcast:episode>243</podcast:episode>
      <itunes:title>Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">41a75b97-b82d-4b42-89d9-a0b1d79c4f9d</guid>
      <link>https://share.transistor.fm/s/8d44f56b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, Saining Xie</p>

            <p><strong>Title:</strong><br>
            Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14171v1">http://arxiv.org/abs/2412.14171v1</a></p>

            <p><strong>Abstract:</strong><br>
            Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive - though subhuman - visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance ability.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, Saining Xie</p>

            <p><strong>Title:</strong><br>
            Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14171v1">http://arxiv.org/abs/2412.14171v1</a></p>

            <p><strong>Abstract:</strong><br>
            Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive - though subhuman - visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance ability.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 19 Dec 2024 20:39:00 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8d44f56b/5efa0c3f.mp3" length="20087584" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1252</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, Saining Xie</p>

            <p><strong>Title:</strong><br>
            Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.14171v1">http://arxiv.org/abs/2412.14171v1</a></p>

            <p><strong>Abstract:</strong><br>
            Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive - though subhuman - visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance ability.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Are Your LLMs Capable of Stable Reasoning?</title>
      <itunes:episode>242</itunes:episode>
      <podcast:episode>242</podcast:episode>
      <itunes:title>Are Your LLMs Capable of Stable Reasoning?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7da84157-e214-4465-ad86-155ca62f4bcf</guid>
      <link>https://share.transistor.fm/s/24408f0a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 61 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Are Your LLMs Capable of Stable Reasoning?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13147v2">http://arxiv.org/abs/2412.13147v2</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancement of Large Language Models (LLMs) has demonstrated remarkable progress in complex reasoning tasks. However, a significant discrepancy persists between benchmark performances and real-world applications. We identify this gap as primarily stemming from current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, particularly in complex reasoning tasks where both accuracy and consistency are crucial. This work makes two key contributions. First, we introduce G-Pass@k, a novel evaluation metric that provides a continuous assessment of model performance across multiple sampling attempts, quantifying both the model's peak performance potential and its stability. Second, we present LiveMathBench, a dynamic benchmark comprising challenging, contemporary mathematical problems designed to minimize data leakage risks during evaluation. Through extensive experiments using G-Pass@k on state-of-the-art LLMs with LiveMathBench, we provide comprehensive insights into both their maximum capabilities and operational consistency. Our findings reveal substantial room for improvement in LLMs' "realistic" reasoning capabilities, highlighting the need for more robust evaluation methods. The benchmark and detailed results are available at: https://github.com/open-compass/GPassK.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 61 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Are Your LLMs Capable of Stable Reasoning?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13147v2">http://arxiv.org/abs/2412.13147v2</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancement of Large Language Models (LLMs) has demonstrated remarkable progress in complex reasoning tasks. However, a significant discrepancy persists between benchmark performances and real-world applications. We identify this gap as primarily stemming from current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, particularly in complex reasoning tasks where both accuracy and consistency are crucial. This work makes two key contributions. First, we introduce G-Pass@k, a novel evaluation metric that provides a continuous assessment of model performance across multiple sampling attempts, quantifying both the model's peak performance potential and its stability. Second, we present LiveMathBench, a dynamic benchmark comprising challenging, contemporary mathematical problems designed to minimize data leakage risks during evaluation. Through extensive experiments using G-Pass@k on state-of-the-art LLMs with LiveMathBench, we provide comprehensive insights into both their maximum capabilities and operational consistency. Our findings reveal substantial room for improvement in LLMs' "realistic" reasoning capabilities, highlighting the need for more robust evaluation methods. The benchmark and detailed results are available at: https://github.com/open-compass/GPassK.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Dec 2024 20:28:08 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/24408f0a/21d56b39.mp3" length="23277403" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1451</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 61 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, Kai Chen</p>

            <p><strong>Title:</strong><br>
            Are Your LLMs Capable of Stable Reasoning?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13147v2">http://arxiv.org/abs/2412.13147v2</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancement of Large Language Models (LLMs) has demonstrated remarkable progress in complex reasoning tasks. However, a significant discrepancy persists between benchmark performances and real-world applications. We identify this gap as primarily stemming from current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, particularly in complex reasoning tasks where both accuracy and consistency are crucial. This work makes two key contributions. First, we introduce G-Pass@k, a novel evaluation metric that provides a continuous assessment of model performance across multiple sampling attempts, quantifying both the model's peak performance potential and its stability. Second, we present LiveMathBench, a dynamic benchmark comprising challenging, contemporary mathematical problems designed to minimize data leakage risks during evaluation. Through extensive experiments using G-Pass@k on state-of-the-art LLMs with LiveMathBench, we provide comprehensive insights into both their maximum capabilities and operational consistency. Our findings reveal substantial room for improvement in LLMs' "realistic" reasoning capabilities, highlighting the need for more robust evaluation methods. The benchmark and detailed results are available at: https://github.com/open-compass/GPassK.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models</title>
      <itunes:episode>241</itunes:episode>
      <podcast:episode>241</podcast:episode>
      <itunes:title>Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">697f63d0-1605-4fc4-b1b1-b0bc64e26cba</guid>
      <link>https://share.transistor.fm/s/5271d2e7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            YiFan Zhang, Shanglin Lei, Runqi Qiao, Zhuoma GongQue, Xiaoshuai Song, Guanting Dong, Qiuna Tan, Zhe Wei, Peiqing Yang, Ye Tian, Yadong Xue, Xiaofei Wang, Honggang Zhang</p>

            <p><strong>Title:</strong><br>
            Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.12606v1">http://arxiv.org/abs/2412.12606v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapidly developing field of large multimodal models (LMMs) has led to the emergence of diverse models with remarkable capabilities. However, existing benchmarks fail to comprehensively, objectively and accurately evaluate whether LMMs align with the diverse needs of humans in real-world scenarios. To bridge this gap, we propose the Multi-Dimensional Insights (MDI) benchmark, which includes over 500 images covering six common scenarios of human life. Notably, the MDI-Benchmark offers two significant advantages over existing evaluations: (1) Each image is accompanied by two types of questions: simple questions to assess the model's understanding of the image, and complex questions to evaluate the model's ability to analyze and reason beyond basic content. (2) Recognizing that people of different age groups have varying needs and perspectives when faced with the same scenario, our benchmark stratifies questions into three age categories: young people, middle-aged people, and older people. This design allows for a detailed assessment of LMMs' capabilities in meeting the preferences and needs of different age groups. With MDI-Benchmark, the strong model like GPT-4o achieve 79% accuracy on age-related tasks, indicating that existing LMMs still have considerable room for improvement in addressing real-world applications. Looking ahead, we anticipate that the MDI-Benchmark will open new pathways for aligning real-world personalization in LMMs. The MDI-Benchmark data and evaluation code are available at https://mdi-benchmark.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            YiFan Zhang, Shanglin Lei, Runqi Qiao, Zhuoma GongQue, Xiaoshuai Song, Guanting Dong, Qiuna Tan, Zhe Wei, Peiqing Yang, Ye Tian, Yadong Xue, Xiaofei Wang, Honggang Zhang</p>

            <p><strong>Title:</strong><br>
            Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.12606v1">http://arxiv.org/abs/2412.12606v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapidly developing field of large multimodal models (LMMs) has led to the emergence of diverse models with remarkable capabilities. However, existing benchmarks fail to comprehensively, objectively and accurately evaluate whether LMMs align with the diverse needs of humans in real-world scenarios. To bridge this gap, we propose the Multi-Dimensional Insights (MDI) benchmark, which includes over 500 images covering six common scenarios of human life. Notably, the MDI-Benchmark offers two significant advantages over existing evaluations: (1) Each image is accompanied by two types of questions: simple questions to assess the model's understanding of the image, and complex questions to evaluate the model's ability to analyze and reason beyond basic content. (2) Recognizing that people of different age groups have varying needs and perspectives when faced with the same scenario, our benchmark stratifies questions into three age categories: young people, middle-aged people, and older people. This design allows for a detailed assessment of LMMs' capabilities in meeting the preferences and needs of different age groups. With MDI-Benchmark, the strong model like GPT-4o achieve 79% accuracy on age-related tasks, indicating that existing LMMs still have considerable room for improvement in addressing real-world applications. Looking ahead, we anticipate that the MDI-Benchmark will open new pathways for aligning real-world personalization in LMMs. The MDI-Benchmark data and evaluation code are available at https://mdi-benchmark.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Dec 2024 20:27:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5271d2e7/afd0bccb.mp3" length="21717631" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1354</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.AI, cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            YiFan Zhang, Shanglin Lei, Runqi Qiao, Zhuoma GongQue, Xiaoshuai Song, Guanting Dong, Qiuna Tan, Zhe Wei, Peiqing Yang, Ye Tian, Yadong Xue, Xiaofei Wang, Honggang Zhang</p>

            <p><strong>Title:</strong><br>
            Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.12606v1">http://arxiv.org/abs/2412.12606v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapidly developing field of large multimodal models (LMMs) has led to the emergence of diverse models with remarkable capabilities. However, existing benchmarks fail to comprehensively, objectively and accurately evaluate whether LMMs align with the diverse needs of humans in real-world scenarios. To bridge this gap, we propose the Multi-Dimensional Insights (MDI) benchmark, which includes over 500 images covering six common scenarios of human life. Notably, the MDI-Benchmark offers two significant advantages over existing evaluations: (1) Each image is accompanied by two types of questions: simple questions to assess the model's understanding of the image, and complex questions to evaluate the model's ability to analyze and reason beyond basic content. (2) Recognizing that people of different age groups have varying needs and perspectives when faced with the same scenario, our benchmark stratifies questions into three age categories: young people, middle-aged people, and older people. This design allows for a detailed assessment of LMMs' capabilities in meeting the preferences and needs of different age groups. With MDI-Benchmark, the strong model like GPT-4o achieve 79% accuracy on age-related tasks, indicating that existing LMMs still have considerable room for improvement in addressing real-world applications. Looking ahead, we anticipate that the MDI-Benchmark will open new pathways for aligning real-world personalization in LMMs. The MDI-Benchmark data and evaluation code are available at https://mdi-benchmark.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain</title>
      <itunes:episode>240</itunes:episode>
      <podcast:episode>240</podcast:episode>
      <itunes:title>OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">987aaadd-c9df-4382-bebe-b2e092cd36e1</guid>
      <link>https://share.transistor.fm/s/fa6c0739</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shuting Wang, Jiejun Tan, Zhicheng Dou, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13018v1">http://arxiv.org/abs/2412.13018v1</a></p>

            <p><strong>Abstract:</strong><br>
            As a typical and practical application of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) techniques have gained extensive attention, particularly in vertical domains where LLMs may lack domain-specific knowledge. In this paper, we introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including (1) a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios; (2) a multi-dimensional evaluation data generation approach, which combines GPT-4-based automatic generation and human annotation, achieving an 87.47\% acceptance ratio in human evaluations on generated instances; (3) a multi-stage evaluation system that evaluates both retrieval and generation performance, result in a comprehensive evaluation on the RAG pipeline; and (4) robust evaluation metrics derived from rule-based and LLM-based ones, enhancing the reliability of assessments through manual annotations and supervised fine-tuning of an LLM evaluator. Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets and highlights the performance variations of RAG systems across diverse topics and tasks, revealing significant opportunities for RAG models to improve their capabilities in vertical domains. We open source the code of our benchmark in \href{https://github.com/RUC-NLPIR/OmniEval}{https://github.com/RUC-NLPIR/OmniEval}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shuting Wang, Jiejun Tan, Zhicheng Dou, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13018v1">http://arxiv.org/abs/2412.13018v1</a></p>

            <p><strong>Abstract:</strong><br>
            As a typical and practical application of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) techniques have gained extensive attention, particularly in vertical domains where LLMs may lack domain-specific knowledge. In this paper, we introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including (1) a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios; (2) a multi-dimensional evaluation data generation approach, which combines GPT-4-based automatic generation and human annotation, achieving an 87.47\% acceptance ratio in human evaluations on generated instances; (3) a multi-stage evaluation system that evaluates both retrieval and generation performance, result in a comprehensive evaluation on the RAG pipeline; and (4) robust evaluation metrics derived from rule-based and LLM-based ones, enhancing the reliability of assessments through manual annotations and supervised fine-tuning of an LLM evaluator. Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets and highlights the performance variations of RAG systems across diverse topics and tasks, revealing significant opportunities for RAG models to improve their capabilities in vertical domains. We open source the code of our benchmark in \href{https://github.com/RUC-NLPIR/OmniEval}{https://github.com/RUC-NLPIR/OmniEval}.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Dec 2024 20:27:26 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fa6c0739/30ba84b6.mp3" length="22381343" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1395</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shuting Wang, Jiejun Tan, Zhicheng Dou, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13018v1">http://arxiv.org/abs/2412.13018v1</a></p>

            <p><strong>Abstract:</strong><br>
            As a typical and practical application of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) techniques have gained extensive attention, particularly in vertical domains where LLMs may lack domain-specific knowledge. In this paper, we introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including (1) a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios; (2) a multi-dimensional evaluation data generation approach, which combines GPT-4-based automatic generation and human annotation, achieving an 87.47\% acceptance ratio in human evaluations on generated instances; (3) a multi-stage evaluation system that evaluates both retrieval and generation performance, result in a comprehensive evaluation on the RAG pipeline; and (4) robust evaluation metrics derived from rule-based and LLM-based ones, enhancing the reliability of assessments through manual annotations and supervised fine-tuning of an LLM evaluator. Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets and highlights the performance variations of RAG systems across diverse topics and tasks, revealing significant opportunities for RAG models to improve their capabilities in vertical domains. We open source the code of our benchmark in \href{https://github.com/RUC-NLPIR/OmniEval}{https://github.com/RUC-NLPIR/OmniEval}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Compressed Chain of Thought: Efficient Reasoning Through Dense Representations</title>
      <itunes:episode>239</itunes:episode>
      <podcast:episode>239</podcast:episode>
      <itunes:title>Compressed Chain of Thought: Efficient Reasoning Through Dense Representations</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">184aa507-60b1-4316-91a5-0276c01705b6</guid>
      <link>https://share.transistor.fm/s/d131754d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jeffrey Cheng, Benjamin Van Durme</p>

            <p><strong>Title:</strong><br>
            Compressed Chain of Thought: Efficient Reasoning Through Dense Representations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13171v1">http://arxiv.org/abs/2412.13171v1</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-thought (CoT) decoding enables language models to improve reasoning performance at the cost of high generation latency in decoding. Recent proposals have explored variants of contemplation tokens, a term we introduce that refers to special tokens used during inference to allow for extra computation. Prior work has considered fixed-length sequences drawn from a discrete set of embeddings as contemplation tokens. Here we propose Compressed Chain-of-Thought (CCoT), a framework to generate contentful and continuous contemplation tokens of variable sequence length. The generated contemplation tokens are compressed representations of explicit reasoning chains, and our method can be applied to off-the-shelf decoder language models. Through experiments, we illustrate how CCoT enables additional reasoning over dense contentful representations to achieve corresponding improvements in accuracy. Moreover, the reasoning improvements can be adaptively modified on demand by controlling the number of contemplation tokens generated.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jeffrey Cheng, Benjamin Van Durme</p>

            <p><strong>Title:</strong><br>
            Compressed Chain of Thought: Efficient Reasoning Through Dense Representations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13171v1">http://arxiv.org/abs/2412.13171v1</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-thought (CoT) decoding enables language models to improve reasoning performance at the cost of high generation latency in decoding. Recent proposals have explored variants of contemplation tokens, a term we introduce that refers to special tokens used during inference to allow for extra computation. Prior work has considered fixed-length sequences drawn from a discrete set of embeddings as contemplation tokens. Here we propose Compressed Chain-of-Thought (CCoT), a framework to generate contentful and continuous contemplation tokens of variable sequence length. The generated contemplation tokens are compressed representations of explicit reasoning chains, and our method can be applied to off-the-shelf decoder language models. Through experiments, we illustrate how CCoT enables additional reasoning over dense contentful representations to achieve corresponding improvements in accuracy. Moreover, the reasoning improvements can be adaptively modified on demand by controlling the number of contemplation tokens generated.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Dec 2024 20:27:04 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d131754d/9935bb30.mp3" length="22214569" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1385</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jeffrey Cheng, Benjamin Van Durme</p>

            <p><strong>Title:</strong><br>
            Compressed Chain of Thought: Efficient Reasoning Through Dense Representations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13171v1">http://arxiv.org/abs/2412.13171v1</a></p>

            <p><strong>Abstract:</strong><br>
            Chain-of-thought (CoT) decoding enables language models to improve reasoning performance at the cost of high generation latency in decoding. Recent proposals have explored variants of contemplation tokens, a term we introduce that refers to special tokens used during inference to allow for extra computation. Prior work has considered fixed-length sequences drawn from a discrete set of embeddings as contemplation tokens. Here we propose Compressed Chain-of-Thought (CCoT), a framework to generate contentful and continuous contemplation tokens of variable sequence length. The generated contemplation tokens are compressed representations of explicit reasoning chains, and our method can be applied to off-the-shelf decoder language models. Through experiments, we illustrate how CCoT enables additional reasoning over dense contentful representations to achieve corresponding improvements in accuracy. Moreover, the reasoning improvements can be adaptively modified on demand by controlling the number of contemplation tokens generated.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers</title>
      <itunes:episode>238</itunes:episode>
      <podcast:episode>238</podcast:episode>
      <itunes:title>Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ce904605-bfef-47ea-9e08-b698d2a61d49</guid>
      <link>https://share.transistor.fm/s/d296585d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Seungwook Han, Jinyeop Song, Jeff Gore, Pulkit Agrawal</p>

            <p><strong>Title:</strong><br>
            Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.12276v2">http://arxiv.org/abs/2412.12276v2</a></p>

            <p><strong>Abstract:</strong><br>
            Humans distill complex experiences into fundamental abstractions that enable rapid learning and adaptation. Similarly, autoregressive transformers exhibit adaptive learning through in-context learning (ICL), which begs the question of how. In this paper, we propose concept encoding-decoding mechanism to explain ICL by studying how transformers form and use internal abstractions in their representations. On synthetic ICL tasks, we analyze the training dynamics of a small transformer and report the coupled emergence of concept encoding and decoding. As the model learns to encode different latent concepts (e.g., ``Finding the first noun in a sentence.") into distinct, separable representations, it concureently builds conditional decoding algorithms and improve its ICL performance. We validate the existence of this mechanism across pretrained models of varying scales (Gemma-2 2B/9B/27B, Llama-3.1 8B/70B). Further, through mechanistic interventions and controlled finetuning, we demonstrate that the quality of concept encoding is causally related and predictive of ICL performance. Our empirical insights shed light into better understanding the success and failure modes of large language models via their representations.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Seungwook Han, Jinyeop Song, Jeff Gore, Pulkit Agrawal</p>

            <p><strong>Title:</strong><br>
            Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.12276v2">http://arxiv.org/abs/2412.12276v2</a></p>

            <p><strong>Abstract:</strong><br>
            Humans distill complex experiences into fundamental abstractions that enable rapid learning and adaptation. Similarly, autoregressive transformers exhibit adaptive learning through in-context learning (ICL), which begs the question of how. In this paper, we propose concept encoding-decoding mechanism to explain ICL by studying how transformers form and use internal abstractions in their representations. On synthetic ICL tasks, we analyze the training dynamics of a small transformer and report the coupled emergence of concept encoding and decoding. As the model learns to encode different latent concepts (e.g., ``Finding the first noun in a sentence.") into distinct, separable representations, it concureently builds conditional decoding algorithms and improve its ICL performance. We validate the existence of this mechanism across pretrained models of varying scales (Gemma-2 2B/9B/27B, Llama-3.1 8B/70B). Further, through mechanistic interventions and controlled finetuning, we demonstrate that the quality of concept encoding is causally related and predictive of ICL performance. Our empirical insights shed light into better understanding the success and failure modes of large language models via their representations.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Dec 2024 20:26:43 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d296585d/10c2f3cd.mp3" length="22013976" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1372</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Seungwook Han, Jinyeop Song, Jeff Gore, Pulkit Agrawal</p>

            <p><strong>Title:</strong><br>
            Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.12276v2">http://arxiv.org/abs/2412.12276v2</a></p>

            <p><strong>Abstract:</strong><br>
            Humans distill complex experiences into fundamental abstractions that enable rapid learning and adaptation. Similarly, autoregressive transformers exhibit adaptive learning through in-context learning (ICL), which begs the question of how. In this paper, we propose concept encoding-decoding mechanism to explain ICL by studying how transformers form and use internal abstractions in their representations. On synthetic ICL tasks, we analyze the training dynamics of a small transformer and report the coupled emergence of concept encoding and decoding. As the model learns to encode different latent concepts (e.g., ``Finding the first noun in a sentence.") into distinct, separable representations, it concureently builds conditional decoding algorithms and improve its ICL performance. We validate the existence of this mechanism across pretrained models of varying scales (Gemma-2 2B/9B/27B, Llama-3.1 8B/70B). Further, through mechanistic interventions and controlled finetuning, we demonstrate that the quality of concept encoding is causally related and predictive of ICL performance. Our empirical insights shed light into better understanding the success and failure modes of large language models via their representations.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration</title>
      <itunes:episode>237</itunes:episode>
      <podcast:episode>237</podcast:episode>
      <itunes:title>Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">056d7293-3458-48cb-85d2-d242c0841676</guid>
      <link>https://share.transistor.fm/s/aa618a4f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Mark Endo, Xiaohan Wang, Serena Yeung-Levy</p>

            <p><strong>Title:</strong><br>
            Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13180v1">http://arxiv.org/abs/2412.13180v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent works on accelerating Vision-Language Models show that strong performance can be maintained across a variety of vision-language tasks despite highly compressing visual information. In this work, we examine the popular acceleration approach of early pruning of visual tokens inside the language model and find that its strong performance across many tasks is not due to an exceptional ability to compress visual information, but rather the benchmarks' limited ability to assess fine-grained visual capabilities. Namely, we demonstrate a core issue with the acceleration approach where most tokens towards the top of the image are pruned away. Yet, this issue is only reflected in performance for a small subset of tasks such as localization. For the other evaluated tasks, strong performance is maintained with the flawed pruning strategy. Noting the limited visual capabilities of the studied acceleration technique, we propose FEATHER (Fast and Effective Acceleration wiTH Ensemble cRiteria), a straightforward approach that (1) resolves the identified issue with early-layer pruning, (2) incorporates uniform sampling to ensure coverage across all image regions, and (3) applies pruning in two stages to allow the criteria to become more effective at a later layer while still achieving significant speedup through early-layer pruning. With comparable computational savings, we find that FEATHER has more than $5\times$ performance improvement on the vision-centric localization benchmarks compared to the original acceleration approach.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Mark Endo, Xiaohan Wang, Serena Yeung-Levy</p>

            <p><strong>Title:</strong><br>
            Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13180v1">http://arxiv.org/abs/2412.13180v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent works on accelerating Vision-Language Models show that strong performance can be maintained across a variety of vision-language tasks despite highly compressing visual information. In this work, we examine the popular acceleration approach of early pruning of visual tokens inside the language model and find that its strong performance across many tasks is not due to an exceptional ability to compress visual information, but rather the benchmarks' limited ability to assess fine-grained visual capabilities. Namely, we demonstrate a core issue with the acceleration approach where most tokens towards the top of the image are pruned away. Yet, this issue is only reflected in performance for a small subset of tasks such as localization. For the other evaluated tasks, strong performance is maintained with the flawed pruning strategy. Noting the limited visual capabilities of the studied acceleration technique, we propose FEATHER (Fast and Effective Acceleration wiTH Ensemble cRiteria), a straightforward approach that (1) resolves the identified issue with early-layer pruning, (2) incorporates uniform sampling to ensure coverage across all image regions, and (3) applies pruning in two stages to allow the criteria to become more effective at a later layer while still achieving significant speedup through early-layer pruning. With comparable computational savings, we find that FEATHER has more than $5\times$ performance improvement on the vision-centric localization benchmarks compared to the original acceleration approach.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Dec 2024 20:26:22 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/aa618a4f/9146e99c.mp3" length="19968470" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1244</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Mark Endo, Xiaohan Wang, Serena Yeung-Levy</p>

            <p><strong>Title:</strong><br>
            Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13180v1">http://arxiv.org/abs/2412.13180v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent works on accelerating Vision-Language Models show that strong performance can be maintained across a variety of vision-language tasks despite highly compressing visual information. In this work, we examine the popular acceleration approach of early pruning of visual tokens inside the language model and find that its strong performance across many tasks is not due to an exceptional ability to compress visual information, but rather the benchmarks' limited ability to assess fine-grained visual capabilities. Namely, we demonstrate a core issue with the acceleration approach where most tokens towards the top of the image are pruned away. Yet, this issue is only reflected in performance for a small subset of tasks such as localization. For the other evaluated tasks, strong performance is maintained with the flawed pruning strategy. Noting the limited visual capabilities of the studied acceleration technique, we propose FEATHER (Fast and Effective Acceleration wiTH Ensemble cRiteria), a straightforward approach that (1) resolves the identified issue with early-layer pruning, (2) incorporates uniform sampling to ensure coverage across all image regions, and (3) applies pruning in two stages to allow the criteria to become more effective at a later layer while still achieving significant speedup through early-layer pruning. With comparable computational savings, we find that FEATHER has more than $5\times$ performance improvement on the vision-centric localization benchmarks compared to the original acceleration approach.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents</title>
      <itunes:episode>236</itunes:episode>
      <podcast:episode>236</podcast:episode>
      <itunes:title>Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5aa23d6d-3939-4128-97be-2e7377f8e815</guid>
      <link>https://share.transistor.fm/s/fb3a1b6a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, Erran Li</p>

            <p><strong>Title:</strong><br>
            Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13194v1">http://arxiv.org/abs/2412.13194v1</a></p>

            <p><strong>Abstract:</strong><br>
            The vision of a broadly capable and goal-directed agent, such as an Internet-browsing agent in the digital world and a household humanoid in the physical world, has rapidly advanced, thanks to the generalization capability of foundation models. Such a generalist agent needs to have a large and diverse skill repertoire, such as finding directions between two travel locations and buying specific items from the Internet. If each skill needs to be specified manually through a fixed set of human-annotated instructions, the agent's skill repertoire will necessarily be limited due to the quantity and diversity of human-annotated instructions. In this work, we address this challenge by proposing Proposer-Agent-Evaluator, an effective learning system that enables foundation model agents to autonomously discover and practice skills in the wild. At the heart of PAE is a context-aware task proposer that autonomously proposes tasks for the agent to practice with context information of the environment such as user demos or even just the name of the website itself for Internet-browsing agents. Then, the agent policy attempts those tasks with thoughts and actual grounded operations in the real world with resulting trajectories evaluated by an autonomous VLM-based success evaluator. The success evaluation serves as the reward signal for the agent to refine its policies through RL. We validate PAE on challenging vision-based web navigation, using both real-world and self-hosted websites from WebVoyager and WebArena.To the best of our knowledge, this work represents the first effective learning system to apply autonomous task proposal with RL for agents that generalizes real-world human-annotated benchmarks with SOTA performances. Our open-source checkpoints and code can be found in https://yanqval.github.io/PAE/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, Erran Li</p>

            <p><strong>Title:</strong><br>
            Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13194v1">http://arxiv.org/abs/2412.13194v1</a></p>

            <p><strong>Abstract:</strong><br>
            The vision of a broadly capable and goal-directed agent, such as an Internet-browsing agent in the digital world and a household humanoid in the physical world, has rapidly advanced, thanks to the generalization capability of foundation models. Such a generalist agent needs to have a large and diverse skill repertoire, such as finding directions between two travel locations and buying specific items from the Internet. If each skill needs to be specified manually through a fixed set of human-annotated instructions, the agent's skill repertoire will necessarily be limited due to the quantity and diversity of human-annotated instructions. In this work, we address this challenge by proposing Proposer-Agent-Evaluator, an effective learning system that enables foundation model agents to autonomously discover and practice skills in the wild. At the heart of PAE is a context-aware task proposer that autonomously proposes tasks for the agent to practice with context information of the environment such as user demos or even just the name of the website itself for Internet-browsing agents. Then, the agent policy attempts those tasks with thoughts and actual grounded operations in the real world with resulting trajectories evaluated by an autonomous VLM-based success evaluator. The success evaluation serves as the reward signal for the agent to refine its policies through RL. We validate PAE on challenging vision-based web navigation, using both real-world and self-hosted websites from WebVoyager and WebArena.To the best of our knowledge, this work represents the first effective learning system to apply autonomous task proposal with RL for agents that generalizes real-world human-annotated benchmarks with SOTA performances. Our open-source checkpoints and code can be found in https://yanqval.github.io/PAE/</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Dec 2024 20:26:00 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fb3a1b6a/fff313bc.mp3" length="22989063" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1433</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, Erran Li</p>

            <p><strong>Title:</strong><br>
            Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13194v1">http://arxiv.org/abs/2412.13194v1</a></p>

            <p><strong>Abstract:</strong><br>
            The vision of a broadly capable and goal-directed agent, such as an Internet-browsing agent in the digital world and a household humanoid in the physical world, has rapidly advanced, thanks to the generalization capability of foundation models. Such a generalist agent needs to have a large and diverse skill repertoire, such as finding directions between two travel locations and buying specific items from the Internet. If each skill needs to be specified manually through a fixed set of human-annotated instructions, the agent's skill repertoire will necessarily be limited due to the quantity and diversity of human-annotated instructions. In this work, we address this challenge by proposing Proposer-Agent-Evaluator, an effective learning system that enables foundation model agents to autonomously discover and practice skills in the wild. At the heart of PAE is a context-aware task proposer that autonomously proposes tasks for the agent to practice with context information of the environment such as user demos or even just the name of the website itself for Internet-browsing agents. Then, the agent policy attempts those tasks with thoughts and actual grounded operations in the real world with resulting trajectories evaluated by an autonomous VLM-based success evaluator. The success evaluation serves as the reward signal for the agent to refine its policies through RL. We validate PAE on challenging vision-based web navigation, using both real-world and self-hosted websites from WebVoyager and WebArena.To the best of our knowledge, this work represents the first effective learning system to apply autonomous task proposal with RL for agents that generalizes real-world human-annotated benchmarks with SOTA performances. Our open-source checkpoints and code can be found in https://yanqval.github.io/PAE/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation</title>
      <itunes:episode>235</itunes:episode>
      <podcast:episode>235</podcast:episode>
      <itunes:title>VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f040c958-f5df-491b-99e3-cc7dba3d7e4a</guid>
      <link>https://share.transistor.fm/s/511389c7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A. Rossi, Dinesh Manocha</p>

            <p><strong>Title:</strong><br>
            VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.10704v1">http://arxiv.org/abs/2412.10704v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding information from a collection of multiple documents, particularly those with visually rich elements, is important for document-grounded question answering. This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings with rich multimodal content, including tables, charts, and presentation slides. We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG, combining robust visual retrieval capabilities with sophisticated linguistic reasoning. VisDoMRAG employs a multi-step reasoning process encompassing evidence curation and chain-of-thought reasoning for concurrent textual and visual RAG pipelines. A key novelty of VisDoMRAG is its consistency-constrained modality fusion mechanism, which aligns the reasoning processes across modalities at inference time to produce a coherent final answer. This leads to enhanced accuracy in scenarios where critical information is distributed across modalities and improved answer verifiability through implicit context attribution. Through extensive experiments involving open-source and proprietary large language models, we benchmark state-of-the-art document QA methods on VisDoMBench. Extensive results show that VisDoMRAG outperforms unimodal and long-context LLM baselines for end-to-end multimodal document QA by 12-20%.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A. Rossi, Dinesh Manocha</p>

            <p><strong>Title:</strong><br>
            VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.10704v1">http://arxiv.org/abs/2412.10704v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding information from a collection of multiple documents, particularly those with visually rich elements, is important for document-grounded question answering. This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings with rich multimodal content, including tables, charts, and presentation slides. We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG, combining robust visual retrieval capabilities with sophisticated linguistic reasoning. VisDoMRAG employs a multi-step reasoning process encompassing evidence curation and chain-of-thought reasoning for concurrent textual and visual RAG pipelines. A key novelty of VisDoMRAG is its consistency-constrained modality fusion mechanism, which aligns the reasoning processes across modalities at inference time to produce a coherent final answer. This leads to enhanced accuracy in scenarios where critical information is distributed across modalities and improved answer verifiability through implicit context attribution. Through extensive experiments involving open-source and proprietary large language models, we benchmark state-of-the-art document QA methods on VisDoMBench. Extensive results show that VisDoMRAG outperforms unimodal and long-context LLM baselines for end-to-end multimodal document QA by 12-20%.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Dec 2024 20:25:39 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/511389c7/3a3480eb.mp3" length="22326605" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1392</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A. Rossi, Dinesh Manocha</p>

            <p><strong>Title:</strong><br>
            VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.10704v1">http://arxiv.org/abs/2412.10704v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding information from a collection of multiple documents, particularly those with visually rich elements, is important for document-grounded question answering. This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings with rich multimodal content, including tables, charts, and presentation slides. We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG, combining robust visual retrieval capabilities with sophisticated linguistic reasoning. VisDoMRAG employs a multi-step reasoning process encompassing evidence curation and chain-of-thought reasoning for concurrent textual and visual RAG pipelines. A key novelty of VisDoMRAG is its consistency-constrained modality fusion mechanism, which aligns the reasoning processes across modalities at inference time to produce a coherent final answer. This leads to enhanced accuracy in scenarios where critical information is distributed across modalities and improved answer verifiability through implicit context attribution. Through extensive experiments involving open-source and proprietary large language models, we benchmark state-of-the-art document QA methods on VisDoMBench. Extensive results show that VisDoMRAG outperforms unimodal and long-context LLM baselines for end-to-end multimodal document QA by 12-20%.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner</title>
      <itunes:episode>234</itunes:episode>
      <podcast:episode>234</podcast:episode>
      <itunes:title>SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">45fde643-0476-4c4c-a3a1-64c6830fb752</guid>
      <link>https://share.transistor.fm/s/128168b9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, Nanxuan Zhao, Jing Shi, Tong Sun</p>

            <p><strong>Title:</strong><br>
            SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.10533v1">http://arxiv.org/abs/2412.10533v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present SUGAR, a zero-shot method for subject-driven video customization. Given an input image, SUGAR is capable of generating videos for the subject contained in the image and aligning the generation with arbitrary visual attributes such as style and motion specified by user-input text. Unlike previous methods, which require test-time fine-tuning or fail to generate text-aligned videos, SUGAR achieves superior results without the need for extra cost at test-time. To enable zero-shot capability, we introduce a scalable pipeline to construct synthetic dataset which is specifically designed for subject-driven customization, leading to 2.5 millions of image-video-text triplets. Additionally, we propose several methods to enhance our model, including special attention designs, improved training strategies, and a refined sampling algorithm. Extensive experiments are conducted. Compared to previous methods, SUGAR achieves state-of-the-art results in identity preservation, video dynamics, and video-text alignment for subject-driven video customization, demonstrating the effectiveness of our proposed method.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, Nanxuan Zhao, Jing Shi, Tong Sun</p>

            <p><strong>Title:</strong><br>
            SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.10533v1">http://arxiv.org/abs/2412.10533v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present SUGAR, a zero-shot method for subject-driven video customization. Given an input image, SUGAR is capable of generating videos for the subject contained in the image and aligning the generation with arbitrary visual attributes such as style and motion specified by user-input text. Unlike previous methods, which require test-time fine-tuning or fail to generate text-aligned videos, SUGAR achieves superior results without the need for extra cost at test-time. To enable zero-shot capability, we introduce a scalable pipeline to construct synthetic dataset which is specifically designed for subject-driven customization, leading to 2.5 millions of image-video-text triplets. Additionally, we propose several methods to enhance our model, including special attention designs, improved training strategies, and a refined sampling algorithm. Extensive experiments are conducted. Compared to previous methods, SUGAR achieves state-of-the-art results in identity preservation, video dynamics, and video-text alignment for subject-driven video customization, demonstrating the effectiveness of our proposed method.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Dec 2024 20:25:18 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/128168b9/8df4e845.mp3" length="19697186" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1227</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, Nanxuan Zhao, Jing Shi, Tong Sun</p>

            <p><strong>Title:</strong><br>
            SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.10533v1">http://arxiv.org/abs/2412.10533v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present SUGAR, a zero-shot method for subject-driven video customization. Given an input image, SUGAR is capable of generating videos for the subject contained in the image and aligning the generation with arbitrary visual attributes such as style and motion specified by user-input text. Unlike previous methods, which require test-time fine-tuning or fail to generate text-aligned videos, SUGAR achieves superior results without the need for extra cost at test-time. To enable zero-shot capability, we introduce a scalable pipeline to construct synthetic dataset which is specifically designed for subject-driven customization, leading to 2.5 millions of image-video-text triplets. Additionally, we propose several methods to enhance our model, including special attention designs, improved training strategies, and a refined sampling algorithm. Extensive experiments are conducted. Compared to previous methods, SUGAR achieves state-of-the-art results in identity preservation, video dynamics, and video-text alignment for subject-driven video customization, demonstrating the effectiveness of our proposed method.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion</title>
      <itunes:episode>233</itunes:episode>
      <podcast:episode>233</podcast:episode>
      <itunes:title>Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5c66ae04-c20e-4339-a352-5dcccc0c1c70</guid>
      <link>https://share.transistor.fm/s/a9058853</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Massimiliano Viola, Kevin Qu, Nando Metzger, Bingxin Ke, Alexander Becker, Konrad Schindler, Anton Obukhov</p>

            <p><strong>Title:</strong><br>
            Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13389v1">http://arxiv.org/abs/2412.13389v1</a></p>

            <p><strong>Abstract:</strong><br>
            Depth completion upgrades sparse depth measurements into dense depth maps guided by a conventional image. Existing methods for this highly ill-posed task operate in tightly constrained settings and tend to struggle when applied to images outside the training domain or when the available depth measurements are sparse, irregularly distributed, or of varying density. Inspired by recent advances in monocular depth estimation, we reframe depth completion as an image-conditional depth map generation guided by sparse measurements. Our method, Marigold-DC, builds on a pretrained latent diffusion model for monocular depth estimation and injects the depth observations as test-time guidance via an optimization scheme that runs in tandem with the iterative inference of denoising diffusion. The method exhibits excellent zero-shot generalization across a diverse range of environments and handles even extremely sparse guidance effectively. Our results suggest that contemporary monocular depth priors greatly robustify depth completion: it may be better to view the task as recovering dense depth from (dense) image pixels, guided by sparse depth; rather than as inpainting (sparse) depth, guided by an image. Project website: https://MarigoldDepthCompletion.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Massimiliano Viola, Kevin Qu, Nando Metzger, Bingxin Ke, Alexander Becker, Konrad Schindler, Anton Obukhov</p>

            <p><strong>Title:</strong><br>
            Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13389v1">http://arxiv.org/abs/2412.13389v1</a></p>

            <p><strong>Abstract:</strong><br>
            Depth completion upgrades sparse depth measurements into dense depth maps guided by a conventional image. Existing methods for this highly ill-posed task operate in tightly constrained settings and tend to struggle when applied to images outside the training domain or when the available depth measurements are sparse, irregularly distributed, or of varying density. Inspired by recent advances in monocular depth estimation, we reframe depth completion as an image-conditional depth map generation guided by sparse measurements. Our method, Marigold-DC, builds on a pretrained latent diffusion model for monocular depth estimation and injects the depth observations as test-time guidance via an optimization scheme that runs in tandem with the iterative inference of denoising diffusion. The method exhibits excellent zero-shot generalization across a diverse range of environments and handles even extremely sparse guidance effectively. Our results suggest that contemporary monocular depth priors greatly robustify depth completion: it may be better to view the task as recovering dense depth from (dense) image pixels, guided by sparse depth; rather than as inpainting (sparse) depth, guided by an image. Project website: https://MarigoldDepthCompletion.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 18 Dec 2024 20:24:56 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a9058853/350ea044.mp3" length="19788309" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1233</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 2 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Massimiliano Viola, Kevin Qu, Nando Metzger, Bingxin Ke, Alexander Becker, Konrad Schindler, Anton Obukhov</p>

            <p><strong>Title:</strong><br>
            Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.13389v1">http://arxiv.org/abs/2412.13389v1</a></p>

            <p><strong>Abstract:</strong><br>
            Depth completion upgrades sparse depth measurements into dense depth maps guided by a conventional image. Existing methods for this highly ill-posed task operate in tightly constrained settings and tend to struggle when applied to images outside the training domain or when the available depth measurements are sparse, irregularly distributed, or of varying density. Inspired by recent advances in monocular depth estimation, we reframe depth completion as an image-conditional depth map generation guided by sparse measurements. Our method, Marigold-DC, builds on a pretrained latent diffusion model for monocular depth estimation and injects the depth observations as test-time guidance via an optimization scheme that runs in tandem with the iterative inference of denoising diffusion. The method exhibits excellent zero-shot generalization across a diverse range of environments and handles even extremely sparse guidance effectively. Our results suggest that contemporary monocular depth priors greatly robustify depth completion: it may be better to view the task as recovering dense depth from (dense) image pixels, guided by sparse depth; rather than as inpainting (sparse) depth, guided by an image. Project website: https://MarigoldDepthCompletion.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Byte Latent Transformer: Patches Scale Better Than Tokens</title>
      <itunes:episode>232</itunes:episode>
      <podcast:episode>232</podcast:episode>
      <itunes:title>Byte Latent Transformer: Patches Scale Better Than Tokens</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e17b79ad-d785-4c73-9e7d-c02520141b3b</guid>
      <link>https://share.transistor.fm/s/431b8e74</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srinivasan Iyer</p>

            <p><strong>Title:</strong><br>
            Byte Latent Transformer: Patches Scale Better Than Tokens</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09871v1">http://arxiv.org/abs/2412.09871v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first FLOP controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srinivasan Iyer</p>

            <p><strong>Title:</strong><br>
            Byte Latent Transformer: Patches Scale Better Than Tokens</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09871v1">http://arxiv.org/abs/2412.09871v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first FLOP controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Dec 2024 20:24:10 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/431b8e74/33f33e3f.mp3" length="24189405" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1508</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 39 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srinivasan Iyer</p>

            <p><strong>Title:</strong><br>
            Byte Latent Transformer: Patches Scale Better Than Tokens</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09871v1">http://arxiv.org/abs/2412.09871v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first FLOP controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation</title>
      <itunes:episode>231</itunes:episode>
      <podcast:episode>231</podcast:episode>
      <itunes:title>RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e1a1655e-d551-45ec-9b13-30704324cfda</guid>
      <link>https://share.transistor.fm/s/5658771b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yongkang Wu, Zhonghua Li, Qi Ye, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11919v1">http://arxiv.org/abs/2412.11919v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) exhibit remarkable generative capabilities but often suffer from hallucinations. Retrieval-augmented generation (RAG) offers an effective solution by incorporating external knowledge, but existing methods still face several limitations: additional deployment costs of separate retrievers, redundant input tokens from retrieved text chunks, and the lack of joint optimization of retrieval and generation. To address these issues, we propose \textbf{RetroLLM}, a unified framework that integrates retrieval and generation into a single, cohesive process, enabling LLMs to directly generate fine-grained evidence from the corpus with constrained decoding. Moreover, to mitigate false pruning in the process of constrained evidence generation, we introduce (1) hierarchical FM-Index constraints, which generate corpus-constrained clues to identify a subset of relevant documents before evidence generation, reducing irrelevant decoding space; and (2) a forward-looking constrained decoding strategy, which considers the relevance of future sequences to improve evidence accuracy. Extensive experiments on five open-domain QA datasets demonstrate RetroLLM's superior performance across both in-domain and out-of-domain tasks. The code is available at \url{https://github.com/sunnynexus/RetroLLM}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yongkang Wu, Zhonghua Li, Qi Ye, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11919v1">http://arxiv.org/abs/2412.11919v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) exhibit remarkable generative capabilities but often suffer from hallucinations. Retrieval-augmented generation (RAG) offers an effective solution by incorporating external knowledge, but existing methods still face several limitations: additional deployment costs of separate retrievers, redundant input tokens from retrieved text chunks, and the lack of joint optimization of retrieval and generation. To address these issues, we propose \textbf{RetroLLM}, a unified framework that integrates retrieval and generation into a single, cohesive process, enabling LLMs to directly generate fine-grained evidence from the corpus with constrained decoding. Moreover, to mitigate false pruning in the process of constrained evidence generation, we introduce (1) hierarchical FM-Index constraints, which generate corpus-constrained clues to identify a subset of relevant documents before evidence generation, reducing irrelevant decoding space; and (2) a forward-looking constrained decoding strategy, which considers the relevance of future sequences to improve evidence accuracy. Extensive experiments on five open-domain QA datasets demonstrate RetroLLM's superior performance across both in-domain and out-of-domain tasks. The code is available at \url{https://github.com/sunnynexus/RetroLLM}.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Dec 2024 20:23:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5658771b/e84f8a67.mp3" length="20954856" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1306</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL, cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yongkang Wu, Zhonghua Li, Qi Ye, Zhicheng Dou</p>

            <p><strong>Title:</strong><br>
            RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11919v1">http://arxiv.org/abs/2412.11919v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) exhibit remarkable generative capabilities but often suffer from hallucinations. Retrieval-augmented generation (RAG) offers an effective solution by incorporating external knowledge, but existing methods still face several limitations: additional deployment costs of separate retrievers, redundant input tokens from retrieved text chunks, and the lack of joint optimization of retrieval and generation. To address these issues, we propose \textbf{RetroLLM}, a unified framework that integrates retrieval and generation into a single, cohesive process, enabling LLMs to directly generate fine-grained evidence from the corpus with constrained decoding. Moreover, to mitigate false pruning in the process of constrained evidence generation, we introduce (1) hierarchical FM-Index constraints, which generate corpus-constrained clues to identify a subset of relevant documents before evidence generation, reducing irrelevant decoding space; and (2) a forward-looking constrained decoding strategy, which considers the relevance of future sequences to improve evidence accuracy. Extensive experiments on five open-domain QA datasets demonstrate RetroLLM's superior performance across both in-domain and out-of-domain tasks. The code is available at \url{https://github.com/sunnynexus/RetroLLM}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models</title>
      <itunes:episode>230</itunes:episode>
      <podcast:episode>230</podcast:episode>
      <itunes:title>Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d7af6177-339b-47e4-b4ed-1026c0a3d2f5</guid>
      <link>https://share.transistor.fm/s/13b50a76</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09645v2">http://arxiv.org/abs/2412.09645v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in visual generative models have enabled high-quality image and video generation, opening diverse applications. However, evaluating these models often demands sampling hundreds or thousands of images or videos, making the process computationally expensive, especially for diffusion-based models with inherently slow sampling. Moreover, existing evaluation methods rely on rigid pipelines that overlook specific user needs and provide numerical results without clear explanations. In contrast, humans can quickly form impressions of a model's capabilities by observing only a few samples. To mimic this, we propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations using only a few samples per round, while offering detailed, user-tailored analyses. It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools. Experiments show that Evaluation Agent reduces evaluation time to 10% of traditional methods while delivering comparable results. The Evaluation Agent framework is fully open-sourced to advance research in visual generative models and their efficient evaluation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09645v2">http://arxiv.org/abs/2412.09645v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in visual generative models have enabled high-quality image and video generation, opening diverse applications. However, evaluating these models often demands sampling hundreds or thousands of images or videos, making the process computationally expensive, especially for diffusion-based models with inherently slow sampling. Moreover, existing evaluation methods rely on rigid pipelines that overlook specific user needs and provide numerical results without clear explanations. In contrast, humans can quickly form impressions of a model's capabilities by observing only a few samples. To mimic this, we propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations using only a few samples per round, while offering detailed, user-tailored analyses. It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools. Experiments show that Evaluation Agent reduces evaluation time to 10% of traditional methods while delivering comparable results. The Evaluation Agent framework is fully open-sourced to advance research in visual generative models and their efficient evaluation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Dec 2024 20:23:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/13b50a76/a6cf4d9f.mp3" length="20372219" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1270</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09645v2">http://arxiv.org/abs/2412.09645v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in visual generative models have enabled high-quality image and video generation, opening diverse applications. However, evaluating these models often demands sampling hundreds or thousands of images or videos, making the process computationally expensive, especially for diffusion-based models with inherently slow sampling. Moreover, existing evaluation methods rely on rigid pipelines that overlook specific user needs and provide numerical results without clear explanations. In contrast, humans can quickly form impressions of a model's capabilities by observing only a few samples. To mimic this, we propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations using only a few samples per round, while offering detailed, user-tailored analyses. It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools. Experiments show that Evaluation Agent reduces evaluation time to 10% of traditional methods while delivering comparable results. The Evaluation Agent framework is fully open-sourced to advance research in visual generative models and their efficient evaluation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BrushEdit: All-In-One Image Inpainting and Editing</title>
      <itunes:episode>229</itunes:episode>
      <podcast:episode>229</podcast:episode>
      <itunes:title>BrushEdit: All-In-One Image Inpainting and Editing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c4d5d70a-83e0-4019-a2f2-bcae292c9dcc</guid>
      <link>https://share.transistor.fm/s/81ccede7</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yaowei Li, Yuxuan Bian, Xuan Ju, Zhaoyang Zhang, Ying Shan, Yuexian Zou, Qiang Xu</p>

            <p><strong>Title:</strong><br>
            BrushEdit: All-In-One Image Inpainting and Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.10316v2">http://arxiv.org/abs/2412.10316v2</a></p>

            <p><strong>Abstract:</strong><br>
            Image editing has advanced significantly with the development of diffusion models using both inversion-based and instruction-based methods. However, current inversion-based approaches struggle with big modifications (e.g., adding or removing objects) due to the structured nature of inversion noise, which hinders substantial changes. Meanwhile, instruction-based methods often constrain users to black-box operations, limiting direct interaction for specifying editing regions and intensity. To address these limitations, we propose BrushEdit, a novel inpainting-based instruction-guided image editing paradigm, which leverages multimodal large language models (MLLMs) and image inpainting models to enable autonomous, user-friendly, and interactive free-form instruction editing. Specifically, we devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model in an agent-cooperative framework to perform editing category classification, main object identification, mask acquisition, and editing area inpainting. Extensive experiments show that our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics including mask region preservation and editing effect coherence.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yaowei Li, Yuxuan Bian, Xuan Ju, Zhaoyang Zhang, Ying Shan, Yuexian Zou, Qiang Xu</p>

            <p><strong>Title:</strong><br>
            BrushEdit: All-In-One Image Inpainting and Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.10316v2">http://arxiv.org/abs/2412.10316v2</a></p>

            <p><strong>Abstract:</strong><br>
            Image editing has advanced significantly with the development of diffusion models using both inversion-based and instruction-based methods. However, current inversion-based approaches struggle with big modifications (e.g., adding or removing objects) due to the structured nature of inversion noise, which hinders substantial changes. Meanwhile, instruction-based methods often constrain users to black-box operations, limiting direct interaction for specifying editing regions and intensity. To address these limitations, we propose BrushEdit, a novel inpainting-based instruction-guided image editing paradigm, which leverages multimodal large language models (MLLMs) and image inpainting models to enable autonomous, user-friendly, and interactive free-form instruction editing. Specifically, we devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model in an agent-cooperative framework to perform editing category classification, main object identification, mask acquisition, and editing area inpainting. Extensive experiments show that our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics including mask region preservation and editing effect coherence.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Dec 2024 20:23:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/81ccede7/49f73e67.mp3" length="26743128" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1668</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yaowei Li, Yuxuan Bian, Xuan Ju, Zhaoyang Zhang, Ying Shan, Yuexian Zou, Qiang Xu</p>

            <p><strong>Title:</strong><br>
            BrushEdit: All-In-One Image Inpainting and Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.10316v2">http://arxiv.org/abs/2412.10316v2</a></p>

            <p><strong>Abstract:</strong><br>
            Image editing has advanced significantly with the development of diffusion models using both inversion-based and instruction-based methods. However, current inversion-based approaches struggle with big modifications (e.g., adding or removing objects) due to the structured nature of inversion noise, which hinders substantial changes. Meanwhile, instruction-based methods often constrain users to black-box operations, limiting direct interaction for specifying editing regions and intensity. To address these limitations, we propose BrushEdit, a novel inpainting-based instruction-guided image editing paradigm, which leverages multimodal large language models (MLLMs) and image inpainting models to enable autonomous, user-friendly, and interactive free-form instruction editing. Specifically, we devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model in an agent-cooperative framework to perform editing category classification, main object identification, mask acquisition, and editing area inpainting. Extensive experiments show that our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics including mask region preservation and editing effect coherence.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ColorFlow: Retrieval-Augmented Image Sequence Colorization</title>
      <itunes:episode>228</itunes:episode>
      <podcast:episode>228</podcast:episode>
      <itunes:title>ColorFlow: Retrieval-Augmented Image Sequence Colorization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d49de961-4e48-4c18-8a81-d5a166d6f673</guid>
      <link>https://share.transistor.fm/s/4f4a842b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junhao Zhuang, Xuan Ju, Zhaoyang Zhang, Yong Liu, Shiyi Zhang, Chun Yuan, Ying Shan</p>

            <p><strong>Title:</strong><br>
            ColorFlow: Retrieval-Augmented Image Sequence Colorization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11815v1">http://arxiv.org/abs/2412.11815v1</a></p>

            <p><strong>Abstract:</strong><br>
            Automatic black-and-white image sequence colorization while preserving character and object identity (ID) is a complex task with significant market demand, such as in cartoon or comic series colorization. Despite advancements in visual colorization using large-scale generative models like diffusion models, challenges with controllability and identity consistency persist, making current solutions unsuitable for industrial application.To address this, we propose ColorFlow, a three-stage diffusion-based framework tailored for image sequence colorization in industrial applications. Unlike existing methods that require per-ID finetuning or explicit ID embedding extraction, we propose a novel robust and generalizable Retrieval Augmented Colorization pipeline for colorizing images with relevant color references. Our pipeline also features a dual-branch design: one branch for color identity extraction and the other for colorization, leveraging the strengths of diffusion models. We utilize the self-attention mechanism in diffusion models for strong in-context learning and color identity matching. To evaluate our model, we introduce ColorFlow-Bench, a comprehensive benchmark for reference-based colorization. Results show that ColorFlow outperforms existing models across multiple metrics, setting a new standard in sequential image colorization and potentially benefiting the art industry. We release our codes and models on our project page: https://zhuang2002.github.io/ColorFlow/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junhao Zhuang, Xuan Ju, Zhaoyang Zhang, Yong Liu, Shiyi Zhang, Chun Yuan, Ying Shan</p>

            <p><strong>Title:</strong><br>
            ColorFlow: Retrieval-Augmented Image Sequence Colorization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11815v1">http://arxiv.org/abs/2412.11815v1</a></p>

            <p><strong>Abstract:</strong><br>
            Automatic black-and-white image sequence colorization while preserving character and object identity (ID) is a complex task with significant market demand, such as in cartoon or comic series colorization. Despite advancements in visual colorization using large-scale generative models like diffusion models, challenges with controllability and identity consistency persist, making current solutions unsuitable for industrial application.To address this, we propose ColorFlow, a three-stage diffusion-based framework tailored for image sequence colorization in industrial applications. Unlike existing methods that require per-ID finetuning or explicit ID embedding extraction, we propose a novel robust and generalizable Retrieval Augmented Colorization pipeline for colorizing images with relevant color references. Our pipeline also features a dual-branch design: one branch for color identity extraction and the other for colorization, leveraging the strengths of diffusion models. We utilize the self-attention mechanism in diffusion models for strong in-context learning and color identity matching. To evaluate our model, we introduce ColorFlow-Bench, a comprehensive benchmark for reference-based colorization. Results show that ColorFlow outperforms existing models across multiple metrics, setting a new standard in sequential image colorization and potentially benefiting the art industry. We release our codes and models on our project page: https://zhuang2002.github.io/ColorFlow/.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Dec 2024 20:22:38 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4f4a842b/5fdbd02e.mp3" length="21693772" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1352</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Junhao Zhuang, Xuan Ju, Zhaoyang Zhang, Yong Liu, Shiyi Zhang, Chun Yuan, Ying Shan</p>

            <p><strong>Title:</strong><br>
            ColorFlow: Retrieval-Augmented Image Sequence Colorization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11815v1">http://arxiv.org/abs/2412.11815v1</a></p>

            <p><strong>Abstract:</strong><br>
            Automatic black-and-white image sequence colorization while preserving character and object identity (ID) is a complex task with significant market demand, such as in cartoon or comic series colorization. Despite advancements in visual colorization using large-scale generative models like diffusion models, challenges with controllability and identity consistency persist, making current solutions unsuitable for industrial application.To address this, we propose ColorFlow, a three-stage diffusion-based framework tailored for image sequence colorization in industrial applications. Unlike existing methods that require per-ID finetuning or explicit ID embedding extraction, we propose a novel robust and generalizable Retrieval Augmented Colorization pipeline for colorizing images with relevant color references. Our pipeline also features a dual-branch design: one branch for color identity extraction and the other for colorization, leveraging the strengths of diffusion models. We utilize the self-attention mechanism in diffusion models for strong in-context learning and color identity matching. To evaluate our model, we introduce ColorFlow-Bench, a comprehensive benchmark for reference-based colorization. Results show that ColorFlow outperforms existing models across multiple metrics, setting a new standard in sequential image colorization and potentially benefiting the art industry. We release our codes and models on our project page: https://zhuang2002.github.io/ColorFlow/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Smaller Language Models Are Better Instruction Evolvers</title>
      <itunes:episode>227</itunes:episode>
      <podcast:episode>227</podcast:episode>
      <itunes:title>Smaller Language Models Are Better Instruction Evolvers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">645d834a-3ef7-4d23-a9e6-34ba2ee35070</guid>
      <link>https://share.transistor.fm/s/677297e0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tingfeng Hui, Lulu Zhao, Guanting Dong, Yaqi Zhang, Hua Zhou, Sen Su</p>

            <p><strong>Title:</strong><br>
            Smaller Language Models Are Better Instruction Evolvers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11231v1">http://arxiv.org/abs/2412.11231v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction tuning has been widely used to unleash the complete potential of large language models. Notably, complex and diverse instructions are of significant importance as they can effectively align models with various downstream tasks. However, current approaches to constructing large-scale instructions predominantly favour powerful models such as GPT-4 or those with over 70 billion parameters, under the empirical presumption that such larger language models (LLMs) inherently possess enhanced capabilities. In this study, we question this prevalent assumption and conduct an in-depth exploration into the potential of smaller language models (SLMs) in the context of instruction evolution. Extensive experiments across three scenarios of instruction evolution reveal that smaller language models (SLMs) can synthesize more effective instructions than LLMs. Further analysis demonstrates that SLMs possess a broader output space during instruction evolution, resulting in more complex and diverse variants. We also observe that the existing metrics fail to focus on the impact of the instructions. Thus, we propose Instruction Complex-Aware IFD (IC-IFD), which introduces instruction complexity in the original IFD score to evaluate the effectiveness of instruction data more accurately. Our source code is available at: \href{https://github.com/HypherX/Evolution-Analysis}{https://github.com/HypherX/Evolution-Analysis}</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tingfeng Hui, Lulu Zhao, Guanting Dong, Yaqi Zhang, Hua Zhou, Sen Su</p>

            <p><strong>Title:</strong><br>
            Smaller Language Models Are Better Instruction Evolvers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11231v1">http://arxiv.org/abs/2412.11231v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction tuning has been widely used to unleash the complete potential of large language models. Notably, complex and diverse instructions are of significant importance as they can effectively align models with various downstream tasks. However, current approaches to constructing large-scale instructions predominantly favour powerful models such as GPT-4 or those with over 70 billion parameters, under the empirical presumption that such larger language models (LLMs) inherently possess enhanced capabilities. In this study, we question this prevalent assumption and conduct an in-depth exploration into the potential of smaller language models (SLMs) in the context of instruction evolution. Extensive experiments across three scenarios of instruction evolution reveal that smaller language models (SLMs) can synthesize more effective instructions than LLMs. Further analysis demonstrates that SLMs possess a broader output space during instruction evolution, resulting in more complex and diverse variants. We also observe that the existing metrics fail to focus on the impact of the instructions. Thus, we propose Instruction Complex-Aware IFD (IC-IFD), which introduces instruction complexity in the original IFD score to evaluate the effectiveness of instruction data more accurately. Our source code is available at: \href{https://github.com/HypherX/Evolution-Analysis}{https://github.com/HypherX/Evolution-Analysis}</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Dec 2024 20:22:16 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/677297e0/30799ade.mp3" length="22404299" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1397</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Tingfeng Hui, Lulu Zhao, Guanting Dong, Yaqi Zhang, Hua Zhou, Sen Su</p>

            <p><strong>Title:</strong><br>
            Smaller Language Models Are Better Instruction Evolvers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11231v1">http://arxiv.org/abs/2412.11231v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction tuning has been widely used to unleash the complete potential of large language models. Notably, complex and diverse instructions are of significant importance as they can effectively align models with various downstream tasks. However, current approaches to constructing large-scale instructions predominantly favour powerful models such as GPT-4 or those with over 70 billion parameters, under the empirical presumption that such larger language models (LLMs) inherently possess enhanced capabilities. In this study, we question this prevalent assumption and conduct an in-depth exploration into the potential of smaller language models (SLMs) in the context of instruction evolution. Extensive experiments across three scenarios of instruction evolution reveal that smaller language models (SLMs) can synthesize more effective instructions than LLMs. Further analysis demonstrates that SLMs possess a broader output space during instruction evolution, resulting in more complex and diverse variants. We also observe that the existing metrics fail to focus on the impact of the instructions. Thus, we propose Instruction Complex-Aware IFD (IC-IFD), which introduces instruction complexity in the original IFD score to evaluate the effectiveness of instruction data more accurately. Our source code is available at: \href{https://github.com/HypherX/Evolution-Analysis}{https://github.com/HypherX/Evolution-Analysis}</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Causal Diffusion Transformers for Generative Modeling</title>
      <itunes:episode>226</itunes:episode>
      <podcast:episode>226</podcast:episode>
      <itunes:title>Causal Diffusion Transformers for Generative Modeling</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">41fdf1d2-600c-467e-839e-e2b3bf8e7078</guid>
      <link>https://share.transistor.fm/s/1179487c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chaorui Deng, Deyao Zhu, Kunchang Li, Shi Guang, Haoqi Fan</p>

            <p><strong>Title:</strong><br>
            Causal Diffusion Transformers for Generative Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.12095v2">http://arxiv.org/abs/2412.12095v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Causal Diffusion as the autoregressive (AR) counterpart of Diffusion models. It is a next-token(s) forecasting framework that is friendly to both discrete and continuous modalities and compatible with existing next-token prediction models like LLaMA and GPT. While recent works attempt to combine diffusion with AR models, we show that introducing sequential factorization to a diffusion model can substantially improve its performance and enables a smooth transition between AR and diffusion generation modes. Hence, we propose CausalFusion - a decoder-only transformer that dual-factorizes data across sequential tokens and diffusion noise levels, leading to state-of-the-art results on the ImageNet generation benchmark while also enjoying the AR advantage of generating an arbitrary number of tokens for in-context reasoning. We further demonstrate CausalFusion's multimodal capabilities through a joint image generation and captioning model, and showcase CausalFusion's ability for zero-shot in-context image manipulations. We hope that this work could provide the community with a fresh perspective on training multimodal models over discrete and continuous data.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chaorui Deng, Deyao Zhu, Kunchang Li, Shi Guang, Haoqi Fan</p>

            <p><strong>Title:</strong><br>
            Causal Diffusion Transformers for Generative Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.12095v2">http://arxiv.org/abs/2412.12095v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Causal Diffusion as the autoregressive (AR) counterpart of Diffusion models. It is a next-token(s) forecasting framework that is friendly to both discrete and continuous modalities and compatible with existing next-token prediction models like LLaMA and GPT. While recent works attempt to combine diffusion with AR models, we show that introducing sequential factorization to a diffusion model can substantially improve its performance and enables a smooth transition between AR and diffusion generation modes. Hence, we propose CausalFusion - a decoder-only transformer that dual-factorizes data across sequential tokens and diffusion noise levels, leading to state-of-the-art results on the ImageNet generation benchmark while also enjoying the AR advantage of generating an arbitrary number of tokens for in-context reasoning. We further demonstrate CausalFusion's multimodal capabilities through a joint image generation and captioning model, and showcase CausalFusion's ability for zero-shot in-context image manipulations. We hope that this work could provide the community with a fresh perspective on training multimodal models over discrete and continuous data.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Dec 2024 20:21:52 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1179487c/34ff4942.mp3" length="22893727" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1427</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chaorui Deng, Deyao Zhu, Kunchang Li, Shi Guang, Haoqi Fan</p>

            <p><strong>Title:</strong><br>
            Causal Diffusion Transformers for Generative Modeling</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.12095v2">http://arxiv.org/abs/2412.12095v2</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Causal Diffusion as the autoregressive (AR) counterpart of Diffusion models. It is a next-token(s) forecasting framework that is friendly to both discrete and continuous modalities and compatible with existing next-token prediction models like LLaMA and GPT. While recent works attempt to combine diffusion with AR models, we show that introducing sequential factorization to a diffusion model can substantially improve its performance and enables a smooth transition between AR and diffusion generation modes. Hence, we propose CausalFusion - a decoder-only transformer that dual-factorizes data across sequential tokens and diffusion noise levels, leading to state-of-the-art results on the ImageNet generation benchmark while also enjoying the AR advantage of generating an arbitrary number of tokens for in-context reasoning. We further demonstrate CausalFusion's multimodal capabilities through a joint image generation and captioning model, and showcase CausalFusion's ability for zero-shot in-context image manipulations. We hope that this work could provide the community with a fresh perspective on training multimodal models over discrete and continuous data.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models</title>
      <itunes:episode>225</itunes:episode>
      <podcast:episode>225</podcast:episode>
      <itunes:title>SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">81978793-b818-4e68-9a71-7296a97a76a3</guid>
      <link>https://share.transistor.fm/s/68c919ab</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiale Cheng, Xiao Liu, Cunxiang Wang, Xiaotao Gu, Yida Lu, Dan Zhang, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang</p>

            <p><strong>Title:</strong><br>
            SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11605v1">http://arxiv.org/abs/2412.11605v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-following is a fundamental capability of language models, requiring the model to recognize even the most subtle requirements in the instructions and accurately reflect them in its output. Such an ability is well-suited for and often optimized by preference learning. However, existing methods often directly sample multiple independent responses from the model when creating preference pairs. Such practice can introduce content variations irrelevant to whether the instruction is precisely followed (e.g., different expressions about the same semantic), interfering with the goal of teaching models to recognize the key differences that lead to improved instruction following. In light of this, we introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs free from distractions. By playing against itself, an LLM employs a tree-search strategy to refine its previous responses with respect to the instruction while minimizing unnecessary variations. Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities. Furthermore, SPaR demonstrates promising scalability and transferability, greatly enhancing models like GLM-4-9B and LLaMA3-70B. We also identify how inference scaling in tree search would impact model performance. Our code and data are publicly available at https://github.com/thu-coai/SPaR.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiale Cheng, Xiao Liu, Cunxiang Wang, Xiaotao Gu, Yida Lu, Dan Zhang, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang</p>

            <p><strong>Title:</strong><br>
            SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11605v1">http://arxiv.org/abs/2412.11605v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-following is a fundamental capability of language models, requiring the model to recognize even the most subtle requirements in the instructions and accurately reflect them in its output. Such an ability is well-suited for and often optimized by preference learning. However, existing methods often directly sample multiple independent responses from the model when creating preference pairs. Such practice can introduce content variations irrelevant to whether the instruction is precisely followed (e.g., different expressions about the same semantic), interfering with the goal of teaching models to recognize the key differences that lead to improved instruction following. In light of this, we introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs free from distractions. By playing against itself, an LLM employs a tree-search strategy to refine its previous responses with respect to the instruction while minimizing unnecessary variations. Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities. Furthermore, SPaR demonstrates promising scalability and transferability, greatly enhancing models like GLM-4-9B and LLaMA3-70B. We also identify how inference scaling in tree search would impact model performance. Our code and data are publicly available at https://github.com/thu-coai/SPaR.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Dec 2024 20:21:30 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/68c919ab/23654762.mp3" length="22213756" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1385</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jiale Cheng, Xiao Liu, Cunxiang Wang, Xiaotao Gu, Yida Lu, Dan Zhang, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang</p>

            <p><strong>Title:</strong><br>
            SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11605v1">http://arxiv.org/abs/2412.11605v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-following is a fundamental capability of language models, requiring the model to recognize even the most subtle requirements in the instructions and accurately reflect them in its output. Such an ability is well-suited for and often optimized by preference learning. However, existing methods often directly sample multiple independent responses from the model when creating preference pairs. Such practice can introduce content variations irrelevant to whether the instruction is precisely followed (e.g., different expressions about the same semantic), interfering with the goal of teaching models to recognize the key differences that lead to improved instruction following. In light of this, we introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs free from distractions. By playing against itself, an LLM employs a tree-search strategy to refine its previous responses with respect to the instruction while minimizing unnecessary variations. Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities. Furthermore, SPaR demonstrates promising scalability and transferability, greatly enhancing models like GLM-4-9B and LLaMA3-70B. We also identify how inference scaling in tree search would impact model performance. Our code and data are publicly available at https://github.com/thu-coai/SPaR.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations</title>
      <itunes:episode>224</itunes:episode>
      <podcast:episode>224</podcast:episode>
      <itunes:title>IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">41b30b40-a3f6-4187-8df9-129c7b5d5e75</guid>
      <link>https://share.transistor.fm/s/8498ab9e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhibing Li, Tong Wu, Jing Tan, Mengchen Zhang, Jiaqi Wang, Dahua Lin</p>

            <p><strong>Title:</strong><br>
            IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.12083v1">http://arxiv.org/abs/2412.12083v1</a></p>

            <p><strong>Abstract:</strong><br>
            Capturing geometric and material information from images remains a fundamental challenge in computer vision and graphics. Traditional optimization-based methods often require hours of computational time to reconstruct geometry, material properties, and environmental lighting from dense multi-view inputs, while still struggling with inherent ambiguities between lighting and material. On the other hand, learning-based approaches leverage rich material priors from existing 3D object datasets but face challenges with maintaining multi-view consistency. In this paper, we introduce IDArb, a diffusion-based model designed to perform intrinsic decomposition on an arbitrary number of images under varying illuminations. Our method achieves accurate and multi-view consistent estimation on surface normals and material properties. This is made possible through a novel cross-view, cross-domain attention module and an illumination-augmented, view-adaptive training strategy. Additionally, we introduce ARB-Objaverse, a new dataset that provides large-scale multi-view intrinsic data and renderings under diverse lighting conditions, supporting robust training. Extensive experiments demonstrate that IDArb outperforms state-of-the-art methods both qualitatively and quantitatively. Moreover, our approach facilitates a range of downstream tasks, including single-image relighting, photometric stereo, and 3D reconstruction, highlighting its broad applications in realistic 3D content creation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhibing Li, Tong Wu, Jing Tan, Mengchen Zhang, Jiaqi Wang, Dahua Lin</p>

            <p><strong>Title:</strong><br>
            IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.12083v1">http://arxiv.org/abs/2412.12083v1</a></p>

            <p><strong>Abstract:</strong><br>
            Capturing geometric and material information from images remains a fundamental challenge in computer vision and graphics. Traditional optimization-based methods often require hours of computational time to reconstruct geometry, material properties, and environmental lighting from dense multi-view inputs, while still struggling with inherent ambiguities between lighting and material. On the other hand, learning-based approaches leverage rich material priors from existing 3D object datasets but face challenges with maintaining multi-view consistency. In this paper, we introduce IDArb, a diffusion-based model designed to perform intrinsic decomposition on an arbitrary number of images under varying illuminations. Our method achieves accurate and multi-view consistent estimation on surface normals and material properties. This is made possible through a novel cross-view, cross-domain attention module and an illumination-augmented, view-adaptive training strategy. Additionally, we introduce ARB-Objaverse, a new dataset that provides large-scale multi-view intrinsic data and renderings under diverse lighting conditions, supporting robust training. Extensive experiments demonstrate that IDArb outperforms state-of-the-art methods both qualitatively and quantitatively. Moreover, our approach facilitates a range of downstream tasks, including single-image relighting, photometric stereo, and 3D reconstruction, highlighting its broad applications in realistic 3D content creation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Dec 2024 20:21:07 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8498ab9e/ffd6539a.mp3" length="19729389" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1229</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhibing Li, Tong Wu, Jing Tan, Mengchen Zhang, Jiaqi Wang, Dahua Lin</p>

            <p><strong>Title:</strong><br>
            IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.12083v1">http://arxiv.org/abs/2412.12083v1</a></p>

            <p><strong>Abstract:</strong><br>
            Capturing geometric and material information from images remains a fundamental challenge in computer vision and graphics. Traditional optimization-based methods often require hours of computational time to reconstruct geometry, material properties, and environmental lighting from dense multi-view inputs, while still struggling with inherent ambiguities between lighting and material. On the other hand, learning-based approaches leverage rich material priors from existing 3D object datasets but face challenges with maintaining multi-view consistency. In this paper, we introduce IDArb, a diffusion-based model designed to perform intrinsic decomposition on an arbitrary number of images under varying illuminations. Our method achieves accurate and multi-view consistent estimation on surface normals and material properties. This is made possible through a novel cross-view, cross-domain attention module and an illumination-augmented, view-adaptive training strategy. Additionally, we introduce ARB-Objaverse, a new dataset that provides large-scale multi-view intrinsic data and renderings under diverse lighting conditions, supporting robust training. Extensive experiments demonstrate that IDArb outperforms state-of-the-art methods both qualitatively and quantitatively. Moreover, our approach facilitates a range of downstream tasks, including single-image relighting, photometric stereo, and 3D reconstruction, highlighting its broad applications in realistic 3D content creation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs</title>
      <itunes:episode>223</itunes:episode>
      <podcast:episode>223</podcast:episode>
      <itunes:title>GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f26522ba-d636-46d2-8c38-9a184ab9f53a</guid>
      <link>https://share.transistor.fm/s/727e76d8</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.RO, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xinli Xu, Wenhang Ge, Dicong Qiu, ZhiFei Chen, Dongyu Yan, Zhuoyun Liu, Haoyu Zhao, Hanfeng Zhao, Shunsi Zhang, Junwei Liang, Ying-Cong Chen</p>

            <p><strong>Title:</strong><br>
            GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11258v1">http://arxiv.org/abs/2412.11258v1</a></p>

            <p><strong>Abstract:</strong><br>
            Estimating physical properties for visual data is a crucial task in computer vision, graphics, and robotics, underpinning applications such as augmented reality, physical simulation, and robotic grasping. However, this area remains under-explored due to the inherent ambiguities in physical property estimation. To address these challenges, we introduce GaussianProperty, a training-free framework that assigns physical properties of materials to 3D Gaussians. Specifically, we integrate the segmentation capability of SAM with the recognition capability of GPT-4V(ision) to formulate a global-local physical property reasoning module for 2D images. Then we project the physical properties from multi-view 2D images to 3D Gaussians using a voting strategy. We demonstrate that 3D Gaussians with physical property annotations enable applications in physics-based dynamic simulation and robotic grasping. For physics-based dynamic simulation, we leverage the Material Point Method (MPM) for realistic dynamic simulation. For robot grasping, we develop a grasping force prediction strategy that estimates a safe force range required for object grasping based on the estimated physical properties. Extensive experiments on material segmentation, physics-based dynamic simulation, and robotic grasping validate the effectiveness of our proposed method, highlighting its crucial role in understanding physical properties from visual data. Online demo, code, more cases and annotated datasets are available on \href{https://Gaussian-Property.github.io}{this https URL}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.RO, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xinli Xu, Wenhang Ge, Dicong Qiu, ZhiFei Chen, Dongyu Yan, Zhuoyun Liu, Haoyu Zhao, Hanfeng Zhao, Shunsi Zhang, Junwei Liang, Ying-Cong Chen</p>

            <p><strong>Title:</strong><br>
            GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11258v1">http://arxiv.org/abs/2412.11258v1</a></p>

            <p><strong>Abstract:</strong><br>
            Estimating physical properties for visual data is a crucial task in computer vision, graphics, and robotics, underpinning applications such as augmented reality, physical simulation, and robotic grasping. However, this area remains under-explored due to the inherent ambiguities in physical property estimation. To address these challenges, we introduce GaussianProperty, a training-free framework that assigns physical properties of materials to 3D Gaussians. Specifically, we integrate the segmentation capability of SAM with the recognition capability of GPT-4V(ision) to formulate a global-local physical property reasoning module for 2D images. Then we project the physical properties from multi-view 2D images to 3D Gaussians using a voting strategy. We demonstrate that 3D Gaussians with physical property annotations enable applications in physics-based dynamic simulation and robotic grasping. For physics-based dynamic simulation, we leverage the Material Point Method (MPM) for realistic dynamic simulation. For robot grasping, we develop a grasping force prediction strategy that estimates a safe force range required for object grasping based on the estimated physical properties. Extensive experiments on material segmentation, physics-based dynamic simulation, and robotic grasping validate the effectiveness of our proposed method, highlighting its crucial role in understanding physical properties from visual data. Online demo, code, more cases and annotated datasets are available on \href{https://Gaussian-Property.github.io}{this https URL}.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 17 Dec 2024 20:20:44 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/727e76d8/881e3d29.mp3" length="20463735" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1275</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.RO, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xinli Xu, Wenhang Ge, Dicong Qiu, ZhiFei Chen, Dongyu Yan, Zhuoyun Liu, Haoyu Zhao, Hanfeng Zhao, Shunsi Zhang, Junwei Liang, Ying-Cong Chen</p>

            <p><strong>Title:</strong><br>
            GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.11258v1">http://arxiv.org/abs/2412.11258v1</a></p>

            <p><strong>Abstract:</strong><br>
            Estimating physical properties for visual data is a crucial task in computer vision, graphics, and robotics, underpinning applications such as augmented reality, physical simulation, and robotic grasping. However, this area remains under-explored due to the inherent ambiguities in physical property estimation. To address these challenges, we introduce GaussianProperty, a training-free framework that assigns physical properties of materials to 3D Gaussians. Specifically, we integrate the segmentation capability of SAM with the recognition capability of GPT-4V(ision) to formulate a global-local physical property reasoning module for 2D images. Then we project the physical properties from multi-view 2D images to 3D Gaussians using a voting strategy. We demonstrate that 3D Gaussians with physical property annotations enable applications in physics-based dynamic simulation and robotic grasping. For physics-based dynamic simulation, we leverage the Material Point Method (MPM) for realistic dynamic simulation. For robot grasping, we develop a grasping force prediction strategy that estimates a safe force range required for object grasping based on the estimated physical properties. Extensive experiments on material segmentation, physics-based dynamic simulation, and robotic grasping validate the effectiveness of our proposed method, highlighting its crucial role in understanding physical properties from visual data. Online demo, code, more cases and annotated datasets are available on \href{https://Gaussian-Property.github.io}{this https URL}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Apollo: An Exploration of Video Understanding in Large Multimodal Models</title>
      <itunes:episode>222</itunes:episode>
      <podcast:episode>222</podcast:episode>
      <itunes:title>Apollo: An Exploration of Video Understanding in Large Multimodal Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">387e850f-0be2-4629-8195-48ffac3cdea8</guid>
      <link>https://share.transistor.fm/s/708aa61f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 91 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia</p>

            <p><strong>Title:</strong><br>
            Apollo: An Exploration of Video Understanding in Large Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.10360v1">http://arxiv.org/abs/2412.10360v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs.   We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover Scaling Consistency, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more. For example, we demonstrated that fps sampling during training is vastly preferable to uniform frame sampling and which vision encoders are the best for video representation.   Guided by these findings, we introduce Apollo, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing $7$B models with an impressive 55.1 on LongVideoBench. Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 91 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia</p>

            <p><strong>Title:</strong><br>
            Apollo: An Exploration of Video Understanding in Large Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.10360v1">http://arxiv.org/abs/2412.10360v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs.   We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover Scaling Consistency, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more. For example, we demonstrated that fps sampling during training is vastly preferable to uniform frame sampling and which vision encoders are the best for video representation.   Guided by these findings, we introduce Apollo, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing $7$B models with an impressive 55.1 on LongVideoBench. Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 16 Dec 2024 20:17:11 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/708aa61f/b28dedb8.mp3" length="24043134" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1499</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 91 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia</p>

            <p><strong>Title:</strong><br>
            Apollo: An Exploration of Video Understanding in Large Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.10360v1">http://arxiv.org/abs/2412.10360v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs.   We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover Scaling Consistency, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more. For example, we demonstrated that fps sampling during training is vastly preferable to uniform frame sampling and which vision encoders are the best for video representation.   Guided by these findings, we introduce Apollo, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing $7$B models with an impressive 55.1 on LongVideoBench. Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GenEx: Generating an Explorable World</title>
      <itunes:episode>221</itunes:episode>
      <podcast:episode>221</podcast:episode>
      <itunes:title>GenEx: Generating an Explorable World</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d47ced6e-b5ce-48c5-91e4-0094830cfaad</guid>
      <link>https://share.transistor.fm/s/432f6bd3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Taiming Lu, Tianmin Shu, Junfei Xiao, Luoxin Ye, Jiahao Wang, Cheng Peng, Chen Wei, Daniel Khashabi, Rama Chellappa, Alan Yuille, Jieneng Chen</p>

            <p><strong>Title:</strong><br>
            GenEx: Generating an Explorable World</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09624v1">http://arxiv.org/abs/2412.09624v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding, navigating, and exploring the 3D physical real world has long been a central challenge in the development of artificial intelligence. In this work, we take a step toward this goal by introducing GenEx, a system capable of planning complex embodied world exploration, guided by its generative imagination that forms priors (expectations) about the surrounding environments. GenEx generates an entire 3D-consistent imaginative environment from as little as a single RGB image, bringing it to life through panoramic video streams. Leveraging scalable 3D world data curated from Unreal Engine, our generative model is rounded in the physical world. It captures a continuous 360-degree environment with little effort, offering a boundless landscape for AI agents to explore and interact with. GenEx achieves high-quality world generation, robust loop consistency over long trajectories, and demonstrates strong 3D capabilities such as consistency and active 3D mapping. Powered by generative imagination of the world, GPT-assisted agents are equipped to perform complex embodied tasks, including both goal-agnostic exploration and goal-driven navigation. These agents utilize predictive expectation regarding unseen parts of the physical world to refine their beliefs, simulate different outcomes based on potential decisions, and make more informed choices. In summary, we demonstrate that GenEx provides a transformative platform for advancing embodied AI in imaginative spaces and brings potential for extending these capabilities to real-world exploration.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Taiming Lu, Tianmin Shu, Junfei Xiao, Luoxin Ye, Jiahao Wang, Cheng Peng, Chen Wei, Daniel Khashabi, Rama Chellappa, Alan Yuille, Jieneng Chen</p>

            <p><strong>Title:</strong><br>
            GenEx: Generating an Explorable World</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09624v1">http://arxiv.org/abs/2412.09624v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding, navigating, and exploring the 3D physical real world has long been a central challenge in the development of artificial intelligence. In this work, we take a step toward this goal by introducing GenEx, a system capable of planning complex embodied world exploration, guided by its generative imagination that forms priors (expectations) about the surrounding environments. GenEx generates an entire 3D-consistent imaginative environment from as little as a single RGB image, bringing it to life through panoramic video streams. Leveraging scalable 3D world data curated from Unreal Engine, our generative model is rounded in the physical world. It captures a continuous 360-degree environment with little effort, offering a boundless landscape for AI agents to explore and interact with. GenEx achieves high-quality world generation, robust loop consistency over long trajectories, and demonstrates strong 3D capabilities such as consistency and active 3D mapping. Powered by generative imagination of the world, GPT-assisted agents are equipped to perform complex embodied tasks, including both goal-agnostic exploration and goal-driven navigation. These agents utilize predictive expectation regarding unseen parts of the physical world to refine their beliefs, simulate different outcomes based on potential decisions, and make more informed choices. In summary, we demonstrate that GenEx provides a transformative platform for advancing embodied AI in imaginative spaces and brings potential for extending these capabilities to real-world exploration.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 16 Dec 2024 20:16:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/432f6bd3/e274e5f5.mp3" length="20664317" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1288</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 65 | cs.CV, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Taiming Lu, Tianmin Shu, Junfei Xiao, Luoxin Ye, Jiahao Wang, Cheng Peng, Chen Wei, Daniel Khashabi, Rama Chellappa, Alan Yuille, Jieneng Chen</p>

            <p><strong>Title:</strong><br>
            GenEx: Generating an Explorable World</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09624v1">http://arxiv.org/abs/2412.09624v1</a></p>

            <p><strong>Abstract:</strong><br>
            Understanding, navigating, and exploring the 3D physical real world has long been a central challenge in the development of artificial intelligence. In this work, we take a step toward this goal by introducing GenEx, a system capable of planning complex embodied world exploration, guided by its generative imagination that forms priors (expectations) about the surrounding environments. GenEx generates an entire 3D-consistent imaginative environment from as little as a single RGB image, bringing it to life through panoramic video streams. Leveraging scalable 3D world data curated from Unreal Engine, our generative model is rounded in the physical world. It captures a continuous 360-degree environment with little effort, offering a boundless landscape for AI agents to explore and interact with. GenEx achieves high-quality world generation, robust loop consistency over long trajectories, and demonstrates strong 3D capabilities such as consistency and active 3D mapping. Powered by generative imagination of the world, GPT-assisted agents are equipped to perform complex embodied tasks, including both goal-agnostic exploration and goal-driven navigation. These agents utilize predictive expectation regarding unseen parts of the physical world to refine their beliefs, simulate different outcomes based on potential decisions, and make more informed choices. In summary, we demonstrate that GenEx provides a transformative platform for advancing embodied AI in imaginative spaces and brings potential for extending these capabilities to real-world exploration.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding</title>
      <itunes:episode>220</itunes:episode>
      <podcast:episode>220</podcast:episode>
      <itunes:title>SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5fec23dc-3625-497e-9f89-4cd17e2cc72a</guid>
      <link>https://share.transistor.fm/s/cd293603</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai</p>

            <p><strong>Title:</strong><br>
            SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09604v1">http://arxiv.org/abs/2412.09604v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified Multimodal Large Language Models (MLLMs) that integrate these capabilities have shown promising results. However, existing approaches often involve complex designs in model architecture or training pipeline, increasing the difficulty of model training and scaling. In this paper, we propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation. To address challenges identified in existing encoder-free unified MLLMs, we introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding while reducing training complexity. After being trained on large-scale mixed image-text data with a unified next-token prediction objective, SynerGen-VL achieves or surpasses the performance of existing encoder-free unified MLLMs with comparable or smaller parameter sizes, and narrows the gap with task-specific state-of-the-art models, highlighting a promising path toward future unified MLLMs. Our code and models shall be released.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai</p>

            <p><strong>Title:</strong><br>
            SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09604v1">http://arxiv.org/abs/2412.09604v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified Multimodal Large Language Models (MLLMs) that integrate these capabilities have shown promising results. However, existing approaches often involve complex designs in model architecture or training pipeline, increasing the difficulty of model training and scaling. In this paper, we propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation. To address challenges identified in existing encoder-free unified MLLMs, we introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding while reducing training complexity. After being trained on large-scale mixed image-text data with a unified next-token prediction objective, SynerGen-VL achieves or surpasses the performance of existing encoder-free unified MLLMs with comparable or smaller parameter sizes, and narrows the gap with task-specific state-of-the-art models, highlighting a promising path toward future unified MLLMs. Our code and models shall be released.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 16 Dec 2024 20:16:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cd293603/43ab51c8.mp3" length="24293943" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1515</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 29 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai</p>

            <p><strong>Title:</strong><br>
            SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09604v1">http://arxiv.org/abs/2412.09604v1</a></p>

            <p><strong>Abstract:</strong><br>
            The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified Multimodal Large Language Models (MLLMs) that integrate these capabilities have shown promising results. However, existing approaches often involve complex designs in model architecture or training pipeline, increasing the difficulty of model training and scaling. In this paper, we propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation. To address challenges identified in existing encoder-free unified MLLMs, we introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding while reducing training complexity. After being trained on large-scale mixed image-text data with a unified next-token prediction objective, SynerGen-VL achieves or surpasses the performance of existing encoder-free unified MLLMs with comparable or smaller parameter sizes, and narrows the gap with task-specific state-of-the-art models, highlighting a promising path toward future unified MLLMs. Our code and models shall be released.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities</title>
      <itunes:episode>219</itunes:episode>
      <podcast:episode>219</podcast:episode>
      <itunes:title>BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a343f583-ef56-4e61-b0fd-b784560f9d05</guid>
      <link>https://share.transistor.fm/s/890b888d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Sara Pieri, Saeed Yahya Alseiari, Shanavas Cholakkal, Khaled Aldahmani, Fahad Khan, Rao Anwer, Salman Khan, Timothy Baldwin, Hisham Cholakkal</p>

            <p><strong>Title:</strong><br>
            BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07769v1">http://arxiv.org/abs/2412.07769v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces BiMediX2, a bilingual (Arabic-English) Bio-Medical EXpert Large Multimodal Model (LMM) with a unified architecture that integrates text and visual modalities, enabling advanced image understanding and medical applications. BiMediX2 leverages the Llama3.1 architecture and integrates text and visual capabilities to facilitate seamless interactions in both English and Arabic, supporting text-based inputs and multi-turn conversations involving medical images. The model is trained on an extensive bilingual healthcare dataset consisting of 1.6M samples of diverse medical interactions for both text and image modalities, mixed in Arabic and English. We also propose the first bilingual GPT-4o based medical LMM benchmark named BiMed-MBench. BiMediX2 is benchmarked on both text-based and image-based tasks, achieving state-of-the-art performance across several medical benchmarks. It outperforms recent state-of-the-art models in medical LLM evaluation benchmarks. Our model also sets a new benchmark in multimodal medical evaluations with over 9% improvement in English and over 20% in Arabic evaluations. Additionally, it surpasses GPT-4 by around 9% in UPHILL factual accuracy evaluations and excels in various medical Visual Question Answering, Report Generation, and Report Summarization tasks. The project page including source code and the trained model, is available at https://github.com/mbzuai-oryx/BiMediX2.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Sara Pieri, Saeed Yahya Alseiari, Shanavas Cholakkal, Khaled Aldahmani, Fahad Khan, Rao Anwer, Salman Khan, Timothy Baldwin, Hisham Cholakkal</p>

            <p><strong>Title:</strong><br>
            BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07769v1">http://arxiv.org/abs/2412.07769v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces BiMediX2, a bilingual (Arabic-English) Bio-Medical EXpert Large Multimodal Model (LMM) with a unified architecture that integrates text and visual modalities, enabling advanced image understanding and medical applications. BiMediX2 leverages the Llama3.1 architecture and integrates text and visual capabilities to facilitate seamless interactions in both English and Arabic, supporting text-based inputs and multi-turn conversations involving medical images. The model is trained on an extensive bilingual healthcare dataset consisting of 1.6M samples of diverse medical interactions for both text and image modalities, mixed in Arabic and English. We also propose the first bilingual GPT-4o based medical LMM benchmark named BiMed-MBench. BiMediX2 is benchmarked on both text-based and image-based tasks, achieving state-of-the-art performance across several medical benchmarks. It outperforms recent state-of-the-art models in medical LLM evaluation benchmarks. Our model also sets a new benchmark in multimodal medical evaluations with over 9% improvement in English and over 20% in Arabic evaluations. Additionally, it surpasses GPT-4 by around 9% in UPHILL factual accuracy evaluations and excels in various medical Visual Question Answering, Report Generation, and Report Summarization tasks. The project page including source code and the trained model, is available at https://github.com/mbzuai-oryx/BiMediX2.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 16 Dec 2024 20:16:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/890b888d/3ff94d08.mp3" length="17118795" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1066</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 24 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Sara Pieri, Saeed Yahya Alseiari, Shanavas Cholakkal, Khaled Aldahmani, Fahad Khan, Rao Anwer, Salman Khan, Timothy Baldwin, Hisham Cholakkal</p>

            <p><strong>Title:</strong><br>
            BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07769v1">http://arxiv.org/abs/2412.07769v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces BiMediX2, a bilingual (Arabic-English) Bio-Medical EXpert Large Multimodal Model (LMM) with a unified architecture that integrates text and visual modalities, enabling advanced image understanding and medical applications. BiMediX2 leverages the Llama3.1 architecture and integrates text and visual capabilities to facilitate seamless interactions in both English and Arabic, supporting text-based inputs and multi-turn conversations involving medical images. The model is trained on an extensive bilingual healthcare dataset consisting of 1.6M samples of diverse medical interactions for both text and image modalities, mixed in Arabic and English. We also propose the first bilingual GPT-4o based medical LMM benchmark named BiMed-MBench. BiMediX2 is benchmarked on both text-based and image-based tasks, achieving state-of-the-art performance across several medical benchmarks. It outperforms recent state-of-the-art models in medical LLM evaluation benchmarks. Our model also sets a new benchmark in multimodal medical evaluations with over 9% improvement in English and over 20% in Arabic evaluations. Additionally, it surpasses GPT-4 by around 9% in UPHILL factual accuracy evaluations and excels in various medical Visual Question Answering, Report Generation, and Report Summarization tasks. The project page including source code and the trained model, is available at https://github.com/mbzuai-oryx/BiMediX2.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Large Action Models: From Inception to Implementation</title>
      <itunes:episode>218</itunes:episode>
      <podcast:episode>218</podcast:episode>
      <itunes:title>Large Action Models: From Inception to Implementation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">26f356ce-f77e-4609-a085-e2bad775cebd</guid>
      <link>https://share.transistor.fm/s/58cda121</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lu Wang, Fangkai Yang, Chaoyun Zhang, Junting Lu, Jiaxu Qian, Shilin He, Pu Zhao, Bo Qiao, Ray Huang, Si Qin, Qisheng Su, Jiayi Ye, Yudi Zhang, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang</p>

            <p><strong>Title:</strong><br>
            Large Action Models: From Inception to Implementation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.10047v1">http://arxiv.org/abs/2412.10047v1</a></p>

            <p><strong>Abstract:</strong><br>
            As AI continues to advance, there is a growing demand for systems that go beyond language-based assistance and move toward intelligent agents capable of performing real-world actions. This evolution requires the transition from traditional Large Language Models (LLMs), which excel at generating textual responses, to Large Action Models (LAMs), designed for action generation and execution within dynamic environments. Enabled by agent systems, LAMs hold the potential to transform AI from passive language understanding to active task completion, marking a significant milestone in the progression toward artificial general intelligence.   In this paper, we present a comprehensive framework for developing LAMs, offering a systematic approach to their creation, from inception to deployment. We begin with an overview of LAMs, highlighting their unique characteristics and delineating their differences from LLMs. Using a Windows OS-based agent as a case study, we provide a detailed, step-by-step guide on the key stages of LAM development, including data collection, model training, environment integration, grounding, and evaluation. This generalizable workflow can serve as a blueprint for creating functional LAMs in various application domains. We conclude by identifying the current limitations of LAMs and discussing directions for future research and industrial deployment, emphasizing the challenges and opportunities that lie ahead in realizing the full potential of LAMs in real-world applications.   The code for the data collection process utilized in this paper is publicly available at: https://github.com/microsoft/UFO/tree/main/dataflow, and comprehensive documentation can be found at https://microsoft.github.io/UFO/dataflow/overview/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lu Wang, Fangkai Yang, Chaoyun Zhang, Junting Lu, Jiaxu Qian, Shilin He, Pu Zhao, Bo Qiao, Ray Huang, Si Qin, Qisheng Su, Jiayi Ye, Yudi Zhang, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang</p>

            <p><strong>Title:</strong><br>
            Large Action Models: From Inception to Implementation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.10047v1">http://arxiv.org/abs/2412.10047v1</a></p>

            <p><strong>Abstract:</strong><br>
            As AI continues to advance, there is a growing demand for systems that go beyond language-based assistance and move toward intelligent agents capable of performing real-world actions. This evolution requires the transition from traditional Large Language Models (LLMs), which excel at generating textual responses, to Large Action Models (LAMs), designed for action generation and execution within dynamic environments. Enabled by agent systems, LAMs hold the potential to transform AI from passive language understanding to active task completion, marking a significant milestone in the progression toward artificial general intelligence.   In this paper, we present a comprehensive framework for developing LAMs, offering a systematic approach to their creation, from inception to deployment. We begin with an overview of LAMs, highlighting their unique characteristics and delineating their differences from LLMs. Using a Windows OS-based agent as a case study, we provide a detailed, step-by-step guide on the key stages of LAM development, including data collection, model training, environment integration, grounding, and evaluation. This generalizable workflow can serve as a blueprint for creating functional LAMs in various application domains. We conclude by identifying the current limitations of LAMs and discussing directions for future research and industrial deployment, emphasizing the challenges and opportunities that lie ahead in realizing the full potential of LAMs in real-world applications.   The code for the data collection process utilized in this paper is publicly available at: https://github.com/microsoft/UFO/tree/main/dataflow, and comprehensive documentation can be found at https://microsoft.github.io/UFO/dataflow/overview/.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 16 Dec 2024 20:15:38 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/58cda121/0ace4bdc.mp3" length="21419585" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1335</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 23 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Lu Wang, Fangkai Yang, Chaoyun Zhang, Junting Lu, Jiaxu Qian, Shilin He, Pu Zhao, Bo Qiao, Ray Huang, Si Qin, Qisheng Su, Jiayi Ye, Yudi Zhang, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang</p>

            <p><strong>Title:</strong><br>
            Large Action Models: From Inception to Implementation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.10047v1">http://arxiv.org/abs/2412.10047v1</a></p>

            <p><strong>Abstract:</strong><br>
            As AI continues to advance, there is a growing demand for systems that go beyond language-based assistance and move toward intelligent agents capable of performing real-world actions. This evolution requires the transition from traditional Large Language Models (LLMs), which excel at generating textual responses, to Large Action Models (LAMs), designed for action generation and execution within dynamic environments. Enabled by agent systems, LAMs hold the potential to transform AI from passive language understanding to active task completion, marking a significant milestone in the progression toward artificial general intelligence.   In this paper, we present a comprehensive framework for developing LAMs, offering a systematic approach to their creation, from inception to deployment. We begin with an overview of LAMs, highlighting their unique characteristics and delineating their differences from LLMs. Using a Windows OS-based agent as a case study, we provide a detailed, step-by-step guide on the key stages of LAM development, including data collection, model training, environment integration, grounding, and evaluation. This generalizable workflow can serve as a blueprint for creating functional LAMs in various application domains. We conclude by identifying the current limitations of LAMs and discussing directions for future research and industrial deployment, emphasizing the challenges and opportunities that lie ahead in realizing the full potential of LAMs in real-world applications.   The code for the data collection process utilized in this paper is publicly available at: https://github.com/microsoft/UFO/tree/main/dataflow, and comprehensive documentation can be found at https://microsoft.github.io/UFO/dataflow/overview/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption</title>
      <itunes:episode>217</itunes:episode>
      <podcast:episode>217</podcast:episode>
      <itunes:title>InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">992df247-7031-48f8-9cbc-1fc6dd32a8ca</guid>
      <link>https://share.transistor.fm/s/54f309e6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Zhenheng Yang, Chaoyou Fu, Xiang Li, Jian Yang, Ying Tai</p>

            <p><strong>Title:</strong><br>
            InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09283v1">http://arxiv.org/abs/2412.09283v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed InstanceCap, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a 22K InstanceVid dataset is curated for training, and an enhancement pipeline that tailored to InstanceCap structure is proposed for inference. Experimental results demonstrate that our proposed InstanceCap significantly outperform previous models, ensuring high fidelity between captions and videos while reducing hallucinations.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Zhenheng Yang, Chaoyou Fu, Xiang Li, Jian Yang, Ying Tai</p>

            <p><strong>Title:</strong><br>
            InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09283v1">http://arxiv.org/abs/2412.09283v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed InstanceCap, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a 22K InstanceVid dataset is curated for training, and an enhancement pipeline that tailored to InstanceCap structure is proposed for inference. Experimental results demonstrate that our proposed InstanceCap significantly outperform previous models, ensuring high fidelity between captions and videos while reducing hallucinations.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 16 Dec 2024 20:15:14 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/54f309e6/5a3727a7.mp3" length="20240554" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1261</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Zhenheng Yang, Chaoyou Fu, Xiang Li, Jian Yang, Ying Tai</p>

            <p><strong>Title:</strong><br>
            InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09283v1">http://arxiv.org/abs/2412.09283v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed InstanceCap, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a 22K InstanceVid dataset is curated for training, and an enhancement pipeline that tailored to InstanceCap structure is proposed for inference. Experimental results demonstrate that our proposed InstanceCap significantly outperform previous models, ensuring high fidelity between captions and videos while reducing hallucinations.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion</title>
      <itunes:episode>216</itunes:episode>
      <podcast:episode>216</podcast:episode>
      <itunes:title>FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cff5a400-41d4-453a-87dc-11c8d59d7d03</guid>
      <link>https://share.transistor.fm/s/1343229f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09626v1">http://arxiv.org/abs/2412.09626v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. To tackle this challenge, we propose FreeScale, a tuning-free inference paradigm to enable higher-resolution visual generation via scale fusion. Specifically, FreeScale processes information from different receptive scales and then fuses it by extracting desired frequency components. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Notably, compared with the previous best-performing method, FreeScale unlocks the generation of 8k-resolution images for the first time.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09626v1">http://arxiv.org/abs/2412.09626v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. To tackle this challenge, we propose FreeScale, a tuning-free inference paradigm to enable higher-resolution visual generation via scale fusion. Specifically, FreeScale processes information from different receptive scales and then fuses it by extracting desired frequency components. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Notably, compared with the previous best-performing method, FreeScale unlocks the generation of 8k-resolution images for the first time.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 16 Dec 2024 20:14:51 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1343229f/f027afe6.mp3" length="20729567" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1292</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09626v1">http://arxiv.org/abs/2412.09626v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. To tackle this challenge, we propose FreeScale, a tuning-free inference paradigm to enable higher-resolution visual generation via scale fusion. Specifically, FreeScale processes information from different receptive scales and then fuses it by extracting desired frequency components. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Notably, compared with the previous best-performing method, FreeScale unlocks the generation of 8k-resolution images for the first time.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation</title>
      <itunes:episode>215</itunes:episode>
      <podcast:episode>215</podcast:episode>
      <itunes:title>ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8bd19410-4b5e-46c2-8861-1e265a34c7b9</guid>
      <link>https://share.transistor.fm/s/92ec0112</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Daniel Winter, Asaf Shul, Matan Cohen, Dana Berman, Yael Pritch, Alex Rav-Acha, Yedid Hoshen</p>

            <p><strong>Title:</strong><br>
            ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08645v1">http://arxiv.org/abs/2412.08645v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces a tuning-free method for both object insertion and subject-driven generation. The task involves composing an object, given multiple views, into a scene specified by either an image or text. Existing methods struggle to fully meet the task's challenging objectives: (i) seamlessly composing the object into the scene with photorealistic pose and lighting, and (ii) preserving the object's identity. We hypothesize that achieving these goals requires large scale supervision, but manually collecting sufficient data is simply too expensive. The key observation in this paper is that many mass-produced objects recur across multiple images of large unlabeled datasets, in different scenes, poses, and lighting conditions. We use this observation to create massive supervision by retrieving sets of diverse views of the same object. This powerful paired dataset enables us to train a straightforward text-to-image diffusion architecture to map the object and scene descriptions to the composited image. We compare our method, ObjectMate, with state-of-the-art methods for object insertion and subject-driven generation, using a single or multiple references. Empirically, ObjectMate achieves superior identity preservation and more photorealistic composition. Differently from many other multi-reference methods, ObjectMate does not require slow test-time tuning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Daniel Winter, Asaf Shul, Matan Cohen, Dana Berman, Yael Pritch, Alex Rav-Acha, Yedid Hoshen</p>

            <p><strong>Title:</strong><br>
            ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08645v1">http://arxiv.org/abs/2412.08645v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces a tuning-free method for both object insertion and subject-driven generation. The task involves composing an object, given multiple views, into a scene specified by either an image or text. Existing methods struggle to fully meet the task's challenging objectives: (i) seamlessly composing the object into the scene with photorealistic pose and lighting, and (ii) preserving the object's identity. We hypothesize that achieving these goals requires large scale supervision, but manually collecting sufficient data is simply too expensive. The key observation in this paper is that many mass-produced objects recur across multiple images of large unlabeled datasets, in different scenes, poses, and lighting conditions. We use this observation to create massive supervision by retrieving sets of diverse views of the same object. This powerful paired dataset enables us to train a straightforward text-to-image diffusion architecture to map the object and scene descriptions to the composited image. We compare our method, ObjectMate, with state-of-the-art methods for object insertion and subject-driven generation, using a single or multiple references. Empirically, ObjectMate achieves superior identity preservation and more photorealistic composition. Differently from many other multi-reference methods, ObjectMate does not require slow test-time tuning.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 16 Dec 2024 20:14:28 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/92ec0112/76e25492.mp3" length="20882118" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1301</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Daniel Winter, Asaf Shul, Matan Cohen, Dana Berman, Yael Pritch, Alex Rav-Acha, Yedid Hoshen</p>

            <p><strong>Title:</strong><br>
            ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08645v1">http://arxiv.org/abs/2412.08645v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper introduces a tuning-free method for both object insertion and subject-driven generation. The task involves composing an object, given multiple views, into a scene specified by either an image or text. Existing methods struggle to fully meet the task's challenging objectives: (i) seamlessly composing the object into the scene with photorealistic pose and lighting, and (ii) preserving the object's identity. We hypothesize that achieving these goals requires large scale supervision, but manually collecting sufficient data is simply too expensive. The key observation in this paper is that many mass-produced objects recur across multiple images of large unlabeled datasets, in different scenes, poses, and lighting conditions. We use this observation to create massive supervision by retrieving sets of diverse views of the same object. This powerful paired dataset enables us to train a straightforward text-to-image diffusion architecture to map the object and scene descriptions to the composited image. We compare our method, ObjectMate, with state-of-the-art methods for object insertion and subject-driven generation, using a single or multiple references. Empirically, ObjectMate achieves superior identity preservation and more photorealistic composition. Differently from many other multi-reference methods, ObjectMate does not require slow test-time tuning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing</title>
      <itunes:episode>214</itunes:episode>
      <podcast:episode>214</podcast:episode>
      <itunes:title>FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e05c4869-58c4-49e6-ae38-d2497e887425</guid>
      <link>https://share.transistor.fm/s/7adbd78b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yingying Deng, Xiangyu He, Changwang Mei, Peisong Wang, Fan Tang</p>

            <p><strong>Title:</strong><br>
            FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07517v1">http://arxiv.org/abs/2412.07517v1</a></p>

            <p><strong>Abstract:</strong><br>
            Though Rectified Flows (ReFlows) with distillation offers a promising way for fast sampling, its fast inversion transforms images back to structured noise for recovery and following editing remains unsolved. This paper introduces FireFlow, a simple yet effective zero-shot approach that inherits the startling capacity of ReFlow-based models (such as FLUX) in generation while extending its capabilities to accurate inversion and editing in $8$ steps. We first demonstrate that a carefully designed numerical solver is pivotal for ReFlow inversion, enabling accurate inversion and reconstruction with the precision of a second-order solver while maintaining the practical efficiency of a first-order Euler method. This solver achieves a $3\times$ runtime speedup compared to state-of-the-art ReFlow inversion and editing techniques, while delivering smaller reconstruction errors and superior editing results in a training-free mode. The code is available at $\href{https://github.com/HolmesShuan/FireFlow}{this URL}$.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yingying Deng, Xiangyu He, Changwang Mei, Peisong Wang, Fan Tang</p>

            <p><strong>Title:</strong><br>
            FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07517v1">http://arxiv.org/abs/2412.07517v1</a></p>

            <p><strong>Abstract:</strong><br>
            Though Rectified Flows (ReFlows) with distillation offers a promising way for fast sampling, its fast inversion transforms images back to structured noise for recovery and following editing remains unsolved. This paper introduces FireFlow, a simple yet effective zero-shot approach that inherits the startling capacity of ReFlow-based models (such as FLUX) in generation while extending its capabilities to accurate inversion and editing in $8$ steps. We first demonstrate that a carefully designed numerical solver is pivotal for ReFlow inversion, enabling accurate inversion and reconstruction with the precision of a second-order solver while maintaining the practical efficiency of a first-order Euler method. This solver achieves a $3\times$ runtime speedup compared to state-of-the-art ReFlow inversion and editing techniques, while delivering smaller reconstruction errors and superior editing results in a training-free mode. The code is available at $\href{https://github.com/HolmesShuan/FireFlow}{this URL}$.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 16 Dec 2024 20:14:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7adbd78b/e9869d0d.mp3" length="20968623" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1307</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yingying Deng, Xiangyu He, Changwang Mei, Peisong Wang, Fan Tang</p>

            <p><strong>Title:</strong><br>
            FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07517v1">http://arxiv.org/abs/2412.07517v1</a></p>

            <p><strong>Abstract:</strong><br>
            Though Rectified Flows (ReFlows) with distillation offers a promising way for fast sampling, its fast inversion transforms images back to structured noise for recovery and following editing remains unsolved. This paper introduces FireFlow, a simple yet effective zero-shot approach that inherits the startling capacity of ReFlow-based models (such as FLUX) in generation while extending its capabilities to accurate inversion and editing in $8$ steps. We first demonstrate that a carefully designed numerical solver is pivotal for ReFlow inversion, enabling accurate inversion and reconstruction with the precision of a second-order solver while maintaining the practical efficiency of a first-order Euler method. This solver achieves a $3\times$ runtime speedup compared to state-of-the-art ReFlow inversion and editing techniques, while delivering smaller reconstruction errors and superior editing results in a training-free mode. The code is available at $\href{https://github.com/HolmesShuan/FireFlow}{this URL}$.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers</title>
      <itunes:episode>213</itunes:episode>
      <podcast:episode>213</podcast:episode>
      <itunes:title>FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">55047cb7-a9e9-43ee-9771-1901fa0efba1</guid>
      <link>https://share.transistor.fm/s/669d3013</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yusuf Dalva, Kavana Venkatesh, Pinar Yanardag</p>

            <p><strong>Title:</strong><br>
            FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09611v1">http://arxiv.org/abs/2412.09611v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rectified flow models have emerged as a dominant approach in image generation, showcasing impressive capabilities in high-quality image synthesis. However, despite their effectiveness in visual generation, rectified flow models often struggle with disentangled editing of images. This limitation prevents the ability to perform precise, attribute-specific modifications without affecting unrelated aspects of the image. In this paper, we introduce FluxSpace, a domain-agnostic image editing method leveraging a representation space with the ability to control the semantics of images generated by rectified flow transformers, such as Flux. By leveraging the representations learned by the transformer blocks within the rectified flow models, we propose a set of semantically interpretable representations that enable a wide range of image editing tasks, from fine-grained image editing to artistic creation. This work offers a scalable and effective image editing approach, along with its disentanglement capabilities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yusuf Dalva, Kavana Venkatesh, Pinar Yanardag</p>

            <p><strong>Title:</strong><br>
            FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09611v1">http://arxiv.org/abs/2412.09611v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rectified flow models have emerged as a dominant approach in image generation, showcasing impressive capabilities in high-quality image synthesis. However, despite their effectiveness in visual generation, rectified flow models often struggle with disentangled editing of images. This limitation prevents the ability to perform precise, attribute-specific modifications without affecting unrelated aspects of the image. In this paper, we introduce FluxSpace, a domain-agnostic image editing method leveraging a representation space with the ability to control the semantics of images generated by rectified flow transformers, such as Flux. By leveraging the representations learned by the transformer blocks within the rectified flow models, we propose a set of semantically interpretable representations that enable a wide range of image editing tasks, from fine-grained image editing to artistic creation. This work offers a scalable and effective image editing approach, along with its disentanglement capabilities.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 16 Dec 2024 20:13:41 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/669d3013/f3dda36d.mp3" length="18103097" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1128</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yusuf Dalva, Kavana Venkatesh, Pinar Yanardag</p>

            <p><strong>Title:</strong><br>
            FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09611v1">http://arxiv.org/abs/2412.09611v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rectified flow models have emerged as a dominant approach in image generation, showcasing impressive capabilities in high-quality image synthesis. However, despite their effectiveness in visual generation, rectified flow models often struggle with disentangled editing of images. This limitation prevents the ability to perform precise, attribute-specific modifications without affecting unrelated aspects of the image. In this paper, we introduce FluxSpace, a domain-agnostic image editing method leveraging a representation space with the ability to control the semantics of images generated by rectified flow transformers, such as Flux. By leveraging the representations learned by the transformer blocks within the rectified flow models, we propose a set of semantically interpretable representations that enable a wide range of image editing tasks, from fine-grained image editing to artistic creation. This work offers a scalable and effective image editing approach, along with its disentanglement capabilities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Phi-4 Technical Report</title>
      <itunes:episode>212</itunes:episode>
      <podcast:episode>212</podcast:episode>
      <itunes:title>Phi-4 Technical Report</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b4701aae-dcea-47c9-9f37-995060bf7f8f</guid>
      <link>https://share.transistor.fm/s/b411dc93</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu, Cyril Zhang, Yi Zhang</p>

            <p><strong>Title:</strong><br>
            Phi-4 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08905v1">http://arxiv.org/abs/2412.08905v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu, Cyril Zhang, Yi Zhang</p>

            <p><strong>Title:</strong><br>
            Phi-4 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08905v1">http://arxiv.org/abs/2412.08905v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 13 Dec 2024 20:06:26 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b411dc93/d513eb0f.mp3" length="21371489" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1332</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 40 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu, Cyril Zhang, Yi Zhang</p>

            <p><strong>Title:</strong><br>
            Phi-4 Technical Report</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08905v1">http://arxiv.org/abs/2412.08905v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions</title>
      <itunes:episode>211</itunes:episode>
      <podcast:episode>211</podcast:episode>
      <itunes:title>Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7e2314f0-b893-43e7-9e0e-b64bd88ec981</guid>
      <link>https://share.transistor.fm/s/d2c813b0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiarui Zhang, Ollie Liu, Tianyu Yu, Jinyi Hu, Willie Neiswanger</p>

            <p><strong>Title:</strong><br>
            Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08737v1">http://arxiv.org/abs/2412.08737v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) have made rapid progress in recent years, yet continue to struggle with low-level visual perception (LLVP) -- particularly the ability to accurately describe the geometric details of an image. This capability is crucial for applications in areas such as robotics, medical image analysis, and manufacturing. In this paper, we first introduce Geoperception, a benchmark designed to evaluate an MLLM's ability to accurately transcribe 2D geometric information from an image. Using this benchmark, we demonstrate the limitations of leading MLLMs, and then conduct a comprehensive empirical study to explore strategies for improving their performance on geometric tasks. Our findings highlight the benefits of certain model architectures, training techniques, and data strategies, including the use of high-fidelity synthetic data and multi-stage training with a data curriculum. Notably, we find that a data curriculum enables models to learn challenging geometry understanding tasks which they fail to learn from scratch. Leveraging these insights, we develop Euclid, a family of models specifically optimized for strong low-level geometric perception. Although purely trained on synthetic multimodal data, Euclid shows strong generalization ability to novel geometry shapes. For instance, Euclid outperforms the best closed-source model, Gemini-1.5-Pro, by up to 58.56% on certain Geoperception benchmark tasks and 10.65% on average across all tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiarui Zhang, Ollie Liu, Tianyu Yu, Jinyi Hu, Willie Neiswanger</p>

            <p><strong>Title:</strong><br>
            Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08737v1">http://arxiv.org/abs/2412.08737v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) have made rapid progress in recent years, yet continue to struggle with low-level visual perception (LLVP) -- particularly the ability to accurately describe the geometric details of an image. This capability is crucial for applications in areas such as robotics, medical image analysis, and manufacturing. In this paper, we first introduce Geoperception, a benchmark designed to evaluate an MLLM's ability to accurately transcribe 2D geometric information from an image. Using this benchmark, we demonstrate the limitations of leading MLLMs, and then conduct a comprehensive empirical study to explore strategies for improving their performance on geometric tasks. Our findings highlight the benefits of certain model architectures, training techniques, and data strategies, including the use of high-fidelity synthetic data and multi-stage training with a data curriculum. Notably, we find that a data curriculum enables models to learn challenging geometry understanding tasks which they fail to learn from scratch. Leveraging these insights, we develop Euclid, a family of models specifically optimized for strong low-level geometric perception. Although purely trained on synthetic multimodal data, Euclid shows strong generalization ability to novel geometry shapes. For instance, Euclid outperforms the best closed-source model, Gemini-1.5-Pro, by up to 58.56% on certain Geoperception benchmark tasks and 10.65% on average across all tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 13 Dec 2024 20:06:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d2c813b0/6fd9f02b.mp3" length="23541179" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1468</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jiarui Zhang, Ollie Liu, Tianyu Yu, Jinyi Hu, Willie Neiswanger</p>

            <p><strong>Title:</strong><br>
            Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08737v1">http://arxiv.org/abs/2412.08737v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal large language models (MLLMs) have made rapid progress in recent years, yet continue to struggle with low-level visual perception (LLVP) -- particularly the ability to accurately describe the geometric details of an image. This capability is crucial for applications in areas such as robotics, medical image analysis, and manufacturing. In this paper, we first introduce Geoperception, a benchmark designed to evaluate an MLLM's ability to accurately transcribe 2D geometric information from an image. Using this benchmark, we demonstrate the limitations of leading MLLMs, and then conduct a comprehensive empirical study to explore strategies for improving their performance on geometric tasks. Our findings highlight the benefits of certain model architectures, training techniques, and data strategies, including the use of high-fidelity synthetic data and multi-stage training with a data curriculum. Notably, we find that a data curriculum enables models to learn challenging geometry understanding tasks which they fail to learn from scratch. Leveraging these insights, we develop Euclid, a family of models specifically optimized for strong low-level geometric perception. Although purely trained on synthetic multimodal data, Euclid shows strong generalization ability to novel geometry shapes. For instance, Euclid outperforms the best closed-source model, Gemini-1.5-Pro, by up to 58.56% on certain Geoperception benchmark tasks and 10.65% on average across all tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Multimodal Latent Language Modeling with Next-Token Diffusion</title>
      <itunes:episode>210</itunes:episode>
      <podcast:episode>210</podcast:episode>
      <itunes:title>Multimodal Latent Language Modeling with Next-Token Diffusion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">505256f2-77e6-454a-9f92-a458c1987420</guid>
      <link>https://share.transistor.fm/s/52539235</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, Furu Wei</p>

            <p><strong>Title:</strong><br>
            Multimodal Latent Language Modeling with Next-Token Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08635v1">http://arxiv.org/abs/2412.08635v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video). In this work, we propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers. Specifically, we employ a variational autoencoder (VAE) to represent continuous data as latent vectors and introduce next-token diffusion for autoregressive generation of these vectors. Additionally, we develop $\sigma$-VAE to address the challenges of variance collapse, which is crucial for autoregressive modeling. Extensive experiments demonstrate the effectiveness of LatentLM across various modalities. In image generation, LatentLM surpasses Diffusion Transformers in both performance and scalability. When integrated into multimodal large language models, LatentLM provides a general-purpose interface that unifies multimodal generation and understanding. Experimental results show that LatentLM achieves favorable performance compared to Transfusion and vector quantized models in the setting of scaling up training tokens. In text-to-speech synthesis, LatentLM outperforms the state-of-the-art VALL-E 2 model in speaker similarity and robustness, while requiring 10x fewer decoding steps. The results establish LatentLM as a highly effective and scalable approach to advance large multimodal models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, Furu Wei</p>

            <p><strong>Title:</strong><br>
            Multimodal Latent Language Modeling with Next-Token Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08635v1">http://arxiv.org/abs/2412.08635v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video). In this work, we propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers. Specifically, we employ a variational autoencoder (VAE) to represent continuous data as latent vectors and introduce next-token diffusion for autoregressive generation of these vectors. Additionally, we develop $\sigma$-VAE to address the challenges of variance collapse, which is crucial for autoregressive modeling. Extensive experiments demonstrate the effectiveness of LatentLM across various modalities. In image generation, LatentLM surpasses Diffusion Transformers in both performance and scalability. When integrated into multimodal large language models, LatentLM provides a general-purpose interface that unifies multimodal generation and understanding. Experimental results show that LatentLM achieves favorable performance compared to Transfusion and vector quantized models in the setting of scaling up training tokens. In text-to-speech synthesis, LatentLM outperforms the state-of-the-art VALL-E 2 model in speaker similarity and robustness, while requiring 10x fewer decoding steps. The results establish LatentLM as a highly effective and scalable approach to advance large multimodal models.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 13 Dec 2024 20:05:44 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/52539235/082f0148.mp3" length="21737660" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1355</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CL, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, Furu Wei</p>

            <p><strong>Title:</strong><br>
            Multimodal Latent Language Modeling with Next-Token Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08635v1">http://arxiv.org/abs/2412.08635v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video). In this work, we propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers. Specifically, we employ a variational autoencoder (VAE) to represent continuous data as latent vectors and introduce next-token diffusion for autoregressive generation of these vectors. Additionally, we develop $\sigma$-VAE to address the challenges of variance collapse, which is crucial for autoregressive modeling. Extensive experiments demonstrate the effectiveness of LatentLM across various modalities. In image generation, LatentLM surpasses Diffusion Transformers in both performance and scalability. When integrated into multimodal large language models, LatentLM provides a general-purpose interface that unifies multimodal generation and understanding. Experimental results show that LatentLM achieves favorable performance compared to Transfusion and vector quantized models in the setting of scaling up training tokens. In text-to-speech synthesis, LatentLM outperforms the state-of-the-art VALL-E 2 model in speaker similarity and robustness, while requiring 10x fewer decoding steps. The results establish LatentLM as a highly effective and scalable approach to advance large multimodal models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM</title>
      <itunes:episode>209</itunes:episode>
      <podcast:episode>209</podcast:episode>
      <itunes:title>EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b19f28d3-5015-456f-9ab7-452242978a57</guid>
      <link>https://share.transistor.fm/s/924a3d9f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhuofan Zong, Dongzhi Jiang, Bingqi Ma, Guanglu Song, Hao Shao, Dazhong Shen, Yu Liu, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09618v1">http://arxiv.org/abs/2412.09618v1</a></p>

            <p><strong>Abstract:</strong><br>
            Significant achievements in personalization of diffusion models have been witnessed. Conventional tuning-free methods mostly encode multiple reference images by averaging their image embeddings as the injection condition, but such an image-independent operation cannot perform interaction among images to capture consistent visual elements within multiple references. Although the tuning-based Low-Rank Adaptation (LoRA) can effectively extract consistent elements within multiple images through the training process, it necessitates specific finetuning for each distinct image group. This paper introduces EasyRef, a novel plug-and-play adaptation method that enables diffusion models to be conditioned on multiple reference images and the text prompt. To effectively exploit consistent visual elements within multiple images, we leverage the multi-image comprehension and instruction-following capabilities of the multimodal large language model (MLLM), prompting it to capture consistent visual elements based on the instruction. Besides, injecting the MLLM's representations into the diffusion process through adapters can easily generalize to unseen domains, mining the consistent visual elements within unseen data. To mitigate computational costs and enhance fine-grained detail preservation, we introduce an efficient reference aggregation strategy and a progressive training scheme. Finally, we introduce MRBench, a new multi-reference image generation benchmark. Experimental results demonstrate EasyRef surpasses both tuning-free methods like IP-Adapter and tuning-based methods like LoRA, achieving superior aesthetic quality and robust zero-shot generalization across diverse domains.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhuofan Zong, Dongzhi Jiang, Bingqi Ma, Guanglu Song, Hao Shao, Dazhong Shen, Yu Liu, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09618v1">http://arxiv.org/abs/2412.09618v1</a></p>

            <p><strong>Abstract:</strong><br>
            Significant achievements in personalization of diffusion models have been witnessed. Conventional tuning-free methods mostly encode multiple reference images by averaging their image embeddings as the injection condition, but such an image-independent operation cannot perform interaction among images to capture consistent visual elements within multiple references. Although the tuning-based Low-Rank Adaptation (LoRA) can effectively extract consistent elements within multiple images through the training process, it necessitates specific finetuning for each distinct image group. This paper introduces EasyRef, a novel plug-and-play adaptation method that enables diffusion models to be conditioned on multiple reference images and the text prompt. To effectively exploit consistent visual elements within multiple images, we leverage the multi-image comprehension and instruction-following capabilities of the multimodal large language model (MLLM), prompting it to capture consistent visual elements based on the instruction. Besides, injecting the MLLM's representations into the diffusion process through adapters can easily generalize to unseen domains, mining the consistent visual elements within unseen data. To mitigate computational costs and enhance fine-grained detail preservation, we introduce an efficient reference aggregation strategy and a progressive training scheme. Finally, we introduce MRBench, a new multi-reference image generation benchmark. Experimental results demonstrate EasyRef surpasses both tuning-free methods like IP-Adapter and tuning-based methods like LoRA, achieving superior aesthetic quality and robust zero-shot generalization across diverse domains.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 13 Dec 2024 20:05:23 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/924a3d9f/1c0fcd4a.mp3" length="21082326" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1314</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhuofan Zong, Dongzhi Jiang, Bingqi Ma, Guanglu Song, Hao Shao, Dazhong Shen, Yu Liu, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09618v1">http://arxiv.org/abs/2412.09618v1</a></p>

            <p><strong>Abstract:</strong><br>
            Significant achievements in personalization of diffusion models have been witnessed. Conventional tuning-free methods mostly encode multiple reference images by averaging their image embeddings as the injection condition, but such an image-independent operation cannot perform interaction among images to capture consistent visual elements within multiple references. Although the tuning-based Low-Rank Adaptation (LoRA) can effectively extract consistent elements within multiple images through the training process, it necessitates specific finetuning for each distinct image group. This paper introduces EasyRef, a novel plug-and-play adaptation method that enables diffusion models to be conditioned on multiple reference images and the text prompt. To effectively exploit consistent visual elements within multiple images, we leverage the multi-image comprehension and instruction-following capabilities of the multimodal large language model (MLLM), prompting it to capture consistent visual elements based on the instruction. Besides, injecting the MLLM's representations into the diffusion process through adapters can easily generalize to unseen domains, mining the consistent visual elements within unseen data. To mitigate computational costs and enhance fine-grained detail preservation, we introduce an efficient reference aggregation strategy and a progressive training scheme. Finally, we introduce MRBench, a new multi-reference image generation benchmark. Experimental results demonstrate EasyRef surpasses both tuning-free methods like IP-Adapter and tuning-based methods like LoRA, achieving superior aesthetic quality and robust zero-shot generalization across diverse domains.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials</title>
      <itunes:episode>208</itunes:episode>
      <podcast:episode>208</podcast:episode>
      <itunes:title>AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">73783762-10d1-4287-b553-4584405e1da0</guid>
      <link>https://share.transistor.fm/s/99c77546</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, Tao Yu</p>

            <p><strong>Title:</strong><br>
            AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09605v1">http://arxiv.org/abs/2412.09605v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interface (GUI) agents hold great potential for automating complex tasks across diverse digital environments, from web applications to desktop software. However, the development of such agents is hindered by the lack of high-quality, multi-step trajectory data required for effective training. Existing approaches rely on expensive and labor-intensive human annotation, making them unsustainable at scale. To address this challenge, we propose AgentTrek, a scalable data synthesis pipeline that generates high-quality GUI agent trajectories by leveraging web tutorials. Our method automatically gathers tutorial-like texts from the internet, transforms them into task goals with step-by-step instructions, and employs a visual-language model agent to simulate their execution in a real digital environment. A VLM-based evaluator ensures the correctness of the generated trajectories. We demonstrate that training GUI agents with these synthesized trajectories significantly improves their grounding and planning performance over the current models. Moreover, our approach is more cost-efficient compared to traditional human annotation methods. This work underscores the potential of guided replay with web tutorials as a viable strategy for large-scale GUI agent training, paving the way for more capable and autonomous digital agents.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, Tao Yu</p>

            <p><strong>Title:</strong><br>
            AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09605v1">http://arxiv.org/abs/2412.09605v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interface (GUI) agents hold great potential for automating complex tasks across diverse digital environments, from web applications to desktop software. However, the development of such agents is hindered by the lack of high-quality, multi-step trajectory data required for effective training. Existing approaches rely on expensive and labor-intensive human annotation, making them unsustainable at scale. To address this challenge, we propose AgentTrek, a scalable data synthesis pipeline that generates high-quality GUI agent trajectories by leveraging web tutorials. Our method automatically gathers tutorial-like texts from the internet, transforms them into task goals with step-by-step instructions, and employs a visual-language model agent to simulate their execution in a real digital environment. A VLM-based evaluator ensures the correctness of the generated trajectories. We demonstrate that training GUI agents with these synthesized trajectories significantly improves their grounding and planning performance over the current models. Moreover, our approach is more cost-efficient compared to traditional human annotation methods. This work underscores the potential of guided replay with web tutorials as a viable strategy for large-scale GUI agent training, paving the way for more capable and autonomous digital agents.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 13 Dec 2024 20:05:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/99c77546/3dec4d9a.mp3" length="18152002" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1131</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, Tao Yu</p>

            <p><strong>Title:</strong><br>
            AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09605v1">http://arxiv.org/abs/2412.09605v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interface (GUI) agents hold great potential for automating complex tasks across diverse digital environments, from web applications to desktop software. However, the development of such agents is hindered by the lack of high-quality, multi-step trajectory data required for effective training. Existing approaches rely on expensive and labor-intensive human annotation, making them unsustainable at scale. To address this challenge, we propose AgentTrek, a scalable data synthesis pipeline that generates high-quality GUI agent trajectories by leveraging web tutorials. Our method automatically gathers tutorial-like texts from the internet, transforms them into task goals with step-by-step instructions, and employs a visual-language model agent to simulate their execution in a real digital environment. A VLM-based evaluator ensures the correctness of the generated trajectories. We demonstrate that training GUI agents with these synthesized trajectories significantly improves their grounding and planning performance over the current models. Moreover, our approach is more cost-efficient compared to traditional human annotation methods. This work underscores the potential of guided replay with web tutorials as a viable strategy for large-scale GUI agent training, paving the way for more capable and autonomous digital agents.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training</title>
      <itunes:episode>207</itunes:episode>
      <podcast:episode>207</podcast:episode>
      <itunes:title>SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f406b62f-7fe6-4cb5-8bbb-28352ea3ed86</guid>
      <link>https://share.transistor.fm/s/f3e2e39c</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Dongting Hu, Jierun Chen, Xijie Huang, Huseyin Coskun, Arpit Sahni, Aarush Gupta, Anujraaj Goyal, Dishani Lahiri, Rajesh Singh, Yerlan Idelbayev, Junli Cao, Yanyu Li, Kwang-Ting Cheng, S. -H. Gary Chan, Mingming Gong, Sergey Tulyakov, Anil Kag, Yanwu Xu, Jian Ren</p>

            <p><strong>Title:</strong><br>
            SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09619v1">http://arxiv.org/abs/2412.09619v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen, demonstrates the generation of 1024x1024 px images on a mobile device around 1.4 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters, surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x smaller than IF-XL).</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Dongting Hu, Jierun Chen, Xijie Huang, Huseyin Coskun, Arpit Sahni, Aarush Gupta, Anujraaj Goyal, Dishani Lahiri, Rajesh Singh, Yerlan Idelbayev, Junli Cao, Yanyu Li, Kwang-Ting Cheng, S. -H. Gary Chan, Mingming Gong, Sergey Tulyakov, Anil Kag, Yanwu Xu, Jian Ren</p>

            <p><strong>Title:</strong><br>
            SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09619v1">http://arxiv.org/abs/2412.09619v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen, demonstrates the generation of 1024x1024 px images on a mobile device around 1.4 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters, surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x smaller than IF-XL).</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 13 Dec 2024 20:04:41 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f3e2e39c/17d72d7a.mp3" length="18434163" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1148</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Dongting Hu, Jierun Chen, Xijie Huang, Huseyin Coskun, Arpit Sahni, Aarush Gupta, Anujraaj Goyal, Dishani Lahiri, Rajesh Singh, Yerlan Idelbayev, Junli Cao, Yanyu Li, Kwang-Ting Cheng, S. -H. Gary Chan, Mingming Gong, Sergey Tulyakov, Anil Kag, Yanwu Xu, Jian Ren</p>

            <p><strong>Title:</strong><br>
            SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09619v1">http://arxiv.org/abs/2412.09619v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen, demonstrates the generation of 1024x1024 px images on a mobile device around 1.4 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters, surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x smaller than IF-XL).</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion</title>
      <itunes:episode>206</itunes:episode>
      <podcast:episode>206</podcast:episode>
      <itunes:title>Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">053a5dd5-80a5-4f6a-ae61-190eb79f6ae1</guid>
      <link>https://share.transistor.fm/s/40c94f9a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zexin He, Tengfei Wang, Xin Huang, Xingang Pan, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09593v1">http://arxiv.org/abs/2412.09593v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recovering the geometry and materials of objects from a single image is challenging due to its under-constrained nature. In this paper, we present Neural LightRig, a novel framework that boosts intrinsic estimation by leveraging auxiliary multi-lighting conditions from 2D diffusion priors. Specifically, 1) we first leverage illumination priors from large-scale diffusion models to build our multi-light diffusion model on a synthetic relighting dataset with dedicated designs. This diffusion model generates multiple consistent images, each illuminated by point light sources in different directions. 2) By using these varied lighting images to reduce estimation uncertainty, we train a large G-buffer model with a U-Net backbone to accurately predict surface normals and materials. Extensive experiments validate that our approach significantly outperforms state-of-the-art methods, enabling accurate surface normal and PBR material estimation with vivid relighting effects. Code and dataset are available on our project page at https://projects.zxhezexin.com/neural-lightrig.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zexin He, Tengfei Wang, Xin Huang, Xingang Pan, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09593v1">http://arxiv.org/abs/2412.09593v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recovering the geometry and materials of objects from a single image is challenging due to its under-constrained nature. In this paper, we present Neural LightRig, a novel framework that boosts intrinsic estimation by leveraging auxiliary multi-lighting conditions from 2D diffusion priors. Specifically, 1) we first leverage illumination priors from large-scale diffusion models to build our multi-light diffusion model on a synthetic relighting dataset with dedicated designs. This diffusion model generates multiple consistent images, each illuminated by point light sources in different directions. 2) By using these varied lighting images to reduce estimation uncertainty, we train a large G-buffer model with a U-Net backbone to accurately predict surface normals and materials. Extensive experiments validate that our approach significantly outperforms state-of-the-art methods, enabling accurate surface normal and PBR material estimation with vivid relighting effects. Code and dataset are available on our project page at https://projects.zxhezexin.com/neural-lightrig.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 13 Dec 2024 20:04:20 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/40c94f9a/caa29a41.mp3" length="21768628" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1357</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zexin He, Tengfei Wang, Xin Huang, Xingang Pan, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09593v1">http://arxiv.org/abs/2412.09593v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recovering the geometry and materials of objects from a single image is challenging due to its under-constrained nature. In this paper, we present Neural LightRig, a novel framework that boosts intrinsic estimation by leveraging auxiliary multi-lighting conditions from 2D diffusion priors. Specifically, 1) we first leverage illumination priors from large-scale diffusion models to build our multi-light diffusion model on a synthetic relighting dataset with dedicated designs. This diffusion model generates multiple consistent images, each illuminated by point light sources in different directions. 2) By using these varied lighting images to reduce estimation uncertainty, we train a large G-buffer model with a U-Net backbone to accurately predict surface normals and materials. Extensive experiments validate that our approach significantly outperforms state-of-the-art methods, enabling accurate surface normal and PBR material estimation with vivid relighting effects. Code and dataset are available on our project page at https://projects.zxhezexin.com/neural-lightrig.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>JuStRank: Benchmarking LLM Judges for System Ranking</title>
      <itunes:episode>205</itunes:episode>
      <podcast:episode>205</podcast:episode>
      <itunes:title>JuStRank: Benchmarking LLM Judges for System Ranking</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">33fff7ad-fdf7-4490-bea1-23230937792d</guid>
      <link>https://share.transistor.fm/s/a6a0e7cc</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, Asaf Yehudai</p>

            <p><strong>Title:</strong><br>
            JuStRank: Benchmarking LLM Judges for System Ranking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09569v1">http://arxiv.org/abs/2412.09569v1</a></p>

            <p><strong>Abstract:</strong><br>
            Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge's quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, Asaf Yehudai</p>

            <p><strong>Title:</strong><br>
            JuStRank: Benchmarking LLM Judges for System Ranking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09569v1">http://arxiv.org/abs/2412.09569v1</a></p>

            <p><strong>Abstract:</strong><br>
            Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge's quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 13 Dec 2024 20:03:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a6a0e7cc/092654ea.mp3" length="20377612" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1270</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, Asaf Yehudai</p>

            <p><strong>Title:</strong><br>
            JuStRank: Benchmarking LLM Judges for System Ranking</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.09569v1">http://arxiv.org/abs/2412.09569v1</a></p>

            <p><strong>Abstract:</strong><br>
            Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge's quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints</title>
      <itunes:episode>204</itunes:episode>
      <podcast:episode>204</podcast:episode>
      <itunes:title>SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4b9621f7-61f8-407a-b3ee-0d0974c343e4</guid>
      <link>https://share.transistor.fm/s/7ece45fb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Xiao Fu, Zuozhu Liu, Haoji Hu, Pengfei Wan, Di Zhang</p>

            <p><strong>Title:</strong><br>
            SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07760v1">http://arxiv.org/abs/2412.07760v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in video diffusion models have shown exceptional abilities in simulating real-world dynamics and maintaining 3D consistency. This progress inspires us to investigate the potential of these models to ensure dynamic consistency across various viewpoints, a highly desirable feature for applications such as virtual filming. Unlike existing methods focused on multi-view generation of single objects for 4D reconstruction, our interest lies in generating open-world videos from arbitrary viewpoints, incorporating 6 DoF camera poses. To achieve this, we propose a plug-and-play module that enhances a pre-trained text-to-video model for multi-camera video generation, ensuring consistent content across different viewpoints. Specifically, we introduce a multi-view synchronization module to maintain appearance and geometry consistency across these viewpoints. Given the scarcity of high-quality training data, we design a hybrid training scheme that leverages multi-camera images and monocular videos to supplement Unreal Engine-rendered multi-camera videos. Furthermore, our method enables intriguing extensions, such as re-rendering a video from novel viewpoints. We also release a multi-view synchronized video dataset, named SynCamVideo-Dataset. Project page: https://jianhongbai.github.io/SynCamMaster/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Xiao Fu, Zuozhu Liu, Haoji Hu, Pengfei Wan, Di Zhang</p>

            <p><strong>Title:</strong><br>
            SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07760v1">http://arxiv.org/abs/2412.07760v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in video diffusion models have shown exceptional abilities in simulating real-world dynamics and maintaining 3D consistency. This progress inspires us to investigate the potential of these models to ensure dynamic consistency across various viewpoints, a highly desirable feature for applications such as virtual filming. Unlike existing methods focused on multi-view generation of single objects for 4D reconstruction, our interest lies in generating open-world videos from arbitrary viewpoints, incorporating 6 DoF camera poses. To achieve this, we propose a plug-and-play module that enhances a pre-trained text-to-video model for multi-camera video generation, ensuring consistent content across different viewpoints. Specifically, we introduce a multi-view synchronization module to maintain appearance and geometry consistency across these viewpoints. Given the scarcity of high-quality training data, we design a hybrid training scheme that leverages multi-camera images and monocular videos to supplement Unreal Engine-rendered multi-camera videos. Furthermore, our method enables intriguing extensions, such as re-rendering a video from novel viewpoints. We also release a multi-view synchronized video dataset, named SynCamVideo-Dataset. Project page: https://jianhongbai.github.io/SynCamMaster/.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 12 Dec 2024 20:18:16 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7ece45fb/e5c706b2.mp3" length="20360087" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1269</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Xiao Fu, Zuozhu Liu, Haoji Hu, Pengfei Wan, Di Zhang</p>

            <p><strong>Title:</strong><br>
            SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07760v1">http://arxiv.org/abs/2412.07760v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in video diffusion models have shown exceptional abilities in simulating real-world dynamics and maintaining 3D consistency. This progress inspires us to investigate the potential of these models to ensure dynamic consistency across various viewpoints, a highly desirable feature for applications such as virtual filming. Unlike existing methods focused on multi-view generation of single objects for 4D reconstruction, our interest lies in generating open-world videos from arbitrary viewpoints, incorporating 6 DoF camera poses. To achieve this, we propose a plug-and-play module that enhances a pre-trained text-to-video model for multi-camera video generation, ensuring consistent content across different viewpoints. Specifically, we introduce a multi-view synchronization module to maintain appearance and geometry consistency across these viewpoints. Given the scarcity of high-quality training data, we design a hybrid training scheme that leverages multi-camera images and monocular videos to supplement Unreal Engine-rendered multi-camera videos. Furthermore, our method enables intriguing extensions, such as re-rendering a video from novel viewpoints. We also release a multi-view synchronized video dataset, named SynCamVideo-Dataset. Project page: https://jianhongbai.github.io/SynCamMaster/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations</title>
      <itunes:episode>203</itunes:episode>
      <podcast:episode>203</podcast:episode>
      <itunes:title>LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d2eb73d9-0927-4c33-a7ca-9d8058d29a7a</guid>
      <link>https://share.transistor.fm/s/6a684a24</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zejian Li, Chenye Meng, Yize Li, Ling Yang, Shengyuan Zhang, Jiarui Ma, Jiayi Li, Guang Yang, Changyuan Yang, Zhiyuan Yang, Jinxiong Chang, Lingyun Sun</p>

            <p><strong>Title:</strong><br>
            LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08580v1">http://arxiv.org/abs/2412.08580v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in text-to-image (T2I) generation have shown remarkable success in producing high-quality images from text. However, existing T2I models show decayed performance in compositional image generation involving multiple objects and intricate relationships. We attribute this problem to limitations in existing datasets of image-text pairs, which lack precise inter-object relationship annotations with prompts only. To address this problem, we construct LAION-SG, a large-scale dataset with high-quality structural annotations of scene graphs (SG), which precisely describe attributes and relationships of multiple objects, effectively representing the semantic structure in complex scenes. Based on LAION-SG, we train a new foundation model SDXL-SG to incorporate structural annotation information into the generation process. Extensive experiments show advanced models trained on our LAION-SG boast significant performance improvements in complex scene generation over models on existing datasets. We also introduce CompSG-Bench, a benchmark that evaluates models on compositional image generation, establishing a new standard for this domain.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zejian Li, Chenye Meng, Yize Li, Ling Yang, Shengyuan Zhang, Jiarui Ma, Jiayi Li, Guang Yang, Changyuan Yang, Zhiyuan Yang, Jinxiong Chang, Lingyun Sun</p>

            <p><strong>Title:</strong><br>
            LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08580v1">http://arxiv.org/abs/2412.08580v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in text-to-image (T2I) generation have shown remarkable success in producing high-quality images from text. However, existing T2I models show decayed performance in compositional image generation involving multiple objects and intricate relationships. We attribute this problem to limitations in existing datasets of image-text pairs, which lack precise inter-object relationship annotations with prompts only. To address this problem, we construct LAION-SG, a large-scale dataset with high-quality structural annotations of scene graphs (SG), which precisely describe attributes and relationships of multiple objects, effectively representing the semantic structure in complex scenes. Based on LAION-SG, we train a new foundation model SDXL-SG to incorporate structural annotation information into the generation process. Extensive experiments show advanced models trained on our LAION-SG boast significant performance improvements in complex scene generation over models on existing datasets. We also introduce CompSG-Bench, a benchmark that evaluates models on compositional image generation, establishing a new standard for this domain.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 12 Dec 2024 20:17:55 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6a684a24/e3d59860.mp3" length="20674837" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1288</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zejian Li, Chenye Meng, Yize Li, Ling Yang, Shengyuan Zhang, Jiarui Ma, Jiayi Li, Guang Yang, Changyuan Yang, Zhiyuan Yang, Jinxiong Chang, Lingyun Sun</p>

            <p><strong>Title:</strong><br>
            LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08580v1">http://arxiv.org/abs/2412.08580v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in text-to-image (T2I) generation have shown remarkable success in producing high-quality images from text. However, existing T2I models show decayed performance in compositional image generation involving multiple objects and intricate relationships. We attribute this problem to limitations in existing datasets of image-text pairs, which lack precise inter-object relationship annotations with prompts only. To address this problem, we construct LAION-SG, a large-scale dataset with high-quality structural annotations of scene graphs (SG), which precisely describe attributes and relationships of multiple objects, effectively representing the semantic structure in complex scenes. Based on LAION-SG, we train a new foundation model SDXL-SG to incorporate structural annotation information into the generation process. Extensive experiments show advanced models trained on our LAION-SG boast significant performance improvements in complex scene generation over models on existing datasets. We also introduce CompSG-Bench, a benchmark that evaluates models on compositional image generation, establishing a new standard for this domain.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>POINTS1.5: Building a Vision-Language Model towards Real World Applications</title>
      <itunes:episode>202</itunes:episode>
      <podcast:episode>202</podcast:episode>
      <itunes:title>POINTS1.5: Building a Vision-Language Model towards Real World Applications</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">704add28-3fa6-4b7f-a7b3-be8c5a8f2e15</guid>
      <link>https://share.transistor.fm/s/c97877cb</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Yuan Liu, Le Tian, Xiao Zhou, Xinyu Gao, Kavio Yu, Yang Yu, Jie Zhou</p>

            <p><strong>Title:</strong><br>
            POINTS1.5: Building a Vision-Language Model towards Real World Applications</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08443v1">http://arxiv.org/abs/2412.08443v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models have made significant strides recently, demonstrating superior performance across a range of tasks, e.g. optical character recognition and complex diagram analysis. Building on this trend, we introduce a new vision-language model, POINTS1.5, designed to excel in various real-world applications. POINTS1.5 is an enhancement of POINTS1.0 and incorporates several key innovations: i) We replace the original CLIP vision encoder, which had a fixed image resolution, with a NaViT-style vision encoder that supports native dynamic high resolution. This allows POINTS1.5 to process images of any resolution without needing to split them into tiles. ii) We add bilingual support to POINTS1.5, significantly enhancing its capability in Chinese. Due to the scarcity of open-source Chinese datasets for vision-language models, we collect numerous images from the Internet and annotate them using a combination of manual and automatic methods. iii) We propose a set of rigorous filtering methods for visual instruction tuning datasets. We comprehensively evaluate all these filtering methods, and choose the most effective ones to obtain the final visual instruction tuning set. Thanks to these innovations, POINTS1.5 significantly outperforms POINTS1.0 and demonstrates strong performance across a range of real-world applications. Notably, POINTS1.5-7B is trained on fewer than 4 billion tokens and ranks first on the OpenCompass leaderboard among models with fewer than 10 billion parameters</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Yuan Liu, Le Tian, Xiao Zhou, Xinyu Gao, Kavio Yu, Yang Yu, Jie Zhou</p>

            <p><strong>Title:</strong><br>
            POINTS1.5: Building a Vision-Language Model towards Real World Applications</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08443v1">http://arxiv.org/abs/2412.08443v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models have made significant strides recently, demonstrating superior performance across a range of tasks, e.g. optical character recognition and complex diagram analysis. Building on this trend, we introduce a new vision-language model, POINTS1.5, designed to excel in various real-world applications. POINTS1.5 is an enhancement of POINTS1.0 and incorporates several key innovations: i) We replace the original CLIP vision encoder, which had a fixed image resolution, with a NaViT-style vision encoder that supports native dynamic high resolution. This allows POINTS1.5 to process images of any resolution without needing to split them into tiles. ii) We add bilingual support to POINTS1.5, significantly enhancing its capability in Chinese. Due to the scarcity of open-source Chinese datasets for vision-language models, we collect numerous images from the Internet and annotate them using a combination of manual and automatic methods. iii) We propose a set of rigorous filtering methods for visual instruction tuning datasets. We comprehensively evaluate all these filtering methods, and choose the most effective ones to obtain the final visual instruction tuning set. Thanks to these innovations, POINTS1.5 significantly outperforms POINTS1.0 and demonstrates strong performance across a range of real-world applications. Notably, POINTS1.5-7B is trained on fewer than 4 billion tokens and ranks first on the OpenCompass leaderboard among models with fewer than 10 billion parameters</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 12 Dec 2024 20:17:34 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c97877cb/2904c9a9.mp3" length="23459666" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1463</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Yuan Liu, Le Tian, Xiao Zhou, Xinyu Gao, Kavio Yu, Yang Yu, Jie Zhou</p>

            <p><strong>Title:</strong><br>
            POINTS1.5: Building a Vision-Language Model towards Real World Applications</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08443v1">http://arxiv.org/abs/2412.08443v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language models have made significant strides recently, demonstrating superior performance across a range of tasks, e.g. optical character recognition and complex diagram analysis. Building on this trend, we introduce a new vision-language model, POINTS1.5, designed to excel in various real-world applications. POINTS1.5 is an enhancement of POINTS1.0 and incorporates several key innovations: i) We replace the original CLIP vision encoder, which had a fixed image resolution, with a NaViT-style vision encoder that supports native dynamic high resolution. This allows POINTS1.5 to process images of any resolution without needing to split them into tiles. ii) We add bilingual support to POINTS1.5, significantly enhancing its capability in Chinese. Due to the scarcity of open-source Chinese datasets for vision-language models, we collect numerous images from the Internet and annotate them using a combination of manual and automatic methods. iii) We propose a set of rigorous filtering methods for visual instruction tuning datasets. We comprehensively evaluate all these filtering methods, and choose the most effective ones to obtain the final visual instruction tuning set. Thanks to these innovations, POINTS1.5 significantly outperforms POINTS1.0 and demonstrates strong performance across a range of real-world applications. Notably, POINTS1.5-7B is trained on fewer than 4 billion tokens and ranks first on the OpenCompass leaderboard among models with fewer than 10 billion parameters</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Learning Flow Fields in Attention for Controllable Person Image Generation</title>
      <itunes:episode>201</itunes:episode>
      <podcast:episode>201</podcast:episode>
      <itunes:title>Learning Flow Fields in Attention for Controllable Person Image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">afa00af7-359b-494d-b22e-7682a84d5fc0</guid>
      <link>https://share.transistor.fm/s/87815a97</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zijian Zhou, Shikun Liu, Xiao Han, Haozhe Liu, Kam Woh Ng, Tian Xie, Yuren Cong, Hang Li, Mengmeng Xu, Juan-Manuel Pérez-Rúa, Aditya Patel, Tao Xiang, Miaojing Shi, Sen He</p>

            <p><strong>Title:</strong><br>
            Learning Flow Fields in Attention for Controllable Person Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08486v2">http://arxiv.org/abs/2412.08486v2</a></p>

            <p><strong>Abstract:</strong><br>
            Controllable person image generation aims to generate a person image conditioned on reference images, allowing precise control over the person's appearance or pose. However, prior methods often distort fine-grained textural details from the reference image, despite achieving high overall image quality. We attribute these distortions to inadequate attention to corresponding regions in the reference image. To address this, we thereby propose learning flow fields in attention (Leffa), which explicitly guides the target query to attend to the correct reference key in the attention layer during training. Specifically, it is realized via a regularization loss on top of the attention map within a diffusion-based baseline. Our extensive experiments show that Leffa achieves state-of-the-art performance in controlling appearance (virtual try-on) and pose (pose transfer), significantly reducing fine-grained detail distortion while maintaining high image quality. Additionally, we show that our loss is model-agnostic and can be used to improve the performance of other diffusion models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zijian Zhou, Shikun Liu, Xiao Han, Haozhe Liu, Kam Woh Ng, Tian Xie, Yuren Cong, Hang Li, Mengmeng Xu, Juan-Manuel Pérez-Rúa, Aditya Patel, Tao Xiang, Miaojing Shi, Sen He</p>

            <p><strong>Title:</strong><br>
            Learning Flow Fields in Attention for Controllable Person Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08486v2">http://arxiv.org/abs/2412.08486v2</a></p>

            <p><strong>Abstract:</strong><br>
            Controllable person image generation aims to generate a person image conditioned on reference images, allowing precise control over the person's appearance or pose. However, prior methods often distort fine-grained textural details from the reference image, despite achieving high overall image quality. We attribute these distortions to inadequate attention to corresponding regions in the reference image. To address this, we thereby propose learning flow fields in attention (Leffa), which explicitly guides the target query to attend to the correct reference key in the attention layer during training. Specifically, it is realized via a regularization loss on top of the attention map within a diffusion-based baseline. Our extensive experiments show that Leffa achieves state-of-the-art performance in controlling appearance (virtual try-on) and pose (pose transfer), significantly reducing fine-grained detail distortion while maintaining high image quality. Additionally, we show that our loss is model-agnostic and can be used to improve the performance of other diffusion models.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 12 Dec 2024 20:17:13 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/87815a97/889030d3.mp3" length="20288609" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1264</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zijian Zhou, Shikun Liu, Xiao Han, Haozhe Liu, Kam Woh Ng, Tian Xie, Yuren Cong, Hang Li, Mengmeng Xu, Juan-Manuel Pérez-Rúa, Aditya Patel, Tao Xiang, Miaojing Shi, Sen He</p>

            <p><strong>Title:</strong><br>
            Learning Flow Fields in Attention for Controllable Person Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08486v2">http://arxiv.org/abs/2412.08486v2</a></p>

            <p><strong>Abstract:</strong><br>
            Controllable person image generation aims to generate a person image conditioned on reference images, allowing precise control over the person's appearance or pose. However, prior methods often distort fine-grained textural details from the reference image, despite achieving high overall image quality. We attribute these distortions to inadequate attention to corresponding regions in the reference image. To address this, we thereby propose learning flow fields in attention (Leffa), which explicitly guides the target query to attend to the correct reference key in the attention layer during training. Specifically, it is realized via a regularization loss on top of the attention map within a diffusion-based baseline. Our extensive experiments show that Leffa achieves state-of-the-art performance in controlling appearance (virtual try-on) and pose (pose transfer), significantly reducing fine-grained detail distortion while maintaining high image quality. Additionally, we show that our loss is model-agnostic and can be used to improve the performance of other diffusion models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>StyleMaster: Stylize Your Video with Artistic Generation and Translation</title>
      <itunes:episode>200</itunes:episode>
      <podcast:episode>200</podcast:episode>
      <itunes:title>StyleMaster: Stylize Your Video with Artistic Generation and Translation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">eb6fe609-34cc-48b1-bd46-84f3b2830a87</guid>
      <link>https://share.transistor.fm/s/3371557a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zixuan Ye, Huijuan Huang, Xintao Wang, Pengfei Wan, Di Zhang, Wenhan Luo</p>

            <p><strong>Title:</strong><br>
            StyleMaster: Stylize Your Video with Artistic Generation and Translation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07744v1">http://arxiv.org/abs/2412.07744v1</a></p>

            <p><strong>Abstract:</strong><br>
            Style control has been popular in video generation models. Existing methods often generate videos far from the given style, cause content leakage, and struggle to transfer one video to the desired style. Our first observation is that the style extraction stage matters, whereas existing methods emphasize global style but ignore local textures. In order to bring texture features while preventing content leakage, we filter content-related patches while retaining style ones based on prompt-patch similarity; for global style extraction, we generate a paired style dataset through model illusion to facilitate contrastive learning, which greatly enhances the absolute style consistency. Moreover, to fill in the image-to-video gap, we train a lightweight motion adapter on still videos, which implicitly enhances stylization extent, and enables our image-trained model to be seamlessly applied to videos. Benefited from these efforts, our approach, StyleMaster, not only achieves significant improvement in both style resemblance and temporal coherence, but also can easily generalize to video style transfer with a gray tile ControlNet. Extensive experiments and visualizations demonstrate that StyleMaster significantly outperforms competitors, effectively generating high-quality stylized videos that align with textual content and closely resemble the style of reference images. Our project page is at https://zixuan-ye.github.io/stylemaster</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zixuan Ye, Huijuan Huang, Xintao Wang, Pengfei Wan, Di Zhang, Wenhan Luo</p>

            <p><strong>Title:</strong><br>
            StyleMaster: Stylize Your Video with Artistic Generation and Translation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07744v1">http://arxiv.org/abs/2412.07744v1</a></p>

            <p><strong>Abstract:</strong><br>
            Style control has been popular in video generation models. Existing methods often generate videos far from the given style, cause content leakage, and struggle to transfer one video to the desired style. Our first observation is that the style extraction stage matters, whereas existing methods emphasize global style but ignore local textures. In order to bring texture features while preventing content leakage, we filter content-related patches while retaining style ones based on prompt-patch similarity; for global style extraction, we generate a paired style dataset through model illusion to facilitate contrastive learning, which greatly enhances the absolute style consistency. Moreover, to fill in the image-to-video gap, we train a lightweight motion adapter on still videos, which implicitly enhances stylization extent, and enables our image-trained model to be seamlessly applied to videos. Benefited from these efforts, our approach, StyleMaster, not only achieves significant improvement in both style resemblance and temporal coherence, but also can easily generalize to video style transfer with a gray tile ControlNet. Extensive experiments and visualizations demonstrate that StyleMaster significantly outperforms competitors, effectively generating high-quality stylized videos that align with textual content and closely resemble the style of reference images. Our project page is at https://zixuan-ye.github.io/stylemaster</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 12 Dec 2024 20:16:52 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3371557a/fc0fa3d0.mp3" length="22465338" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1400</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 14 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zixuan Ye, Huijuan Huang, Xintao Wang, Pengfei Wan, Di Zhang, Wenhan Luo</p>

            <p><strong>Title:</strong><br>
            StyleMaster: Stylize Your Video with Artistic Generation and Translation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07744v1">http://arxiv.org/abs/2412.07744v1</a></p>

            <p><strong>Abstract:</strong><br>
            Style control has been popular in video generation models. Existing methods often generate videos far from the given style, cause content leakage, and struggle to transfer one video to the desired style. Our first observation is that the style extraction stage matters, whereas existing methods emphasize global style but ignore local textures. In order to bring texture features while preventing content leakage, we filter content-related patches while retaining style ones based on prompt-patch similarity; for global style extraction, we generate a paired style dataset through model illusion to facilitate contrastive learning, which greatly enhances the absolute style consistency. Moreover, to fill in the image-to-video gap, we train a lightweight motion adapter on still videos, which implicitly enhances stylization extent, and enables our image-trained model to be seamlessly applied to videos. Benefited from these efforts, our approach, StyleMaster, not only achieves significant improvement in both style resemblance and temporal coherence, but also can easily generalize to video style transfer with a gray tile ControlNet. Extensive experiments and visualizations demonstrate that StyleMaster significantly outperforms competitors, effectively generating high-quality stylized videos that align with textual content and closely resemble the style of reference images. Our project page is at https://zixuan-ye.github.io/stylemaster</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>StreamChat: Chatting with Streaming Video</title>
      <itunes:episode>199</itunes:episode>
      <podcast:episode>199</podcast:episode>
      <itunes:title>StreamChat: Chatting with Streaming Video</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3f52c74f-2ac4-400a-af71-0b562aee23c3</guid>
      <link>https://share.transistor.fm/s/4e233826</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan Kautz, Hongsheng Li, Jose M. Alvare</p>

            <p><strong>Title:</strong><br>
            StreamChat: Chatting with Streaming Video</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08646v1">http://arxiv.org/abs/2412.08646v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents StreamChat, a novel approach that enhances the interaction capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming interaction scenarios, existing methods rely solely on visual information available at the moment a question is posed, resulting in significant delays as the model remains unaware of subsequent changes in the streaming video. StreamChat addresses this limitation by innovatively updating the visual context at each decoding step, ensuring that the model utilizes up-to-date video content throughout the decoding process. Additionally, we introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs while maintaining inference efficiency for streaming interactions. Furthermore, we construct a new dense instruction dataset to facilitate the training of streaming interaction models, complemented by a parallel 3D-RoPE mechanism that encodes the relative temporal information of visual and text tokens. Experimental results demonstrate that StreamChat achieves competitive performance on established image and video benchmarks and exhibits superior capabilities in streaming interaction scenarios compared to state-of-the-art video LMM.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan Kautz, Hongsheng Li, Jose M. Alvare</p>

            <p><strong>Title:</strong><br>
            StreamChat: Chatting with Streaming Video</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08646v1">http://arxiv.org/abs/2412.08646v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents StreamChat, a novel approach that enhances the interaction capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming interaction scenarios, existing methods rely solely on visual information available at the moment a question is posed, resulting in significant delays as the model remains unaware of subsequent changes in the streaming video. StreamChat addresses this limitation by innovatively updating the visual context at each decoding step, ensuring that the model utilizes up-to-date video content throughout the decoding process. Additionally, we introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs while maintaining inference efficiency for streaming interactions. Furthermore, we construct a new dense instruction dataset to facilitate the training of streaming interaction models, complemented by a parallel 3D-RoPE mechanism that encodes the relative temporal information of visual and text tokens. Experimental results demonstrate that StreamChat achieves competitive performance on established image and video benchmarks and exhibits superior capabilities in streaming interaction scenarios compared to state-of-the-art video LMM.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 12 Dec 2024 20:16:31 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4e233826/70abf905.mp3" length="19007531" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1184</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan Kautz, Hongsheng Li, Jose M. Alvare</p>

            <p><strong>Title:</strong><br>
            StreamChat: Chatting with Streaming Video</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.08646v1">http://arxiv.org/abs/2412.08646v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents StreamChat, a novel approach that enhances the interaction capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming interaction scenarios, existing methods rely solely on visual information available at the moment a question is posed, resulting in significant delays as the model remains unaware of subsequent changes in the streaming video. StreamChat addresses this limitation by innovatively updating the visual context at each decoding step, ensuring that the model utilizes up-to-date video content throughout the decoding process. Additionally, we introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs while maintaining inference efficiency for streaming interactions. Furthermore, we construct a new dense instruction dataset to facilitate the training of streaming interaction models, complemented by a parallel 3D-RoPE mechanism that encodes the relative temporal information of visual and text tokens. Experimental results demonstrate that StreamChat achieves competitive performance on established image and video benchmarks and exhibits superior capabilities in streaming interaction scenarios compared to state-of-the-art video LMM.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark</title>
      <itunes:episode>198</itunes:episode>
      <podcast:episode>198</podcast:episode>
      <itunes:title>3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">aee9d38b-9ff6-4006-b3cf-6e43d21090c2</guid>
      <link>https://share.transistor.fm/s/5191f0f6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wufei Ma, Haoyu Chen, Guofeng Zhang, Celso M de Melo, Alan Yuille, Jieneng Chen</p>

            <p><strong>Title:</strong><br>
            3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07825v1">http://arxiv.org/abs/2412.07825v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their applicability to a broader range of areas, such as autonomous navigation, robotics, and AR/VR. While large multi-modal models (LMMs) have achieved remarkable progress in a wide range of image and video understanding tasks, their capabilities to perform 3D spatial reasoning on diverse natural images are less studied. In this work we present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 2,772 manually annotated visual question-answer pairs across 12 question types. We conduct robust and thorough evaluation of 3D spatial reasoning capabilities by balancing the data distribution and adopting a novel FlipEval strategy. To further study the robustness of 3D spatial reasoning w.r.t. camera 3D viewpoints, our 3DSRBench includes two subsets with 3D spatial reasoning questions on paired images with common and uncommon viewpoints. We benchmark a wide range of open-sourced and proprietary LMMs, uncovering their limitations in various aspects of 3D awareness, such as height, orientation, location, and multi-object reasoning, as well as their degraded performance on images with uncommon camera viewpoints. Our 3DSRBench provide valuable findings and insights about the future development of LMMs with strong 3D reasoning capabilities. Our project page and dataset is available https://3dsrbench.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wufei Ma, Haoyu Chen, Guofeng Zhang, Celso M de Melo, Alan Yuille, Jieneng Chen</p>

            <p><strong>Title:</strong><br>
            3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07825v1">http://arxiv.org/abs/2412.07825v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their applicability to a broader range of areas, such as autonomous navigation, robotics, and AR/VR. While large multi-modal models (LMMs) have achieved remarkable progress in a wide range of image and video understanding tasks, their capabilities to perform 3D spatial reasoning on diverse natural images are less studied. In this work we present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 2,772 manually annotated visual question-answer pairs across 12 question types. We conduct robust and thorough evaluation of 3D spatial reasoning capabilities by balancing the data distribution and adopting a novel FlipEval strategy. To further study the robustness of 3D spatial reasoning w.r.t. camera 3D viewpoints, our 3DSRBench includes two subsets with 3D spatial reasoning questions on paired images with common and uncommon viewpoints. We benchmark a wide range of open-sourced and proprietary LMMs, uncovering their limitations in various aspects of 3D awareness, such as height, orientation, location, and multi-object reasoning, as well as their degraded performance on images with uncommon camera viewpoints. Our 3DSRBench provide valuable findings and insights about the future development of LMMs with strong 3D reasoning capabilities. Our project page and dataset is available https://3dsrbench.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 12 Dec 2024 20:16:10 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5191f0f6/13146a2b.mp3" length="24116262" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1504</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wufei Ma, Haoyu Chen, Guofeng Zhang, Celso M de Melo, Alan Yuille, Jieneng Chen</p>

            <p><strong>Title:</strong><br>
            3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07825v1">http://arxiv.org/abs/2412.07825v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their applicability to a broader range of areas, such as autonomous navigation, robotics, and AR/VR. While large multi-modal models (LMMs) have achieved remarkable progress in a wide range of image and video understanding tasks, their capabilities to perform 3D spatial reasoning on diverse natural images are less studied. In this work we present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 2,772 manually annotated visual question-answer pairs across 12 question types. We conduct robust and thorough evaluation of 3D spatial reasoning capabilities by balancing the data distribution and adopting a novel FlipEval strategy. To further study the robustness of 3D spatial reasoning w.r.t. camera 3D viewpoints, our 3DSRBench includes two subsets with 3D spatial reasoning questions on paired images with common and uncommon viewpoints. We benchmark a wide range of open-sourced and proprietary LMMs, uncovering their limitations in various aspects of 3D awareness, such as height, orientation, location, and multi-object reasoning, as well as their degraded performance on images with uncommon camera viewpoints. Our 3DSRBench provide valuable findings and insights about the future development of LMMs with strong 3D reasoning capabilities. Our project page and dataset is available https://3dsrbench.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction</title>
      <itunes:episode>197</itunes:episode>
      <podcast:episode>197</podcast:episode>
      <itunes:title>Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7334d0d0-4afb-494e-aa9a-d75e2b47495d</guid>
      <link>https://share.transistor.fm/s/8219c558</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Seungtae Nam, Xiangyu Sun, Gyeongjin Kang, Younggeun Lee, Seungjun Oh, Eunbyung Park</p>

            <p><strong>Title:</strong><br>
            Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06234v2">http://arxiv.org/abs/2412.06234v2</a></p>

            <p><strong>Abstract:</strong><br>
            Generalized feed-forward Gaussian models have achieved significant progress in sparse-view 3D reconstruction by leveraging prior knowledge from large multi-view datasets. However, these models often struggle to represent high-frequency details due to the limited number of Gaussians. While the densification strategy used in per-scene 3D Gaussian splatting (3D-GS) optimization can be adapted to the feed-forward models, it may not be ideally suited for generalized scenarios. In this paper, we propose Generative Densification, an efficient and generalizable method to densify Gaussians generated by feed-forward models. Unlike the 3D-GS densification strategy, which iteratively splits and clones raw Gaussian parameters, our method up-samples feature representations from the feed-forward models and generates their corresponding fine Gaussians in a single forward pass, leveraging the embedded prior knowledge for enhanced generalization. Experimental results on both object-level and scene-level reconstruction tasks demonstrate that our method outperforms state-of-the-art approaches with comparable or smaller model sizes, achieving notable improvements in representing fine details.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Seungtae Nam, Xiangyu Sun, Gyeongjin Kang, Younggeun Lee, Seungjun Oh, Eunbyung Park</p>

            <p><strong>Title:</strong><br>
            Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06234v2">http://arxiv.org/abs/2412.06234v2</a></p>

            <p><strong>Abstract:</strong><br>
            Generalized feed-forward Gaussian models have achieved significant progress in sparse-view 3D reconstruction by leveraging prior knowledge from large multi-view datasets. However, these models often struggle to represent high-frequency details due to the limited number of Gaussians. While the densification strategy used in per-scene 3D Gaussian splatting (3D-GS) optimization can be adapted to the feed-forward models, it may not be ideally suited for generalized scenarios. In this paper, we propose Generative Densification, an efficient and generalizable method to densify Gaussians generated by feed-forward models. Unlike the 3D-GS densification strategy, which iteratively splits and clones raw Gaussian parameters, our method up-samples feature representations from the feed-forward models and generates their corresponding fine Gaussians in a single forward pass, leveraging the embedded prior knowledge for enhanced generalization. Experimental results on both object-level and scene-level reconstruction tasks demonstrate that our method outperforms state-of-the-art approaches with comparable or smaller model sizes, achieving notable improvements in representing fine details.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 12 Dec 2024 20:15:49 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8219c558/390e74e7.mp3" length="21865182" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1363</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Seungtae Nam, Xiangyu Sun, Gyeongjin Kang, Younggeun Lee, Seungjun Oh, Eunbyung Park</p>

            <p><strong>Title:</strong><br>
            Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06234v2">http://arxiv.org/abs/2412.06234v2</a></p>

            <p><strong>Abstract:</strong><br>
            Generalized feed-forward Gaussian models have achieved significant progress in sparse-view 3D reconstruction by leveraging prior knowledge from large multi-view datasets. However, these models often struggle to represent high-frequency details due to the limited number of Gaussians. While the densification strategy used in per-scene 3D Gaussian splatting (3D-GS) optimization can be adapted to the feed-forward models, it may not be ideally suited for generalized scenarios. In this paper, we propose Generative Densification, an efficient and generalizable method to densify Gaussians generated by feed-forward models. Unlike the 3D-GS densification strategy, which iteratively splits and clones raw Gaussian parameters, our method up-samples feature representations from the feed-forward models and generates their corresponding fine Gaussians in a single forward pass, leveraging the embedded prior knowledge for enhanced generalization. Experimental results on both object-level and scene-level reconstruction tasks demonstrate that our method outperforms state-of-the-art approaches with comparable or smaller model sizes, achieving notable improvements in representing fine details.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The BrowserGym Ecosystem for Web Agent Research</title>
      <itunes:episode>196</itunes:episode>
      <podcast:episode>196</podcast:episode>
      <itunes:title>The BrowserGym Ecosystem for Web Agent Research</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c196ebdf-d0dc-47f7-9bcd-07c6f55427dd</guid>
      <link>https://share.transistor.fm/s/5a26350f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.LG, cs.AI, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, Alexandre Lacoste</p>

            <p><strong>Title:</strong><br>
            The BrowserGym Ecosystem for Web Agent Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.05467v3">http://arxiv.org/abs/2412.05467v3</a></p>

            <p><strong>Abstract:</strong><br>
            The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.LG, cs.AI, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, Alexandre Lacoste</p>

            <p><strong>Title:</strong><br>
            The BrowserGym Ecosystem for Web Agent Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.05467v3">http://arxiv.org/abs/2412.05467v3</a></p>

            <p><strong>Abstract:</strong><br>
            The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 12 Dec 2024 20:15:28 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5a26350f/ae54ecbc.mp3" length="24311857" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1516</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.LG, cs.AI, cs.SE</p>

            <p><strong>Authors:</strong><br>
            Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, Alexandre Lacoste</p>

            <p><strong>Title:</strong><br>
            The BrowserGym Ecosystem for Web Agent Research</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.05467v3">http://arxiv.org/abs/2412.05467v3</a></p>

            <p><strong>Abstract:</strong><br>
            The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation</title>
      <itunes:episode>195</itunes:episode>
      <podcast:episode>195</podcast:episode>
      <itunes:title>DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ae1afdae-7fbf-4233-9576-38feabfaa464</guid>
      <link>https://share.transistor.fm/s/bf5daa2b</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianzong Wu, Chao Tang, Jingbo Wang, Yanhong Zeng, Xiangtai Li, Yunhai Tong</p>

            <p><strong>Title:</strong><br>
            DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07589v1">http://arxiv.org/abs/2412.07589v1</a></p>

            <p><strong>Abstract:</strong><br>
            Story visualization, the task of creating visual narratives from textual descriptions, has seen progress with text-to-image generation models. However, these models often lack effective control over character appearances and interactions, particularly in multi-character scenes. To address these limitations, we propose a new task: \textbf{customized manga generation} and introduce \textbf{DiffSensei}, an innovative framework specifically designed for generating manga with dynamic multi-character control. DiffSensei integrates a diffusion-based image generator with a multimodal large language model (MLLM) that acts as a text-compatible identity adapter. Our approach employs masked cross-attention to seamlessly incorporate character features, enabling precise layout control without direct pixel transfer. Additionally, the MLLM-based adapter adjusts character features to align with panel-specific text cues, allowing flexible adjustments in character expressions, poses, and actions. We also introduce \textbf{MangaZero}, a large-scale dataset tailored to this task, containing 43,264 manga pages and 427,147 annotated panels, supporting the visualization of varied character interactions and movements across sequential frames. Extensive experiments demonstrate that DiffSensei outperforms existing models, marking a significant advancement in manga generation by enabling text-adaptable character customization. The project page is https://jianzongwu.github.io/projects/diffsensei/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianzong Wu, Chao Tang, Jingbo Wang, Yanhong Zeng, Xiangtai Li, Yunhai Tong</p>

            <p><strong>Title:</strong><br>
            DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07589v1">http://arxiv.org/abs/2412.07589v1</a></p>

            <p><strong>Abstract:</strong><br>
            Story visualization, the task of creating visual narratives from textual descriptions, has seen progress with text-to-image generation models. However, these models often lack effective control over character appearances and interactions, particularly in multi-character scenes. To address these limitations, we propose a new task: \textbf{customized manga generation} and introduce \textbf{DiffSensei}, an innovative framework specifically designed for generating manga with dynamic multi-character control. DiffSensei integrates a diffusion-based image generator with a multimodal large language model (MLLM) that acts as a text-compatible identity adapter. Our approach employs masked cross-attention to seamlessly incorporate character features, enabling precise layout control without direct pixel transfer. Additionally, the MLLM-based adapter adjusts character features to align with panel-specific text cues, allowing flexible adjustments in character expressions, poses, and actions. We also introduce \textbf{MangaZero}, a large-scale dataset tailored to this task, containing 43,264 manga pages and 427,147 annotated panels, supporting the visualization of varied character interactions and movements across sequential frames. Extensive experiments demonstrate that DiffSensei outperforms existing models, marking a significant advancement in manga generation by enabling text-adaptable character customization. The project page is https://jianzongwu.github.io/projects/diffsensei/.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 11 Dec 2024 23:35:14 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bf5daa2b/8040e4fa.mp3" length="21322238" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1329</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianzong Wu, Chao Tang, Jingbo Wang, Yanhong Zeng, Xiangtai Li, Yunhai Tong</p>

            <p><strong>Title:</strong><br>
            DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07589v1">http://arxiv.org/abs/2412.07589v1</a></p>

            <p><strong>Abstract:</strong><br>
            Story visualization, the task of creating visual narratives from textual descriptions, has seen progress with text-to-image generation models. However, these models often lack effective control over character appearances and interactions, particularly in multi-character scenes. To address these limitations, we propose a new task: \textbf{customized manga generation} and introduce \textbf{DiffSensei}, an innovative framework specifically designed for generating manga with dynamic multi-character control. DiffSensei integrates a diffusion-based image generator with a multimodal large language model (MLLM) that acts as a text-compatible identity adapter. Our approach employs masked cross-attention to seamlessly incorporate character features, enabling precise layout control without direct pixel transfer. Additionally, the MLLM-based adapter adjusts character features to align with panel-specific text cues, allowing flexible adjustments in character expressions, poses, and actions. We also introduce \textbf{MangaZero}, a large-scale dataset tailored to this task, containing 43,264 manga pages and 427,147 annotated panels, supporting the visualization of varied character interactions and movements across sequential frames. Extensive experiments demonstrate that DiffSensei outperforms existing models, marking a significant advancement in manga generation by enabling text-adaptable character customization. The project page is https://jianzongwu.github.io/projects/diffsensei/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Hidden in the Noise: Two-Stage Robust Watermarking for Images</title>
      <itunes:episode>194</itunes:episode>
      <podcast:episode>194</podcast:episode>
      <itunes:title>Hidden in the Noise: Two-Stage Robust Watermarking for Images</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5d69fd28-3dd4-4b39-8d02-6646cabafcb6</guid>
      <link>https://share.transistor.fm/s/d56c7b83</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kasra Arabi, Benjamin Feuer, R. Teal Witter, Chinmay Hegde, Niv Cohen</p>

            <p><strong>Title:</strong><br>
            Hidden in the Noise: Two-Stage Robust Watermarking for Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04653v2">http://arxiv.org/abs/2412.04653v2</a></p>

            <p><strong>Abstract:</strong><br>
            As the quality of image generators continues to improve, deepfakes become a topic of considerable societal debate. Image watermarking allows responsible model owners to detect and label their AI-generated content, which can mitigate the harm. Yet, current state-of-the-art methods in image watermarking remain vulnerable to forgery and removal attacks. This vulnerability occurs in part because watermarks distort the distribution of generated images, unintentionally revealing information about the watermarking techniques.   In this work, we first demonstrate a distortion-free watermarking method for images, based on a diffusion model's initial noise. However, detecting the watermark requires comparing the initial noise reconstructed for an image to all previously used initial noises. To mitigate these issues, we propose a two-stage watermarking framework for efficient detection. During generation, we augment the initial noise with generated Fourier patterns to embed information about the group of initial noises we used. For detection, we (i) retrieve the relevant group of noises, and (ii) search within the given group for an initial noise that might match our image. This watermarking approach achieves state-of-the-art robustness to forgery and removal against a large battery of attacks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kasra Arabi, Benjamin Feuer, R. Teal Witter, Chinmay Hegde, Niv Cohen</p>

            <p><strong>Title:</strong><br>
            Hidden in the Noise: Two-Stage Robust Watermarking for Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04653v2">http://arxiv.org/abs/2412.04653v2</a></p>

            <p><strong>Abstract:</strong><br>
            As the quality of image generators continues to improve, deepfakes become a topic of considerable societal debate. Image watermarking allows responsible model owners to detect and label their AI-generated content, which can mitigate the harm. Yet, current state-of-the-art methods in image watermarking remain vulnerable to forgery and removal attacks. This vulnerability occurs in part because watermarks distort the distribution of generated images, unintentionally revealing information about the watermarking techniques.   In this work, we first demonstrate a distortion-free watermarking method for images, based on a diffusion model's initial noise. However, detecting the watermark requires comparing the initial noise reconstructed for an image to all previously used initial noises. To mitigate these issues, we propose a two-stage watermarking framework for efficient detection. During generation, we augment the initial noise with generated Fourier patterns to embed information about the group of initial noises we used. For detection, we (i) retrieve the relevant group of noises, and (ii) search within the given group for an initial noise that might match our image. This watermarking approach achieves state-of-the-art robustness to forgery and removal against a large battery of attacks.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 11 Dec 2024 23:34:51 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d56c7b83/9fc3ffec.mp3" length="20682313" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1289</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 20 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kasra Arabi, Benjamin Feuer, R. Teal Witter, Chinmay Hegde, Niv Cohen</p>

            <p><strong>Title:</strong><br>
            Hidden in the Noise: Two-Stage Robust Watermarking for Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04653v2">http://arxiv.org/abs/2412.04653v2</a></p>

            <p><strong>Abstract:</strong><br>
            As the quality of image generators continues to improve, deepfakes become a topic of considerable societal debate. Image watermarking allows responsible model owners to detect and label their AI-generated content, which can mitigate the harm. Yet, current state-of-the-art methods in image watermarking remain vulnerable to forgery and removal attacks. This vulnerability occurs in part because watermarks distort the distribution of generated images, unintentionally revealing information about the watermarking techniques.   In this work, we first demonstrate a distortion-free watermarking method for images, based on a diffusion model's initial noise. However, detecting the watermark requires comparing the initial noise reconstructed for an image to all previously used initial noises. To mitigate these issues, we propose a two-stage watermarking framework for efficient detection. During generation, we augment the initial noise with generated Fourier patterns to embed information about the group of initial noises we used. For detection, we (i) retrieve the relevant group of noises, and (ii) search within the given group for an initial noise that might match our image. This watermarking approach achieves state-of-the-art robustness to forgery and removal against a large battery of attacks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models</title>
      <itunes:episode>193</itunes:episode>
      <podcast:episode>193</podcast:episode>
      <itunes:title>FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0e8d284a-9605-4cbc-9b32-82b9d302b0aa</guid>
      <link>https://share.transistor.fm/s/8abddf74</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tong Wu, Yinghao Xu, Ryan Po, Mengchen Zhang, Guandao Yang, Jiaqi Wang, Ziwei Liu, Dahua Lin, Gordon Wetzstein</p>

            <p><strong>Title:</strong><br>
            FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07674v1">http://arxiv.org/abs/2412.07674v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in text-to-image generation have enabled the creation of high-quality images with diverse applications. However, accurately describing desired visual attributes can be challenging, especially for non-experts in art and photography. An intuitive solution involves adopting favorable attributes from the source images. Current methods attempt to distill identity and style from source images. However, "style" is a broad concept that includes texture, color, and artistic elements, but does not cover other important attributes such as lighting and dynamics. Additionally, a simplified "style" adaptation prevents combining multiple attributes from different sources into one generated image. In this work, we formulate a more effective approach to decompose the aesthetics of a picture into specific visual attributes, allowing users to apply characteristics such as lighting, texture, and dynamics from different images. To achieve this goal, we constructed the first fine-grained visual attributes dataset (FiVA) to the best of our knowledge. This FiVA dataset features a well-organized taxonomy for visual attributes and includes around 1 M high-quality generated images with visual attribute annotations. Leveraging this dataset, we propose a fine-grained visual attribute adaptation framework (FiVA-Adapter), which decouples and adapts visual attributes from one or more source images into a generated one. This approach enhances user-friendly customization, allowing users to selectively apply desired attributes to create images that meet their unique preferences and specific content requirements.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tong Wu, Yinghao Xu, Ryan Po, Mengchen Zhang, Guandao Yang, Jiaqi Wang, Ziwei Liu, Dahua Lin, Gordon Wetzstein</p>

            <p><strong>Title:</strong><br>
            FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07674v1">http://arxiv.org/abs/2412.07674v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in text-to-image generation have enabled the creation of high-quality images with diverse applications. However, accurately describing desired visual attributes can be challenging, especially for non-experts in art and photography. An intuitive solution involves adopting favorable attributes from the source images. Current methods attempt to distill identity and style from source images. However, "style" is a broad concept that includes texture, color, and artistic elements, but does not cover other important attributes such as lighting and dynamics. Additionally, a simplified "style" adaptation prevents combining multiple attributes from different sources into one generated image. In this work, we formulate a more effective approach to decompose the aesthetics of a picture into specific visual attributes, allowing users to apply characteristics such as lighting, texture, and dynamics from different images. To achieve this goal, we constructed the first fine-grained visual attributes dataset (FiVA) to the best of our knowledge. This FiVA dataset features a well-organized taxonomy for visual attributes and includes around 1 M high-quality generated images with visual attribute annotations. Leveraging this dataset, we propose a fine-grained visual attribute adaptation framework (FiVA-Adapter), which decouples and adapts visual attributes from one or more source images into a generated one. This approach enhances user-friendly customization, allowing users to selectively apply desired attributes to create images that meet their unique preferences and specific content requirements.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 11 Dec 2024 23:34:28 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8abddf74/efc48cf1.mp3" length="19070262" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1188</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tong Wu, Yinghao Xu, Ryan Po, Mengchen Zhang, Guandao Yang, Jiaqi Wang, Ziwei Liu, Dahua Lin, Gordon Wetzstein</p>

            <p><strong>Title:</strong><br>
            FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07674v1">http://arxiv.org/abs/2412.07674v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in text-to-image generation have enabled the creation of high-quality images with diverse applications. However, accurately describing desired visual attributes can be challenging, especially for non-experts in art and photography. An intuitive solution involves adopting favorable attributes from the source images. Current methods attempt to distill identity and style from source images. However, "style" is a broad concept that includes texture, color, and artistic elements, but does not cover other important attributes such as lighting and dynamics. Additionally, a simplified "style" adaptation prevents combining multiple attributes from different sources into one generated image. In this work, we formulate a more effective approach to decompose the aesthetics of a picture into specific visual attributes, allowing users to apply characteristics such as lighting, texture, and dynamics from different images. To achieve this goal, we constructed the first fine-grained visual attributes dataset (FiVA) to the best of our knowledge. This FiVA dataset features a well-organized taxonomy for visual attributes and includes around 1 M high-quality generated images with visual attribute annotations. Leveraging this dataset, we propose a fine-grained visual attribute adaptation framework (FiVA-Adapter), which decouples and adapts visual attributes from one or more source images into a generated one. This approach enhances user-friendly customization, allowing users to selectively apply desired attributes to create images that meet their unique preferences and specific content requirements.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics</title>
      <itunes:episode>192</itunes:episode>
      <podcast:episode>192</podcast:episode>
      <itunes:title>UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3e92a9c2-442a-418f-937b-8707f458f07f</guid>
      <link>https://share.transistor.fm/s/a6e29702</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xi Chen, Zhifei Zhang, He Zhang, Yuqian Zhou, Soo Ye Kim, Qing Liu, Yijun Li, Jianming Zhang, Nanxuan Zhao, Yilin Wang, Hui Ding, Zhe Lin, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07774v1">http://arxiv.org/abs/2412.07774v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce UniReal, a unified framework designed to address various image generation and editing tasks. Existing solutions often vary by tasks, yet share fundamental principles: preserving consistency between inputs and outputs while capturing visual variations. Inspired by recent video generation models that effectively balance consistency and variation across frames, we propose a unifying approach that treats image-level tasks as discontinuous video generation. Specifically, we treat varying numbers of input and output images as frames, enabling seamless support for tasks such as image generation, editing, customization, composition, etc. Although designed for image-level tasks, we leverage videos as a scalable source for universal supervision. UniReal learns world dynamics from large-scale videos, demonstrating advanced capability in handling shadows, reflections, pose variation, and object interaction, while also exhibiting emergent capability for novel applications.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xi Chen, Zhifei Zhang, He Zhang, Yuqian Zhou, Soo Ye Kim, Qing Liu, Yijun Li, Jianming Zhang, Nanxuan Zhao, Yilin Wang, Hui Ding, Zhe Lin, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07774v1">http://arxiv.org/abs/2412.07774v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce UniReal, a unified framework designed to address various image generation and editing tasks. Existing solutions often vary by tasks, yet share fundamental principles: preserving consistency between inputs and outputs while capturing visual variations. Inspired by recent video generation models that effectively balance consistency and variation across frames, we propose a unifying approach that treats image-level tasks as discontinuous video generation. Specifically, we treat varying numbers of input and output images as frames, enabling seamless support for tasks such as image generation, editing, customization, composition, etc. Although designed for image-level tasks, we leverage videos as a scalable source for universal supervision. UniReal learns world dynamics from large-scale videos, demonstrating advanced capability in handling shadows, reflections, pose variation, and object interaction, while also exhibiting emergent capability for novel applications.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 11 Dec 2024 23:34:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a6e29702/22816f11.mp3" length="23039204" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1436</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xi Chen, Zhifei Zhang, He Zhang, Yuqian Zhou, Soo Ye Kim, Qing Liu, Yijun Li, Jianming Zhang, Nanxuan Zhao, Yilin Wang, Hui Ding, Zhe Lin, Hengshuang Zhao</p>

            <p><strong>Title:</strong><br>
            UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07774v1">http://arxiv.org/abs/2412.07774v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce UniReal, a unified framework designed to address various image generation and editing tasks. Existing solutions often vary by tasks, yet share fundamental principles: preserving consistency between inputs and outputs while capturing visual variations. Inspired by recent video generation models that effectively balance consistency and variation across frames, we propose a unifying approach that treats image-level tasks as discontinuous video generation. Specifically, we treat varying numbers of input and output images as frames, enabling seamless support for tasks such as image generation, editing, customization, composition, etc. Although designed for image-level tasks, we leverage videos as a scalable source for universal supervision. UniReal learns world dynamics from large-scale videos, demonstrating advanced capability in handling shadows, reflections, pose variation, and object interaction, while also exhibiting emergent capability for novel applications.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation</title>
      <itunes:episode>191</itunes:episode>
      <podcast:episode>191</podcast:episode>
      <itunes:title>3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c23e658b-a193-44a4-bf98-bfaa1025c26e</guid>
      <link>https://share.transistor.fm/s/832aa358</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiao Fu, Xian Liu, Xintao Wang, Sida Peng, Menghan Xia, Xiaoyu Shi, Ziyang Yuan, Pengfei Wan, Di Zhang, Dahua Lin</p>

            <p><strong>Title:</strong><br>
            3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07759v1">http://arxiv.org/abs/2412.07759v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper aims to manipulate multi-entity 3D motions in video generation. Previous methods on controllable video generation primarily leverage 2D control signals to manipulate object motions and have achieved remarkable synthesis results. However, 2D control signals are inherently limited in expressing the 3D nature of object motions. To overcome this problem, we introduce 3DTrajMaster, a robust controller that regulates multi-entity dynamics in 3D space, given user-desired 6DoF pose (location and rotation) sequences of entities. At the core of our approach is a plug-and-play 3D-motion grounded object injector that fuses multiple input entities with their respective 3D trajectories through a gated self-attention mechanism. In addition, we exploit an injector architecture to preserve the video diffusion prior, which is crucial for generalization ability. To mitigate video quality degradation, we introduce a domain adaptor during training and employ an annealed sampling strategy during inference. To address the lack of suitable training data, we construct a 360-Motion Dataset, which first correlates collected 3D human and animal assets with GPT-generated trajectory and then captures their motion with 12 evenly-surround cameras on diverse 3D UE platforms. Extensive experiments show that 3DTrajMaster sets a new state-of-the-art in both accuracy and generalization for controlling multi-entity 3D motions. Project page: http://fuxiao0719.github.io/projects/3dtrajmaster</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiao Fu, Xian Liu, Xintao Wang, Sida Peng, Menghan Xia, Xiaoyu Shi, Ziyang Yuan, Pengfei Wan, Di Zhang, Dahua Lin</p>

            <p><strong>Title:</strong><br>
            3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07759v1">http://arxiv.org/abs/2412.07759v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper aims to manipulate multi-entity 3D motions in video generation. Previous methods on controllable video generation primarily leverage 2D control signals to manipulate object motions and have achieved remarkable synthesis results. However, 2D control signals are inherently limited in expressing the 3D nature of object motions. To overcome this problem, we introduce 3DTrajMaster, a robust controller that regulates multi-entity dynamics in 3D space, given user-desired 6DoF pose (location and rotation) sequences of entities. At the core of our approach is a plug-and-play 3D-motion grounded object injector that fuses multiple input entities with their respective 3D trajectories through a gated self-attention mechanism. In addition, we exploit an injector architecture to preserve the video diffusion prior, which is crucial for generalization ability. To mitigate video quality degradation, we introduce a domain adaptor during training and employ an annealed sampling strategy during inference. To address the lack of suitable training data, we construct a 360-Motion Dataset, which first correlates collected 3D human and animal assets with GPT-generated trajectory and then captures their motion with 12 evenly-surround cameras on diverse 3D UE platforms. Extensive experiments show that 3DTrajMaster sets a new state-of-the-art in both accuracy and generalization for controlling multi-entity 3D motions. Project page: http://fuxiao0719.github.io/projects/3dtrajmaster</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 11 Dec 2024 23:33:42 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/832aa358/1f79642b.mp3" length="22882470" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1426</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xiao Fu, Xian Liu, Xintao Wang, Sida Peng, Menghan Xia, Xiaoyu Shi, Ziyang Yuan, Pengfei Wan, Di Zhang, Dahua Lin</p>

            <p><strong>Title:</strong><br>
            3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07759v1">http://arxiv.org/abs/2412.07759v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper aims to manipulate multi-entity 3D motions in video generation. Previous methods on controllable video generation primarily leverage 2D control signals to manipulate object motions and have achieved remarkable synthesis results. However, 2D control signals are inherently limited in expressing the 3D nature of object motions. To overcome this problem, we introduce 3DTrajMaster, a robust controller that regulates multi-entity dynamics in 3D space, given user-desired 6DoF pose (location and rotation) sequences of entities. At the core of our approach is a plug-and-play 3D-motion grounded object injector that fuses multiple input entities with their respective 3D trajectories through a gated self-attention mechanism. In addition, we exploit an injector architecture to preserve the video diffusion prior, which is crucial for generalization ability. To mitigate video quality degradation, we introduce a domain adaptor during training and employ an annealed sampling strategy during inference. To address the lack of suitable training data, we construct a 360-Motion Dataset, which first correlates collected 3D human and animal assets with GPT-generated trajectory and then captures their motion with 12 evenly-surround cameras on diverse 3D UE platforms. Extensive experiments show that 3DTrajMaster sets a new state-of-the-art in both accuracy and generalization for controlling multi-entity 3D motions. Project page: http://fuxiao0719.github.io/projects/3dtrajmaster</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Mobile Video Diffusion</title>
      <itunes:episode>190</itunes:episode>
      <podcast:episode>190</podcast:episode>
      <itunes:title>Mobile Video Diffusion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ad86ceab-3c58-4acc-a997-c3c5b4b5a7fd</guid>
      <link>https://share.transistor.fm/s/79f65b45</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haitam Ben Yahia, Denis Korzhenkov, Ioannis Lelekas, Amir Ghodrati, Amirhossein Habibian</p>

            <p><strong>Title:</strong><br>
            Mobile Video Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07583v1">http://arxiv.org/abs/2412.07583v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haitam Ben Yahia, Denis Korzhenkov, Ioannis Lelekas, Amir Ghodrati, Amirhossein Habibian</p>

            <p><strong>Title:</strong><br>
            Mobile Video Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07583v1">http://arxiv.org/abs/2412.07583v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion/</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 11 Dec 2024 23:33:19 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/79f65b45/ea6e66ca.mp3" length="23717494" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1479</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Haitam Ben Yahia, Denis Korzhenkov, Ioannis Lelekas, Amir Ghodrati, Amirhossein Habibian</p>

            <p><strong>Title:</strong><br>
            Mobile Video Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07583v1">http://arxiv.org/abs/2412.07583v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Granite Guardian</title>
      <itunes:episode>189</itunes:episode>
      <podcast:episode>189</podcast:episode>
      <itunes:title>Granite Guardian</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d0dd9ccf-c252-4eea-aaec-c8d2fb19cf72</guid>
      <link>https://share.transistor.fm/s/01cadc81</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri</p>

            <p><strong>Title:</strong><br>
            Granite Guardian</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07724v1">http://arxiv.org/abs/2412.07724v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce the Granite Guardian models, a suite of safeguards designed to provide risk detection for prompts and responses, enabling safe and responsible use in combination with any large language model (LLM). These models offer comprehensive coverage across multiple risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related risks such as context relevance, groundedness, and answer relevance for retrieval-augmented generation (RAG). Trained on a unique dataset combining human annotations from diverse sources and synthetic data, Granite Guardian models address risks typically overlooked by traditional risk detection models, such as jailbreaks and RAG-specific issues. With AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks respectively, Granite Guardian is the most generalizable and competitive model available in the space. Released as open-source, Granite Guardian aims to promote responsible AI development across the community.   https://github.com/ibm-granite/granite-guardian</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri</p>

            <p><strong>Title:</strong><br>
            Granite Guardian</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07724v1">http://arxiv.org/abs/2412.07724v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce the Granite Guardian models, a suite of safeguards designed to provide risk detection for prompts and responses, enabling safe and responsible use in combination with any large language model (LLM). These models offer comprehensive coverage across multiple risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related risks such as context relevance, groundedness, and answer relevance for retrieval-augmented generation (RAG). Trained on a unique dataset combining human annotations from diverse sources and synthetic data, Granite Guardian models address risks typically overlooked by traditional risk detection models, such as jailbreaks and RAG-specific issues. With AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks respectively, Granite Guardian is the most generalizable and competitive model available in the space. Released as open-source, Granite Guardian aims to promote responsible AI development across the community.   https://github.com/ibm-granite/granite-guardian</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 11 Dec 2024 23:32:45 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/01cadc81/cea0da9a.mp3" length="20221259" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1260</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 16 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri</p>

            <p><strong>Title:</strong><br>
            Granite Guardian</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07724v1">http://arxiv.org/abs/2412.07724v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce the Granite Guardian models, a suite of safeguards designed to provide risk detection for prompts and responses, enabling safe and responsible use in combination with any large language model (LLM). These models offer comprehensive coverage across multiple risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related risks such as context relevance, groundedness, and answer relevance for retrieval-augmented generation (RAG). Trained on a unique dataset combining human annotations from diverse sources and synthetic data, Granite Guardian models address risks typically overlooked by traditional risk detection models, such as jailbreaks and RAG-specific issues. With AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks respectively, Granite Guardian is the most generalizable and competitive model available in the space. Released as open-source, Granite Guardian aims to promote responsible AI development across the community.   https://github.com/ibm-granite/granite-guardian</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation</title>
      <itunes:episode>188</itunes:episode>
      <podcast:episode>188</podcast:episode>
      <itunes:title>Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f5690b78-f891-41dc-ae66-1604c1cbcb14</guid>
      <link>https://share.transistor.fm/s/18052e0a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Egor Cherepanov, Nikita Kachaev, Artem Zholus, Alexey K. Kovalev, Aleksandr I. Panov</p>

            <p><strong>Title:</strong><br>
            Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06531v1">http://arxiv.org/abs/2412.06531v1</a></p>

            <p><strong>Abstract:</strong><br>
            The incorporation of memory into agents is essential for numerous tasks within the domain of Reinforcement Learning (RL). In particular, memory is paramount for tasks that require the utilization of past information, adaptation to novel environments, and improved sample efficiency. However, the term ``memory'' encompasses a wide range of concepts, which, coupled with the lack of a unified methodology for validating an agent's memory, leads to erroneous judgments about agents' memory capabilities and prevents objective comparison with other memory-enhanced agents. This paper aims to streamline the concept of memory in RL by providing practical precise definitions of agent memory types, such as long-term versus short-term memory and declarative versus procedural memory, inspired by cognitive science. Using these definitions, we categorize different classes of agent memory, propose a robust experimental methodology for evaluating the memory capabilities of RL agents, and standardize evaluations. Furthermore, we empirically demonstrate the importance of adhering to the proposed methodology when evaluating different types of agent memory by conducting experiments with different RL agents and what its violation leads to.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Egor Cherepanov, Nikita Kachaev, Artem Zholus, Alexey K. Kovalev, Aleksandr I. Panov</p>

            <p><strong>Title:</strong><br>
            Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06531v1">http://arxiv.org/abs/2412.06531v1</a></p>

            <p><strong>Abstract:</strong><br>
            The incorporation of memory into agents is essential for numerous tasks within the domain of Reinforcement Learning (RL). In particular, memory is paramount for tasks that require the utilization of past information, adaptation to novel environments, and improved sample efficiency. However, the term ``memory'' encompasses a wide range of concepts, which, coupled with the lack of a unified methodology for validating an agent's memory, leads to erroneous judgments about agents' memory capabilities and prevents objective comparison with other memory-enhanced agents. This paper aims to streamline the concept of memory in RL by providing practical precise definitions of agent memory types, such as long-term versus short-term memory and declarative versus procedural memory, inspired by cognitive science. Using these definitions, we categorize different classes of agent memory, propose a robust experimental methodology for evaluating the memory capabilities of RL agents, and standardize evaluations. Furthermore, we empirically demonstrate the importance of adhering to the proposed methodology when evaluating different types of agent memory by conducting experiments with different RL agents and what its violation leads to.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 10 Dec 2024 20:25:14 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/18052e0a/882f03b4.mp3" length="18199670" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1134</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 54 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Egor Cherepanov, Nikita Kachaev, Artem Zholus, Alexey K. Kovalev, Aleksandr I. Panov</p>

            <p><strong>Title:</strong><br>
            Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06531v1">http://arxiv.org/abs/2412.06531v1</a></p>

            <p><strong>Abstract:</strong><br>
            The incorporation of memory into agents is essential for numerous tasks within the domain of Reinforcement Learning (RL). In particular, memory is paramount for tasks that require the utilization of past information, adaptation to novel environments, and improved sample efficiency. However, the term ``memory'' encompasses a wide range of concepts, which, coupled with the lack of a unified methodology for validating an agent's memory, leads to erroneous judgments about agents' memory capabilities and prevents objective comparison with other memory-enhanced agents. This paper aims to streamline the concept of memory in RL by providing practical precise definitions of agent memory types, such as long-term versus short-term memory and declarative versus procedural memory, inspired by cognitive science. Using these definitions, we categorize different classes of agent memory, propose a robust experimental methodology for evaluating the memory capabilities of RL agents, and standardize evaluations. Furthermore, we empirically demonstrate the importance of adhering to the proposed methodology when evaluating different types of agent memory by conducting experiments with different RL agents and what its violation leads to.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ProcessBench: Identifying Process Errors in Mathematical Reasoning</title>
      <itunes:episode>187</itunes:episode>
      <podcast:episode>187</podcast:episode>
      <itunes:title>ProcessBench: Identifying Process Errors in Mathematical Reasoning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f4672dd0-9810-4297-bb21-253667ef8c77</guid>
      <link>https://share.transistor.fm/s/9ce79155</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            ProcessBench: Identifying Process Errors in Mathematical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06559v2">http://arxiv.org/abs/2412.06559v2</a></p>

            <p><strong>Abstract:</strong><br>
            As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. Each test case contains a step-by-step solution with error location annotated by human experts. Models are required to identify the earliest step that contains an error, or conclude that all steps are correct. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models, where for the latter we prompt general language models to critique each solution step by step. We draw two main observations: (1) Existing PRMs typically fail to generalize to more challenging math problems beyond GSM8K and MATH. They underperform both critic models (i.e., prompted general language models) and our own trained PRM that is straightforwardly fine-tuned on the PRM800K dataset. (2) The best open-source model, QwQ-32B-Preview, has demonstrated the critique capability competitive with the proprietary model GPT-4o, despite that it still lags behind the reasoning-specialized o1-mini. We hope ProcessBench can foster future research in reasoning process assessment, paving the way toward scalable oversight of language models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            ProcessBench: Identifying Process Errors in Mathematical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06559v2">http://arxiv.org/abs/2412.06559v2</a></p>

            <p><strong>Abstract:</strong><br>
            As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. Each test case contains a step-by-step solution with error location annotated by human experts. Models are required to identify the earliest step that contains an error, or conclude that all steps are correct. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models, where for the latter we prompt general language models to critique each solution step by step. We draw two main observations: (1) Existing PRMs typically fail to generalize to more challenging math problems beyond GSM8K and MATH. They underperform both critic models (i.e., prompted general language models) and our own trained PRM that is straightforwardly fine-tuned on the PRM800K dataset. (2) The best open-source model, QwQ-32B-Preview, has demonstrated the critique capability competitive with the proprietary model GPT-4o, despite that it still lags behind the reasoning-specialized o1-mini. We hope ProcessBench can foster future research in reasoning process assessment, paving the way toward scalable oversight of language models.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 10 Dec 2024 20:24:53 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9ce79155/46a5ee7b.mp3" length="20578246" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1282</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 38 | cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin</p>

            <p><strong>Title:</strong><br>
            ProcessBench: Identifying Process Errors in Mathematical Reasoning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06559v2">http://arxiv.org/abs/2412.06559v2</a></p>

            <p><strong>Abstract:</strong><br>
            As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. Each test case contains a step-by-step solution with error location annotated by human experts. Models are required to identify the earliest step that contains an error, or conclude that all steps are correct. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models, where for the latter we prompt general language models to critique each solution step by step. We draw two main observations: (1) Existing PRMs typically fail to generalize to more challenging math problems beyond GSM8K and MATH. They underperform both critic models (i.e., prompted general language models) and our own trained PRM that is straightforwardly fine-tuned on the PRM800K dataset. (2) The best open-source model, QwQ-32B-Preview, has demonstrated the critique capability competitive with the proprietary model GPT-4o, despite that it still lags behind the reasoning-specialized o1-mini. We hope ProcessBench can foster future research in reasoning process assessment, paving the way toward scalable oversight of language models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Training Large Language Models to Reason in a Continuous Latent Space</title>
      <itunes:episode>186</itunes:episode>
      <podcast:episode>186</podcast:episode>
      <itunes:title>Training Large Language Models to Reason in a Continuous Latent Space</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f57ca28e-1cec-4d58-b512-a4f6739fbdf6</guid>
      <link>https://share.transistor.fm/s/daa4e91d</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, Yuandong Tian</p>

            <p><strong>Title:</strong><br>
            Training Large Language Models to Reason in a Continuous Latent Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06769v1">http://arxiv.org/abs/2412.06769v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are restricted to reason in the "language space", where they typically express the reasoning process with a chain-of-thought (CoT) to solve a complex reasoning problem. However, we argue that language space may not always be optimal for reasoning. For example, most word tokens are primarily for textual coherence and not essential for reasoning, while some critical tokens require complex planning and pose huge challenges to LLMs. To explore the potential of LLM reasoning in an unrestricted latent space instead of using natural language, we introduce a new paradigm Coconut (Chain of Continuous Thought). We utilize the last hidden state of the LLM as a representation of the reasoning state (termed "continuous thought"). Rather than decoding this into a word token, we feed it back to the LLM as the subsequent input embedding directly in the continuous space. Experiments show that Coconut can effectively augment the LLM on several reasoning tasks. This novel latent reasoning paradigm leads to emergent advanced reasoning patterns: the continuous thought can encode multiple alternative next reasoning steps, allowing the model to perform a breadth-first search (BFS) to solve the problem, rather than prematurely committing to a single deterministic path like CoT. Coconut outperforms CoT in certain logical reasoning tasks that require substantial backtracking during planning, with fewer thinking tokens during inference. These findings demonstrate the promise of latent reasoning and offer valuable insights for future research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, Yuandong Tian</p>

            <p><strong>Title:</strong><br>
            Training Large Language Models to Reason in a Continuous Latent Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06769v1">http://arxiv.org/abs/2412.06769v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are restricted to reason in the "language space", where they typically express the reasoning process with a chain-of-thought (CoT) to solve a complex reasoning problem. However, we argue that language space may not always be optimal for reasoning. For example, most word tokens are primarily for textual coherence and not essential for reasoning, while some critical tokens require complex planning and pose huge challenges to LLMs. To explore the potential of LLM reasoning in an unrestricted latent space instead of using natural language, we introduce a new paradigm Coconut (Chain of Continuous Thought). We utilize the last hidden state of the LLM as a representation of the reasoning state (termed "continuous thought"). Rather than decoding this into a word token, we feed it back to the LLM as the subsequent input embedding directly in the continuous space. Experiments show that Coconut can effectively augment the LLM on several reasoning tasks. This novel latent reasoning paradigm leads to emergent advanced reasoning patterns: the continuous thought can encode multiple alternative next reasoning steps, allowing the model to perform a breadth-first search (BFS) to solve the problem, rather than prematurely committing to a single deterministic path like CoT. Coconut outperforms CoT in certain logical reasoning tasks that require substantial backtracking during planning, with fewer thinking tokens during inference. These findings demonstrate the promise of latent reasoning and offer valuable insights for future research.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 10 Dec 2024 20:24:32 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/daa4e91d/fa829317.mp3" length="21206442" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1322</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, Yuandong Tian</p>

            <p><strong>Title:</strong><br>
            Training Large Language Models to Reason in a Continuous Latent Space</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06769v1">http://arxiv.org/abs/2412.06769v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are restricted to reason in the "language space", where they typically express the reasoning process with a chain-of-thought (CoT) to solve a complex reasoning problem. However, we argue that language space may not always be optimal for reasoning. For example, most word tokens are primarily for textual coherence and not essential for reasoning, while some critical tokens require complex planning and pose huge challenges to LLMs. To explore the potential of LLM reasoning in an unrestricted latent space instead of using natural language, we introduce a new paradigm Coconut (Chain of Continuous Thought). We utilize the last hidden state of the LLM as a representation of the reasoning state (termed "continuous thought"). Rather than decoding this into a word token, we feed it back to the LLM as the subsequent input embedding directly in the continuous space. Experiments show that Coconut can effectively augment the LLM on several reasoning tasks. This novel latent reasoning paradigm leads to emergent advanced reasoning patterns: the continuous thought can encode multiple alternative next reasoning steps, allowing the model to perform a breadth-first search (BFS) to solve the problem, rather than prematurely committing to a single deterministic path like CoT. Coconut outperforms CoT in certain logical reasoning tasks that require substantial backtracking during planning, with fewer thinking tokens during inference. These findings demonstrate the promise of latent reasoning and offer valuable insights for future research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation</title>
      <itunes:episode>185</itunes:episode>
      <podcast:episode>185</podcast:episode>
      <itunes:title>Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b9f2fbcc-7d76-4494-a1a2-6d87f1198fa9</guid>
      <link>https://share.transistor.fm/s/3edb83ff</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuying Ge, Yizhuo Li, Yixiao Ge, Ying Shan</p>

            <p><strong>Title:</strong><br>
            Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04432v1">http://arxiv.org/abs/2412.04432v1</a></p>

            <p><strong>Abstract:</strong><br>
            In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challenge lies in developing a versatile video tokenizer that captures both the spatial characteristics and temporal dynamics of videos to obtain representations for LLMs, and the representations can be further decoded into realistic video clips to enable video generation. In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations. Building upon the Divot tokenizer, we present Divot-Vicuna through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model. Experimental results demonstrate that our diffusion-based video tokenizer, when integrated with a pre-trained LLM, achieves competitive performance across various video comprehension and generation benchmarks. The instruction tuned Divot-Vicuna also excels in video storytelling, generating interleaved narratives and corresponding videos.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuying Ge, Yizhuo Li, Yixiao Ge, Ying Shan</p>

            <p><strong>Title:</strong><br>
            Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04432v1">http://arxiv.org/abs/2412.04432v1</a></p>

            <p><strong>Abstract:</strong><br>
            In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challenge lies in developing a versatile video tokenizer that captures both the spatial characteristics and temporal dynamics of videos to obtain representations for LLMs, and the representations can be further decoded into realistic video clips to enable video generation. In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations. Building upon the Divot tokenizer, we present Divot-Vicuna through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model. Experimental results demonstrate that our diffusion-based video tokenizer, when integrated with a pre-trained LLM, achieves competitive performance across various video comprehension and generation benchmarks. The instruction tuned Divot-Vicuna also excels in video storytelling, generating interleaved narratives and corresponding videos.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 10 Dec 2024 20:24:11 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3edb83ff/b6044e18.mp3" length="22962710" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1431</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuying Ge, Yizhuo Li, Yixiao Ge, Ying Shan</p>

            <p><strong>Title:</strong><br>
            Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04432v1">http://arxiv.org/abs/2412.04432v1</a></p>

            <p><strong>Abstract:</strong><br>
            In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challenge lies in developing a versatile video tokenizer that captures both the spatial characteristics and temporal dynamics of videos to obtain representations for LLMs, and the representations can be further decoded into realistic video clips to enable video generation. In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations. Building upon the Divot tokenizer, we present Divot-Vicuna through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model. Experimental results demonstrate that our diffusion-based video tokenizer, when integrated with a pre-trained LLM, achieves competitive performance across various video comprehension and generation benchmarks. The instruction tuned Divot-Vicuna also excels in video storytelling, generating interleaved narratives and corresponding videos.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation</title>
      <itunes:episode>184</itunes:episode>
      <podcast:episode>184</podcast:episode>
      <itunes:title>Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cd61c7a9-3e92-46e8-a338-9d1d8b144a64</guid>
      <link>https://share.transistor.fm/s/f3ce9d4a</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Nicolas Dufour, David Picard, Vicky Kalogeiton, Loic Landrieu</p>

            <p><strong>Title:</strong><br>
            Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06781v1">http://arxiv.org/abs/2412.06781v1</a></p>

            <p><strong>Abstract:</strong><br>
            Global visual geolocation predicts where an image was captured on Earth. Since images vary in how precisely they can be localized, this task inherently involves a significant degree of ambiguity. However, existing approaches are deterministic and overlook this aspect. In this paper, we aim to close the gap between traditional geolocalization and modern generative methods. We propose the first generative geolocation approach based on diffusion and Riemannian flow matching, where the denoising process operates directly on the Earth's surface. Our model achieves state-of-the-art performance on three visual geolocation benchmarks: OpenStreetView-5M, YFCC-100M, and iNat21. In addition, we introduce the task of probabilistic visual geolocation, where the model predicts a probability distribution over all possible locations instead of a single point. We introduce new metrics and baselines for this task, demonstrating the advantages of our diffusion-based approach. Codes and models will be made available.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Nicolas Dufour, David Picard, Vicky Kalogeiton, Loic Landrieu</p>

            <p><strong>Title:</strong><br>
            Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06781v1">http://arxiv.org/abs/2412.06781v1</a></p>

            <p><strong>Abstract:</strong><br>
            Global visual geolocation predicts where an image was captured on Earth. Since images vary in how precisely they can be localized, this task inherently involves a significant degree of ambiguity. However, existing approaches are deterministic and overlook this aspect. In this paper, we aim to close the gap between traditional geolocalization and modern generative methods. We propose the first generative geolocation approach based on diffusion and Riemannian flow matching, where the denoising process operates directly on the Earth's surface. Our model achieves state-of-the-art performance on three visual geolocation benchmarks: OpenStreetView-5M, YFCC-100M, and iNat21. In addition, we introduce the task of probabilistic visual geolocation, where the model predicts a probability distribution over all possible locations instead of a single point. We introduce new metrics and baselines for this task, demonstrating the advantages of our diffusion-based approach. Codes and models will be made available.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 10 Dec 2024 20:23:50 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f3ce9d4a/663c6b3b.mp3" length="21483564" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1339</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 9 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Nicolas Dufour, David Picard, Vicky Kalogeiton, Loic Landrieu</p>

            <p><strong>Title:</strong><br>
            Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06781v1">http://arxiv.org/abs/2412.06781v1</a></p>

            <p><strong>Abstract:</strong><br>
            Global visual geolocation predicts where an image was captured on Earth. Since images vary in how precisely they can be localized, this task inherently involves a significant degree of ambiguity. However, existing approaches are deterministic and overlook this aspect. In this paper, we aim to close the gap between traditional geolocalization and modern generative methods. We propose the first generative geolocation approach based on diffusion and Riemannian flow matching, where the denoising process operates directly on the Earth's surface. Our model achieves state-of-the-art performance on three visual geolocation benchmarks: OpenStreetView-5M, YFCC-100M, and iNat21. In addition, we introduce the task of probabilistic visual geolocation, where the model predicts a probability distribution over all possible locations instead of a single point. We introduce new metrics and baselines for this task, demonstrating the advantages of our diffusion-based approach. Codes and models will be made available.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models</title>
      <itunes:episode>183</itunes:episode>
      <podcast:episode>183</podcast:episode>
      <itunes:title>Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ee8685a4-e423-41d0-a7a8-8fcd67af047d</guid>
      <link>https://share.transistor.fm/s/3a6b27b9</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiao Xu, Tianhao Niu, Yuxi Xie, Libo Qin, Wanxiang Che, Min-Yen Kan</p>

            <p><strong>Title:</strong><br>
            Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.05939v1">http://arxiv.org/abs/2412.05939v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) excel in vision--language tasks by pre-training solely on coarse-grained concept annotations (e.g., image captions). We hypothesize that integrating fine-grained concept annotations (e.g., object labels and object regions) will further improve performance, as both data granularities complement each other in terms of breadth and depth in concept representation. We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC) for MLLMs. In constructing MMGiC, we explore the impact of different data recipes on multimodal comprehension and generation. Our analyses reveal that multi-grained concept annotations integrate and complement each other, under our structured template and a general MLLM framework. We clearly explore and demonstrate the potential of MMGiC to help MLLMs better locate and learn concepts, aligning vision and language at multiple granularities. We further validate our hypothesis by investigating the fair comparison and effective collaboration between MMGiC and image--caption data on 12 multimodal comprehension and generation benchmarks, e.g., their appropriate combination achieve 3.95% and 2.34% absolute improvements over image--caption data alone on POPE and SEED-Bench. Code, data and models will be available at https://github.com/LooperXX/MMGiC.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiao Xu, Tianhao Niu, Yuxi Xie, Libo Qin, Wanxiang Che, Min-Yen Kan</p>

            <p><strong>Title:</strong><br>
            Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.05939v1">http://arxiv.org/abs/2412.05939v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) excel in vision--language tasks by pre-training solely on coarse-grained concept annotations (e.g., image captions). We hypothesize that integrating fine-grained concept annotations (e.g., object labels and object regions) will further improve performance, as both data granularities complement each other in terms of breadth and depth in concept representation. We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC) for MLLMs. In constructing MMGiC, we explore the impact of different data recipes on multimodal comprehension and generation. Our analyses reveal that multi-grained concept annotations integrate and complement each other, under our structured template and a general MLLM framework. We clearly explore and demonstrate the potential of MMGiC to help MLLMs better locate and learn concepts, aligning vision and language at multiple granularities. We further validate our hypothesis by investigating the fair comparison and effective collaboration between MMGiC and image--caption data on 12 multimodal comprehension and generation benchmarks, e.g., their appropriate combination achieve 3.95% and 2.34% absolute improvements over image--caption data alone on POPE and SEED-Bench. Code, data and models will be available at https://github.com/LooperXX/MMGiC.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 10 Dec 2024 20:23:29 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3a6b27b9/b301f906.mp3" length="21670388" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1351</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 8 | cs.CV, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiao Xu, Tianhao Niu, Yuxi Xie, Libo Qin, Wanxiang Che, Min-Yen Kan</p>

            <p><strong>Title:</strong><br>
            Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.05939v1">http://arxiv.org/abs/2412.05939v1</a></p>

            <p><strong>Abstract:</strong><br>
            Multimodal Large Language Models (MLLMs) excel in vision--language tasks by pre-training solely on coarse-grained concept annotations (e.g., image captions). We hypothesize that integrating fine-grained concept annotations (e.g., object labels and object regions) will further improve performance, as both data granularities complement each other in terms of breadth and depth in concept representation. We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC) for MLLMs. In constructing MMGiC, we explore the impact of different data recipes on multimodal comprehension and generation. Our analyses reveal that multi-grained concept annotations integrate and complement each other, under our structured template and a general MLLM framework. We clearly explore and demonstrate the potential of MMGiC to help MLLMs better locate and learn concepts, aligning vision and language at multiple granularities. We further validate our hypothesis by investigating the fair comparison and effective collaboration between MMGiC and image--caption data on 12 multimodal comprehension and generation benchmarks, e.g., their appropriate combination achieve 3.95% and 2.34% absolute improvements over image--caption data alone on POPE and SEED-Bench. Code, data and models will be available at https://github.com/LooperXX/MMGiC.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale</title>
      <itunes:episode>182</itunes:episode>
      <podcast:episode>182</podcast:episode>
      <itunes:title>You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9faa7814-3dfa-4e1f-a4db-9b628312ba27</guid>
      <link>https://share.transistor.fm/s/0b498e32</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Baorui Ma, Huachen Gao, Haoge Deng, Zhengxiong Luo, Tiejun Huang, Lulu Tang, Xinlong Wang</p>

            <p><strong>Title:</strong><br>
            You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06699v1">http://arxiv.org/abs/2412.06699v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent 3D generation models typically rely on limited-scale 3D `gold-labels' or 2D diffusion priors for 3D content creation. However, their performance is upper-bounded by constrained 3D priors due to the lack of scalable learning paradigms. In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data -- You See it, You Got it. To achieve this, we first scale up the training data using a proposed data curation pipeline that automatically filters out multi-view inconsistencies and insufficient observations from source videos. This results in a high-quality, richly diverse, large-scale dataset of multi-view images, termed WebVi3D, containing 320M frames from 16M video clips. Nevertheless, learning generic 3D priors from videos without explicit 3D geometry or camera pose annotations is nontrivial, and annotating poses for web-scale videos is prohibitively expensive. To eliminate the need for pose conditions, we introduce an innovative visual-condition - a purely 2D-inductive visual signal generated by adding time-dependent noise to the masked video data. Finally, we introduce a novel visual-conditional 3D generation framework by integrating See3D into a warping-based pipeline for high-fidelity 3D generation. Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities, markedly outperforming models trained on costly and constrained 3D datasets. Please refer to our project page at: https://vision.baai.ac.cn/see3d</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Baorui Ma, Huachen Gao, Haoge Deng, Zhengxiong Luo, Tiejun Huang, Lulu Tang, Xinlong Wang</p>

            <p><strong>Title:</strong><br>
            You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06699v1">http://arxiv.org/abs/2412.06699v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent 3D generation models typically rely on limited-scale 3D `gold-labels' or 2D diffusion priors for 3D content creation. However, their performance is upper-bounded by constrained 3D priors due to the lack of scalable learning paradigms. In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data -- You See it, You Got it. To achieve this, we first scale up the training data using a proposed data curation pipeline that automatically filters out multi-view inconsistencies and insufficient observations from source videos. This results in a high-quality, richly diverse, large-scale dataset of multi-view images, termed WebVi3D, containing 320M frames from 16M video clips. Nevertheless, learning generic 3D priors from videos without explicit 3D geometry or camera pose annotations is nontrivial, and annotating poses for web-scale videos is prohibitively expensive. To eliminate the need for pose conditions, we introduce an innovative visual-condition - a purely 2D-inductive visual signal generated by adding time-dependent noise to the masked video data. Finally, we introduce a novel visual-conditional 3D generation framework by integrating See3D into a warping-based pipeline for high-fidelity 3D generation. Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities, markedly outperforming models trained on costly and constrained 3D datasets. Please refer to our project page at: https://vision.baai.ac.cn/see3d</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 10 Dec 2024 20:23:08 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0b498e32/0d99ec44.mp3" length="19155938" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1194</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Baorui Ma, Huachen Gao, Haoge Deng, Zhengxiong Luo, Tiejun Huang, Lulu Tang, Xinlong Wang</p>

            <p><strong>Title:</strong><br>
            You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06699v1">http://arxiv.org/abs/2412.06699v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent 3D generation models typically rely on limited-scale 3D `gold-labels' or 2D diffusion priors for 3D content creation. However, their performance is upper-bounded by constrained 3D priors due to the lack of scalable learning paradigms. In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data -- You See it, You Got it. To achieve this, we first scale up the training data using a proposed data curation pipeline that automatically filters out multi-view inconsistencies and insufficient observations from source videos. This results in a high-quality, richly diverse, large-scale dataset of multi-view images, termed WebVi3D, containing 320M frames from 16M video clips. Nevertheless, learning generic 3D priors from videos without explicit 3D geometry or camera pose annotations is nontrivial, and annotating poses for web-scale videos is prohibitively expensive. To eliminate the need for pose conditions, we introduce an innovative visual-condition - a purely 2D-inductive visual signal generated by adding time-dependent noise to the masked video data. Finally, we introduce a novel visual-conditional 3D generation framework by integrating See3D into a warping-based pipeline for high-fidelity 3D generation. Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities, markedly outperforming models trained on costly and constrained 3D datasets. Please refer to our project page at: https://vision.baai.ac.cn/see3d</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations</title>
      <itunes:episode>181</itunes:episode>
      <podcast:episode>181</podcast:episode>
      <itunes:title>OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f3b138eb-f568-4864-ba9f-c76d146cc59a</guid>
      <link>https://share.transistor.fm/s/f8955517</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV, cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, Conghui He</p>

            <p><strong>Title:</strong><br>
            OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07626v1">http://arxiv.org/abs/2412.07626v1</a></p>

            <p><strong>Abstract:</strong><br>
            Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies. However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation. To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction. OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others. Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types. Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation. OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies. The codes and dataset is available in https://github.com/opendatalab/OmniDocBench.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV, cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, Conghui He</p>

            <p><strong>Title:</strong><br>
            OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07626v1">http://arxiv.org/abs/2412.07626v1</a></p>

            <p><strong>Abstract:</strong><br>
            Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies. However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation. To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction. OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others. Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types. Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation. OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies. The codes and dataset is available in https://github.com/opendatalab/OmniDocBench.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 10 Dec 2024 20:22:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f8955517/57959250.mp3" length="19648725" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1224</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 7 | cs.CV, cs.AI, cs.IR</p>

            <p><strong>Authors:</strong><br>
            Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, Conghui He</p>

            <p><strong>Title:</strong><br>
            OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.07626v1">http://arxiv.org/abs/2412.07626v1</a></p>

            <p><strong>Abstract:</strong><br>
            Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies. However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation. To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction. OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others. Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types. Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation. OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies. The codes and dataset is available in https://github.com/opendatalab/OmniDocBench.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Robust Multi-bit Text Watermark with LLM-based Paraphrasers</title>
      <itunes:episode>180</itunes:episode>
      <podcast:episode>180</podcast:episode>
      <itunes:title>Robust Multi-bit Text Watermark with LLM-based Paraphrasers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">018493eb-20a2-4293-95a6-e9a27765126a</guid>
      <link>https://share.transistor.fm/s/b047d4c1</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xiaojun Xu, Jinghan Jia, Yuanshun Yao, Yang Liu, Hang Li</p>

            <p><strong>Title:</strong><br>
            Robust Multi-bit Text Watermark with LLM-based Paraphrasers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.03123v1">http://arxiv.org/abs/2412.03123v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose an imperceptible multi-bit text watermark embedded by paraphrasing with LLMs. We fine-tune a pair of LLM paraphrasers that are designed to behave differently so that their paraphrasing difference reflected in the text semantics can be identified by a trained decoder. To embed our multi-bit watermark, we use two paraphrasers alternatively to encode the pre-defined binary code at the sentence level. Then we use a text classifier as the decoder to decode each bit of the watermark. Through extensive experiments, we show that our watermarks can achieve over 99.99\% detection AUC with small (1.1B) text paraphrasers while keeping the semantic information of the original sentence. More importantly, our pipeline is robust under word substitution and sentence paraphrasing perturbations and generalizes well to out-of-distributional data. We also show the stealthiness of our watermark with LLM-based evaluation. We open-source the code: https://github.com/xiaojunxu/multi-bit-text-watermark.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xiaojun Xu, Jinghan Jia, Yuanshun Yao, Yang Liu, Hang Li</p>

            <p><strong>Title:</strong><br>
            Robust Multi-bit Text Watermark with LLM-based Paraphrasers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.03123v1">http://arxiv.org/abs/2412.03123v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose an imperceptible multi-bit text watermark embedded by paraphrasing with LLMs. We fine-tune a pair of LLM paraphrasers that are designed to behave differently so that their paraphrasing difference reflected in the text semantics can be identified by a trained decoder. To embed our multi-bit watermark, we use two paraphrasers alternatively to encode the pre-defined binary code at the sentence level. Then we use a text classifier as the decoder to decode each bit of the watermark. Through extensive experiments, we show that our watermarks can achieve over 99.99\% detection AUC with small (1.1B) text paraphrasers while keeping the semantic information of the original sentence. More importantly, our pipeline is robust under word substitution and sentence paraphrasing perturbations and generalizes well to out-of-distributional data. We also show the stealthiness of our watermark with LLM-based evaluation. We open-source the code: https://github.com/xiaojunxu/multi-bit-text-watermark.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 10 Dec 2024 20:22:26 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b047d4c1/0d5deb79.mp3" length="17321502" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1079</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 5 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xiaojun Xu, Jinghan Jia, Yuanshun Yao, Yang Liu, Hang Li</p>

            <p><strong>Title:</strong><br>
            Robust Multi-bit Text Watermark with LLM-based Paraphrasers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.03123v1">http://arxiv.org/abs/2412.03123v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose an imperceptible multi-bit text watermark embedded by paraphrasing with LLMs. We fine-tune a pair of LLM paraphrasers that are designed to behave differently so that their paraphrasing difference reflected in the text semantics can be identified by a trained decoder. To embed our multi-bit watermark, we use two paraphrasers alternatively to encode the pre-defined binary code at the sentence level. Then we use a text classifier as the decoder to decode each bit of the watermark. Through extensive experiments, we show that our watermarks can achieve over 99.99\% detection AUC with small (1.1B) text paraphrasers while keeping the semantic information of the original sentence. More importantly, our pipeline is robust under word substitution and sentence paraphrasing perturbations and generalizes well to out-of-distributional data. We also show the stealthiness of our watermark with LLM-based evaluation. We open-source the code: https://github.com/xiaojunxu/multi-bit-text-watermark.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views</title>
      <itunes:episode>179</itunes:episode>
      <podcast:episode>179</podcast:episode>
      <itunes:title>MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">73d049c2-fd06-4650-ab68-3529f38acffb</guid>
      <link>https://share.transistor.fm/s/2c2de053</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Antoine Guédon, Tomoki Ichikawa, Kohei Yamashita, Ko Nishino</p>

            <p><strong>Title:</strong><br>
            MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06767v1">http://arxiv.org/abs/2412.06767v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a novel appearance model that simultaneously realizes explicit high-quality 3D surface mesh recovery and photorealistic novel view synthesis from sparse view samples. Our key idea is to model the underlying scene geometry Mesh as an Atlas of Charts which we render with 2D Gaussian surfels (MAtCha Gaussians). MAtCha distills high-frequency scene surface details from an off-the-shelf monocular depth estimator and refines it through Gaussian surfel rendering. The Gaussian surfels are attached to the charts on the fly, satisfying photorealism of neural volumetric rendering and crisp geometry of a mesh model, i.e., two seemingly contradicting goals in a single model. At the core of MAtCha lies a novel neural deformation model and a structure loss that preserve the fine surface details distilled from learned monocular depths while addressing their fundamental scale ambiguities. Results of extensive experimental validation demonstrate MAtCha's state-of-the-art quality of surface reconstruction and photorealism on-par with top contenders but with dramatic reduction in the number of input views and computational time. We believe MAtCha will serve as a foundational tool for any visual application in vision, graphics, and robotics that require explicit geometry in addition to photorealism. Our project page is the following: https://anttwo.github.io/matcha/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Antoine Guédon, Tomoki Ichikawa, Kohei Yamashita, Ko Nishino</p>

            <p><strong>Title:</strong><br>
            MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06767v1">http://arxiv.org/abs/2412.06767v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a novel appearance model that simultaneously realizes explicit high-quality 3D surface mesh recovery and photorealistic novel view synthesis from sparse view samples. Our key idea is to model the underlying scene geometry Mesh as an Atlas of Charts which we render with 2D Gaussian surfels (MAtCha Gaussians). MAtCha distills high-frequency scene surface details from an off-the-shelf monocular depth estimator and refines it through Gaussian surfel rendering. The Gaussian surfels are attached to the charts on the fly, satisfying photorealism of neural volumetric rendering and crisp geometry of a mesh model, i.e., two seemingly contradicting goals in a single model. At the core of MAtCha lies a novel neural deformation model and a structure loss that preserve the fine surface details distilled from learned monocular depths while addressing their fundamental scale ambiguities. Results of extensive experimental validation demonstrate MAtCha's state-of-the-art quality of surface reconstruction and photorealism on-par with top contenders but with dramatic reduction in the number of input views and computational time. We believe MAtCha will serve as a foundational tool for any visual application in vision, graphics, and robotics that require explicit geometry in addition to photorealism. Our project page is the following: https://anttwo.github.io/matcha/</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 10 Dec 2024 20:22:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2c2de053/98de05e8.mp3" length="21308449" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1328</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 4 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Antoine Guédon, Tomoki Ichikawa, Kohei Yamashita, Ko Nishino</p>

            <p><strong>Title:</strong><br>
            MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.06767v1">http://arxiv.org/abs/2412.06767v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a novel appearance model that simultaneously realizes explicit high-quality 3D surface mesh recovery and photorealistic novel view synthesis from sparse view samples. Our key idea is to model the underlying scene geometry Mesh as an Atlas of Charts which we render with 2D Gaussian surfels (MAtCha Gaussians). MAtCha distills high-frequency scene surface details from an off-the-shelf monocular depth estimator and refines it through Gaussian surfel rendering. The Gaussian surfels are attached to the charts on the fly, satisfying photorealism of neural volumetric rendering and crisp geometry of a mesh model, i.e., two seemingly contradicting goals in a single model. At the core of MAtCha lies a novel neural deformation model and a structure loss that preserve the fine surface details distilled from learned monocular depths while addressing their fundamental scale ambiguities. Results of extensive experimental validation demonstrate MAtCha's state-of-the-art quality of surface reconstruction and photorealism on-par with top contenders but with dramatic reduction in the number of input views and computational time. We believe MAtCha will serve as a foundational tool for any visual application in vision, graphics, and robotics that require explicit geometry in addition to photorealism. Our project page is the following: https://anttwo.github.io/matcha/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment</title>
      <itunes:episode>178</itunes:episode>
      <podcast:episode>178</podcast:episode>
      <itunes:title>LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ae1fcc03-ac62-41dc-bafe-7394a59beef4</guid>
      <link>https://share.transistor.fm/s/d03785ae</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, Hao Li</p>

            <p><strong>Title:</strong><br>
            LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04814v1">http://arxiv.org/abs/2412.04814v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which is particularly difficult to address, as human preferences are inherently subjective and challenging to formalize as objective functions. Therefore, this paper proposes LiFT, a novel fine-tuning method leveraging human feedback for T2V model alignment. Specifically, we first construct a Human Rating Annotation dataset, LiFT-HRA, consisting of approximately 10k human annotations, each including a score and its corresponding rationale. Based on this, we train a reward model LiFT-Critic to learn reward function effectively, which serves as a proxy for human judgment, measuring the alignment between given videos and human expectations. Lastly, we leverage the learned reward function to align the T2V model by maximizing the reward-weighted likelihood. As a case study, we apply our pipeline to CogVideoX-2B, showing that the fine-tuned model outperforms the CogVideoX-5B across all 16 metrics, highlighting the potential of human feedback in improving the alignment and quality of synthesized videos.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, Hao Li</p>

            <p><strong>Title:</strong><br>
            LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04814v1">http://arxiv.org/abs/2412.04814v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which is particularly difficult to address, as human preferences are inherently subjective and challenging to formalize as objective functions. Therefore, this paper proposes LiFT, a novel fine-tuning method leveraging human feedback for T2V model alignment. Specifically, we first construct a Human Rating Annotation dataset, LiFT-HRA, consisting of approximately 10k human annotations, each including a score and its corresponding rationale. Based on this, we train a reward model LiFT-Critic to learn reward function effectively, which serves as a proxy for human judgment, measuring the alignment between given videos and human expectations. Lastly, we leverage the learned reward function to align the T2V model by maximizing the reward-weighted likelihood. As a case study, we apply our pipeline to CogVideoX-2B, showing that the fine-tuned model outperforms the CogVideoX-5B across all 16 metrics, highlighting the potential of human feedback in improving the alignment and quality of synthesized videos.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 09 Dec 2024 20:27:23 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d03785ae/75061458.mp3" length="19639927" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1224</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 33 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, Hao Li</p>

            <p><strong>Title:</strong><br>
            LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04814v1">http://arxiv.org/abs/2412.04814v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which is particularly difficult to address, as human preferences are inherently subjective and challenging to formalize as objective functions. Therefore, this paper proposes LiFT, a novel fine-tuning method leveraging human feedback for T2V model alignment. Specifically, we first construct a Human Rating Annotation dataset, LiFT-HRA, consisting of approximately 10k human annotations, each including a score and its corresponding rationale. Based on this, we train a reward model LiFT-Critic to learn reward function effectively, which serves as a proxy for human judgment, measuring the alignment between given videos and human expectations. Lastly, we leverage the learned reward function to align the T2V model by maximizing the reward-weighted likelihood. As a case study, we apply our pipeline to CogVideoX-2B, showing that the fine-tuned model outperforms the CogVideoX-5B across all 16 metrics, highlighting the potential of human feedback in improving the alignment and quality of synthesized videos.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>EXAONE 3.5: Series of Large Language Models for Real-world Use Cases</title>
      <itunes:episode>177</itunes:episode>
      <podcast:episode>177</podcast:episode>
      <itunes:title>EXAONE 3.5: Series of Large Language Models for Real-world Use Cases</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ff8ea4b0-47ae-497c-b138-be628885f90f</guid>
      <link>https://share.transistor.fm/s/9ca475f3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            LG AI Research, Soyoung An, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Yountae Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Woohyung Lim, Sangha Park, Sooyoun Park, Yongmin Park, Sihoon Yang, Heuiyeen Yeen, Hyeongu Yun</p>

            <p><strong>Title:</strong><br>
            EXAONE 3.5: Series of Large Language Models for Real-world Use Cases</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04862v2">http://arxiv.org/abs/2412.04862v2</a></p>

            <p><strong>Abstract:</strong><br>
            This technical report introduces the EXAONE 3.5 instruction-tuned language models, developed and released by LG AI Research. The EXAONE 3.5 language models are offered in three configurations: 32B, 7.8B, and 2.4B. These models feature several standout capabilities: 1) exceptional instruction following capabilities in real-world scenarios, achieving the highest scores across seven benchmarks, 2) outstanding long-context comprehension, attaining the top performance in four benchmarks, and 3) competitive results compared to state-of-the-art open models of similar sizes across nine general benchmarks. The EXAONE 3.5 language models are open to anyone for research purposes and can be downloaded from https://huggingface.co/LGAI-EXAONE. For commercial use, please reach out to the official contact point of LG AI Research: contact_us@lgresearch.ai.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            LG AI Research, Soyoung An, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Yountae Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Woohyung Lim, Sangha Park, Sooyoun Park, Yongmin Park, Sihoon Yang, Heuiyeen Yeen, Hyeongu Yun</p>

            <p><strong>Title:</strong><br>
            EXAONE 3.5: Series of Large Language Models for Real-world Use Cases</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04862v2">http://arxiv.org/abs/2412.04862v2</a></p>

            <p><strong>Abstract:</strong><br>
            This technical report introduces the EXAONE 3.5 instruction-tuned language models, developed and released by LG AI Research. The EXAONE 3.5 language models are offered in three configurations: 32B, 7.8B, and 2.4B. These models feature several standout capabilities: 1) exceptional instruction following capabilities in real-world scenarios, achieving the highest scores across seven benchmarks, 2) outstanding long-context comprehension, attaining the top performance in four benchmarks, and 3) competitive results compared to state-of-the-art open models of similar sizes across nine general benchmarks. The EXAONE 3.5 language models are open to anyone for research purposes and can be downloaded from https://huggingface.co/LGAI-EXAONE. For commercial use, please reach out to the official contact point of LG AI Research: contact_us@lgresearch.ai.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 09 Dec 2024 20:26:57 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9ca475f3/5f1c6cfa.mp3" length="21172168" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1320</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 31 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            LG AI Research, Soyoung An, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Yountae Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Woohyung Lim, Sangha Park, Sooyoun Park, Yongmin Park, Sihoon Yang, Heuiyeen Yeen, Hyeongu Yun</p>

            <p><strong>Title:</strong><br>
            EXAONE 3.5: Series of Large Language Models for Real-world Use Cases</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04862v2">http://arxiv.org/abs/2412.04862v2</a></p>

            <p><strong>Abstract:</strong><br>
            This technical report introduces the EXAONE 3.5 instruction-tuned language models, developed and released by LG AI Research. The EXAONE 3.5 language models are offered in three configurations: 32B, 7.8B, and 2.4B. These models feature several standout capabilities: 1) exceptional instruction following capabilities in real-world scenarios, achieving the highest scores across seven benchmarks, 2) outstanding long-context comprehension, attaining the top performance in four benchmarks, and 3) competitive results compared to state-of-the-art open models of similar sizes across nine general benchmarks. The EXAONE 3.5 language models are open to anyone for research purposes and can be downloaded from https://huggingface.co/LGAI-EXAONE. For commercial use, please reach out to the official contact point of LG AI Research: contact_us@lgresearch.ai.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale</title>
      <itunes:episode>176</itunes:episode>
      <podcast:episode>176</podcast:episode>
      <itunes:title>MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">710ed47a-c1ff-42f1-ae5a-d67bfab18c3f</guid>
      <link>https://share.transistor.fm/s/a670eda0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, Xiang Yue</p>

            <p><strong>Title:</strong><br>
            MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.05237v1">http://arxiv.org/abs/2412.05237v1</a></p>

            <p><strong>Abstract:</strong><br>
            Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, Xiang Yue</p>

            <p><strong>Title:</strong><br>
            MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.05237v1">http://arxiv.org/abs/2412.05237v1</a></p>

            <p><strong>Abstract:</strong><br>
            Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 09 Dec 2024 20:26:34 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a670eda0/1b0563e2.mp3" length="21139993" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1318</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, Xiang Yue</p>

            <p><strong>Title:</strong><br>
            MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.05237v1">http://arxiv.org/abs/2412.05237v1</a></p>

            <p><strong>Abstract:</strong><br>
            Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>APOLLO: SGD-like Memory, AdamW-level Performance</title>
      <itunes:episode>175</itunes:episode>
      <podcast:episode>175</podcast:episode>
      <itunes:title>APOLLO: SGD-like Memory, AdamW-level Performance</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2b1a9db1-0636-4171-bce0-a426fa8dae4a</guid>
      <link>https://share.transistor.fm/s/c269414f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.LG, cs.AI, cs.PF</p>

            <p><strong>Authors:</strong><br>
            Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, Jinwon Lee</p>

            <p><strong>Title:</strong><br>
            APOLLO: SGD-like Memory, AdamW-level Performance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.05270v2">http://arxiv.org/abs/2412.05270v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance.   In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs.   Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.LG, cs.AI, cs.PF</p>

            <p><strong>Authors:</strong><br>
            Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, Jinwon Lee</p>

            <p><strong>Title:</strong><br>
            APOLLO: SGD-like Memory, AdamW-level Performance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.05270v2">http://arxiv.org/abs/2412.05270v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance.   In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs.   Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 09 Dec 2024 20:26:10 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c269414f/41692b4b.mp3" length="18996253" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1184</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 27 | cs.LG, cs.AI, cs.PF</p>

            <p><strong>Authors:</strong><br>
            Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, Jinwon Lee</p>

            <p><strong>Title:</strong><br>
            APOLLO: SGD-like Memory, AdamW-level Performance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.05270v2">http://arxiv.org/abs/2412.05270v2</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance.   In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs.   Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion</title>
      <itunes:episode>174</itunes:episode>
      <podcast:episode>174</podcast:episode>
      <itunes:title>SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1a9199dc-0ace-46d1-9c21-58253bb466a8</guid>
      <link>https://share.transistor.fm/s/e4ebfb31</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Trong-Tung Nguyen, Quang Nguyen, Khoi Nguyen, Anh Tran, Cuong Pham</p>

            <p><strong>Title:</strong><br>
            SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04301v2">http://arxiv.org/abs/2412.04301v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in text-guided image editing enable users to perform image edits through simple text inputs, leveraging the extensive priors of multi-step diffusion-based text-to-image models. However, these methods often fall short of the speed demands required for real-world and on-device applications due to the costly multi-step inversion and sampling process involved. In response to this, we introduce SwiftEdit, a simple yet highly efficient editing tool that achieve instant text-guided image editing (in 0.23s). The advancement of SwiftEdit lies in its two novel contributions: a one-step inversion framework that enables one-step image reconstruction via inversion and a mask-guided editing technique with our proposed attention rescaling mechanism to perform localized image editing. Extensive experiments are provided to demonstrate the effectiveness and efficiency of SwiftEdit. In particular, SwiftEdit enables instant text-guided image editing, which is extremely faster than previous multi-step methods (at least 50 times faster) while maintain a competitive performance in editing results. Our project page is at: https://swift-edit.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Trong-Tung Nguyen, Quang Nguyen, Khoi Nguyen, Anh Tran, Cuong Pham</p>

            <p><strong>Title:</strong><br>
            SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04301v2">http://arxiv.org/abs/2412.04301v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in text-guided image editing enable users to perform image edits through simple text inputs, leveraging the extensive priors of multi-step diffusion-based text-to-image models. However, these methods often fall short of the speed demands required for real-world and on-device applications due to the costly multi-step inversion and sampling process involved. In response to this, we introduce SwiftEdit, a simple yet highly efficient editing tool that achieve instant text-guided image editing (in 0.23s). The advancement of SwiftEdit lies in its two novel contributions: a one-step inversion framework that enables one-step image reconstruction via inversion and a mask-guided editing technique with our proposed attention rescaling mechanism to perform localized image editing. Extensive experiments are provided to demonstrate the effectiveness and efficiency of SwiftEdit. In particular, SwiftEdit enables instant text-guided image editing, which is extremely faster than previous multi-step methods (at least 50 times faster) while maintain a competitive performance in editing results. Our project page is at: https://swift-edit.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 09 Dec 2024 20:25:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e4ebfb31/80b19530.mp3" length="19396684" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1209</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 19 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Trong-Tung Nguyen, Quang Nguyen, Khoi Nguyen, Anh Tran, Cuong Pham</p>

            <p><strong>Title:</strong><br>
            SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04301v2">http://arxiv.org/abs/2412.04301v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in text-guided image editing enable users to perform image edits through simple text inputs, leveraging the extensive priors of multi-step diffusion-based text-to-image models. However, these methods often fall short of the speed demands required for real-world and on-device applications due to the costly multi-step inversion and sampling process involved. In response to this, we introduce SwiftEdit, a simple yet highly efficient editing tool that achieve instant text-guided image editing (in 0.23s). The advancement of SwiftEdit lies in its two novel contributions: a one-step inversion framework that enables one-step image reconstruction via inversion and a mask-guided editing technique with our proposed attention rescaling mechanism to perform localized image editing. Extensive experiments are provided to demonstrate the effectiveness and efficiency of SwiftEdit. In particular, SwiftEdit enables instant text-guided image editing, which is extremely faster than previous multi-step methods (at least 50 times faster) while maintain a competitive performance in editing results. Our project page is at: https://swift-edit.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Moto: Latent Motion Token as the Bridging Language for Robot Manipulation</title>
      <itunes:episode>173</itunes:episode>
      <podcast:episode>173</podcast:episode>
      <itunes:title>Moto: Latent Motion Token as the Bridging Language for Robot Manipulation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">595d5949-7d19-4ff2-b005-e6ae6758dcac</guid>
      <link>https://share.transistor.fm/s/86b3bed0</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.RO, cs.AI, cs.CL, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yi Chen, Yuying Ge, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            Moto: Latent Motion Token as the Bridging Language for Robot Manipulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04445v1">http://arxiv.org/abs/2412.04445v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent developments in Large Language Models pre-trained on extensive corpora have shown significant success in various natural language processing tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich "corpus", can a similar generative pre-training approach be effectively applied to enhance robot learning? The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks. Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. To this end, we introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer, learning a bridging "language" of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output likelihood. To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control. Extensive experiments show that the fine-tuned Moto-GPT exhibits superior robustness and efficiency on robot manipulation benchmarks, underscoring its effectiveness in transferring knowledge from video data to downstream visual manipulation tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.RO, cs.AI, cs.CL, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yi Chen, Yuying Ge, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            Moto: Latent Motion Token as the Bridging Language for Robot Manipulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04445v1">http://arxiv.org/abs/2412.04445v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent developments in Large Language Models pre-trained on extensive corpora have shown significant success in various natural language processing tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich "corpus", can a similar generative pre-training approach be effectively applied to enhance robot learning? The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks. Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. To this end, we introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer, learning a bridging "language" of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output likelihood. To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control. Extensive experiments show that the fine-tuned Moto-GPT exhibits superior robustness and efficiency on robot manipulation benchmarks, underscoring its effectiveness in transferring knowledge from video data to downstream visual manipulation tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 09 Dec 2024 20:25:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/86b3bed0/0e4a20c9.mp3" length="19540043" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1218</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 18 | cs.RO, cs.AI, cs.CL, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yi Chen, Yuying Ge, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            Moto: Latent Motion Token as the Bridging Language for Robot Manipulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04445v1">http://arxiv.org/abs/2412.04445v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent developments in Large Language Models pre-trained on extensive corpora have shown significant success in various natural language processing tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich "corpus", can a similar generative pre-training approach be effectively applied to enhance robot learning? The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks. Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. To this end, we introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer, learning a bridging "language" of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output likelihood. To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control. Extensive experiments show that the fine-tuned Moto-GPT exhibits superior robustness and efficiency on robot manipulation benchmarks, underscoring its effectiveness in transferring knowledge from video data to downstream visual manipulation tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration</title>
      <itunes:episode>172</itunes:episode>
      <podcast:episode>172</podcast:episode>
      <itunes:title>GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">65b69241-e0b2-4224-b4be-2b351d72eff8</guid>
      <link>https://share.transistor.fm/s/33c92119</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kaiyi Huang, Yukun Huang, Xuefei Ning, Zinan Lin, Yu Wang, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04440v1">http://arxiv.org/abs/2412.04440v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with different objects, and interactions between objects. Our key motivation is that complex tasks can be decomposed into simpler ones, each handled by a role-specialized MLLM agent. Multiple agents can collaborate together to achieve collective intelligence for complex goals. We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign, with an iterative loop between the Generation and Redesign stages to progressively verify and refine the generated videos. The Redesign stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and redesign the text prompts, frame-wise layouts, and guidance scales for the next iteration of generation. To avoid hallucination of a single MLLM agent, we decompose this stage to four sequentially-executed MLLM-based agents: verification agent, suggestion agent, correction agent, and output structuring agent. Furthermore, to tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario. Extensive experiments demonstrate the effectiveness of GenMAC, achieving state-of-the art performance in compositional text-to-video generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kaiyi Huang, Yukun Huang, Xuefei Ning, Zinan Lin, Yu Wang, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04440v1">http://arxiv.org/abs/2412.04440v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with different objects, and interactions between objects. Our key motivation is that complex tasks can be decomposed into simpler ones, each handled by a role-specialized MLLM agent. Multiple agents can collaborate together to achieve collective intelligence for complex goals. We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign, with an iterative loop between the Generation and Redesign stages to progressively verify and refine the generated videos. The Redesign stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and redesign the text prompts, frame-wise layouts, and guidance scales for the next iteration of generation. To avoid hallucination of a single MLLM agent, we decompose this stage to four sequentially-executed MLLM-based agents: verification agent, suggestion agent, correction agent, and output structuring agent. Furthermore, to tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario. Extensive experiments demonstrate the effectiveness of GenMAC, achieving state-of-the art performance in compositional text-to-video generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 09 Dec 2024 20:25:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/33c92119/42217562.mp3" length="21991795" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1371</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kaiyi Huang, Yukun Huang, Xuefei Ning, Zinan Lin, Yu Wang, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04440v1">http://arxiv.org/abs/2412.04440v1</a></p>

            <p><strong>Abstract:</strong><br>
            Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with different objects, and interactions between objects. Our key motivation is that complex tasks can be decomposed into simpler ones, each handled by a role-specialized MLLM agent. Multiple agents can collaborate together to achieve collective intelligence for complex goals. We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign, with an iterative loop between the Generation and Redesign stages to progressively verify and refine the generated videos. The Redesign stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and redesign the text prompts, frame-wise layouts, and guidance scales for the next iteration of generation. To avoid hallucination of a single MLLM agent, we decompose this stage to four sequentially-executed MLLM-based agents: verification agent, suggestion agent, correction agent, and output structuring agent. Furthermore, to tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario. Extensive experiments demonstrate the effectiveness of GenMAC, achieving state-of-the art performance in compositional text-to-video generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction</title>
      <itunes:episode>171</itunes:episode>
      <podcast:episode>171</podcast:episode>
      <itunes:title>Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9e315915-24e2-4c20-8e11-55da70834dd1</guid>
      <link>https://share.transistor.fm/s/8eac99c4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jixuan Fan, Wanhua Li, Yifei Han, Yansong Tang</p>

            <p><strong>Title:</strong><br>
            Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04887v1">http://arxiv.org/abs/2412.04887v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D Gaussian Splatting has demonstrated notable success in large-scale scene reconstruction, but challenges persist due to high training memory consumption and storage overhead. Hybrid representations that integrate implicit and explicit features offer a way to mitigate these limitations. However, when applied in parallelized block-wise training, two critical issues arise since reconstruction accuracy deteriorates due to reduced data diversity when training each block independently, and parallel training restricts the number of divided blocks to the available number of GPUs. To address these issues, we propose Momentum-GS, a novel approach that leverages momentum-based self-distillation to promote consistency and accuracy across the blocks while decoupling the number of blocks from the physical GPU count. Our method maintains a teacher Gaussian decoder updated with momentum, ensuring a stable reference during training. This teacher provides each block with global guidance in a self-distillation manner, promoting spatial consistency in reconstruction. To further ensure consistency across the blocks, we incorporate block weighting, dynamically adjusting each block's weight according to its reconstruction accuracy. Extensive experiments on large-scale scenes show that our method consistently outperforms existing techniques, achieving a 12.8% improvement in LPIPS over CityGaussian with much fewer divided blocks and establishing a new state of the art. Project page: https://jixuan-fan.github.io/Momentum-GS_Page/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jixuan Fan, Wanhua Li, Yifei Han, Yansong Tang</p>

            <p><strong>Title:</strong><br>
            Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04887v1">http://arxiv.org/abs/2412.04887v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D Gaussian Splatting has demonstrated notable success in large-scale scene reconstruction, but challenges persist due to high training memory consumption and storage overhead. Hybrid representations that integrate implicit and explicit features offer a way to mitigate these limitations. However, when applied in parallelized block-wise training, two critical issues arise since reconstruction accuracy deteriorates due to reduced data diversity when training each block independently, and parallel training restricts the number of divided blocks to the available number of GPUs. To address these issues, we propose Momentum-GS, a novel approach that leverages momentum-based self-distillation to promote consistency and accuracy across the blocks while decoupling the number of blocks from the physical GPU count. Our method maintains a teacher Gaussian decoder updated with momentum, ensuring a stable reference during training. This teacher provides each block with global guidance in a self-distillation manner, promoting spatial consistency in reconstruction. To further ensure consistency across the blocks, we incorporate block weighting, dynamically adjusting each block's weight according to its reconstruction accuracy. Extensive experiments on large-scale scenes show that our method consistently outperforms existing techniques, achieving a 12.8% improvement in LPIPS over CityGaussian with much fewer divided blocks and establishing a new state of the art. Project page: https://jixuan-fan.github.io/Momentum-GS_Page/</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 09 Dec 2024 20:24:38 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8eac99c4/cb48b4f1.mp3" length="20300748" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1265</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jixuan Fan, Wanhua Li, Yifei Han, Yansong Tang</p>

            <p><strong>Title:</strong><br>
            Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04887v1">http://arxiv.org/abs/2412.04887v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D Gaussian Splatting has demonstrated notable success in large-scale scene reconstruction, but challenges persist due to high training memory consumption and storage overhead. Hybrid representations that integrate implicit and explicit features offer a way to mitigate these limitations. However, when applied in parallelized block-wise training, two critical issues arise since reconstruction accuracy deteriorates due to reduced data diversity when training each block independently, and parallel training restricts the number of divided blocks to the available number of GPUs. To address these issues, we propose Momentum-GS, a novel approach that leverages momentum-based self-distillation to promote consistency and accuracy across the blocks while decoupling the number of blocks from the physical GPU count. Our method maintains a teacher Gaussian decoder updated with momentum, ensuring a stable reference during training. This teacher provides each block with global guidance in a self-distillation manner, promoting spatial consistency in reconstruction. To further ensure consistency across the blocks, we incorporate block weighting, dynamically adjusting each block's weight according to its reconstruction accuracy. Extensive experiments on large-scale scenes show that our method consistently outperforms existing techniques, achieving a 12.8% improvement in LPIPS over CityGaussian with much fewer divided blocks and establishing a new state of the art. Project page: https://jixuan-fan.github.io/Momentum-GS_Page/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CompCap: Improving Multimodal Large Language Models with Composite Captions</title>
      <itunes:episode>170</itunes:episode>
      <podcast:episode>170</podcast:episode>
      <itunes:title>CompCap: Improving Multimodal Large Language Models with Composite Captions</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fc6aea67-4de1-4f16-bbc9-4351f2de10cb</guid>
      <link>https://share.transistor.fm/s/810325e4</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, Aashu Singh, Qifan Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, Baosheng He</p>

            <p><strong>Title:</strong><br>
            CompCap: Improving Multimodal Large Language Models with Composite Captions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.05243v1">http://arxiv.org/abs/2412.05243v1</a></p>

            <p><strong>Abstract:</strong><br>
            How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused on interpreting natural images (NIs). Our research reveals that current MLLMs face significant challenges in accurately understanding CIs, often struggling to extract information or perform complex reasoning based on these images. We find that existing training data for CIs are mostly formatted for question-answer tasks (e.g., in datasets like ChartQA and ScienceQA), while high-quality image-caption datasets, critical for robust vision-language alignment, are only available for NIs. To bridge this gap, we introduce Composite Captions (CompCap), a flexible framework that leverages Large Language Models (LLMs) and automation tools to synthesize CIs with accurate and detailed captions. Using CompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs across six CI types. We validate the effectiveness of CompCap-118K by supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K significantly enhances MLLMs' understanding of CIs, yielding average gains of 1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, Aashu Singh, Qifan Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, Baosheng He</p>

            <p><strong>Title:</strong><br>
            CompCap: Improving Multimodal Large Language Models with Composite Captions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.05243v1">http://arxiv.org/abs/2412.05243v1</a></p>

            <p><strong>Abstract:</strong><br>
            How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused on interpreting natural images (NIs). Our research reveals that current MLLMs face significant challenges in accurately understanding CIs, often struggling to extract information or perform complex reasoning based on these images. We find that existing training data for CIs are mostly formatted for question-answer tasks (e.g., in datasets like ChartQA and ScienceQA), while high-quality image-caption datasets, critical for robust vision-language alignment, are only available for NIs. To bridge this gap, we introduce Composite Captions (CompCap), a flexible framework that leverages Large Language Models (LLMs) and automation tools to synthesize CIs with accurate and detailed captions. Using CompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs across six CI types. We validate the effectiveness of CompCap-118K by supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K significantly enhances MLLMs' understanding of CIs, yielding average gains of 1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 09 Dec 2024 20:24:14 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/810325e4/f607e86d.mp3" length="21102376" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1315</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 11 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, Aashu Singh, Qifan Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, Baosheng He</p>

            <p><strong>Title:</strong><br>
            CompCap: Improving Multimodal Large Language Models with Composite Captions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.05243v1">http://arxiv.org/abs/2412.05243v1</a></p>

            <p><strong>Abstract:</strong><br>
            How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused on interpreting natural images (NIs). Our research reveals that current MLLMs face significant challenges in accurately understanding CIs, often struggling to extract information or perform complex reasoning based on these images. We find that existing training data for CIs are mostly formatted for question-answer tasks (e.g., in datasets like ChartQA and ScienceQA), while high-quality image-caption datasets, critical for robust vision-language alignment, are only available for NIs. To bridge this gap, we introduce Composite Captions (CompCap), a flexible framework that leverages Large Language Models (LLMs) and automation tools to synthesize CIs with accurate and detailed captions. Using CompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs across six CI types. We validate the effectiveness of CompCap-118K by supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K significantly enhances MLLMs' understanding of CIs, yielding average gains of 1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VisionZip: Longer is Better but Not Necessary in Vision Language Models</title>
      <itunes:episode>169</itunes:episode>
      <podcast:episode>169</podcast:episode>
      <itunes:title>VisionZip: Longer is Better but Not Necessary in Vision Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">51ff973a-c77f-4cd4-b50d-199d40c5a004</guid>
      <link>https://share.transistor.fm/s/381d8fb6</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia</p>

            <p><strong>Title:</strong><br>
            VisionZip: Longer is Better but Not Necessary in Vision Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04467v1">http://arxiv.org/abs/2412.04467v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at https://github.com/dvlab-research/VisionZip .</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia</p>

            <p><strong>Title:</strong><br>
            VisionZip: Longer is Better but Not Necessary in Vision Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04467v1">http://arxiv.org/abs/2412.04467v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at https://github.com/dvlab-research/VisionZip .</p>
            ]]>
      </content:encoded>
      <pubDate>Sun, 08 Dec 2024 14:57:29 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/381d8fb6/d985dd8a.mp3" length="20989523" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1308</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 83 | cs.CV, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia</p>

            <p><strong>Title:</strong><br>
            VisionZip: Longer is Better but Not Necessary in Vision Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04467v1">http://arxiv.org/abs/2412.04467v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at https://github.com/dvlab-research/VisionZip .</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion</title>
      <itunes:episode>168</itunes:episode>
      <podcast:episode>168</podcast:episode>
      <itunes:title>Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">94fb7121-cd8a-43c0-a357-1af37ac0efa6</guid>
      <link>https://share.transistor.fm/s/5e2a7c99</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiuhai Chen, Jianwei Yang, Haiping Wu, Dianqi Li, Jianfeng Gao, Tianyi Zhou, Bin Xiao</p>

            <p><strong>Title:</strong><br>
            Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04424v1">http://arxiv.org/abs/2412.04424v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose "depth-breath fusion (DBFusion)" to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully designed recipe of diverse open-source datasets that include high-quality image captions and instruction-tuning pairs. Our quantitative analysis and visualization of Florence-VL's visual features show its advantages over popular vision encoders on vision-language alignment, where the enriched depth and breath play important roles. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various multi-modal and vision-centric benchmarks covering general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, etc. To facilitate future research, our models and the complete training recipe are open-sourced. https://github.com/JiuhaiChen/Florence-VL</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiuhai Chen, Jianwei Yang, Haiping Wu, Dianqi Li, Jianfeng Gao, Tianyi Zhou, Bin Xiao</p>

            <p><strong>Title:</strong><br>
            Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04424v1">http://arxiv.org/abs/2412.04424v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose "depth-breath fusion (DBFusion)" to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully designed recipe of diverse open-source datasets that include high-quality image captions and instruction-tuning pairs. Our quantitative analysis and visualization of Florence-VL's visual features show its advantages over popular vision encoders on vision-language alignment, where the enriched depth and breath play important roles. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various multi-modal and vision-centric benchmarks covering general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, etc. To facilitate future research, our models and the complete training recipe are open-sourced. https://github.com/JiuhaiChen/Florence-VL</p>
            ]]>
      </content:encoded>
      <pubDate>Sun, 08 Dec 2024 14:57:09 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5e2a7c99/c5d40816.mp3" length="18665282" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1163</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 46 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Jiuhai Chen, Jianwei Yang, Haiping Wu, Dianqi Li, Jianfeng Gao, Tianyi Zhou, Bin Xiao</p>

            <p><strong>Title:</strong><br>
            Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04424v1">http://arxiv.org/abs/2412.04424v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose "depth-breath fusion (DBFusion)" to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully designed recipe of diverse open-source datasets that include high-quality image captions and instruction-tuning pairs. Our quantitative analysis and visualization of Florence-VL's visual features show its advantages over popular vision encoders on vision-language alignment, where the enriched depth and breath play important roles. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various multi-modal and vision-centric benchmarks covering general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, etc. To facilitate future research, our models and the complete training recipe are open-sourced. https://github.com/JiuhaiChen/Florence-VL</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>NVILA: Efficient Frontier Visual Language Models</title>
      <itunes:episode>167</itunes:episode>
      <podcast:episode>167</podcast:episode>
      <itunes:title>NVILA: Efficient Frontier Visual Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3a29eb96-f076-48fc-9732-e72f48c6327f</guid>
      <link>https://share.transistor.fm/s/00ceb83e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu</p>

            <p><strong>Title:</strong><br>
            NVILA: Efficient Frontier Visual Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04468v1">http://arxiv.org/abs/2412.04468v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu</p>

            <p><strong>Title:</strong><br>
            NVILA: Efficient Frontier Visual Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04468v1">http://arxiv.org/abs/2412.04468v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility.</p>
            ]]>
      </content:encoded>
      <pubDate>Sun, 08 Dec 2024 14:56:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/00ceb83e/488c69b4.mp3" length="18799394" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1171</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 36 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu</p>

            <p><strong>Title:</strong><br>
            NVILA: Efficient Frontier Visual Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04468v1">http://arxiv.org/abs/2412.04468v1</a></p>

            <p><strong>Abstract:</strong><br>
            Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction</title>
      <itunes:episode>166</itunes:episode>
      <podcast:episode>166</podcast:episode>
      <itunes:title>Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a6e4d2f6-8418-46f8-bf22-eafaed596923</guid>
      <link>https://share.transistor.fm/s/89dc661e</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong</p>

            <p><strong>Title:</strong><br>
            Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04454v1">http://arxiv.org/abs/2412.04454v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interfaces (GUIs) are critical to human-computer interaction, yet automating GUI tasks remains challenging due to the complexity and variability of visual environments. Existing approaches often rely on textual representations of GUIs, which introduce limitations in generalization, efficiency, and scalability. In this paper, we introduce Aguvis, a unified pure vision-based framework for autonomous GUI agents that operates across various platforms. Our approach leverages image-based observations, and grounding instructions in natural language to visual elements, and employs a consistent action space to ensure cross-platform generalization. To address the limitations of previous work, we integrate explicit planning and reasoning within the model, enhancing its ability to autonomously navigate and interact with complex digital environments. We construct a large-scale dataset of GUI agent trajectories, incorporating multimodal reasoning and grounding, and employ a two-stage training pipeline that first focuses on general GUI grounding, followed by planning and reasoning. Through comprehensive experiments, we demonstrate that Aguvis surpasses previous state-of-the-art methods in both offline and real-world online scenarios, achieving, to our knowledge, the first fully autonomous pure vision GUI agent capable of performing tasks independently without collaboration with external closed-source models. We open-sourced all datasets, models, and training recipes to facilitate future research at https://aguvis-project.github.io/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong</p>

            <p><strong>Title:</strong><br>
            Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04454v1">http://arxiv.org/abs/2412.04454v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interfaces (GUIs) are critical to human-computer interaction, yet automating GUI tasks remains challenging due to the complexity and variability of visual environments. Existing approaches often rely on textual representations of GUIs, which introduce limitations in generalization, efficiency, and scalability. In this paper, we introduce Aguvis, a unified pure vision-based framework for autonomous GUI agents that operates across various platforms. Our approach leverages image-based observations, and grounding instructions in natural language to visual elements, and employs a consistent action space to ensure cross-platform generalization. To address the limitations of previous work, we integrate explicit planning and reasoning within the model, enhancing its ability to autonomously navigate and interact with complex digital environments. We construct a large-scale dataset of GUI agent trajectories, incorporating multimodal reasoning and grounding, and employ a two-stage training pipeline that first focuses on general GUI grounding, followed by planning and reasoning. Through comprehensive experiments, we demonstrate that Aguvis surpasses previous state-of-the-art methods in both offline and real-world online scenarios, achieving, to our knowledge, the first fully autonomous pure vision GUI agent capable of performing tasks independently without collaboration with external closed-source models. We open-sourced all datasets, models, and training recipes to facilitate future research at https://aguvis-project.github.io/.</p>
            ]]>
      </content:encoded>
      <pubDate>Sun, 08 Dec 2024 14:56:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/89dc661e/6b31a21b.mp3" length="19898226" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1240</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong</p>

            <p><strong>Title:</strong><br>
            Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04454v1">http://arxiv.org/abs/2412.04454v1</a></p>

            <p><strong>Abstract:</strong><br>
            Graphical User Interfaces (GUIs) are critical to human-computer interaction, yet automating GUI tasks remains challenging due to the complexity and variability of visual environments. Existing approaches often rely on textual representations of GUIs, which introduce limitations in generalization, efficiency, and scalability. In this paper, we introduce Aguvis, a unified pure vision-based framework for autonomous GUI agents that operates across various platforms. Our approach leverages image-based observations, and grounding instructions in natural language to visual elements, and employs a consistent action space to ensure cross-platform generalization. To address the limitations of previous work, we integrate explicit planning and reasoning within the model, enhancing its ability to autonomously navigate and interact with complex digital environments. We construct a large-scale dataset of GUI agent trajectories, incorporating multimodal reasoning and grounding, and employ a two-stage training pipeline that first focuses on general GUI grounding, followed by planning and reasoning. Through comprehensive experiments, we demonstrate that Aguvis surpasses previous state-of-the-art methods in both offline and real-world online scenarios, achieving, to our knowledge, the first fully autonomous pure vision GUI agent capable of performing tasks independently without collaboration with external closed-source models. We open-sourced all datasets, models, and training recipes to facilitate future research at https://aguvis-project.github.io/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection</title>
      <itunes:episode>165</itunes:episode>
      <podcast:episode>165</podcast:episode>
      <itunes:title>Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">110cb74a-eff0-48f2-892e-84e96f5d1478</guid>
      <link>https://share.transistor.fm/s/f3c67f02</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.RO, cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Enshen Zhou, Qi Su, Cheng Chi, Zhizheng Zhang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, He Wang</p>

            <p><strong>Title:</strong><br>
            Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04455v1">http://arxiv.org/abs/2412.04455v1</a></p>

            <p><strong>Abstract:</strong><br>
            Automatic detection and prevention of open-set failures are crucial in closed-loop robotic systems. Recent studies often struggle to simultaneously identify unexpected failures reactively after they occur and prevent foreseeable ones proactively. To this end, we propose Code-as-Monitor (CaM), a novel paradigm leveraging the vision-language model (VLM) for both open-set reactive and proactive failure detection. The core of our method is to formulate both tasks as a unified set of spatio-temporal constraint satisfaction problems and use VLM-generated code to evaluate them for real-time monitoring. To enhance the accuracy and efficiency of monitoring, we further introduce constraint elements that abstract constraint-related entities or their parts into compact geometric elements. This approach offers greater generality, simplifies tracking, and facilitates constraint-aware visual programming by leveraging these elements as visual prompts. Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances compared to baselines across three simulators and a real-world setting. Moreover, CaM can be integrated with open-loop control policies to form closed-loop systems, enabling long-horizon tasks in cluttered scenes with dynamic environments.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.RO, cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Enshen Zhou, Qi Su, Cheng Chi, Zhizheng Zhang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, He Wang</p>

            <p><strong>Title:</strong><br>
            Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04455v1">http://arxiv.org/abs/2412.04455v1</a></p>

            <p><strong>Abstract:</strong><br>
            Automatic detection and prevention of open-set failures are crucial in closed-loop robotic systems. Recent studies often struggle to simultaneously identify unexpected failures reactively after they occur and prevent foreseeable ones proactively. To this end, we propose Code-as-Monitor (CaM), a novel paradigm leveraging the vision-language model (VLM) for both open-set reactive and proactive failure detection. The core of our method is to formulate both tasks as a unified set of spatio-temporal constraint satisfaction problems and use VLM-generated code to evaluate them for real-time monitoring. To enhance the accuracy and efficiency of monitoring, we further introduce constraint elements that abstract constraint-related entities or their parts into compact geometric elements. This approach offers greater generality, simplifies tracking, and facilitates constraint-aware visual programming by leveraging these elements as visual prompts. Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances compared to baselines across three simulators and a real-world setting. Moreover, CaM can be integrated with open-loop control policies to form closed-loop systems, enabling long-horizon tasks in cluttered scenes with dynamic environments.</p>
            ]]>
      </content:encoded>
      <pubDate>Sun, 08 Dec 2024 14:56:06 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f3c67f02/6eef36d8.mp3" length="22090044" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1377</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 32 | cs.RO, cs.AI, cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Enshen Zhou, Qi Su, Cheng Chi, Zhizheng Zhang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, He Wang</p>

            <p><strong>Title:</strong><br>
            Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.04455v1">http://arxiv.org/abs/2412.04455v1</a></p>

            <p><strong>Abstract:</strong><br>
            Automatic detection and prevention of open-set failures are crucial in closed-loop robotic systems. Recent studies often struggle to simultaneously identify unexpected failures reactively after they occur and prevent foreseeable ones proactively. To this end, we propose Code-as-Monitor (CaM), a novel paradigm leveraging the vision-language model (VLM) for both open-set reactive and proactive failure detection. The core of our method is to formulate both tasks as a unified set of spatio-temporal constraint satisfaction problems and use VLM-generated code to evaluate them for real-time monitoring. To enhance the accuracy and efficiency of monitoring, we further introduce constraint elements that abstract constraint-related entities or their parts into compact geometric elements. This approach offers greater generality, simplifies tracking, and facilitates constraint-aware visual programming by leveraging these elements as visual prompts. Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances compared to baselines across three simulators and a real-world setting. Moreover, CaM can be integrated with open-loop control policies to form closed-loop systems, enabling long-horizon tasks in cluttered scenes with dynamic environments.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Evaluating Language Models as Synthetic Data Generators</title>
      <itunes:episode>164</itunes:episode>
      <podcast:episode>164</podcast:episode>
      <itunes:title>Evaluating Language Models as Synthetic Data Generators</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7570bd4f-1870-4001-a0cb-6c3e60e19dc8</guid>
      <link>https://share.transistor.fm/s/332e92e3</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, Graham Neubig</p>

            <p><strong>Title:</strong><br>
            Evaluating Language Models as Synthetic Data Generators</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.03679v1">http://arxiv.org/abs/2412.03679v1</a></p>

            <p><strong>Abstract:</strong><br>
            Given the increasing use of synthetic data in language model (LM) post-training, an LM's ability to generate high-quality data has become nearly as crucial as its ability to solve problems directly. While prior works have focused on developing effective data generation methods, they lack systematic comparison of different LMs as data generators in a unified setting. To address this gap, we propose AgoraBench, a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities. First, we observe that LMs exhibit distinct strengths. For instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet performs better at enhancing existing ones. Furthermore, our analysis reveals that an LM's data generation ability doesn't necessarily correlate with its problem-solving ability. Instead, multiple intrinsic features of data quality-including response quality, perplexity, and instruction difficulty-collectively serve as better indicators. Finally, we demonstrate that strategic choices in output format and cost-conscious model selection significantly impact data generation effectiveness.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, Graham Neubig</p>

            <p><strong>Title:</strong><br>
            Evaluating Language Models as Synthetic Data Generators</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.03679v1">http://arxiv.org/abs/2412.03679v1</a></p>

            <p><strong>Abstract:</strong><br>
            Given the increasing use of synthetic data in language model (LM) post-training, an LM's ability to generate high-quality data has become nearly as crucial as its ability to solve problems directly. While prior works have focused on developing effective data generation methods, they lack systematic comparison of different LMs as data generators in a unified setting. To address this gap, we propose AgoraBench, a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities. First, we observe that LMs exhibit distinct strengths. For instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet performs better at enhancing existing ones. Furthermore, our analysis reveals that an LM's data generation ability doesn't necessarily correlate with its problem-solving ability. Instead, multiple intrinsic features of data quality-including response quality, perplexity, and instruction difficulty-collectively serve as better indicators. Finally, we demonstrate that strategic choices in output format and cost-conscious model selection significantly impact data generation effectiveness.</p>
            ]]>
      </content:encoded>
      <pubDate>Sun, 08 Dec 2024 14:55:45 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/332e92e3/1c187e8d.mp3" length="20254317" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1262</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 30 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, Graham Neubig</p>

            <p><strong>Title:</strong><br>
            Evaluating Language Models as Synthetic Data Generators</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.03679v1">http://arxiv.org/abs/2412.03679v1</a></p>

            <p><strong>Abstract:</strong><br>
            Given the increasing use of synthetic data in language model (LM) post-training, an LM's ability to generate high-quality data has become nearly as crucial as its ability to solve problems directly. While prior works have focused on developing effective data generation methods, they lack systematic comparison of different LMs as data generators in a unified setting. To address this gap, we propose AgoraBench, a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities. First, we observe that LMs exhibit distinct strengths. For instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet performs better at enhancing existing ones. Furthermore, our analysis reveals that an LM's data generation ability doesn't necessarily correlate with its problem-solving ability. Instead, multiple intrinsic features of data quality-including response quality, perplexity, and instruction difficulty-collectively serve as better indicators. Finally, we demonstrate that strategic choices in output format and cost-conscious model selection significantly impact data generation effectiveness.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Noise is Worth Diffusion Guidance</title>
      <itunes:episode>163</itunes:episode>
      <podcast:episode>163</podcast:episode>
      <itunes:title>A Noise is Worth Diffusion Guidance</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4987f6cc-e969-4ce5-8f59-d843c62fc755</guid>
      <link>https://share.transistor.fm/s/a52bfc5f</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Jaewon Min, Minjae Kim, Wooseok Jang, Hyoungwon Cho, Sayak Paul, SeonHwa Kim, Eunju Cha, Kyong Hwan Jin, Seungryong Kim</p>

            <p><strong>Title:</strong><br>
            A Noise is Worth Diffusion Guidance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.03895v1">http://arxiv.org/abs/2412.03895v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models excel in generating high-quality images. However, current diffusion models struggle to produce reliable images without guidance methods, such as classifier-free guidance (CFG). Are guidance methods truly necessary? Observing that noise obtained via diffusion inversion can reconstruct high-quality images without guidance, we focus on the initial noise of the denoising pipeline. By mapping Gaussian noise to `guidance-free noise', we uncover that small low-magnitude low-frequency components significantly enhance the denoising process, removing the need for guidance and thus improving both inference throughput and memory. Expanding on this, we propose \ours, a novel method that replaces guidance methods with a single refinement of the initial noise. This refined noise enables high-quality image generation without guidance, within the same diffusion pipeline. Our noise-refining model leverages efficient noise-space learning, achieving rapid convergence and strong performance with just 50K text-image pairs. We validate its effectiveness across diverse metrics and analyze how refined noise can eliminate the need for guidance. See our project page: https://cvlab-kaist.github.io/NoiseRefine/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Jaewon Min, Minjae Kim, Wooseok Jang, Hyoungwon Cho, Sayak Paul, SeonHwa Kim, Eunju Cha, Kyong Hwan Jin, Seungryong Kim</p>

            <p><strong>Title:</strong><br>
            A Noise is Worth Diffusion Guidance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.03895v1">http://arxiv.org/abs/2412.03895v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models excel in generating high-quality images. However, current diffusion models struggle to produce reliable images without guidance methods, such as classifier-free guidance (CFG). Are guidance methods truly necessary? Observing that noise obtained via diffusion inversion can reconstruct high-quality images without guidance, we focus on the initial noise of the denoising pipeline. By mapping Gaussian noise to `guidance-free noise', we uncover that small low-magnitude low-frequency components significantly enhance the denoising process, removing the need for guidance and thus improving both inference throughput and memory. Expanding on this, we propose \ours, a novel method that replaces guidance methods with a single refinement of the initial noise. This refined noise enables high-quality image generation without guidance, within the same diffusion pipeline. Our noise-refining model leverages efficient noise-space learning, achieving rapid convergence and strong performance with just 50K text-image pairs. We validate its effectiveness across diverse metrics and analyze how refined noise can eliminate the need for guidance. See our project page: https://cvlab-kaist.github.io/NoiseRefine/.</p>
            ]]>
      </content:encoded>
      <pubDate>Sun, 08 Dec 2024 14:55:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a52bfc5f/a3b40e02.mp3" length="20537255" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1280</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 25 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Jaewon Min, Minjae Kim, Wooseok Jang, Hyoungwon Cho, Sayak Paul, SeonHwa Kim, Eunju Cha, Kyong Hwan Jin, Seungryong Kim</p>

            <p><strong>Title:</strong><br>
            A Noise is Worth Diffusion Guidance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.03895v1">http://arxiv.org/abs/2412.03895v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models excel in generating high-quality images. However, current diffusion models struggle to produce reliable images without guidance methods, such as classifier-free guidance (CFG). Are guidance methods truly necessary? Observing that noise obtained via diffusion inversion can reconstruct high-quality images without guidance, we focus on the initial noise of the denoising pipeline. By mapping Gaussian noise to `guidance-free noise', we uncover that small low-magnitude low-frequency components significantly enhance the denoising process, removing the need for guidance and thus improving both inference throughput and memory. Expanding on this, we propose \ours, a novel method that replaces guidance methods with a single refinement of the initial noise. This refined noise enables high-quality image generation without guidance, within the same diffusion pipeline. Our noise-refining model leverages efficient noise-space learning, achieving rapid convergence and strong performance with just 50K text-image pairs. We validate its effectiveness across diverse metrics and analyze how refined noise can eliminate the need for guidance. See our project page: https://cvlab-kaist.github.io/NoiseRefine/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Structured 3D Latents for Scalable and Versatile 3D Generation</title>
      <itunes:episode>162</itunes:episode>
      <podcast:episode>162</podcast:episode>
      <itunes:title>Structured 3D Latents for Scalable and Versatile 3D Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8478c8c3-6c6d-4e94-ac1a-0f942f4adaeb</guid>
      <link>https://share.transistor.fm/s/096ee917</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, Jiaolong Yang</p>

            <p><strong>Title:</strong><br>
            Structured 3D Latents for Scalable and Versatile 3D Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.01506v1">http://arxiv.org/abs/2412.01506v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding. We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, Jiaolong Yang</p>

            <p><strong>Title:</strong><br>
            Structured 3D Latents for Scalable and Versatile 3D Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.01506v1">http://arxiv.org/abs/2412.01506v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding. We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.</p>
            ]]>
      </content:encoded>
      <pubDate>Sun, 08 Dec 2024 14:55:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/096ee917/ee4ddd92.mp3" length="22732404" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1417</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 22 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, Jiaolong Yang</p>

            <p><strong>Title:</strong><br>
            Structured 3D Latents for Scalable and Versatile 3D Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.01506v1">http://arxiv.org/abs/2412.01506v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding. We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Negative Token Merging: Image-based Adversarial Feature Guidance</title>
      <itunes:episode>161</itunes:episode>
      <podcast:episode>161</podcast:episode>
      <itunes:title>Negative Token Merging: Image-based Adversarial Feature Guidance</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">09e54431-1e26-4098-a3c3-15673da53ffc</guid>
      <link>https://share.transistor.fm/s/b9dfa614</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV, cs.AI, cs.GR, cs.LG, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Jaskirat Singh, Lindsey Li, Weijia Shi, Ranjay Krishna, Yejin Choi, Pang Wei Koh, Michael F. Cohen, Stephen Gould, Liang Zheng, Luke Zettlemoyer</p>

            <p><strong>Title:</strong><br>
            Negative Token Merging: Image-based Adversarial Feature Guidance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.01339v2">http://arxiv.org/abs/2412.01339v2</a></p>

            <p><strong>Abstract:</strong><br>
            Text-based adversarial guidance using a negative prompt has emerged as a widely adopted approach to steer diffusion models away from producing undesired concepts. While useful, performing adversarial guidance using text alone can be insufficient to capture complex visual concepts or avoid specific visual elements like copyrighted characters. In this paper, for the first time we explore an alternate modality in this direction by performing adversarial guidance directly using visual features from a reference image or other images in a batch. We introduce negative token merging (NegToMe), a simple but effective training-free approach which performs adversarial guidance through images by selectively pushing apart matching visual features between reference and generated images during the reverse diffusion process. By simply adjusting the used reference, NegToMe enables a diverse range of applications. Notably, when using other images in same batch as reference, we find that NegToMe significantly enhances output diversity (e.g., racial, gender, visual) by guiding features of each image away from others. Similarly, when used w.r.t. copyrighted reference images, NegToMe reduces visual similarity to copyrighted content by 34.57%. NegToMe is simple to implement using just few-lines of code, uses only marginally higher (&lt;4%) inference time and is compatible with different diffusion architectures, including those like Flux, which don't natively support the use of a negative prompt. Code is available at https://negtome.github.io</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV, cs.AI, cs.GR, cs.LG, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Jaskirat Singh, Lindsey Li, Weijia Shi, Ranjay Krishna, Yejin Choi, Pang Wei Koh, Michael F. Cohen, Stephen Gould, Liang Zheng, Luke Zettlemoyer</p>

            <p><strong>Title:</strong><br>
            Negative Token Merging: Image-based Adversarial Feature Guidance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.01339v2">http://arxiv.org/abs/2412.01339v2</a></p>

            <p><strong>Abstract:</strong><br>
            Text-based adversarial guidance using a negative prompt has emerged as a widely adopted approach to steer diffusion models away from producing undesired concepts. While useful, performing adversarial guidance using text alone can be insufficient to capture complex visual concepts or avoid specific visual elements like copyrighted characters. In this paper, for the first time we explore an alternate modality in this direction by performing adversarial guidance directly using visual features from a reference image or other images in a batch. We introduce negative token merging (NegToMe), a simple but effective training-free approach which performs adversarial guidance through images by selectively pushing apart matching visual features between reference and generated images during the reverse diffusion process. By simply adjusting the used reference, NegToMe enables a diverse range of applications. Notably, when using other images in same batch as reference, we find that NegToMe significantly enhances output diversity (e.g., racial, gender, visual) by guiding features of each image away from others. Similarly, when used w.r.t. copyrighted reference images, NegToMe reduces visual similarity to copyrighted content by 34.57%. NegToMe is simple to implement using just few-lines of code, uses only marginally higher (&lt;4%) inference time and is compatible with different diffusion architectures, including those like Flux, which don't natively support the use of a negative prompt. Code is available at https://negtome.github.io</p>
            ]]>
      </content:encoded>
      <pubDate>Sun, 08 Dec 2024 14:54:42 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b9dfa614/3c414d9a.mp3" length="18832847" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1173</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 21 | cs.CV, cs.AI, cs.GR, cs.LG, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Jaskirat Singh, Lindsey Li, Weijia Shi, Ranjay Krishna, Yejin Choi, Pang Wei Koh, Michael F. Cohen, Stephen Gould, Liang Zheng, Luke Zettlemoyer</p>

            <p><strong>Title:</strong><br>
            Negative Token Merging: Image-based Adversarial Feature Guidance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.01339v2">http://arxiv.org/abs/2412.01339v2</a></p>

            <p><strong>Abstract:</strong><br>
            Text-based adversarial guidance using a negative prompt has emerged as a widely adopted approach to steer diffusion models away from producing undesired concepts. While useful, performing adversarial guidance using text alone can be insufficient to capture complex visual concepts or avoid specific visual elements like copyrighted characters. In this paper, for the first time we explore an alternate modality in this direction by performing adversarial guidance directly using visual features from a reference image or other images in a batch. We introduce negative token merging (NegToMe), a simple but effective training-free approach which performs adversarial guidance through images by selectively pushing apart matching visual features between reference and generated images during the reverse diffusion process. By simply adjusting the used reference, NegToMe enables a diverse range of applications. Notably, when using other images in same batch as reference, we find that NegToMe significantly enhances output diversity (e.g., racial, gender, visual) by guiding features of each image away from others. Similarly, when used w.r.t. copyrighted reference images, NegToMe reduces visual similarity to copyrighted content by 34.57%. NegToMe is simple to implement using just few-lines of code, uses only marginally higher (&lt;4%) inference time and is compatible with different diffusion architectures, including those like Flux, which don't natively support the use of a negative prompt. Code is available at https://negtome.github.io</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MV-Adapter: Multi-view Consistent Image Generation Made Easy</title>
      <itunes:episode>160</itunes:episode>
      <podcast:episode>160</podcast:episode>
      <itunes:title>MV-Adapter: Multi-view Consistent Image Generation Made Easy</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a48d5e79-c451-4615-86c8-076c1a0231d0</guid>
      <link>https://share.transistor.fm/s/3b17c551</link>
      <description>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, Lu Sheng</p>

            <p><strong>Title:</strong><br>
            MV-Adapter: Multi-view Consistent Image Generation Made Easy</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.03632v1">http://arxiv.org/abs/2412.03632v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) models and require full fine-tuning, leading to (1) high computational costs, especially with large base models and high-resolution images, and (2) degradation in image quality due to optimization difficulties and scarce high-quality 3D data. In this paper, we propose the first adapter-based solution for multi-view image generation, and introduce MV-Adapter, a versatile plug-and-play adapter that enhances T2I models and their derivatives without altering the original network structure or feature space. By updating fewer parameters, MV-Adapter enables efficient training and preserves the prior knowledge embedded in pre-trained models, mitigating overfitting risks. To efficiently model the 3D geometric knowledge within the adapter, we introduce innovative designs that include duplicated self-attention layers and parallel attention architecture, enabling the adapter to inherit the powerful priors of the pre-trained models to model the novel 3D knowledge. Moreover, we present a unified condition encoder that seamlessly integrates camera parameters and geometric information, facilitating applications such as text- and image-based 3D generation and texturing. MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion XL (SDXL), and demonstrates adaptability and versatility. It can also be extended to arbitrary view generation, enabling broader applications. We demonstrate that MV-Adapter sets a new quality standard for multi-view image generation, and opens up new possibilities due to its efficiency, adaptability and versatility.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, Lu Sheng</p>

            <p><strong>Title:</strong><br>
            MV-Adapter: Multi-view Consistent Image Generation Made Easy</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.03632v1">http://arxiv.org/abs/2412.03632v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) models and require full fine-tuning, leading to (1) high computational costs, especially with large base models and high-resolution images, and (2) degradation in image quality due to optimization difficulties and scarce high-quality 3D data. In this paper, we propose the first adapter-based solution for multi-view image generation, and introduce MV-Adapter, a versatile plug-and-play adapter that enhances T2I models and their derivatives without altering the original network structure or feature space. By updating fewer parameters, MV-Adapter enables efficient training and preserves the prior knowledge embedded in pre-trained models, mitigating overfitting risks. To efficiently model the 3D geometric knowledge within the adapter, we introduce innovative designs that include duplicated self-attention layers and parallel attention architecture, enabling the adapter to inherit the powerful priors of the pre-trained models to model the novel 3D knowledge. Moreover, we present a unified condition encoder that seamlessly integrates camera parameters and geometric information, facilitating applications such as text- and image-based 3D generation and texturing. MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion XL (SDXL), and demonstrates adaptability and versatility. It can also be extended to arbitrary view generation, enabling broader applications. We demonstrate that MV-Adapter sets a new quality standard for multi-view image generation, and opens up new possibilities due to its efficiency, adaptability and versatility.</p>
            ]]>
      </content:encoded>
      <pubDate>Sun, 08 Dec 2024 14:54:21 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3b17c551/e8bb32d4.mp3" length="20581166" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1283</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Upvotes: 17 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, Lu Sheng</p>

            <p><strong>Title:</strong><br>
            MV-Adapter: Multi-view Consistent Image Generation Made Easy</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2412.03632v1">http://arxiv.org/abs/2412.03632v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) models and require full fine-tuning, leading to (1) high computational costs, especially with large base models and high-resolution images, and (2) degradation in image quality due to optimization difficulties and scarce high-quality 3D data. In this paper, we propose the first adapter-based solution for multi-view image generation, and introduce MV-Adapter, a versatile plug-and-play adapter that enhances T2I models and their derivatives without altering the original network structure or feature space. By updating fewer parameters, MV-Adapter enables efficient training and preserves the prior knowledge embedded in pre-trained models, mitigating overfitting risks. To efficiently model the 3D geometric knowledge within the adapter, we introduce innovative designs that include duplicated self-attention layers and parallel attention architecture, enabling the adapter to inherit the powerful priors of the pre-trained models to model the novel 3D knowledge. Moreover, we present a unified condition encoder that seamlessly integrates camera parameters and geometric information, facilitating applications such as text- and image-based 3D generation and texturing. MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion XL (SDXL), and demonstrates adaptability and versatility. It can also be extended to arbitrary view generation, enabling broader applications. We demonstrate that MV-Adapter sets a new quality standard for multi-view image generation, and opens up new possibilities due to its efficiency, adaptability and versatility.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ShowUI: One Vision-Language-Action Model for GUI Visual Agent</title>
      <itunes:episode>159</itunes:episode>
      <podcast:episode>159</podcast:episode>
      <itunes:title>ShowUI: One Vision-Language-Action Model for GUI Visual Agent</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ac5ea3b8-b5ac-4327-8dea-67e34328d1c1</guid>
      <link>https://share.transistor.fm/s/9678084f</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 48 | cs.CV, cs.AI, cs.CL, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            ShowUI: One Vision-Language-Action Model for GUI Visual Agent</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.17465v1">http://arxiv.org/abs/2411.17465v1</a></p>

            <p><strong>Abstract:</strong><br>
            Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visuals as humans do, highlighting the need for GUI visual agents. In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances. With above components, ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding. Its UI-guided token selection further reduces 33% of redundant visual tokens during training and speeds up the performance by 1.4x. Navigation experiments across web Mind2Web, mobile AITW, and online MiniWob environments further underscore the effectiveness and potential of our model in advancing GUI visual agents. The models are available at https://github.com/showlab/ShowUI.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 48 | cs.CV, cs.AI, cs.CL, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            ShowUI: One Vision-Language-Action Model for GUI Visual Agent</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.17465v1">http://arxiv.org/abs/2411.17465v1</a></p>

            <p><strong>Abstract:</strong><br>
            Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visuals as humans do, highlighting the need for GUI visual agents. In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances. With above components, ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding. Its UI-guided token selection further reduces 33% of redundant visual tokens during training and speeds up the performance by 1.4x. Navigation experiments across web Mind2Web, mobile AITW, and online MiniWob environments further underscore the effectiveness and potential of our model in advancing GUI visual agents. The models are available at https://github.com/showlab/ShowUI.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 27 Nov 2024 19:51:52 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9678084f/d0dc373e.mp3" length="23632687" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1473</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 48 | cs.CV, cs.AI, cs.CL, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou</p>

            <p><strong>Title:</strong><br>
            ShowUI: One Vision-Language-Action Model for GUI Visual Agent</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.17465v1">http://arxiv.org/abs/2411.17465v1</a></p>

            <p><strong>Abstract:</strong><br>
            Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visuals as humans do, highlighting the need for GUI visual agents. In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances. With above components, ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding. Its UI-guided token selection further reduces 33% of redundant visual tokens during training and speeds up the performance by 1.4x. Navigation experiments across web Mind2Web, mobile AITW, and online MiniWob environments further underscore the effectiveness and potential of our model in advancing GUI visual agents. The models are available at https://github.com/showlab/ShowUI.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Star Attention: Efficient LLM Inference over Long Sequences</title>
      <itunes:episode>158</itunes:episode>
      <podcast:episode>158</podcast:episode>
      <itunes:title>Star Attention: Efficient LLM Inference over Long Sequences</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">90617b85-9e83-4bad-9f14-49bef7dad2e5</guid>
      <link>https://share.transistor.fm/s/9bbfd181</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 32 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shantanu Acharya, Fei Jia, Boris Ginsburg</p>

            <p><strong>Title:</strong><br>
            Star Attention: Efficient LLM Inference over Long Sequences</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.17116v1">http://arxiv.org/abs/2411.17116v1</a></p>

            <p><strong>Abstract:</strong><br>
            Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 32 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shantanu Acharya, Fei Jia, Boris Ginsburg</p>

            <p><strong>Title:</strong><br>
            Star Attention: Efficient LLM Inference over Long Sequences</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.17116v1">http://arxiv.org/abs/2411.17116v1</a></p>

            <p><strong>Abstract:</strong><br>
            Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 27 Nov 2024 19:51:31 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9bbfd181/dc618c41.mp3" length="19808777" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1234</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 32 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Shantanu Acharya, Fei Jia, Boris Ginsburg</p>

            <p><strong>Title:</strong><br>
            Star Attention: Efficient LLM Inference over Long Sequences</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.17116v1">http://arxiv.org/abs/2411.17116v1</a></p>

            <p><strong>Abstract:</strong><br>
            Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Pathways on the Image Manifold: Image Editing via Video Generation</title>
      <itunes:episode>157</itunes:episode>
      <podcast:episode>157</podcast:episode>
      <itunes:title>Pathways on the Image Manifold: Image Editing via Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">64ab78ed-ad3e-47a4-8f9c-d0c086b414e0</guid>
      <link>https://share.transistor.fm/s/2f8f4f2b</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 23 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Noam Rotstein, Gal Yona, Daniel Silver, Roy Velich, David Bensaïd, Ron Kimmel</p>

            <p><strong>Title:</strong><br>
            Pathways on the Image Manifold: Image Editing via Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16819v1">http://arxiv.org/abs/2411.16819v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in image editing, driven by image diffusion models, have shown remarkable progress. However, significant challenges remain, as these models often struggle to follow complex edit instructions accurately and frequently compromise fidelity by altering key elements of the original image. Simultaneously, video generation has made remarkable strides, with models that effectively function as consistent and continuous world simulators. In this paper, we propose merging these two fields by utilizing image-to-video models for image editing. We reformulate image editing as a temporal process, using pretrained video models to create smooth transitions from the original image to the desired edit. This approach traverses the image manifold continuously, ensuring consistent edits while preserving the original image's key aspects. Our approach achieves state-of-the-art results on text-based image editing, demonstrating significant improvements in both edit accuracy and image preservation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 23 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Noam Rotstein, Gal Yona, Daniel Silver, Roy Velich, David Bensaïd, Ron Kimmel</p>

            <p><strong>Title:</strong><br>
            Pathways on the Image Manifold: Image Editing via Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16819v1">http://arxiv.org/abs/2411.16819v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in image editing, driven by image diffusion models, have shown remarkable progress. However, significant challenges remain, as these models often struggle to follow complex edit instructions accurately and frequently compromise fidelity by altering key elements of the original image. Simultaneously, video generation has made remarkable strides, with models that effectively function as consistent and continuous world simulators. In this paper, we propose merging these two fields by utilizing image-to-video models for image editing. We reformulate image editing as a temporal process, using pretrained video models to create smooth transitions from the original image to the desired edit. This approach traverses the image manifold continuously, ensuring consistent edits while preserving the original image's key aspects. Our approach achieves state-of-the-art results on text-based image editing, demonstrating significant improvements in both edit accuracy and image preservation.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 27 Nov 2024 19:51:09 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2f8f4f2b/79f41f3f.mp3" length="24116689" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1504</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 23 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Noam Rotstein, Gal Yona, Daniel Silver, Roy Velich, David Bensaïd, Ron Kimmel</p>

            <p><strong>Title:</strong><br>
            Pathways on the Image Manifold: Image Editing via Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16819v1">http://arxiv.org/abs/2411.16819v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in image editing, driven by image diffusion models, have shown remarkable progress. However, significant challenges remain, as these models often struggle to follow complex edit instructions accurately and frequently compromise fidelity by altering key elements of the original image. Simultaneously, video generation has made remarkable strides, with models that effectively function as consistent and continuous world simulators. In this paper, we propose merging these two fields by utilizing image-to-video models for image editing. We reformulate image editing as a temporal process, using pretrained video models to create smooth transitions from the original image to the desired edit. This approach traverses the image manifold continuously, ensuring consistent edits while preserving the original image's key aspects. Our approach achieves state-of-the-art results on text-based image editing, demonstrating significant improvements in both edit accuracy and image preservation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs</title>
      <itunes:episode>156</itunes:episode>
      <podcast:episode>156</podcast:episode>
      <itunes:title>MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e3f2e8f6-d661-4f0a-b8c0-71f3a75346b5</guid>
      <link>https://share.transistor.fm/s/1732ec9e</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chaoyou Fu, Yi-Fan Zhang, Shukang Yin, Bo Li, Xinyu Fang, Sirui Zhao, Haodong Duan, Xing Sun, Ziwei Liu, Liang Wang, Caifeng Shan, Ran He</p>

            <p><strong>Title:</strong><br>
            MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.15296v1">http://arxiv.org/abs/2411.15296v1</a></p>

            <p><strong>Abstract:</strong><br>
            As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. Building upon pre-trained LLMs, this family of models further develops multimodal perception and reasoning capabilities that are impressive, such as writing code given a flow chart or creating stories based on an image. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. Distinct from the traditional train-eval-test paradigm that only favors a single task like image classification, the versatility of MLLMs has spurred the rise of various new benchmarks and evaluation methods. In this paper, we aim to present a comprehensive survey of MLLM evaluation, discussing four key aspects: 1) the summarised benchmarks types divided by the evaluation capabilities, including foundation capabilities, model self-analysis, and extented applications; 2) the typical process of benchmark counstruction, consisting of data collection, annotation, and precautions; 3) the systematic evaluation manner composed of judge, metric, and toolkit; 4) the outlook for the next benchmark. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods, thereby driving the progress of MLLM research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chaoyou Fu, Yi-Fan Zhang, Shukang Yin, Bo Li, Xinyu Fang, Sirui Zhao, Haodong Duan, Xing Sun, Ziwei Liu, Liang Wang, Caifeng Shan, Ran He</p>

            <p><strong>Title:</strong><br>
            MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.15296v1">http://arxiv.org/abs/2411.15296v1</a></p>

            <p><strong>Abstract:</strong><br>
            As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. Building upon pre-trained LLMs, this family of models further develops multimodal perception and reasoning capabilities that are impressive, such as writing code given a flow chart or creating stories based on an image. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. Distinct from the traditional train-eval-test paradigm that only favors a single task like image classification, the versatility of MLLMs has spurred the rise of various new benchmarks and evaluation methods. In this paper, we aim to present a comprehensive survey of MLLM evaluation, discussing four key aspects: 1) the summarised benchmarks types divided by the evaluation capabilities, including foundation capabilities, model self-analysis, and extented applications; 2) the typical process of benchmark counstruction, consisting of data collection, annotation, and precautions; 3) the systematic evaluation manner composed of judge, metric, and toolkit; 4) the outlook for the next benchmark. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods, thereby driving the progress of MLLM research.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 27 Nov 2024 19:50:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1732ec9e/ab3acf30.mp3" length="25485924" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1589</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chaoyou Fu, Yi-Fan Zhang, Shukang Yin, Bo Li, Xinyu Fang, Sirui Zhao, Haodong Duan, Xing Sun, Ziwei Liu, Liang Wang, Caifeng Shan, Ran He</p>

            <p><strong>Title:</strong><br>
            MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.15296v1">http://arxiv.org/abs/2411.15296v1</a></p>

            <p><strong>Abstract:</strong><br>
            As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. Building upon pre-trained LLMs, this family of models further develops multimodal perception and reasoning capabilities that are impressive, such as writing code given a flow chart or creating stories based on an image. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. Distinct from the traditional train-eval-test paradigm that only favors a single task like image classification, the versatility of MLLMs has spurred the rise of various new benchmarks and evaluation methods. In this paper, we aim to present a comprehensive survey of MLLM evaluation, discussing four key aspects: 1) the summarised benchmarks types divided by the evaluation capabilities, including foundation capabilities, model self-analysis, and extented applications; 2) the typical process of benchmark counstruction, consisting of data collection, annotation, and precautions; 3) the systematic evaluation manner composed of judge, metric, and toolkit; 4) the outlook for the next benchmark. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods, thereby driving the progress of MLLM research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration</title>
      <itunes:episode>155</itunes:episode>
      <podcast:episode>155</podcast:episode>
      <itunes:title>Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">573ad731-3fda-41e3-846f-1e1fdd41988c</guid>
      <link>https://share.transistor.fm/s/6405f665</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang</p>

            <p><strong>Title:</strong><br>
            Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.17686v1">http://arxiv.org/abs/2411.17686v1</a></p>

            <p><strong>Abstract:</strong><br>
            To accelerate the inference of heavy Multimodal Large Language Models (MLLMs), this study rethinks the current landscape of training-free token reduction research. We regret to find that the critical components of existing methods are tightly intertwined, with their interconnections and effects remaining unclear for comparison, transfer, and expansion. Therefore, we propose a unified ''filter-correlate-compress'' paradigm that decomposes the token reduction into three distinct stages within a pipeline, maintaining consistent design objectives and elements while allowing for unique implementations. We additionally demystify the popular works and subsume them into our paradigm to showcase its universality. Finally, we offer a suite of methods grounded in the paradigm, striking a balance between speed and accuracy throughout different phases of the inference. Experimental results across 10 benchmarks indicate that our methods can achieve up to an 82.4% reduction in FLOPs with a minimal impact on performance, simultaneously surpassing state-of-the-art training-free methods. Our project page is at https://ficoco-accelerate.github.io/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang</p>

            <p><strong>Title:</strong><br>
            Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.17686v1">http://arxiv.org/abs/2411.17686v1</a></p>

            <p><strong>Abstract:</strong><br>
            To accelerate the inference of heavy Multimodal Large Language Models (MLLMs), this study rethinks the current landscape of training-free token reduction research. We regret to find that the critical components of existing methods are tightly intertwined, with their interconnections and effects remaining unclear for comparison, transfer, and expansion. Therefore, we propose a unified ''filter-correlate-compress'' paradigm that decomposes the token reduction into three distinct stages within a pipeline, maintaining consistent design objectives and elements while allowing for unique implementations. We additionally demystify the popular works and subsume them into our paradigm to showcase its universality. Finally, we offer a suite of methods grounded in the paradigm, striking a balance between speed and accuracy throughout different phases of the inference. Experimental results across 10 benchmarks indicate that our methods can achieve up to an 82.4% reduction in FLOPs with a minimal impact on performance, simultaneously surpassing state-of-the-art training-free methods. Our project page is at https://ficoco-accelerate.github.io/.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 27 Nov 2024 19:50:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6405f665/0b877c7e.mp3" length="21387443" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1333</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang</p>

            <p><strong>Title:</strong><br>
            Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.17686v1">http://arxiv.org/abs/2411.17686v1</a></p>

            <p><strong>Abstract:</strong><br>
            To accelerate the inference of heavy Multimodal Large Language Models (MLLMs), this study rethinks the current landscape of training-free token reduction research. We regret to find that the critical components of existing methods are tightly intertwined, with their interconnections and effects remaining unclear for comparison, transfer, and expansion. Therefore, we propose a unified ''filter-correlate-compress'' paradigm that decomposes the token reduction into three distinct stages within a pipeline, maintaining consistent design objectives and elements while allowing for unique implementations. We additionally demystify the popular works and subsume them into our paradigm to showcase its universality. Finally, we offer a suite of methods grounded in the paradigm, striking a balance between speed and accuracy throughout different phases of the inference. Experimental results across 10 benchmarks indicate that our methods can achieve up to an 82.4% reduction in FLOPs with a minimal impact on performance, simultaneously surpassing state-of-the-art training-free methods. Our project page is at https://ficoco-accelerate.github.io/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SketchAgent: Language-Driven Sequential Sketch Generation</title>
      <itunes:episode>154</itunes:episode>
      <podcast:episode>154</podcast:episode>
      <itunes:title>SketchAgent: Language-Driven Sequential Sketch Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b78f52fc-8e70-4a37-be7a-0df1a0af1ca8</guid>
      <link>https://share.transistor.fm/s/06c7fee3</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yael Vinker, Tamar Rott Shaham, Kristine Zheng, Alex Zhao, Judith E Fan, Antonio Torralba</p>

            <p><strong>Title:</strong><br>
            SketchAgent: Language-Driven Sequential Sketch Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.17673v1">http://arxiv.org/abs/2411.17673v1</a></p>

            <p><strong>Abstract:</strong><br>
            Sketching serves as a versatile tool for externalizing ideas, enabling rapid exploration and visual communication that spans various disciplines. While artificial systems have driven substantial advances in content creation and human-computer interaction, capturing the dynamic and abstract nature of human sketching remains challenging. In this work, we introduce SketchAgent, a language-driven, sequential sketch generation method that enables users to create, modify, and refine sketches through dynamic, conversational interactions. Our approach requires no training or fine-tuning. Instead, we leverage the sequential nature and rich prior knowledge of off-the-shelf multimodal large language models (LLMs). We present an intuitive sketching language, introduced to the model through in-context examples, enabling it to "draw" using string-based actions. These are processed into vector graphics and then rendered to create a sketch on a pixel canvas, which can be accessed again for further tasks. By drawing stroke by stroke, our agent captures the evolving, dynamic qualities intrinsic to sketching. We demonstrate that SketchAgent can generate sketches from diverse prompts, engage in dialogue-driven drawing, and collaborate meaningfully with human users.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yael Vinker, Tamar Rott Shaham, Kristine Zheng, Alex Zhao, Judith E Fan, Antonio Torralba</p>

            <p><strong>Title:</strong><br>
            SketchAgent: Language-Driven Sequential Sketch Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.17673v1">http://arxiv.org/abs/2411.17673v1</a></p>

            <p><strong>Abstract:</strong><br>
            Sketching serves as a versatile tool for externalizing ideas, enabling rapid exploration and visual communication that spans various disciplines. While artificial systems have driven substantial advances in content creation and human-computer interaction, capturing the dynamic and abstract nature of human sketching remains challenging. In this work, we introduce SketchAgent, a language-driven, sequential sketch generation method that enables users to create, modify, and refine sketches through dynamic, conversational interactions. Our approach requires no training or fine-tuning. Instead, we leverage the sequential nature and rich prior knowledge of off-the-shelf multimodal large language models (LLMs). We present an intuitive sketching language, introduced to the model through in-context examples, enabling it to "draw" using string-based actions. These are processed into vector graphics and then rendered to create a sketch on a pixel canvas, which can be accessed again for further tasks. By drawing stroke by stroke, our agent captures the evolving, dynamic qualities intrinsic to sketching. We demonstrate that SketchAgent can generate sketches from diverse prompts, engage in dialogue-driven drawing, and collaborate meaningfully with human users.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 27 Nov 2024 19:50:06 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/06c7fee3/2bfc1242.mp3" length="23755981" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1481</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yael Vinker, Tamar Rott Shaham, Kristine Zheng, Alex Zhao, Judith E Fan, Antonio Torralba</p>

            <p><strong>Title:</strong><br>
            SketchAgent: Language-Driven Sequential Sketch Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.17673v1">http://arxiv.org/abs/2411.17673v1</a></p>

            <p><strong>Abstract:</strong><br>
            Sketching serves as a versatile tool for externalizing ideas, enabling rapid exploration and visual communication that spans various disciplines. While artificial systems have driven substantial advances in content creation and human-computer interaction, capturing the dynamic and abstract nature of human sketching remains challenging. In this work, we introduce SketchAgent, a language-driven, sequential sketch generation method that enables users to create, modify, and refine sketches through dynamic, conversational interactions. Our approach requires no training or fine-tuning. Instead, we leverage the sequential nature and rich prior knowledge of off-the-shelf multimodal large language models (LLMs). We present an intuitive sketching language, introduced to the model through in-context examples, enabling it to "draw" using string-based actions. These are processed into vector graphics and then rendered to create a sketch on a pixel canvas, which can be accessed again for further tasks. By drawing stroke by stroke, our agent captures the evolving, dynamic qualities intrinsic to sketching. We demonstrate that SketchAgent can generate sketches from diverse prompts, engage in dialogue-driven drawing, and collaborate meaningfully with human users.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TEXGen: a Generative Diffusion Model for Mesh Textures</title>
      <itunes:episode>153</itunes:episode>
      <podcast:episode>153</podcast:episode>
      <itunes:title>TEXGen: a Generative Diffusion Model for Mesh Textures</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">31be5640-84c0-40ab-a138-61d9d898f9b1</guid>
      <link>https://share.transistor.fm/s/ddc094d1</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Xin Yu, Ze Yuan, Yuan-Chen Guo, Ying-Tian Liu, JianHui Liu, Yangguang Li, Yan-Pei Cao, Ding Liang, Xiaojuan Qi</p>

            <p><strong>Title:</strong><br>
            TEXGen: a Generative Diffusion Model for Mesh Textures</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14740v1">http://arxiv.org/abs/2411.14740v1</a></p>

            <p><strong>Abstract:</strong><br>
            While high-quality texture maps are essential for realistic 3D asset rendering, few studies have explored learning directly in the texture space, especially on large-scale datasets. In this work, we depart from the conventional approach of relying on pre-trained 2D diffusion models for test-time optimization of 3D textures. Instead, we focus on the fundamental problem of learning in the UV texture space itself. For the first time, we train a large diffusion model capable of directly generating high-resolution texture maps in a feed-forward manner. To facilitate efficient learning in high-resolution UV spaces, we propose a scalable network architecture that interleaves convolutions on UV maps with attention layers on point clouds. Leveraging this architectural design, we train a 700 million parameter diffusion model that can generate UV texture maps guided by text prompts and single-view images. Once trained, our model naturally supports various extended applications, including text-guided texture inpainting, sparse-view texture completion, and text-driven texture synthesis. Project page is at http://cvmi-lab.github.io/TEXGen/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Xin Yu, Ze Yuan, Yuan-Chen Guo, Ying-Tian Liu, JianHui Liu, Yangguang Li, Yan-Pei Cao, Ding Liang, Xiaojuan Qi</p>

            <p><strong>Title:</strong><br>
            TEXGen: a Generative Diffusion Model for Mesh Textures</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14740v1">http://arxiv.org/abs/2411.14740v1</a></p>

            <p><strong>Abstract:</strong><br>
            While high-quality texture maps are essential for realistic 3D asset rendering, few studies have explored learning directly in the texture space, especially on large-scale datasets. In this work, we depart from the conventional approach of relying on pre-trained 2D diffusion models for test-time optimization of 3D textures. Instead, we focus on the fundamental problem of learning in the UV texture space itself. For the first time, we train a large diffusion model capable of directly generating high-resolution texture maps in a feed-forward manner. To facilitate efficient learning in high-resolution UV spaces, we propose a scalable network architecture that interleaves convolutions on UV maps with attention layers on point clouds. Leveraging this architectural design, we train a 700 million parameter diffusion model that can generate UV texture maps guided by text prompts and single-view images. Once trained, our model naturally supports various extended applications, including text-guided texture inpainting, sparse-view texture completion, and text-driven texture synthesis. Project page is at http://cvmi-lab.github.io/TEXGen/.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 27 Nov 2024 19:49:45 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ddc094d1/0a191878.mp3" length="23503113" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1465</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Xin Yu, Ze Yuan, Yuan-Chen Guo, Ying-Tian Liu, JianHui Liu, Yangguang Li, Yan-Pei Cao, Ding Liang, Xiaojuan Qi</p>

            <p><strong>Title:</strong><br>
            TEXGen: a Generative Diffusion Model for Mesh Textures</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14740v1">http://arxiv.org/abs/2411.14740v1</a></p>

            <p><strong>Abstract:</strong><br>
            While high-quality texture maps are essential for realistic 3D asset rendering, few studies have explored learning directly in the texture space, especially on large-scale datasets. In this work, we depart from the conventional approach of relying on pre-trained 2D diffusion models for test-time optimization of 3D textures. Instead, we focus on the fundamental problem of learning in the UV texture space itself. For the first time, we train a large diffusion model capable of directly generating high-resolution texture maps in a feed-forward manner. To facilitate efficient learning in high-resolution UV spaces, we propose a scalable network architecture that interleaves convolutions on UV maps with attention layers on point clouds. Leveraging this architectural design, we train a 700 million parameter diffusion model that can generate UV texture maps guided by text prompts and single-view images. Once trained, our model naturally supports various extended applications, including text-guided texture inpainting, sparse-view texture completion, and text-driven texture synthesis. Project page is at http://cvmi-lab.github.io/TEXGen/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models</title>
      <itunes:episode>152</itunes:episode>
      <podcast:episode>152</podcast:episode>
      <itunes:title>VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">de0194e0-1ca3-4418-9171-c351240cb81a</guid>
      <link>https://share.transistor.fm/s/c85f55b0</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, Qi Liu</p>

            <p><strong>Title:</strong><br>
            VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.17451v1">http://arxiv.org/abs/2411.17451v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations. Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r &gt; 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs. Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.7% accuracy for a 7B VL-GenRM). We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, Qi Liu</p>

            <p><strong>Title:</strong><br>
            VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.17451v1">http://arxiv.org/abs/2411.17451v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations. Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r &gt; 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs. Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.7% accuracy for a 7B VL-GenRM). We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 27 Nov 2024 19:49:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c85f55b0/d1528794.mp3" length="21200605" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1321</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, Qi Liu</p>

            <p><strong>Title:</strong><br>
            VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.17451v1">http://arxiv.org/abs/2411.17451v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations. Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r &gt; 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs. Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.7% accuracy for a 7B VL-GenRM). We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Learning 3D Representations from Procedural 3D Programs</title>
      <itunes:episode>151</itunes:episode>
      <podcast:episode>151</podcast:episode>
      <itunes:title>Learning 3D Representations from Procedural 3D Programs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a252f204-a4b6-48f0-bf6c-372c479ee3b7</guid>
      <link>https://share.transistor.fm/s/0ca69304</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xuweiyi Chen, Zezhou Cheng</p>

            <p><strong>Title:</strong><br>
            Learning 3D Representations from Procedural 3D Programs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.17467v1">http://arxiv.org/abs/2411.17467v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-supervised learning has emerged as a promising approach for acquiring transferable 3D representations from unlabeled 3D point clouds. Unlike 2D images, which are widely accessible, acquiring 3D assets requires specialized expertise or professional 3D scanning equipment, making it difficult to scale and raising copyright concerns. To address these challenges, we propose learning 3D representations from procedural 3D programs that automatically generate 3D shapes using simple primitives and augmentations.   Remarkably, despite lacking semantic content, the 3D representations learned from this synthesized dataset perform on par with state-of-the-art representations learned from semantically recognizable 3D models (e.g., airplanes) across various downstream 3D tasks, including shape classification, part segmentation, and masked point cloud completion. Our analysis further suggests that current self-supervised learning methods primarily capture geometric structures rather than high-level semantics.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xuweiyi Chen, Zezhou Cheng</p>

            <p><strong>Title:</strong><br>
            Learning 3D Representations from Procedural 3D Programs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.17467v1">http://arxiv.org/abs/2411.17467v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-supervised learning has emerged as a promising approach for acquiring transferable 3D representations from unlabeled 3D point clouds. Unlike 2D images, which are widely accessible, acquiring 3D assets requires specialized expertise or professional 3D scanning equipment, making it difficult to scale and raising copyright concerns. To address these challenges, we propose learning 3D representations from procedural 3D programs that automatically generate 3D shapes using simple primitives and augmentations.   Remarkably, despite lacking semantic content, the 3D representations learned from this synthesized dataset perform on par with state-of-the-art representations learned from semantically recognizable 3D models (e.g., airplanes) across various downstream 3D tasks, including shape classification, part segmentation, and masked point cloud completion. Our analysis further suggests that current self-supervised learning methods primarily capture geometric structures rather than high-level semantics.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 27 Nov 2024 19:49:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0ca69304/076c3d9e.mp3" length="23470931" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1463</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Xuweiyi Chen, Zezhou Cheng</p>

            <p><strong>Title:</strong><br>
            Learning 3D Representations from Procedural 3D Programs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.17467v1">http://arxiv.org/abs/2411.17467v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-supervised learning has emerged as a promising approach for acquiring transferable 3D representations from unlabeled 3D point clouds. Unlike 2D images, which are widely accessible, acquiring 3D assets requires specialized expertise or professional 3D scanning equipment, making it difficult to scale and raising copyright concerns. To address these challenges, we propose learning 3D representations from procedural 3D programs that automatically generate 3D shapes using simple primitives and augmentations.   Remarkably, despite lacking semantic content, the 3D representations learned from this synthesized dataset perform on par with state-of-the-art representations learned from semantically recognizable 3D models (e.g., airplanes) across various downstream 3D tasks, including shape classification, part segmentation, and masked point cloud completion. Our analysis further suggests that current self-supervised learning methods primarily capture geometric structures rather than high-level semantics.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE</title>
      <itunes:episode>150</itunes:episode>
      <podcast:episode>150</podcast:episode>
      <itunes:title>SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3ad10744-f42f-4e73-aa9c-448d2cab469a</guid>
      <link>https://share.transistor.fm/s/e4fd971c</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yongwei Chen, Yushi Lan, Shangchen Zhou, Tengfei Wang, XIngang Pan</p>

            <p><strong>Title:</strong><br>
            SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16856v1">http://arxiv.org/abs/2411.16856v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive models have demonstrated remarkable success across various fields, from large language models (LLMs) to large multimodal models (LMMs) and 2D content generation, moving closer to artificial general intelligence (AGI). Despite these advances, applying autoregressive approaches to 3D object generation and understanding remains largely unexplored. This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects for efficient autoregressive generation and detailed understanding. By predicting the next scale in a multi-scale latent representation instead of the next single token, SAR3D reduces generation time significantly, achieving fast 3D object generation in just 0.82 seconds on an A6000 GPU. Additionally, given the tokens enriched with hierarchical 3D-aware information, we finetune a pretrained LLM on them, enabling multimodal comprehension of 3D content. Our experiments show that SAR3D surpasses current 3D generation methods in both speed and quality and allows LLMs to interpret and caption 3D models comprehensively.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yongwei Chen, Yushi Lan, Shangchen Zhou, Tengfei Wang, XIngang Pan</p>

            <p><strong>Title:</strong><br>
            SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16856v1">http://arxiv.org/abs/2411.16856v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive models have demonstrated remarkable success across various fields, from large language models (LLMs) to large multimodal models (LMMs) and 2D content generation, moving closer to artificial general intelligence (AGI). Despite these advances, applying autoregressive approaches to 3D object generation and understanding remains largely unexplored. This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects for efficient autoregressive generation and detailed understanding. By predicting the next scale in a multi-scale latent representation instead of the next single token, SAR3D reduces generation time significantly, achieving fast 3D object generation in just 0.82 seconds on an A6000 GPU. Additionally, given the tokens enriched with hierarchical 3D-aware information, we finetune a pretrained LLM on them, enabling multimodal comprehension of 3D content. Our experiments show that SAR3D surpasses current 3D generation methods in both speed and quality and allows LLMs to interpret and caption 3D models comprehensively.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 27 Nov 2024 19:48:41 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e4fd971c/a8ed701d.mp3" length="25033292" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1561</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yongwei Chen, Yushi Lan, Shangchen Zhou, Tengfei Wang, XIngang Pan</p>

            <p><strong>Title:</strong><br>
            SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16856v1">http://arxiv.org/abs/2411.16856v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive models have demonstrated remarkable success across various fields, from large language models (LLMs) to large multimodal models (LMMs) and 2D content generation, moving closer to artificial general intelligence (AGI). Despite these advances, applying autoregressive approaches to 3D object generation and understanding remains largely unexplored. This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects for efficient autoregressive generation and detailed understanding. By predicting the next scale in a multi-scale latent representation instead of the next single token, SAR3D reduces generation time significantly, achieving fast 3D object generation in just 0.82 seconds on an A6000 GPU. Additionally, given the tokens enriched with hierarchical 3D-aware information, we finetune a pretrained LLM on them, enabling multimodal comprehension of 3D content. Our experiments show that SAR3D surpasses current 3D generation methods in both speed and quality and allows LLMs to interpret and caption 3D models comprehensively.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Material Anything: Generating Materials for Any 3D Object via Diffusion</title>
      <itunes:episode>149</itunes:episode>
      <podcast:episode>149</podcast:episode>
      <itunes:title>Material Anything: Generating Materials for Any 3D Object via Diffusion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1eb46d31-56fc-4483-ba71-eeba4dcb397f</guid>
      <link>https://share.transistor.fm/s/bfd2ea11</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 33 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Xin Huang, Tengfei Wang, Ziwei Liu, Qing Wang</p>

            <p><strong>Title:</strong><br>
            Material Anything: Generating Materials for Any 3D Object via Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.15138v1">http://arxiv.org/abs/2411.15138v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Material Anything, a fully-automated, unified diffusion framework designed to generate physically-based materials for 3D objects. Unlike existing methods that rely on complex pipelines or case-specific optimizations, Material Anything offers a robust, end-to-end solution adaptable to objects under diverse lighting conditions. Our approach leverages a pre-trained image diffusion model, enhanced with a triple-head architecture and rendering loss to improve stability and material quality. Additionally, we introduce confidence masks as a dynamic switcher within the diffusion model, enabling it to effectively handle both textured and texture-less objects across varying lighting conditions. By employing a progressive material generation strategy guided by these confidence masks, along with a UV-space material refiner, our method ensures consistent, UV-ready material outputs. Extensive experiments demonstrate our approach outperforms existing methods across a wide range of object categories and lighting conditions.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 33 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Xin Huang, Tengfei Wang, Ziwei Liu, Qing Wang</p>

            <p><strong>Title:</strong><br>
            Material Anything: Generating Materials for Any 3D Object via Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.15138v1">http://arxiv.org/abs/2411.15138v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Material Anything, a fully-automated, unified diffusion framework designed to generate physically-based materials for 3D objects. Unlike existing methods that rely on complex pipelines or case-specific optimizations, Material Anything offers a robust, end-to-end solution adaptable to objects under diverse lighting conditions. Our approach leverages a pre-trained image diffusion model, enhanced with a triple-head architecture and rendering loss to improve stability and material quality. Additionally, we introduce confidence masks as a dynamic switcher within the diffusion model, enabling it to effectively handle both textured and texture-less objects across varying lighting conditions. By employing a progressive material generation strategy guided by these confidence masks, along with a UV-space material refiner, our method ensures consistent, UV-ready material outputs. Extensive experiments demonstrate our approach outperforms existing methods across a wide range of object categories and lighting conditions.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 26 Nov 2024 19:51:13 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bfd2ea11/4ef229b3.mp3" length="21044276" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1312</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 33 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Xin Huang, Tengfei Wang, Ziwei Liu, Qing Wang</p>

            <p><strong>Title:</strong><br>
            Material Anything: Generating Materials for Any 3D Object via Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.15138v1">http://arxiv.org/abs/2411.15138v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present Material Anything, a fully-automated, unified diffusion framework designed to generate physically-based materials for 3D objects. Unlike existing methods that rely on complex pipelines or case-specific optimizations, Material Anything offers a robust, end-to-end solution adaptable to objects under diverse lighting conditions. Our approach leverages a pre-trained image diffusion model, enhanced with a triple-head architecture and rendering loss to improve stability and material quality. Additionally, we introduce confidence masks as a dynamic switcher within the diffusion model, enabling it to effectively handle both textured and texture-less objects across varying lighting conditions. By employing a progressive material generation strategy guided by these confidence masks, along with a UV-space material refiner, our method ensures consistent, UV-ready material outputs. Extensive experiments demonstrate our approach outperforms existing methods across a wide range of object categories and lighting conditions.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator</title>
      <itunes:episode>148</itunes:episode>
      <podcast:episode>148</podcast:episode>
      <itunes:title>Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">28dfa422-84f3-4603-a7d6-776708985038</guid>
      <link>https://share.transistor.fm/s/98b8d22d</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chaehun Shin, Jooyoung Choi, Heeseung Kim, Sungroh Yoon</p>

            <p><strong>Title:</strong><br>
            Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.15466v1">http://arxiv.org/abs/2411.15466v1</a></p>

            <p><strong>Abstract:</strong><br>
            Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications. Project page: https://diptychprompting.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chaehun Shin, Jooyoung Choi, Heeseung Kim, Sungroh Yoon</p>

            <p><strong>Title:</strong><br>
            Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.15466v1">http://arxiv.org/abs/2411.15466v1</a></p>

            <p><strong>Abstract:</strong><br>
            Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications. Project page: https://diptychprompting.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 26 Nov 2024 19:50:52 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/98b8d22d/5123a714.mp3" length="26328556" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1642</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Chaehun Shin, Jooyoung Choi, Heeseung Kim, Sungroh Yoon</p>

            <p><strong>Title:</strong><br>
            Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.15466v1">http://arxiv.org/abs/2411.15466v1</a></p>

            <p><strong>Abstract:</strong><br>
            Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications. Project page: https://diptychprompting.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge</title>
      <itunes:episode>147</itunes:episode>
      <podcast:episode>147</podcast:episode>
      <itunes:title>From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9d882485-bbd3-499f-8948-c776c03aba0b</guid>
      <link>https://share.transistor.fm/s/16be9f05</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 19 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, Huan Liu</p>

            <p><strong>Title:</strong><br>
            From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16594v1">http://arxiv.org/abs/2411.16594v1</a></p>

            <p><strong>Abstract:</strong><br>
            Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP). However, traditional methods, whether matching-based or embedding-based, often fall short of judging subtle attributes and delivering satisfactory results. Recent advancements in Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm, where LLMs are leveraged to perform scoring, ranking, or selection across various tasks and applications. This paper provides a comprehensive survey of LLM-based judgment and assessment, offering an in-depth overview to advance this emerging field. We begin by giving detailed definitions from both input and output perspectives. Then we introduce a comprehensive taxonomy to explore LLM-as-a-judge from three dimensions: what to judge, how to judge and where to judge. Finally, we compile benchmarks for evaluating LLM-as-a-judge and highlight key challenges and promising directions, aiming to provide valuable insights and inspire future research in this promising research area. Paper list and more resources about LLM-as-a-judge can be found at \url{https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge} and \url{https://llm-as-a-judge.github.io}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 19 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, Huan Liu</p>

            <p><strong>Title:</strong><br>
            From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16594v1">http://arxiv.org/abs/2411.16594v1</a></p>

            <p><strong>Abstract:</strong><br>
            Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP). However, traditional methods, whether matching-based or embedding-based, often fall short of judging subtle attributes and delivering satisfactory results. Recent advancements in Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm, where LLMs are leveraged to perform scoring, ranking, or selection across various tasks and applications. This paper provides a comprehensive survey of LLM-based judgment and assessment, offering an in-depth overview to advance this emerging field. We begin by giving detailed definitions from both input and output perspectives. Then we introduce a comprehensive taxonomy to explore LLM-as-a-judge from three dimensions: what to judge, how to judge and where to judge. Finally, we compile benchmarks for evaluating LLM-as-a-judge and highlight key challenges and promising directions, aiming to provide valuable insights and inspire future research in this promising research area. Paper list and more resources about LLM-as-a-judge can be found at \url{https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge} and \url{https://llm-as-a-judge.github.io}.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 26 Nov 2024 19:50:31 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/16be9f05/329b0f21.mp3" length="21125364" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1317</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 19 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, Huan Liu</p>

            <p><strong>Title:</strong><br>
            From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16594v1">http://arxiv.org/abs/2411.16594v1</a></p>

            <p><strong>Abstract:</strong><br>
            Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP). However, traditional methods, whether matching-based or embedding-based, often fall short of judging subtle attributes and delivering satisfactory results. Recent advancements in Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm, where LLMs are leveraged to perform scoring, ranking, or selection across various tasks and applications. This paper provides a comprehensive survey of LLM-based judgment and assessment, offering an in-depth overview to advance this emerging field. We begin by giving detailed definitions from both input and output perspectives. Then we introduce a comprehensive taxonomy to explore LLM-as-a-judge from three dimensions: what to judge, how to judge and where to judge. Finally, we compile benchmarks for evaluating LLM-as-a-judge and highlight key challenges and promising directions, aiming to provide valuable insights and inspire future research in this promising research area. Paper list and more resources about LLM-as-a-judge can be found at \url{https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge} and \url{https://llm-as-a-judge.github.io}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?</title>
      <itunes:episode>146</itunes:episode>
      <podcast:episode>146</podcast:episode>
      <itunes:title>O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">95c431b3-fa0a-4669-ad80-aef97ac5bbeb</guid>
      <link>https://share.transistor.fm/s/ef1e06c6</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 18 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16489v1">http://arxiv.org/abs/2411.16489v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents a critical examination of current approaches to replicating OpenAI's O1 model capabilities, with particular focus on the widespread but often undisclosed use of knowledge distillation techniques. While our previous work explored the fundamental technical path to O1 replication, this study reveals how simple distillation from O1's API, combined with supervised fine-tuning, can achieve superior performance on complex mathematical reasoning tasks. Through extensive experiments, we show that a base model fine-tuned on simply tens of thousands of samples O1-distilled long-thought chains outperforms O1-preview on the American Invitational Mathematics Examination (AIME) with minimal technical complexity. Moreover, our investigation extends beyond mathematical reasoning to explore the generalization capabilities of O1-distilled models across diverse tasks: hallucination, safety and open-domain QA. Notably, despite training only on mathematical problem-solving data, our models demonstrated strong generalization to open-ended QA tasks and became significantly less susceptible to sycophancy after fine-tuning. We deliberately make this finding public to promote transparency in AI research and to challenge the current trend of obscured technical claims in the field. Our work includes: (1) A detailed technical exposition of the distillation process and its effectiveness, (2) A comprehensive benchmark framework for evaluating and categorizing O1 replication attempts based on their technical transparency and reproducibility, (3) A critical discussion of the limitations and potential risks of over-relying on distillation approaches, our analysis culminates in a crucial bitter lesson: while the pursuit of more capable AI systems is important, the development of researchers grounded in first-principles thinking is paramount.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 18 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16489v1">http://arxiv.org/abs/2411.16489v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents a critical examination of current approaches to replicating OpenAI's O1 model capabilities, with particular focus on the widespread but often undisclosed use of knowledge distillation techniques. While our previous work explored the fundamental technical path to O1 replication, this study reveals how simple distillation from O1's API, combined with supervised fine-tuning, can achieve superior performance on complex mathematical reasoning tasks. Through extensive experiments, we show that a base model fine-tuned on simply tens of thousands of samples O1-distilled long-thought chains outperforms O1-preview on the American Invitational Mathematics Examination (AIME) with minimal technical complexity. Moreover, our investigation extends beyond mathematical reasoning to explore the generalization capabilities of O1-distilled models across diverse tasks: hallucination, safety and open-domain QA. Notably, despite training only on mathematical problem-solving data, our models demonstrated strong generalization to open-ended QA tasks and became significantly less susceptible to sycophancy after fine-tuning. We deliberately make this finding public to promote transparency in AI research and to challenge the current trend of obscured technical claims in the field. Our work includes: (1) A detailed technical exposition of the distillation process and its effectiveness, (2) A comprehensive benchmark framework for evaluating and categorizing O1 replication attempts based on their technical transparency and reproducibility, (3) A critical discussion of the limitations and potential risks of over-relying on distillation approaches, our analysis culminates in a crucial bitter lesson: while the pursuit of more capable AI systems is important, the development of researchers grounded in first-principles thinking is paramount.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 26 Nov 2024 19:50:10 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ef1e06c6/31e5e075.mp3" length="20227628" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1261</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 18 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, Pengfei Liu</p>

            <p><strong>Title:</strong><br>
            O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16489v1">http://arxiv.org/abs/2411.16489v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents a critical examination of current approaches to replicating OpenAI's O1 model capabilities, with particular focus on the widespread but often undisclosed use of knowledge distillation techniques. While our previous work explored the fundamental technical path to O1 replication, this study reveals how simple distillation from O1's API, combined with supervised fine-tuning, can achieve superior performance on complex mathematical reasoning tasks. Through extensive experiments, we show that a base model fine-tuned on simply tens of thousands of samples O1-distilled long-thought chains outperforms O1-preview on the American Invitational Mathematics Examination (AIME) with minimal technical complexity. Moreover, our investigation extends beyond mathematical reasoning to explore the generalization capabilities of O1-distilled models across diverse tasks: hallucination, safety and open-domain QA. Notably, despite training only on mathematical problem-solving data, our models demonstrated strong generalization to open-ended QA tasks and became significantly less susceptible to sycophancy after fine-tuning. We deliberately make this finding public to promote transparency in AI research and to challenge the current trend of obscured technical claims in the field. Our work includes: (1) A detailed technical exposition of the distillation process and its effectiveness, (2) A comprehensive benchmark framework for evaluating and categorizing O1 replication attempts based on their technical transparency and reproducibility, (3) A critical discussion of the limitations and potential risks of over-relying on distillation approaches, our analysis culminates in a crucial bitter lesson: while the pursuit of more capable AI systems is important, the development of researchers grounded in first-principles thinking is paramount.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MH-MoE: Multi-Head Mixture-of-Experts</title>
      <itunes:episode>145</itunes:episode>
      <podcast:episode>145</podcast:episode>
      <itunes:title>MH-MoE: Multi-Head Mixture-of-Experts</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">719fd516-4cc3-466f-a872-d0d47c7fb254</guid>
      <link>https://share.transistor.fm/s/11f937b1</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 17 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shaohan Huang, Xun Wu, Shuming Ma, Furu Wei</p>

            <p><strong>Title:</strong><br>
            MH-MoE: Multi-Head Mixture-of-Experts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16205v2">http://arxiv.org/abs/2411.16205v2</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by using the multi-head mechanism to collectively attend to information from various representation spaces within different experts. In this paper, we present a novel implementation of MH-MoE that maintains both FLOPs and parameter parity with sparse Mixture of Experts models. Experimental results on language models show that the new implementation yields quality improvements over both vanilla MoE and fine-grained MoE models. Additionally, our experiments demonstrate that MH-MoE is compatible with 1-bit Large Language Models (LLMs) such as BitNet.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 17 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shaohan Huang, Xun Wu, Shuming Ma, Furu Wei</p>

            <p><strong>Title:</strong><br>
            MH-MoE: Multi-Head Mixture-of-Experts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16205v2">http://arxiv.org/abs/2411.16205v2</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by using the multi-head mechanism to collectively attend to information from various representation spaces within different experts. In this paper, we present a novel implementation of MH-MoE that maintains both FLOPs and parameter parity with sparse Mixture of Experts models. Experimental results on language models show that the new implementation yields quality improvements over both vanilla MoE and fine-grained MoE models. Additionally, our experiments demonstrate that MH-MoE is compatible with 1-bit Large Language Models (LLMs) such as BitNet.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 26 Nov 2024 19:49:44 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/11f937b1/53a4a398.mp3" length="20220444" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1260</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 17 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Shaohan Huang, Xun Wu, Shuming Ma, Furu Wei</p>

            <p><strong>Title:</strong><br>
            MH-MoE: Multi-Head Mixture-of-Experts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16205v2">http://arxiv.org/abs/2411.16205v2</a></p>

            <p><strong>Abstract:</strong><br>
            Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by using the multi-head mechanism to collectively attend to information from various representation spaces within different experts. In this paper, we present a novel implementation of MH-MoE that maintains both FLOPs and parameter parity with sparse Mixture of Experts models. Experimental results on language models show that the new implementation yields quality improvements over both vanilla MoE and fine-grained MoE models. Additionally, our experiments demonstrate that MH-MoE is compatible with 1-bit Large Language Models (LLMs) such as BitNet.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GMAI-VL &amp; GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI</title>
      <itunes:episode>144</itunes:episode>
      <podcast:episode>144</podcast:episode>
      <itunes:title>GMAI-VL &amp; GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d567247c-a350-4918-921c-c5e7117671ae</guid>
      <link>https://share.transistor.fm/s/b06d727f</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tianbin Li, Yanzhou Su, Wei Li, Bin Fu, Zhe Chen, Ziyan Huang, Guoan Wang, Chenglong Ma, Ying Chen, Ming Hu, Yanjun Li, Pengcheng Chen, Xiaowei Hu, Zhongying Deng, Yuanfeng Ji, Jin Ye, Yu Qiao, Junjun He</p>

            <p><strong>Title:</strong><br>
            GMAI-VL &amp; GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14522v1">http://arxiv.org/abs/2411.14522v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite significant advancements in general artificial intelligence, such as GPT-4, their effectiveness in the medical domain (general medical AI, GMAI) remains constrained due to the absence of specialized medical knowledge. To address this challenge, we present GMAI-VL-5.5M, a comprehensive multimodal medical dataset created by converting hundreds of specialized medical datasets into meticulously constructed image-text pairs. This dataset features comprehensive task coverage, diverse modalities, and high-quality image-text data. Building upon this multimodal dataset, we propose GMAI-VL, a general medical vision-language model with a progressively three-stage training strategy. This approach significantly enhances the model's ability by integrating visual and textual information, thereby improving its ability to process multimodal data and support accurate diagnosis and clinical decision-making. Experimental evaluations demonstrate that GMAI-VL achieves state-of-the-art results across a wide range of multimodal medical tasks, such as visual question answering and medical image diagnosis. Our contributions include the development of the GMAI-VL-5.5M dataset, the introduction of the GMAI-VL model, and the establishment of new benchmarks in multiple medical domains. Code and dataset will be released at https://github.com/uni-medical/GMAI-VL.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tianbin Li, Yanzhou Su, Wei Li, Bin Fu, Zhe Chen, Ziyan Huang, Guoan Wang, Chenglong Ma, Ying Chen, Ming Hu, Yanjun Li, Pengcheng Chen, Xiaowei Hu, Zhongying Deng, Yuanfeng Ji, Jin Ye, Yu Qiao, Junjun He</p>

            <p><strong>Title:</strong><br>
            GMAI-VL &amp; GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14522v1">http://arxiv.org/abs/2411.14522v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite significant advancements in general artificial intelligence, such as GPT-4, their effectiveness in the medical domain (general medical AI, GMAI) remains constrained due to the absence of specialized medical knowledge. To address this challenge, we present GMAI-VL-5.5M, a comprehensive multimodal medical dataset created by converting hundreds of specialized medical datasets into meticulously constructed image-text pairs. This dataset features comprehensive task coverage, diverse modalities, and high-quality image-text data. Building upon this multimodal dataset, we propose GMAI-VL, a general medical vision-language model with a progressively three-stage training strategy. This approach significantly enhances the model's ability by integrating visual and textual information, thereby improving its ability to process multimodal data and support accurate diagnosis and clinical decision-making. Experimental evaluations demonstrate that GMAI-VL achieves state-of-the-art results across a wide range of multimodal medical tasks, such as visual question answering and medical image diagnosis. Our contributions include the development of the GMAI-VL-5.5M dataset, the introduction of the GMAI-VL model, and the establishment of new benchmarks in multiple medical domains. Code and dataset will be released at https://github.com/uni-medical/GMAI-VL.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 26 Nov 2024 19:49:23 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b06d727f/2222d539.mp3" length="20366394" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1269</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tianbin Li, Yanzhou Su, Wei Li, Bin Fu, Zhe Chen, Ziyan Huang, Guoan Wang, Chenglong Ma, Ying Chen, Ming Hu, Yanjun Li, Pengcheng Chen, Xiaowei Hu, Zhongying Deng, Yuanfeng Ji, Jin Ye, Yu Qiao, Junjun He</p>

            <p><strong>Title:</strong><br>
            GMAI-VL &amp; GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14522v1">http://arxiv.org/abs/2411.14522v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite significant advancements in general artificial intelligence, such as GPT-4, their effectiveness in the medical domain (general medical AI, GMAI) remains constrained due to the absence of specialized medical knowledge. To address this challenge, we present GMAI-VL-5.5M, a comprehensive multimodal medical dataset created by converting hundreds of specialized medical datasets into meticulously constructed image-text pairs. This dataset features comprehensive task coverage, diverse modalities, and high-quality image-text data. Building upon this multimodal dataset, we propose GMAI-VL, a general medical vision-language model with a progressively three-stage training strategy. This approach significantly enhances the model's ability by integrating visual and textual information, thereby improving its ability to process multimodal data and support accurate diagnosis and clinical decision-making. Experimental evaluations demonstrate that GMAI-VL achieves state-of-the-art results across a wide range of multimodal medical tasks, such as visual question answering and medical image diagnosis. Our contributions include the development of the GMAI-VL-5.5M dataset, the introduction of the GMAI-VL model, and the establishment of new benchmarks in multiple medical domains. Code and dataset will be released at https://github.com/uni-medical/GMAI-VL.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation</title>
      <itunes:episode>143</itunes:episode>
      <podcast:episode>143</podcast:episode>
      <itunes:title>DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6132c317-ac3e-4f24-a875-62e86b73f679</guid>
      <link>https://share.transistor.fm/s/29e221c1</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, Mohit Bansal</p>

            <p><strong>Title:</strong><br>
            DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16657v1">http://arxiv.org/abs/2411.16657v1</a></p>

            <p><strong>Abstract:</strong><br>
            Storytelling video generation (SVG) has recently emerged as a task to create long, multi-motion, multi-scene videos that consistently represent the story described in the input text script. SVG holds great potential for diverse content creation in media and entertainment; however, it also presents significant challenges: (1) objects must exhibit a range of fine-grained, complex motions, (2) multiple objects need to appear consistently across scenes, and (3) subjects may require multiple motions with seamless transitions within a single scene. To address these challenges, we propose DreamRunner, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout and motion planning. Next, DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame semantic control. We compare DreamRunner with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DreamRunner exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DreamRunner's robust ability to generate multi-object interactions with qualitative examples.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, Mohit Bansal</p>

            <p><strong>Title:</strong><br>
            DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16657v1">http://arxiv.org/abs/2411.16657v1</a></p>

            <p><strong>Abstract:</strong><br>
            Storytelling video generation (SVG) has recently emerged as a task to create long, multi-motion, multi-scene videos that consistently represent the story described in the input text script. SVG holds great potential for diverse content creation in media and entertainment; however, it also presents significant challenges: (1) objects must exhibit a range of fine-grained, complex motions, (2) multiple objects need to appear consistently across scenes, and (3) subjects may require multiple motions with seamless transitions within a single scene. To address these challenges, we propose DreamRunner, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout and motion planning. Next, DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame semantic control. We compare DreamRunner with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DreamRunner exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DreamRunner's robust ability to generate multi-object interactions with qualitative examples.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 26 Nov 2024 19:49:02 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/29e221c1/8cfb7a2c.mp3" length="21848456" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1362</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, Mohit Bansal</p>

            <p><strong>Title:</strong><br>
            DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16657v1">http://arxiv.org/abs/2411.16657v1</a></p>

            <p><strong>Abstract:</strong><br>
            Storytelling video generation (SVG) has recently emerged as a task to create long, multi-motion, multi-scene videos that consistently represent the story described in the input text script. SVG holds great potential for diverse content creation in media and entertainment; however, it also presents significant challenges: (1) objects must exhibit a range of fine-grained, complex motions, (2) multiple objects need to appear consistently across scenes, and (3) subjects may require multiple motions with seamless transitions within a single scene. To address these challenges, we propose DreamRunner, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout and motion planning. Next, DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame semantic control. We compare DreamRunner with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DreamRunner exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DreamRunner's robust ability to generate multi-object interactions with qualitative examples.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Knowledge Transfer Across Modalities with Natural Language Supervision</title>
      <itunes:episode>142</itunes:episode>
      <podcast:episode>142</podcast:episode>
      <itunes:title>Knowledge Transfer Across Modalities with Natural Language Supervision</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">19aeb575-daa6-4188-9121-ffdca05b1327</guid>
      <link>https://share.transistor.fm/s/522b2237</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV, 68T45 (Primary) 68T50 (Secondary), I.2.6</p>

            <p><strong>Authors:</strong><br>
            Carlo Alberto Barbano, Luca Molinaro, Emanuele Aiello, Marco Grangetto</p>

            <p><strong>Title:</strong><br>
            Knowledge Transfer Across Modalities with Natural Language Supervision</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.15611v1">http://arxiv.org/abs/2411.15611v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a way to learn novel concepts by only using their textual description. We call this method Knowledge Transfer. Similarly to human perception, we leverage cross-modal interaction to introduce new concepts. We hypothesize that in a pre-trained visual encoder there are enough low-level features already learned (e.g. shape, appearance, color) that can be used to describe previously unknown high-level concepts. Provided with a textual description of the novel concept, our method works by aligning the known low-level features of the visual encoder to its high-level textual description. We show that Knowledge Transfer can successfully introduce novel concepts in multimodal models, in a very efficient manner, by only requiring a single description of the target concept. Our approach is compatible with both separate textual and visual encoders (e.g. CLIP) and shared parameters across modalities. We also show that, following the same principle, Knowledge Transfer can improve concepts already known by the model. Leveraging Knowledge Transfer we improve zero-shot performance across different tasks such as classification, segmentation, image-text retrieval, and captioning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV, 68T45 (Primary) 68T50 (Secondary), I.2.6</p>

            <p><strong>Authors:</strong><br>
            Carlo Alberto Barbano, Luca Molinaro, Emanuele Aiello, Marco Grangetto</p>

            <p><strong>Title:</strong><br>
            Knowledge Transfer Across Modalities with Natural Language Supervision</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.15611v1">http://arxiv.org/abs/2411.15611v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a way to learn novel concepts by only using their textual description. We call this method Knowledge Transfer. Similarly to human perception, we leverage cross-modal interaction to introduce new concepts. We hypothesize that in a pre-trained visual encoder there are enough low-level features already learned (e.g. shape, appearance, color) that can be used to describe previously unknown high-level concepts. Provided with a textual description of the novel concept, our method works by aligning the known low-level features of the visual encoder to its high-level textual description. We show that Knowledge Transfer can successfully introduce novel concepts in multimodal models, in a very efficient manner, by only requiring a single description of the target concept. Our approach is compatible with both separate textual and visual encoders (e.g. CLIP) and shared parameters across modalities. We also show that, following the same principle, Knowledge Transfer can improve concepts already known by the model. Leveraging Knowledge Transfer we improve zero-shot performance across different tasks such as classification, segmentation, image-text retrieval, and captioning.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 26 Nov 2024 19:48:41 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/522b2237/3c1bbd98.mp3" length="19862286" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1238</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV, 68T45 (Primary) 68T50 (Secondary), I.2.6</p>

            <p><strong>Authors:</strong><br>
            Carlo Alberto Barbano, Luca Molinaro, Emanuele Aiello, Marco Grangetto</p>

            <p><strong>Title:</strong><br>
            Knowledge Transfer Across Modalities with Natural Language Supervision</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.15611v1">http://arxiv.org/abs/2411.15611v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a way to learn novel concepts by only using their textual description. We call this method Knowledge Transfer. Similarly to human perception, we leverage cross-modal interaction to introduce new concepts. We hypothesize that in a pre-trained visual encoder there are enough low-level features already learned (e.g. shape, appearance, color) that can be used to describe previously unknown high-level concepts. Provided with a textual description of the novel concept, our method works by aligning the known low-level features of the visual encoder to its high-level textual description. We show that Knowledge Transfer can successfully introduce novel concepts in multimodal models, in a very efficient manner, by only requiring a single description of the target concept. Our approach is compatible with both separate textual and visual encoders (e.g. CLIP) and shared parameters across modalities. We also show that, following the same principle, Knowledge Transfer can improve concepts already known by the model. Leveraging Knowledge Transfer we improve zero-shot performance across different tasks such as classification, segmentation, image-text retrieval, and captioning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>One Diffusion to Generate Them All</title>
      <itunes:episode>141</itunes:episode>
      <podcast:episode>141</podcast:episode>
      <itunes:title>One Diffusion to Generate Them All</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cf2f7617-0458-4fe8-9358-cce3b83f6fef</guid>
      <link>https://share.transistor.fm/s/16935ff1</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Duong H. Le, Tuan Pham, Sangho Lee, Christopher Clark, Aniruddha Kembhavi, Stephan Mandt, Ranjay Krishna, Jiasen Lu</p>

            <p><strong>Title:</strong><br>
            One Diffusion to Generate Them All</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16318v1">http://arxiv.org/abs/2411.16318v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as depth estimation and segmentation. Additionally, OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs. Our model takes a straightforward yet effective approach by treating all tasks as frame sequences with varying noise scales during training, allowing any frame to act as a conditioning image at inference time. Our unified training framework removes the need for specialized architectures, supports scalable multi-task training, and adapts smoothly to any resolution, enhancing both generalization and scalability. Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset. Our code and checkpoint are freely available at https://github.com/lehduong/OneDiffusion</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Duong H. Le, Tuan Pham, Sangho Lee, Christopher Clark, Aniruddha Kembhavi, Stephan Mandt, Ranjay Krishna, Jiasen Lu</p>

            <p><strong>Title:</strong><br>
            One Diffusion to Generate Them All</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16318v1">http://arxiv.org/abs/2411.16318v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as depth estimation and segmentation. Additionally, OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs. Our model takes a straightforward yet effective approach by treating all tasks as frame sequences with varying noise scales during training, allowing any frame to act as a conditioning image at inference time. Our unified training framework removes the need for specialized architectures, supports scalable multi-task training, and adapts smoothly to any resolution, enhancing both generalization and scalability. Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset. Our code and checkpoint are freely available at https://github.com/lehduong/OneDiffusion</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 26 Nov 2024 19:48:20 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/16935ff1/ba87493e.mp3" length="22533428" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1405</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Duong H. Le, Tuan Pham, Sangho Lee, Christopher Clark, Aniruddha Kembhavi, Stephan Mandt, Ranjay Krishna, Jiasen Lu</p>

            <p><strong>Title:</strong><br>
            One Diffusion to Generate Them All</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16318v1">http://arxiv.org/abs/2411.16318v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as depth estimation and segmentation. Additionally, OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs. Our model takes a straightforward yet effective approach by treating all tasks as frame sequences with varying noise scales during training, allowing any frame to act as a conditioning image at inference time. Our unified training framework removes the need for specialized architectures, supports scalable multi-task training, and adapts smoothly to any resolution, enhancing both generalization and scalability. Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset. Our code and checkpoint are freely available at https://github.com/lehduong/OneDiffusion</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VisualLens: Personalization through Visual History</title>
      <itunes:episode>140</itunes:episode>
      <podcast:episode>140</podcast:episode>
      <itunes:title>VisualLens: Personalization through Visual History</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">85504dee-7ce4-465b-a27c-4e63a835dee8</guid>
      <link>https://share.transistor.fm/s/bd45e292</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wang Bill Zhu, Deqing Fu, Kai Sun, Yi Lu, Zhaojiang Lin, Seungwhan Moon, Kanika Narang, Mustafa Canim, Yue Liu, Anuj Kumar, Xin Luna Dong</p>

            <p><strong>Title:</strong><br>
            VisualLens: Personalization through Visual History</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16034v1">http://arxiv.org/abs/2411.16034v1</a></p>

            <p><strong>Abstract:</strong><br>
            We hypothesize that a user's visual history with images reflecting their daily life, offers valuable insights into their interests and preferences, and can be leveraged for personalization. Among the many challenges to achieve this goal, the foremost is the diversity and noises in the visual history, containing images not necessarily related to a recommendation task, not necessarily reflecting the user's interest, or even not necessarily preference-relevant. Existing recommendation systems either rely on task-specific user interaction logs, such as online shopping history for shopping recommendations, or focus on text signals. We propose a novel approach, VisualLens, that extracts, filters, and refines image representations, and leverages these signals for personalization. We created two new benchmarks with task-agnostic visual histories, and show that our method improves over state-of-the-art recommendations by 5-10% on Hit@3, and improves over GPT-4o by 2-5%. Our approach paves the way for personalized recommendations in scenarios where traditional methods fail.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wang Bill Zhu, Deqing Fu, Kai Sun, Yi Lu, Zhaojiang Lin, Seungwhan Moon, Kanika Narang, Mustafa Canim, Yue Liu, Anuj Kumar, Xin Luna Dong</p>

            <p><strong>Title:</strong><br>
            VisualLens: Personalization through Visual History</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16034v1">http://arxiv.org/abs/2411.16034v1</a></p>

            <p><strong>Abstract:</strong><br>
            We hypothesize that a user's visual history with images reflecting their daily life, offers valuable insights into their interests and preferences, and can be leveraged for personalization. Among the many challenges to achieve this goal, the foremost is the diversity and noises in the visual history, containing images not necessarily related to a recommendation task, not necessarily reflecting the user's interest, or even not necessarily preference-relevant. Existing recommendation systems either rely on task-specific user interaction logs, such as online shopping history for shopping recommendations, or focus on text signals. We propose a novel approach, VisualLens, that extracts, filters, and refines image representations, and leverages these signals for personalization. We created two new benchmarks with task-agnostic visual histories, and show that our method improves over state-of-the-art recommendations by 5-10% on Hit@3, and improves over GPT-4o by 2-5%. Our approach paves the way for personalized recommendations in scenarios where traditional methods fail.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 26 Nov 2024 19:47:59 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bd45e292/a685de48.mp3" length="23812817" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1485</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wang Bill Zhu, Deqing Fu, Kai Sun, Yi Lu, Zhaojiang Lin, Seungwhan Moon, Kanika Narang, Mustafa Canim, Yue Liu, Anuj Kumar, Xin Luna Dong</p>

            <p><strong>Title:</strong><br>
            VisualLens: Personalization through Visual History</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.16034v1">http://arxiv.org/abs/2411.16034v1</a></p>

            <p><strong>Abstract:</strong><br>
            We hypothesize that a user's visual history with images reflecting their daily life, offers valuable insights into their interests and preferences, and can be leveraged for personalization. Among the many challenges to achieve this goal, the foremost is the diversity and noises in the visual history, containing images not necessarily related to a recommendation task, not necessarily reflecting the user's interest, or even not necessarily preference-relevant. Existing recommendation systems either rely on task-specific user interaction logs, such as online shopping history for shopping recommendations, or focus on text signals. We propose a novel approach, VisualLens, that extracts, filters, and refines image representations, and leverages these signals for personalization. We created two new benchmarks with task-agnostic visual histories, and show that our method improves over state-of-the-art recommendations by 5-10% on Hit@3, and improves over GPT-4o by 2-5%. Our approach paves the way for personalized recommendations in scenarios where traditional methods fail.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TÜLU 3: Pushing Frontiers in Open Language Model Post-Training</title>
      <itunes:episode>139</itunes:episode>
      <podcast:episode>139</podcast:episode>
      <itunes:title>TÜLU 3: Pushing Frontiers in Open Language Model Post-Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f2825d2c-0714-49e2-b089-9e163fc73b40</guid>
      <link>https://share.transistor.fm/s/7a24bddc</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 38 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, Hannaneh Hajishirzi</p>

            <p><strong>Title:</strong><br>
            TÜLU 3: Pushing Frontiers in Open Language Model Post-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.15124v1">http://arxiv.org/abs/2411.15124v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce T\"ULU 3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. T\"ULU 3, which builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku. The training algorithms for our models include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). With T\"ULU 3, we introduce a multi-task evaluation scheme for post-training recipes with development and unseen evaluations, standard benchmark implementations, and substantial decontamination of existing open datasets on said benchmarks. We conclude with analysis and discussion of training methods that did not reliably improve performance.   In addition to the T\"ULU 3 model weights and demo, we release the complete recipe -- including datasets for diverse core skills, a robust toolkit for data curation and evaluation, the training code and infrastructure, and, most importantly, a detailed report for reproducing and further adapting the T\"ULU 3 approach to more domains.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 38 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, Hannaneh Hajishirzi</p>

            <p><strong>Title:</strong><br>
            TÜLU 3: Pushing Frontiers in Open Language Model Post-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.15124v1">http://arxiv.org/abs/2411.15124v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce T\"ULU 3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. T\"ULU 3, which builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku. The training algorithms for our models include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). With T\"ULU 3, we introduce a multi-task evaluation scheme for post-training recipes with development and unseen evaluations, standard benchmark implementations, and substantial decontamination of existing open datasets on said benchmarks. We conclude with analysis and discussion of training methods that did not reliably improve performance.   In addition to the T\"ULU 3 model weights and demo, we release the complete recipe -- including datasets for diverse core skills, a robust toolkit for data curation and evaluation, the training code and infrastructure, and, most importantly, a detailed report for reproducing and further adapting the T\"ULU 3 approach to more domains.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 25 Nov 2024 19:55:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7a24bddc/91d14413.mp3" length="25383519" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1583</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 38 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, Hannaneh Hajishirzi</p>

            <p><strong>Title:</strong><br>
            TÜLU 3: Pushing Frontiers in Open Language Model Post-Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.15124v1">http://arxiv.org/abs/2411.15124v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce T\"ULU 3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. T\"ULU 3, which builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku. The training algorithms for our models include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). With T\"ULU 3, we introduce a multi-task evaluation scheme for post-training recipes with development and unseen evaluations, standard benchmark implementations, and substantial decontamination of existing open datasets on said benchmarks. We conclude with analysis and discussion of training methods that did not reliably improve performance.   In addition to the T\"ULU 3 model weights and demo, we release the complete recipe -- including datasets for diverse core skills, a robust toolkit for data curation and evaluation, the training code and infrastructure, and, most importantly, a detailed report for reproducing and further adapting the T\"ULU 3 approach to more domains.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Style-Friendly SNR Sampler for Style-Driven Generation</title>
      <itunes:episode>138</itunes:episode>
      <podcast:episode>138</podcast:episode>
      <itunes:title>Style-Friendly SNR Sampler for Style-Driven Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8aad6b52-9618-448c-ade8-c2fa5b7f2caf</guid>
      <link>https://share.transistor.fm/s/f68fa9a5</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jooyoung Choi, Chaehun Shin, Yeongtak Oh, Heeseung Kim, Sungroh Yoon</p>

            <p><strong>Title:</strong><br>
            Style-Friendly SNR Sampler for Style-Driven Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14793v1">http://arxiv.org/abs/2411.14793v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent large-scale diffusion models generate high-quality images but struggle to learn new, personalized artistic styles, which limits the creation of unique style templates. Fine-tuning with reference images is the most promising approach, but it often blindly utilizes objectives and noise level distributions used for pre-training, leading to suboptimal style alignment. We propose the Style-friendly SNR sampler, which aggressively shifts the signal-to-noise ratio (SNR) distribution toward higher noise levels during fine-tuning to focus on noise levels where stylistic features emerge. This enables models to better capture unique styles and generate images with higher style alignment. Our method allows diffusion models to learn and share new "style templates", enhancing personalized content creation. We demonstrate the ability to generate styles such as personal watercolor paintings, minimal flat cartoons, 3D renderings, multi-panel images, and memes with text, thereby broadening the scope of style-driven generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jooyoung Choi, Chaehun Shin, Yeongtak Oh, Heeseung Kim, Sungroh Yoon</p>

            <p><strong>Title:</strong><br>
            Style-Friendly SNR Sampler for Style-Driven Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14793v1">http://arxiv.org/abs/2411.14793v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent large-scale diffusion models generate high-quality images but struggle to learn new, personalized artistic styles, which limits the creation of unique style templates. Fine-tuning with reference images is the most promising approach, but it often blindly utilizes objectives and noise level distributions used for pre-training, leading to suboptimal style alignment. We propose the Style-friendly SNR sampler, which aggressively shifts the signal-to-noise ratio (SNR) distribution toward higher noise levels during fine-tuning to focus on noise levels where stylistic features emerge. This enables models to better capture unique styles and generate images with higher style alignment. Our method allows diffusion models to learn and share new "style templates", enhancing personalized content creation. We demonstrate the ability to generate styles such as personal watercolor paintings, minimal flat cartoons, 3D renderings, multi-panel images, and memes with text, thereby broadening the scope of style-driven generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 25 Nov 2024 19:54:38 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f68fa9a5/87d9cfc8.mp3" length="19334388" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1205</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 28 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jooyoung Choi, Chaehun Shin, Yeongtak Oh, Heeseung Kim, Sungroh Yoon</p>

            <p><strong>Title:</strong><br>
            Style-Friendly SNR Sampler for Style-Driven Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14793v1">http://arxiv.org/abs/2411.14793v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent large-scale diffusion models generate high-quality images but struggle to learn new, personalized artistic styles, which limits the creation of unique style templates. Fine-tuning with reference images is the most promising approach, but it often blindly utilizes objectives and noise level distributions used for pre-training, leading to suboptimal style alignment. We propose the Style-friendly SNR sampler, which aggressively shifts the signal-to-noise ratio (SNR) distribution toward higher noise levels during fine-tuning to focus on noise levels where stylistic features emerge. This enables models to better capture unique styles and generate images with higher style alignment. Our method allows diffusion models to learn and share new "style templates", enhancing personalized content creation. We demonstrate the ability to generate styles such as personal watercolor paintings, minimal flat cartoons, 3D renderings, multi-panel images, and memes with text, thereby broadening the scope of style-driven generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OminiControl: Minimal and Universal Control for Diffusion Transformer</title>
      <itunes:episode>137</itunes:episode>
      <podcast:episode>137</podcast:episode>
      <itunes:title>OminiControl: Minimal and Universal Control for Diffusion Transformer</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">060a3a7d-ab00-408a-abcd-3d8264ca82f6</guid>
      <link>https://share.transistor.fm/s/32c17ad2</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 22 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            OminiControl: Minimal and Universal Control for Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.15098v1">http://arxiv.org/abs/2411.15098v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce OminiControl, a highly versatile and parameter-efficient framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models. At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone and process them with its flexible multi-modal attention processors. Unlike existing methods, which rely heavily on additional encoder modules with complex architectures, OminiControl (1) effectively and efficiently incorporates injected image conditions with only ~0.1% additional parameters, and (2) addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions such as edges, depth, and more. Remarkably, these capabilities are achieved by training on images generated by the DiT itself, which is particularly beneficial for subject-driven generation. Extensive evaluations demonstrate that OminiControl outperforms existing UNet-based and DiT-adapted models in both subject-driven and spatially-aligned conditional generation. Additionally, we release our training dataset, Subjects200K, a diverse collection of over 200,000 identity-consistent images, along with an efficient data synthesis pipeline to advance research in subject-consistent generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 22 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            OminiControl: Minimal and Universal Control for Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.15098v1">http://arxiv.org/abs/2411.15098v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce OminiControl, a highly versatile and parameter-efficient framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models. At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone and process them with its flexible multi-modal attention processors. Unlike existing methods, which rely heavily on additional encoder modules with complex architectures, OminiControl (1) effectively and efficiently incorporates injected image conditions with only ~0.1% additional parameters, and (2) addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions such as edges, depth, and more. Remarkably, these capabilities are achieved by training on images generated by the DiT itself, which is particularly beneficial for subject-driven generation. Extensive evaluations demonstrate that OminiControl outperforms existing UNet-based and DiT-adapted models in both subject-driven and spatially-aligned conditional generation. Additionally, we release our training dataset, Subjects200K, a diverse collection of over 200,000 identity-consistent images, along with an efficient data synthesis pipeline to advance research in subject-consistent generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 25 Nov 2024 19:54:16 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/32c17ad2/19d0fd8c.mp3" length="25072565" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1563</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 22 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, Xinchao Wang</p>

            <p><strong>Title:</strong><br>
            OminiControl: Minimal and Universal Control for Diffusion Transformer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.15098v1">http://arxiv.org/abs/2411.15098v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce OminiControl, a highly versatile and parameter-efficient framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models. At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone and process them with its flexible multi-modal attention processors. Unlike existing methods, which rely heavily on additional encoder modules with complex architectures, OminiControl (1) effectively and efficiently incorporates injected image conditions with only ~0.1% additional parameters, and (2) addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions such as edges, depth, and more. Remarkably, these capabilities are achieved by training on images generated by the DiT itself, which is particularly beneficial for subject-driven generation. Extensive evaluations demonstrate that OminiControl outperforms existing UNet-based and DiT-adapted models in both subject-driven and spatially-aligned conditional generation. Additionally, we release our training dataset, Subjects200K, a diverse collection of over 200,000 identity-consistent images, along with an efficient data synthesis pipeline to advance research in subject-consistent generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection</title>
      <itunes:episode>136</itunes:episode>
      <podcast:episode>136</podcast:episode>
      <itunes:title>A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">75f34531-a64b-40f0-9ac7-9f66c3609819</guid>
      <link>https://share.transistor.fm/s/dc53c131</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.CL, cs.LG, 68T50, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Gabriel Chua, Shing Yee Chan, Shaun Khoo</p>

            <p><strong>Title:</strong><br>
            A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12946v1">http://arxiv.org/abs/2411.12946v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models are prone to off-topic misuse, where users may prompt these models to perform tasks beyond their intended scope. Current guardrails, which often rely on curated examples or custom classifiers, suffer from high false-positive rates, limited adaptability, and the impracticality of requiring real-world data that is not available in pre-production. In this paper, we introduce a flexible, data-free guardrail development methodology that addresses these challenges. By thoroughly defining the problem space qualitatively and passing this to an LLM to generate diverse prompts, we construct a synthetic dataset to benchmark and train off-topic guardrails that outperform heuristic approaches. Additionally, by framing the task as classifying whether the user prompt is relevant with respect to the system prompt, our guardrails effectively generalize to other misuse categories, including jailbreak and harmful prompts. Lastly, we further contribute to the field by open-sourcing both the synthetic dataset and the off-topic guardrail models, providing valuable resources for developing guardrails in pre-production environments and supporting future research and development in LLM safety.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.CL, cs.LG, 68T50, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Gabriel Chua, Shing Yee Chan, Shaun Khoo</p>

            <p><strong>Title:</strong><br>
            A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12946v1">http://arxiv.org/abs/2411.12946v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models are prone to off-topic misuse, where users may prompt these models to perform tasks beyond their intended scope. Current guardrails, which often rely on curated examples or custom classifiers, suffer from high false-positive rates, limited adaptability, and the impracticality of requiring real-world data that is not available in pre-production. In this paper, we introduce a flexible, data-free guardrail development methodology that addresses these challenges. By thoroughly defining the problem space qualitatively and passing this to an LLM to generate diverse prompts, we construct a synthetic dataset to benchmark and train off-topic guardrails that outperform heuristic approaches. Additionally, by framing the task as classifying whether the user prompt is relevant with respect to the system prompt, our guardrails effectively generalize to other misuse categories, including jailbreak and harmful prompts. Lastly, we further contribute to the field by open-sourcing both the synthetic dataset and the off-topic guardrail models, providing valuable resources for developing guardrails in pre-production environments and supporting future research and development in LLM safety.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 25 Nov 2024 19:53:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/dc53c131/57d3d8b1.mp3" length="22691068" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1415</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.CL, cs.LG, 68T50, I.2.7</p>

            <p><strong>Authors:</strong><br>
            Gabriel Chua, Shing Yee Chan, Shaun Khoo</p>

            <p><strong>Title:</strong><br>
            A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12946v1">http://arxiv.org/abs/2411.12946v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models are prone to off-topic misuse, where users may prompt these models to perform tasks beyond their intended scope. Current guardrails, which often rely on curated examples or custom classifiers, suffer from high false-positive rates, limited adaptability, and the impracticality of requiring real-world data that is not available in pre-production. In this paper, we introduce a flexible, data-free guardrail development methodology that addresses these challenges. By thoroughly defining the problem space qualitatively and passing this to an LLM to generate diverse prompts, we construct a synthetic dataset to benchmark and train off-topic guardrails that outperform heuristic approaches. Additionally, by framing the task as classifying whether the user prompt is relevant with respect to the system prompt, our guardrails effectively generalize to other misuse categories, including jailbreak and harmful prompts. Lastly, we further contribute to the field by open-sourcing both the synthetic dataset and the off-topic guardrail models, providing valuable resources for developing guardrails in pre-production environments and supporting future research and development in LLM safety.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games</title>
      <itunes:episode>135</itunes:episode>
      <podcast:episode>135</podcast:episode>
      <itunes:title>BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d812e255-45ef-4bab-a478-fd4c224e7e39</guid>
      <link>https://share.transistor.fm/s/bdc380b1</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel</p>

            <p><strong>Title:</strong><br>
            BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.13543v1">http://arxiv.org/abs/2411.13543v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) and Vision Language Models (VLMs) possess extensive knowledge and exhibit promising reasoning abilities; however, they still struggle to perform well in complex, dynamic environments. Real-world tasks require handling intricate interactions, advanced spatial reasoning, long-term planning, and continuous exploration of new strategies-areas in which we lack effective methodologies for comprehensively evaluating these capabilities. To address this gap, we introduce BALROG, a novel benchmark designed to assess the agentic capabilities of LLMs and VLMs through a diverse set of challenging games. Our benchmark incorporates a range of existing reinforcement learning environments with varying levels of difficulty, including tasks that are solvable by non-expert humans in seconds to extremely challenging ones that may take years to master (e.g., the NetHack Learning Environment). We devise fine-grained metrics to measure performance and conduct an extensive evaluation of several popular open-source and closed-source LLMs and VLMs. Our findings indicate that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks. Notably, we observe severe deficiencies in vision-based decision-making, as models perform worse when visual representations of the environments are provided. We release BALROG as an open and user-friendly benchmark to facilitate future research and development in the agentic community.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel</p>

            <p><strong>Title:</strong><br>
            BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.13543v1">http://arxiv.org/abs/2411.13543v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) and Vision Language Models (VLMs) possess extensive knowledge and exhibit promising reasoning abilities; however, they still struggle to perform well in complex, dynamic environments. Real-world tasks require handling intricate interactions, advanced spatial reasoning, long-term planning, and continuous exploration of new strategies-areas in which we lack effective methodologies for comprehensively evaluating these capabilities. To address this gap, we introduce BALROG, a novel benchmark designed to assess the agentic capabilities of LLMs and VLMs through a diverse set of challenging games. Our benchmark incorporates a range of existing reinforcement learning environments with varying levels of difficulty, including tasks that are solvable by non-expert humans in seconds to extremely challenging ones that may take years to master (e.g., the NetHack Learning Environment). We devise fine-grained metrics to measure performance and conduct an extensive evaluation of several popular open-source and closed-source LLMs and VLMs. Our findings indicate that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks. Notably, we observe severe deficiencies in vision-based decision-making, as models perform worse when visual representations of the environments are provided. We release BALROG as an open and user-friendly benchmark to facilitate future research and development in the agentic community.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 25 Nov 2024 19:53:33 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bdc380b1/4c3ff98e.mp3" length="26193103" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1633</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel</p>

            <p><strong>Title:</strong><br>
            BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.13543v1">http://arxiv.org/abs/2411.13543v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) and Vision Language Models (VLMs) possess extensive knowledge and exhibit promising reasoning abilities; however, they still struggle to perform well in complex, dynamic environments. Real-world tasks require handling intricate interactions, advanced spatial reasoning, long-term planning, and continuous exploration of new strategies-areas in which we lack effective methodologies for comprehensively evaluating these capabilities. To address this gap, we introduce BALROG, a novel benchmark designed to assess the agentic capabilities of LLMs and VLMs through a diverse set of challenging games. Our benchmark incorporates a range of existing reinforcement learning environments with varying levels of difficulty, including tasks that are solvable by non-expert humans in seconds to extremely challenging ones that may take years to master (e.g., the NetHack Learning Environment). We devise fine-grained metrics to measure performance and conduct an extensive evaluation of several popular open-source and closed-source LLMs and VLMs. Our findings indicate that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks. Notably, we observe severe deficiencies in vision-based decision-making, as models perform worse when visual representations of the environments are provided. We release BALROG as an open and user-friendly benchmark to facilitate future research and development in the agentic community.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Large Multi-modal Models Can Interpret Features in Large Multi-modal Models</title>
      <itunes:episode>134</itunes:episode>
      <podcast:episode>134</podcast:episode>
      <itunes:title>Large Multi-modal Models Can Interpret Features in Large Multi-modal Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">54311ee0-6b0e-40ef-a915-e9a69ea9d9c5</guid>
      <link>https://share.transistor.fm/s/1eed2266</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kaichen Zhang, Yifei Shen, Bo Li, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Large Multi-modal Models Can Interpret Features in Large Multi-modal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14982v1">http://arxiv.org/abs/2411.14982v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in Large Multimodal Models (LMMs) lead to significant breakthroughs in both academia and industry. One question that arises is how we, as humans, can understand their internal neural representations. This paper takes an initial step towards addressing this question by presenting a versatile framework to identify and interpret the semantics within LMMs. Specifically, 1) we first apply a Sparse Autoencoder(SAE) to disentangle the representations into human understandable features. 2) We then present an automatic interpretation framework to interpreted the open-semantic features learned in SAE by the LMMs themselves. We employ this framework to analyze the LLaVA-NeXT-8B model using the LLaVA-OV-72B model, demonstrating that these features can effectively steer the model's behavior. Our results contribute to a deeper understanding of why LMMs excel in specific tasks, including EQ tests, and illuminate the nature of their mistakes along with potential strategies for their rectification. These findings offer new insights into the internal mechanisms of LMMs and suggest parallels with the cognitive processes of the human brain.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kaichen Zhang, Yifei Shen, Bo Li, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Large Multi-modal Models Can Interpret Features in Large Multi-modal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14982v1">http://arxiv.org/abs/2411.14982v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in Large Multimodal Models (LMMs) lead to significant breakthroughs in both academia and industry. One question that arises is how we, as humans, can understand their internal neural representations. This paper takes an initial step towards addressing this question by presenting a versatile framework to identify and interpret the semantics within LMMs. Specifically, 1) we first apply a Sparse Autoencoder(SAE) to disentangle the representations into human understandable features. 2) We then present an automatic interpretation framework to interpreted the open-semantic features learned in SAE by the LMMs themselves. We employ this framework to analyze the LLaVA-NeXT-8B model using the LLaVA-OV-72B model, demonstrating that these features can effectively steer the model's behavior. Our results contribute to a deeper understanding of why LMMs excel in specific tasks, including EQ tests, and illuminate the nature of their mistakes along with potential strategies for their rectification. These findings offer new insights into the internal mechanisms of LMMs and suggest parallels with the cognitive processes of the human brain.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 25 Nov 2024 19:53:11 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1eed2266/872b1537.mp3" length="22077475" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1376</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Kaichen Zhang, Yifei Shen, Bo Li, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Large Multi-modal Models Can Interpret Features in Large Multi-modal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14982v1">http://arxiv.org/abs/2411.14982v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in Large Multimodal Models (LMMs) lead to significant breakthroughs in both academia and industry. One question that arises is how we, as humans, can understand their internal neural representations. This paper takes an initial step towards addressing this question by presenting a versatile framework to identify and interpret the semantics within LMMs. Specifically, 1) we first apply a Sparse Autoencoder(SAE) to disentangle the representations into human understandable features. 2) We then present an automatic interpretation framework to interpreted the open-semantic features learned in SAE by the LMMs themselves. We employ this framework to analyze the LLaVA-NeXT-8B model using the LLaVA-OV-72B model, demonstrating that these features can effectively steer the model's behavior. Our results contribute to a deeper understanding of why LMMs excel in specific tasks, including EQ tests, and illuminate the nature of their mistakes along with potential strategies for their rectification. These findings offer new insights into the internal mechanisms of LMMs and suggest parallels with the cognitive processes of the human brain.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection</title>
      <itunes:episode>133</itunes:episode>
      <podcast:episode>133</podcast:episode>
      <itunes:title>VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">56cf1b67-cc97-43c0-9078-aeea6a48c5ff</guid>
      <link>https://share.transistor.fm/s/3270aef2</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 9 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu</p>

            <p><strong>Title:</strong><br>
            VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14794v1">http://arxiv.org/abs/2411.14794v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: https://github.com/hshjerry/VideoEspresso</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 9 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu</p>

            <p><strong>Title:</strong><br>
            VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14794v1">http://arxiv.org/abs/2411.14794v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: https://github.com/hshjerry/VideoEspresso</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 25 Nov 2024 19:52:49 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3270aef2/91d178cf.mp3" length="20648091" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1287</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 9 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu</p>

            <p><strong>Title:</strong><br>
            VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14794v1">http://arxiv.org/abs/2411.14794v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: https://github.com/hshjerry/VideoEspresso</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction</title>
      <itunes:episode>132</itunes:episode>
      <podcast:episode>132</podcast:episode>
      <itunes:title>Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0890facb-f80b-48b4-bbcb-3d458c77c9f1</guid>
      <link>https://share.transistor.fm/s/d3bc20a8</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 9 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Huiwon Jang, Sihyun Yu, Jinwoo Shin, Pieter Abbeel, Younggyo Seo</p>

            <p><strong>Title:</strong><br>
            Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14762v1">http://arxiv.org/abs/2411.14762v1</a></p>

            <p><strong>Abstract:</strong><br>
            Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization. However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once. In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled $(x,y,t)$ coordinates. This allows for training large tokenizer models directly on long videos without requiring excessive training resources. Our experiments show that CoordTok can drastically reduce the number of tokens for encoding long video clips. For instance, CoordTok can encode a 128-frame video with 128$\times$128 resolution into 1280 tokens, while baselines need 6144 or 8192 tokens to achieve similar reconstruction quality. We further show that this efficient video tokenization enables memory-efficient training of a diffusion transformer that can generate 128 frames at once.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 9 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Huiwon Jang, Sihyun Yu, Jinwoo Shin, Pieter Abbeel, Younggyo Seo</p>

            <p><strong>Title:</strong><br>
            Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14762v1">http://arxiv.org/abs/2411.14762v1</a></p>

            <p><strong>Abstract:</strong><br>
            Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization. However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once. In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled $(x,y,t)$ coordinates. This allows for training large tokenizer models directly on long videos without requiring excessive training resources. Our experiments show that CoordTok can drastically reduce the number of tokens for encoding long video clips. For instance, CoordTok can encode a 128-frame video with 128$\times$128 resolution into 1280 tokens, while baselines need 6144 or 8192 tokens to achieve similar reconstruction quality. We further show that this efficient video tokenization enables memory-efficient training of a diffusion transformer that can generate 128 frames at once.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 25 Nov 2024 19:52:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d3bc20a8/9c9aa75f.mp3" length="25035373" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1561</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 9 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Huiwon Jang, Sihyun Yu, Jinwoo Shin, Pieter Abbeel, Younggyo Seo</p>

            <p><strong>Title:</strong><br>
            Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14762v1">http://arxiv.org/abs/2411.14762v1</a></p>

            <p><strong>Abstract:</strong><br>
            Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization. However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once. In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled $(x,y,t)$ coordinates. This allows for training large tokenizer models directly on long videos without requiring excessive training resources. Our experiments show that CoordTok can drastically reduce the number of tokens for encoding long video clips. For instance, CoordTok can encode a 128-frame video with 128$\times$128 resolution into 1280 tokens, while baselines need 6144 or 8192 tokens to achieve similar reconstruction quality. We further show that this efficient video tokenization enables memory-efficient training of a diffusion transformer that can generate 128 frames at once.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MyTimeMachine: Personalized Facial Age Transformation</title>
      <itunes:episode>131</itunes:episode>
      <podcast:episode>131</podcast:episode>
      <itunes:title>MyTimeMachine: Personalized Facial Age Transformation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">74a16eb4-9ac0-4578-801f-eabf3dd2b4f4</guid>
      <link>https://share.transistor.fm/s/fc589f72</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Luchao Qi, Jiaye Wu, Bang Gong, Annie N. Wang, David W. Jacobs, Roni Sengupta</p>

            <p><strong>Title:</strong><br>
            MyTimeMachine: Personalized Facial Age Transformation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14521v1">http://arxiv.org/abs/2411.14521v1</a></p>

            <p><strong>Abstract:</strong><br>
            Facial aging is a complex process, highly dependent on multiple factors like gender, ethnicity, lifestyle, etc., making it extremely challenging to learn a global aging prior to predict aging for any individual accurately. Existing techniques often produce realistic and plausible aging results, but the re-aged images often do not resemble the person's appearance at the target age and thus need personalization. In many practical applications of virtual aging, e.g. VFX in movies and TV shows, access to a personal photo collection of the user depicting aging in a small time interval (20$\sim$40 years) is often available. However, naive attempts to personalize global aging techniques on personal photo collections often fail. Thus, we propose MyTimeMachine (MyTM), which combines a global aging prior with a personal photo collection (using as few as 50 images) to learn a personalized age transformation. We introduce a novel Adapter Network that combines personalized aging features with global aging features and generates a re-aged image with StyleGAN2. We also introduce three loss functions to personalize the Adapter Network with personalized aging loss, extrapolation regularization, and adaptive w-norm regularization. Our approach can also be extended to videos, achieving high-quality, identity-preserving, and temporally consistent aging effects that resemble actual appearances at target ages, demonstrating its superiority over state-of-the-art approaches.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Luchao Qi, Jiaye Wu, Bang Gong, Annie N. Wang, David W. Jacobs, Roni Sengupta</p>

            <p><strong>Title:</strong><br>
            MyTimeMachine: Personalized Facial Age Transformation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14521v1">http://arxiv.org/abs/2411.14521v1</a></p>

            <p><strong>Abstract:</strong><br>
            Facial aging is a complex process, highly dependent on multiple factors like gender, ethnicity, lifestyle, etc., making it extremely challenging to learn a global aging prior to predict aging for any individual accurately. Existing techniques often produce realistic and plausible aging results, but the re-aged images often do not resemble the person's appearance at the target age and thus need personalization. In many practical applications of virtual aging, e.g. VFX in movies and TV shows, access to a personal photo collection of the user depicting aging in a small time interval (20$\sim$40 years) is often available. However, naive attempts to personalize global aging techniques on personal photo collections often fail. Thus, we propose MyTimeMachine (MyTM), which combines a global aging prior with a personal photo collection (using as few as 50 images) to learn a personalized age transformation. We introduce a novel Adapter Network that combines personalized aging features with global aging features and generates a re-aged image with StyleGAN2. We also introduce three loss functions to personalize the Adapter Network with personalized aging loss, extrapolation regularization, and adaptive w-norm regularization. Our approach can also be extended to videos, achieving high-quality, identity-preserving, and temporally consistent aging effects that resemble actual appearances at target ages, demonstrating its superiority over state-of-the-art approaches.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 25 Nov 2024 19:52:06 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fc589f72/83b82997.mp3" length="21149584" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1318</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Luchao Qi, Jiaye Wu, Bang Gong, Annie N. Wang, David W. Jacobs, Roni Sengupta</p>

            <p><strong>Title:</strong><br>
            MyTimeMachine: Personalized Facial Age Transformation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14521v1">http://arxiv.org/abs/2411.14521v1</a></p>

            <p><strong>Abstract:</strong><br>
            Facial aging is a complex process, highly dependent on multiple factors like gender, ethnicity, lifestyle, etc., making it extremely challenging to learn a global aging prior to predict aging for any individual accurately. Existing techniques often produce realistic and plausible aging results, but the re-aged images often do not resemble the person's appearance at the target age and thus need personalization. In many practical applications of virtual aging, e.g. VFX in movies and TV shows, access to a personal photo collection of the user depicting aging in a small time interval (20$\sim$40 years) is often available. However, naive attempts to personalize global aging techniques on personal photo collections often fail. Thus, we propose MyTimeMachine (MyTM), which combines a global aging prior with a personal photo collection (using as few as 50 images) to learn a personalized age transformation. We introduce a novel Adapter Network that combines personalized aging features with global aging features and generates a re-aged image with StyleGAN2. We also introduce three loss functions to personalize the Adapter Network with personalized aging loss, extrapolation regularization, and adaptive w-norm regularization. Our approach can also be extended to videos, achieving high-quality, identity-preserving, and temporally consistent aging effects that resemble actual appearances at target ages, demonstrating its superiority over state-of-the-art approaches.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Novel View Extrapolation with Video Diffusion Priors</title>
      <itunes:episode>130</itunes:episode>
      <podcast:episode>130</podcast:episode>
      <itunes:title>Novel View Extrapolation with Video Diffusion Priors</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2bfaef1d-e84a-4845-a767-68c164a8325a</guid>
      <link>https://share.transistor.fm/s/5330ab49</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kunhao Liu, Ling Shao, Shijian Lu</p>

            <p><strong>Title:</strong><br>
            Novel View Extrapolation with Video Diffusion Priors</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14208v1">http://arxiv.org/abs/2411.14208v1</a></p>

            <p><strong>Abstract:</strong><br>
            The field of novel view synthesis has made significant strides thanks to the development of radiance field methods. However, most radiance field techniques are far better at novel view interpolation than novel view extrapolation where the synthesis novel views are far beyond the observed training views. We design ViewExtrapolator, a novel view synthesis approach that leverages the generative priors of Stable Video Diffusion (SVD) for realistic novel view extrapolation. By redesigning the SVD denoising process, ViewExtrapolator refines the artifact-prone views rendered by radiance fields, greatly enhancing the clarity and realism of the synthesized novel views. ViewExtrapolator is a generic novel view extrapolator that can work with different types of 3D rendering such as views rendered from point clouds when only a single view or monocular video is available. Additionally, ViewExtrapolator requires no fine-tuning of SVD, making it both data-efficient and computation-efficient. Extensive experiments demonstrate the superiority of ViewExtrapolator in novel view extrapolation. Project page: \url{https://kunhao-liu.github.io/ViewExtrapolator/}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kunhao Liu, Ling Shao, Shijian Lu</p>

            <p><strong>Title:</strong><br>
            Novel View Extrapolation with Video Diffusion Priors</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14208v1">http://arxiv.org/abs/2411.14208v1</a></p>

            <p><strong>Abstract:</strong><br>
            The field of novel view synthesis has made significant strides thanks to the development of radiance field methods. However, most radiance field techniques are far better at novel view interpolation than novel view extrapolation where the synthesis novel views are far beyond the observed training views. We design ViewExtrapolator, a novel view synthesis approach that leverages the generative priors of Stable Video Diffusion (SVD) for realistic novel view extrapolation. By redesigning the SVD denoising process, ViewExtrapolator refines the artifact-prone views rendered by radiance fields, greatly enhancing the clarity and realism of the synthesized novel views. ViewExtrapolator is a generic novel view extrapolator that can work with different types of 3D rendering such as views rendered from point clouds when only a single view or monocular video is available. Additionally, ViewExtrapolator requires no fine-tuning of SVD, making it both data-efficient and computation-efficient. Extensive experiments demonstrate the superiority of ViewExtrapolator in novel view extrapolation. Project page: \url{https://kunhao-liu.github.io/ViewExtrapolator/}.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 25 Nov 2024 19:51:44 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/5330ab49/fbaf2ab9.mp3" length="20586592" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1283</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Kunhao Liu, Ling Shao, Shijian Lu</p>

            <p><strong>Title:</strong><br>
            Novel View Extrapolation with Video Diffusion Priors</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14208v1">http://arxiv.org/abs/2411.14208v1</a></p>

            <p><strong>Abstract:</strong><br>
            The field of novel view synthesis has made significant strides thanks to the development of radiance field methods. However, most radiance field techniques are far better at novel view interpolation than novel view extrapolation where the synthesis novel views are far beyond the observed training views. We design ViewExtrapolator, a novel view synthesis approach that leverages the generative priors of Stable Video Diffusion (SVD) for realistic novel view extrapolation. By redesigning the SVD denoising process, ViewExtrapolator refines the artifact-prone views rendered by radiance fields, greatly enhancing the clarity and realism of the synthesized novel views. ViewExtrapolator is a generic novel view extrapolator that can work with different types of 3D rendering such as views rendered from point clouds when only a single view or monocular video is available. Additionally, ViewExtrapolator requires no fine-tuning of SVD, making it both data-efficient and computation-efficient. Extensive experiments demonstrate the superiority of ViewExtrapolator in novel view extrapolation. Project page: \url{https://kunhao-liu.github.io/ViewExtrapolator/}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization</title>
      <itunes:episode>129</itunes:episode>
      <podcast:episode>129</podcast:episode>
      <itunes:title>Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">428c3ffe-a006-4ba2-a901-95c2d70b916e</guid>
      <link>https://share.transistor.fm/s/d35efb88</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 42 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, Jifeng Dai</p>

            <p><strong>Title:</strong><br>
            Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10442v1">http://arxiv.org/abs/2411.10442v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, (1) on the data side, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset. and (2) on the model side, we explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our approach demonstrates improved performance across multiple benchmarks, particularly in multimodal reasoning tasks. Notably, our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10x larger InternVL2-76B. We hope this study could inspire further advancements in MLLMs. Code, data, and model shall be publicly released.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 42 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, Jifeng Dai</p>

            <p><strong>Title:</strong><br>
            Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10442v1">http://arxiv.org/abs/2411.10442v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, (1) on the data side, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset. and (2) on the model side, we explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our approach demonstrates improved performance across multiple benchmarks, particularly in multimodal reasoning tasks. Notably, our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10x larger InternVL2-76B. We hope this study could inspire further advancements in MLLMs. Code, data, and model shall be publicly released.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 22 Nov 2024 19:44:42 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d35efb88/cd4399db.mp3" length="18652326" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1162</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 42 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, Jifeng Dai</p>

            <p><strong>Title:</strong><br>
            Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10442v1">http://arxiv.org/abs/2411.10442v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address this, we introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, (1) on the data side, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset. and (2) on the model side, we explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance. Our approach demonstrates improved performance across multiple benchmarks, particularly in multimodal reasoning tasks. Notably, our model, InternVL2-8B-MPO, achieves an accuracy of 67.0 on MathVista, outperforming InternVL2-8B by 8.7 points and achieving performance comparable to the 10x larger InternVL2-76B. We hope this study could inspire further advancements in MLLMs. Code, data, and model shall be publicly released.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Multimodal Autoregressive Pre-training of Large Vision Encoders</title>
      <itunes:episode>128</itunes:episode>
      <podcast:episode>128</podcast:episode>
      <itunes:title>Multimodal Autoregressive Pre-training of Large Vision Encoders</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">edf58156-d171-4cdb-b819-6e49572da9bb</guid>
      <link>https://share.transistor.fm/s/c949ef95</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 23 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby</p>

            <p><strong>Title:</strong><br>
            Multimodal Autoregressive Pre-training of Large Vision Encoders</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14402v1">http://arxiv.org/abs/2411.14402v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 23 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby</p>

            <p><strong>Title:</strong><br>
            Multimodal Autoregressive Pre-training of Large Vision Encoders</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14402v1">http://arxiv.org/abs/2411.14402v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 22 Nov 2024 19:44:20 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c949ef95/fe9c23ec.mp3" length="23348477" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1456</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 23 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby</p>

            <p><strong>Title:</strong><br>
            Multimodal Autoregressive Pre-training of Large Vision Encoders</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14402v1">http://arxiv.org/abs/2411.14402v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions</title>
      <itunes:episode>127</itunes:episode>
      <podcast:episode>127</podcast:episode>
      <itunes:title>Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5471bb6c-ec62-448f-974c-d1c9ac180d83</guid>
      <link>https://share.transistor.fm/s/2c675a77</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang</p>

            <p><strong>Title:</strong><br>
            Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14405v1">http://arxiv.org/abs/2411.14405v1</a></p>

            <p><strong>Abstract:</strong><br>
            Currently OpenAI o1 has sparked a surge of interest in the study of large reasoning models (LRM). Building on this momentum, Marco-o1 not only focuses on disciplines with standard answers, such as mathematics, physics, and coding -- which are well-suited for reinforcement learning (RL) -- but also places greater emphasis on open-ended resolutions. We aim to address the question: "Can the o1 model effectively generalize to broader domains where clear standards are absent and rewards are challenging to quantify?" Marco-o1 is powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and innovative reasoning strategies -- optimized for complex real-world problem-solving tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang</p>

            <p><strong>Title:</strong><br>
            Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14405v1">http://arxiv.org/abs/2411.14405v1</a></p>

            <p><strong>Abstract:</strong><br>
            Currently OpenAI o1 has sparked a surge of interest in the study of large reasoning models (LRM). Building on this momentum, Marco-o1 not only focuses on disciplines with standard answers, such as mathematics, physics, and coding -- which are well-suited for reinforcement learning (RL) -- but also places greater emphasis on open-ended resolutions. We aim to address the question: "Can the o1 model effectively generalize to broader domains where clear standards are absent and rewards are challenging to quantify?" Marco-o1 is powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and innovative reasoning strategies -- optimized for complex real-world problem-solving tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 22 Nov 2024 19:43:59 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2c675a77/9807bb7d.mp3" length="18354284" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1143</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 23 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang</p>

            <p><strong>Title:</strong><br>
            Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14405v1">http://arxiv.org/abs/2411.14405v1</a></p>

            <p><strong>Abstract:</strong><br>
            Currently OpenAI o1 has sparked a surge of interest in the study of large reasoning models (LRM). Building on this momentum, Marco-o1 not only focuses on disciplines with standard answers, such as mathematics, physics, and coding -- which are well-suited for reinforcement learning (RL) -- but also places greater emphasis on open-ended resolutions. We aim to address the question: "Can the o1 model effectively generalize to broader domains where clear standards are absent and rewards are challenging to quantify?" Marco-o1 is powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and innovative reasoning strategies -- optimized for complex real-world problem-solving tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Hymba: A Hybrid-head Architecture for Small Language Models</title>
      <itunes:episode>126</itunes:episode>
      <podcast:episode>126</podcast:episode>
      <itunes:title>Hymba: A Hybrid-head Architecture for Small Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a06de184-5be9-4d13-ac3b-f2c6e9031c9c</guid>
      <link>https://share.transistor.fm/s/e71b8361</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 20 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov</p>

            <p><strong>Title:</strong><br>
            Hymba: A Hybrid-head Architecture for Small Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.13676v1">http://arxiv.org/abs/2411.13676v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 20 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov</p>

            <p><strong>Title:</strong><br>
            Hymba: A Hybrid-head Architecture for Small Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.13676v1">http://arxiv.org/abs/2411.13676v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 22 Nov 2024 19:43:38 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e71b8361/035ab792.mp3" length="23483056" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1464</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 20 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov</p>

            <p><strong>Title:</strong><br>
            Hymba: A Hybrid-head Architecture for Small Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.13676v1">http://arxiv.org/abs/2411.13676v1</a></p>

            <p><strong>Abstract:</strong><br>
            We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Natural Language Reinforcement Learning</title>
      <itunes:episode>125</itunes:episode>
      <podcast:episode>125</podcast:episode>
      <itunes:title>Natural Language Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1724f50b-9833-453f-a980-ae263d35bb6c</guid>
      <link>https://share.transistor.fm/s/9d56b090</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xidong Feng, Ziyu Wan, Haotian Fu, Bo Liu, Mengyue Yang, Girish A. Koushik, Zhiyuan Hu, Ying Wen, Jun Wang</p>

            <p><strong>Title:</strong><br>
            Natural Language Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14251v1">http://arxiv.org/abs/2411.14251v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning (RL) mathematically formulates decision-making with Markov Decision Process (MDP). With MDPs, researchers have achieved remarkable breakthroughs across various domains, including games, robotics, and language models. This paper seeks a new possibility, Natural Language Reinforcement Learning (NLRL), by extending traditional MDP to natural language-based representation space. Specifically, NLRL innovatively redefines RL principles, including task objectives, policy, value function, Bellman equation, and policy iteration, into their language counterparts. With recent advancements in large language models (LLMs), NLRL can be practically implemented to achieve RL-like policy and value improvement by either pure prompting or gradient-based training. Experiments over Maze, Breakthrough, and Tic-Tac-Toe games demonstrate the effectiveness, efficiency, and interpretability of the NLRL framework among diverse use cases. Our code will be released at https://github.com/waterhorse1/Natural-language-RL.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xidong Feng, Ziyu Wan, Haotian Fu, Bo Liu, Mengyue Yang, Girish A. Koushik, Zhiyuan Hu, Ying Wen, Jun Wang</p>

            <p><strong>Title:</strong><br>
            Natural Language Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14251v1">http://arxiv.org/abs/2411.14251v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning (RL) mathematically formulates decision-making with Markov Decision Process (MDP). With MDPs, researchers have achieved remarkable breakthroughs across various domains, including games, robotics, and language models. This paper seeks a new possibility, Natural Language Reinforcement Learning (NLRL), by extending traditional MDP to natural language-based representation space. Specifically, NLRL innovatively redefines RL principles, including task objectives, policy, value function, Bellman equation, and policy iteration, into their language counterparts. With recent advancements in large language models (LLMs), NLRL can be practically implemented to achieve RL-like policy and value improvement by either pure prompting or gradient-based training. Experiments over Maze, Breakthrough, and Tic-Tac-Toe games demonstrate the effectiveness, efficiency, and interpretability of the NLRL framework among diverse use cases. Our code will be released at https://github.com/waterhorse1/Natural-language-RL.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 22 Nov 2024 19:43:16 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9d56b090/865684bc.mp3" length="23053374" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1437</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xidong Feng, Ziyu Wan, Haotian Fu, Bo Liu, Mengyue Yang, Girish A. Koushik, Zhiyuan Hu, Ying Wen, Jun Wang</p>

            <p><strong>Title:</strong><br>
            Natural Language Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14251v1">http://arxiv.org/abs/2411.14251v1</a></p>

            <p><strong>Abstract:</strong><br>
            Reinforcement Learning (RL) mathematically formulates decision-making with Markov Decision Process (MDP). With MDPs, researchers have achieved remarkable breakthroughs across various domains, including games, robotics, and language models. This paper seeks a new possibility, Natural Language Reinforcement Learning (NLRL), by extending traditional MDP to natural language-based representation space. Specifically, NLRL innovatively redefines RL principles, including task objectives, policy, value function, Bellman equation, and policy iteration, into their language counterparts. With recent advancements in large language models (LLMs), NLRL can be practically implemented to achieve RL-like policy and value improvement by either pure prompting or gradient-based training. Experiments over Maze, Breakthrough, and Tic-Tac-Toe games demonstrate the effectiveness, efficiency, and interpretability of the NLRL framework among diverse use cases. Our code will be released at https://github.com/waterhorse1/Natural-language-RL.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs</title>
      <itunes:episode>124</itunes:episode>
      <podcast:episode>124</podcast:episode>
      <itunes:title>OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8998e718-21d5-4fbf-9c48-c62a3005cd13</guid>
      <link>https://share.transistor.fm/s/ec831db9</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.CL, cs.AI, cs.DL, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen-tau Yih, Pang Wei Koh, Hannaneh Hajishirzi</p>

            <p><strong>Title:</strong><br>
            OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14199v1">http://arxiv.org/abs/2411.14199v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can large language models (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience, and biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o hallucinates citations 78 to 90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar's datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs: for instance, OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's 32%. We open-source all of our code, models, datastore, data and a public demo.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.CL, cs.AI, cs.DL, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen-tau Yih, Pang Wei Koh, Hannaneh Hajishirzi</p>

            <p><strong>Title:</strong><br>
            OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14199v1">http://arxiv.org/abs/2411.14199v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can large language models (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience, and biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o hallucinates citations 78 to 90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar's datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs: for instance, OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's 32%. We open-source all of our code, models, datastore, data and a public demo.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 22 Nov 2024 19:42:55 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ec831db9/5eec8d80.mp3" length="22505884" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1403</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.CL, cs.AI, cs.DL, cs.IR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen-tau Yih, Pang Wei Koh, Hannaneh Hajishirzi</p>

            <p><strong>Title:</strong><br>
            OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14199v1">http://arxiv.org/abs/2411.14199v1</a></p>

            <p><strong>Abstract:</strong><br>
            Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can large language models (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience, and biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o hallucinates citations 78 to 90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar's datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs: for instance, OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's 32%. We open-source all of our code, models, datastore, data and a public demo.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Ultra-Sparse Memory Network</title>
      <itunes:episode>123</itunes:episode>
      <podcast:episode>123</podcast:episode>
      <itunes:title>Ultra-Sparse Memory Network</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f88ddbe0-e194-4c5f-9f9d-8e34cc788679</guid>
      <link>https://share.transistor.fm/s/b61233fe</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zihao Huang, Qiyang Min, Hongzhi Huang, Defa Zhu, Yutao Zeng, Ran Guo, Xun Zhou</p>

            <p><strong>Title:</strong><br>
            Ultra-Sparse Memory Network</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12364v1">http://arxiv.org/abs/2411.12364v1</a></p>

            <p><strong>Abstract:</strong><br>
            It is widely acknowledged that the performance of Transformer models is exponentially related to their number of parameters and computational complexity. While approaches like Mixture of Experts (MoE) decouple parameter count from computational complexity, they still face challenges in inference due to high memory access costs. This work introduces UltraMem, incorporating large-scale, ultra-sparse memory layer to address these limitations. Our approach significantly reduces inference latency while maintaining model performance. We also investigate the scaling laws of this new architecture, demonstrating that it not only exhibits favorable scaling properties but outperforms traditional models. In our experiments, we train networks with up to 20 million memory slots. The results show that our method achieves state-of-the-art inference speed and model performance within a given computational budget.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zihao Huang, Qiyang Min, Hongzhi Huang, Defa Zhu, Yutao Zeng, Ran Guo, Xun Zhou</p>

            <p><strong>Title:</strong><br>
            Ultra-Sparse Memory Network</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12364v1">http://arxiv.org/abs/2411.12364v1</a></p>

            <p><strong>Abstract:</strong><br>
            It is widely acknowledged that the performance of Transformer models is exponentially related to their number of parameters and computational complexity. While approaches like Mixture of Experts (MoE) decouple parameter count from computational complexity, they still face challenges in inference due to high memory access costs. This work introduces UltraMem, incorporating large-scale, ultra-sparse memory layer to address these limitations. Our approach significantly reduces inference latency while maintaining model performance. We also investigate the scaling laws of this new architecture, demonstrating that it not only exhibits favorable scaling properties but outperforms traditional models. In our experiments, we train networks with up to 20 million memory slots. The results show that our method achieves state-of-the-art inference speed and model performance within a given computational budget.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 22 Nov 2024 19:42:34 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b61233fe/9f5ce6f8.mp3" length="19606870" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1222</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zihao Huang, Qiyang Min, Hongzhi Huang, Defa Zhu, Yutao Zeng, Ran Guo, Xun Zhou</p>

            <p><strong>Title:</strong><br>
            Ultra-Sparse Memory Network</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12364v1">http://arxiv.org/abs/2411.12364v1</a></p>

            <p><strong>Abstract:</strong><br>
            It is widely acknowledged that the performance of Transformer models is exponentially related to their number of parameters and computational complexity. While approaches like Mixture of Experts (MoE) decouple parameter count from computational complexity, they still face challenges in inference due to high memory access costs. This work introduces UltraMem, incorporating large-scale, ultra-sparse memory layer to address these limitations. Our approach significantly reduces inference latency while maintaining model performance. We also investigate the scaling laws of this new architecture, demonstrating that it not only exhibits favorable scaling properties but outperforms traditional models. In our experiments, we train networks with up to 20 million memory slots. The results show that our method achieves state-of-the-art inference speed and model performance within a given computational budget.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models</title>
      <itunes:episode>122</itunes:episode>
      <podcast:episode>122</podcast:episode>
      <itunes:title>Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e2703f3b-d1fb-4fe3-a71e-621493786249</guid>
      <link>https://share.transistor.fm/s/26a5abd5</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14432v1">http://arxiv.org/abs/2411.14432v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) demonstrate enhanced capabilities and reliability by reasoning more, evolving from Chain-of-Thought prompting to product-level solutions like OpenAI o1. Despite various efforts to improve LLM reasoning, high-quality long-chain reasoning data and optimized training pipelines still remain inadequately explored in vision-language tasks. In this paper, we present Insight-V, an early effort to 1) scalably produce long and robust reasoning data for complex multi-modal tasks, and 2) an effective training pipeline to enhance the reasoning capabilities of multi-modal large language models (MLLMs). Specifically, to create long and structured reasoning data without human labor, we design a two-step pipeline with a progressive strategy to generate sufficiently long and diverse reasoning paths and a multi-granularity assessment method to ensure data quality. We observe that directly supervising MLLMs with such long and complex reasoning data will not yield ideal reasoning ability. To tackle this problem, we design a multi-agent system consisting of a reasoning agent dedicated to performing long-chain reasoning and a summary agent trained to judge and summarize reasoning results. We further incorporate an iterative DPO algorithm to enhance the reasoning agent's generation stability and quality. Based on the popular LLaVA-NeXT model and our stronger base MLLM, we demonstrate significant performance gains across challenging multi-modal benchmarks requiring visual reasoning. Benefiting from our multi-agent system, Insight-V can also easily maintain or improve performance on perception-focused multi-modal tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14432v1">http://arxiv.org/abs/2411.14432v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) demonstrate enhanced capabilities and reliability by reasoning more, evolving from Chain-of-Thought prompting to product-level solutions like OpenAI o1. Despite various efforts to improve LLM reasoning, high-quality long-chain reasoning data and optimized training pipelines still remain inadequately explored in vision-language tasks. In this paper, we present Insight-V, an early effort to 1) scalably produce long and robust reasoning data for complex multi-modal tasks, and 2) an effective training pipeline to enhance the reasoning capabilities of multi-modal large language models (MLLMs). Specifically, to create long and structured reasoning data without human labor, we design a two-step pipeline with a progressive strategy to generate sufficiently long and diverse reasoning paths and a multi-granularity assessment method to ensure data quality. We observe that directly supervising MLLMs with such long and complex reasoning data will not yield ideal reasoning ability. To tackle this problem, we design a multi-agent system consisting of a reasoning agent dedicated to performing long-chain reasoning and a summary agent trained to judge and summarize reasoning results. We further incorporate an iterative DPO algorithm to enhance the reasoning agent's generation stability and quality. Based on the popular LLaVA-NeXT model and our stronger base MLLM, we demonstrate significant performance gains across challenging multi-modal benchmarks requiring visual reasoning. Benefiting from our multi-agent system, Insight-V can also easily maintain or improve performance on perception-focused multi-modal tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 22 Nov 2024 19:42:12 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/26a5abd5/b82e85a8.mp3" length="23468036" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1463</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14432v1">http://arxiv.org/abs/2411.14432v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) demonstrate enhanced capabilities and reliability by reasoning more, evolving from Chain-of-Thought prompting to product-level solutions like OpenAI o1. Despite various efforts to improve LLM reasoning, high-quality long-chain reasoning data and optimized training pipelines still remain inadequately explored in vision-language tasks. In this paper, we present Insight-V, an early effort to 1) scalably produce long and robust reasoning data for complex multi-modal tasks, and 2) an effective training pipeline to enhance the reasoning capabilities of multi-modal large language models (MLLMs). Specifically, to create long and structured reasoning data without human labor, we design a two-step pipeline with a progressive strategy to generate sufficiently long and diverse reasoning paths and a multi-granularity assessment method to ensure data quality. We observe that directly supervising MLLMs with such long and complex reasoning data will not yield ideal reasoning ability. To tackle this problem, we design a multi-agent system consisting of a reasoning agent dedicated to performing long-chain reasoning and a summary agent trained to judge and summarize reasoning results. We further incorporate an iterative DPO algorithm to enhance the reasoning agent's generation stability and quality. Based on the popular LLaVA-NeXT model and our stronger base MLLM, we demonstrate significant performance gains across challenging multi-modal benchmarks requiring visual reasoning. Benefiting from our multi-agent system, Insight-V can also easily maintain or improve performance on perception-focused multi-modal tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Stable Flow: Vital Layers for Training-Free Image Editing</title>
      <itunes:episode>121</itunes:episode>
      <podcast:episode>121</podcast:episode>
      <itunes:title>Stable Flow: Vital Layers for Training-Free Image Editing</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">99801d27-6d49-4d3b-9cb5-a6d34e4604cb</guid>
      <link>https://share.transistor.fm/s/a931dc58</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CV, cs.GR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Omri Avrahami, Or Patashnik, Ohad Fried, Egor Nemchinov, Kfir Aberman, Dani Lischinski, Daniel Cohen-Or</p>

            <p><strong>Title:</strong><br>
            Stable Flow: Vital Layers for Training-Free Image Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14430v1">http://arxiv.org/abs/2411.14430v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, we propose an automatic method to identify "vital layers" within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object addition, using the same mechanism. Next, to enable real-image editing, we introduce an improved image inversion method for flow models. Finally, we evaluate our approach through qualitative and quantitative comparisons, along with a user study, and demonstrate its effectiveness across multiple applications. The project page is available at https://omriavrahami.com/stable-flow</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CV, cs.GR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Omri Avrahami, Or Patashnik, Ohad Fried, Egor Nemchinov, Kfir Aberman, Dani Lischinski, Daniel Cohen-Or</p>

            <p><strong>Title:</strong><br>
            Stable Flow: Vital Layers for Training-Free Image Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14430v1">http://arxiv.org/abs/2411.14430v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, we propose an automatic method to identify "vital layers" within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object addition, using the same mechanism. Next, to enable real-image editing, we introduce an improved image inversion method for flow models. Finally, we evaluate our approach through qualitative and quantitative comparisons, along with a user study, and demonstrate its effectiveness across multiple applications. The project page is available at https://omriavrahami.com/stable-flow</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 22 Nov 2024 19:41:50 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a931dc58/ad9d0682.mp3" length="21972131" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1370</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CV, cs.GR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Omri Avrahami, Or Patashnik, Ohad Fried, Egor Nemchinov, Kfir Aberman, Dani Lischinski, Daniel Cohen-Or</p>

            <p><strong>Title:</strong><br>
            Stable Flow: Vital Layers for Training-Free Image Editing</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14430v1">http://arxiv.org/abs/2411.14430v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, we propose an automatic method to identify "vital layers" within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object addition, using the same mechanism. Next, to enable real-image editing, we introduce an improved image inversion method for flow models. Finally, we evaluate our approach through qualitative and quantitative comparisons, along with a user study, and demonstrate its effectiveness across multiple applications. The project page is available at https://omriavrahami.com/stable-flow</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models</title>
      <itunes:episode>120</itunes:episode>
      <podcast:episode>120</podcast:episode>
      <itunes:title>Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2ffd80de-e50f-4415-a166-a5906e04436a</guid>
      <link>https://share.transistor.fm/s/14322310</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 6 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda</p>

            <p><strong>Title:</strong><br>
            Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14257v1">http://arxiv.org/abs/2411.14257v1</a></p>

            <p><strong>Abstract:</strong><br>
            Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. Sparse autoencoders uncover meaningful directions in the representation space, these detect whether the model recognizes an entity, e.g. detecting it doesn't know about an athlete or a movie. This suggests that models can have self-knowledge: internal representations about their own capabilities. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these directions have a causal effect on the chat model's refusal behavior, suggesting that chat finetuning has repurposed this existing mechanism. Furthermore, we provide an initial exploration into the mechanistic role of these directions in the model, finding that they disrupt the attention of downstream heads that typically move entity attributes to the final token.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 6 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda</p>

            <p><strong>Title:</strong><br>
            Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14257v1">http://arxiv.org/abs/2411.14257v1</a></p>

            <p><strong>Abstract:</strong><br>
            Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. Sparse autoencoders uncover meaningful directions in the representation space, these detect whether the model recognizes an entity, e.g. detecting it doesn't know about an athlete or a movie. This suggests that models can have self-knowledge: internal representations about their own capabilities. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these directions have a causal effect on the chat model's refusal behavior, suggesting that chat finetuning has repurposed this existing mechanism. Furthermore, we provide an initial exploration into the mechanistic role of these directions in the model, finding that they disrupt the attention of downstream heads that typically move entity attributes to the final token.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 22 Nov 2024 19:41:29 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/14322310/97959414.mp3" length="20760909" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1294</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 6 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda</p>

            <p><strong>Title:</strong><br>
            Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.14257v1">http://arxiv.org/abs/2411.14257v1</a></p>

            <p><strong>Abstract:</strong><br>
            Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. Sparse autoencoders uncover meaningful directions in the representation space, these detect whether the model recognizes an entity, e.g. detecting it doesn't know about an athlete or a movie. This suggests that models can have self-knowledge: internal representations about their own capabilities. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these directions have a causal effect on the chat model's refusal behavior, suggesting that chat finetuning has repurposed this existing mechanism. Furthermore, we provide an initial exploration into the mechanistic role of these directions in the model, finding that they disrupt the attention of downstream heads that typically move entity attributes to the final token.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration</title>
      <itunes:episode>119</itunes:episode>
      <podcast:episode>119</podcast:episode>
      <itunes:title>SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cc91722a-94fb-429e-8c1c-4f9d082662f8</guid>
      <link>https://share.transistor.fm/s/38ce3d6b</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 35 | cs.LG, cs.AI, cs.CV, cs.NE, cs.PF</p>

            <p><strong>Authors:</strong><br>
            Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen</p>

            <p><strong>Title:</strong><br>
            SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10958v1">http://arxiv.org/abs/2411.10958v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although quantization for linear layers has been widely used, its application to accelerate the attention process remains limited. SageAttention utilizes 8-bit matrix multiplication, 16-bit matrix multiplication with 16-bit accumulator, and precision-enhancing methods, implementing an accurate and 2x speedup kernel compared to FlashAttention2. To further enhance the efficiency of attention computation while maintaining precision, we propose SageAttention2, which utilizes significantly faster 4-bit matrix multiplication (Matmul) alongside additional precision-enhancing techniques. First, we propose to quantize matrixes $(Q, K)$ to INT4 in a warp-level granularity and quantize matrixes $(\widetilde P, V)$ to FP8. Second, we propose a method to smooth $Q$ and $V$, enhancing the accuracy of attention with INT4 $QK$ and FP8 $PV$. Third, we analyze the quantization accuracy across timesteps and layers, then propose an adaptive quantization method to ensure the end-to-end metrics over various models. The operations per second (OPS) of SageAttention2 surpass FlashAttention2 and xformers by about 3x and 5x on RTX4090, respectively. Comprehensive experiments confirm that our approach incurs negligible end-to-end metrics loss across diverse models, including those for large language processing, image generation, and video generation. The codes are available at https://github.com/thu-ml/SageAttention.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 35 | cs.LG, cs.AI, cs.CV, cs.NE, cs.PF</p>

            <p><strong>Authors:</strong><br>
            Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen</p>

            <p><strong>Title:</strong><br>
            SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10958v1">http://arxiv.org/abs/2411.10958v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although quantization for linear layers has been widely used, its application to accelerate the attention process remains limited. SageAttention utilizes 8-bit matrix multiplication, 16-bit matrix multiplication with 16-bit accumulator, and precision-enhancing methods, implementing an accurate and 2x speedup kernel compared to FlashAttention2. To further enhance the efficiency of attention computation while maintaining precision, we propose SageAttention2, which utilizes significantly faster 4-bit matrix multiplication (Matmul) alongside additional precision-enhancing techniques. First, we propose to quantize matrixes $(Q, K)$ to INT4 in a warp-level granularity and quantize matrixes $(\widetilde P, V)$ to FP8. Second, we propose a method to smooth $Q$ and $V$, enhancing the accuracy of attention with INT4 $QK$ and FP8 $PV$. Third, we analyze the quantization accuracy across timesteps and layers, then propose an adaptive quantization method to ensure the end-to-end metrics over various models. The operations per second (OPS) of SageAttention2 surpass FlashAttention2 and xformers by about 3x and 5x on RTX4090, respectively. Comprehensive experiments confirm that our approach incurs negligible end-to-end metrics loss across diverse models, including those for large language processing, image generation, and video generation. The codes are available at https://github.com/thu-ml/SageAttention.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 21 Nov 2024 19:46:55 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/38ce3d6b/588e3580.mp3" length="22375921" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1395</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 35 | cs.LG, cs.AI, cs.CV, cs.NE, cs.PF</p>

            <p><strong>Authors:</strong><br>
            Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen</p>

            <p><strong>Title:</strong><br>
            SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10958v1">http://arxiv.org/abs/2411.10958v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although quantization for linear layers has been widely used, its application to accelerate the attention process remains limited. SageAttention utilizes 8-bit matrix multiplication, 16-bit matrix multiplication with 16-bit accumulator, and precision-enhancing methods, implementing an accurate and 2x speedup kernel compared to FlashAttention2. To further enhance the efficiency of attention computation while maintaining precision, we propose SageAttention2, which utilizes significantly faster 4-bit matrix multiplication (Matmul) alongside additional precision-enhancing techniques. First, we propose to quantize matrixes $(Q, K)$ to INT4 in a warp-level granularity and quantize matrixes $(\widetilde P, V)$ to FP8. Second, we propose a method to smooth $Q$ and $V$, enhancing the accuracy of attention with INT4 $QK$ and FP8 $PV$. Third, we analyze the quantization accuracy across timesteps and layers, then propose an adaptive quantization method to ensure the end-to-end metrics over various models. The operations per second (OPS) of SageAttention2 surpass FlashAttention2 and xformers by about 3x and 5x on RTX4090, respectively. Comprehensive experiments confirm that our approach incurs negligible end-to-end metrics loss across diverse models, including those for large language processing, image generation, and video generation. The codes are available at https://github.com/thu-ml/SageAttention.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models</title>
      <itunes:episode>118</itunes:episode>
      <podcast:episode>118</podcast:episode>
      <itunes:title>VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fe7a902e-97e8-4e29-9c47-6a90145e98b6</guid>
      <link>https://share.transistor.fm/s/8619c037</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.13503v1">http://arxiv.org/abs/2411.13503v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video generation has witnessed significant advancements, yet evaluating these models remains a challenge. A comprehensive evaluation benchmark for video generation is indispensable for two reasons: 1) Existing metrics do not fully align with human perceptions; 2) An ideal evaluation system should provide insights to inform future developments of video generation. To this end, we present VBench, a comprehensive benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. VBench has several appealing properties: 1) Comprehensive Dimensions: VBench comprises 16 dimensions in video generation (e.g., subject identity inconsistency, motion smoothness, temporal flickering, and spatial relationship, etc). The evaluation metrics with fine-grained levels reveal individual models' strengths and weaknesses. 2) Human Alignment: We also provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception, for each evaluation dimension respectively. 3) Valuable Insights: We look into current models' ability across various evaluation dimensions, and various content types. We also investigate the gaps between video and image generation models. 4) Versatile Benchmarking: VBench++ supports evaluating text-to-video and image-to-video. We introduce a high-quality Image Suite with an adaptive aspect ratio to enable fair evaluations across different image-to-video generation settings. Beyond assessing technical quality, VBench++ evaluates the trustworthiness of video generative models, providing a more holistic view of model performance. 5) Full Open-Sourcing: We fully open-source VBench++ and continually add new video generation models to our leaderboard to drive forward the field of video generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.13503v1">http://arxiv.org/abs/2411.13503v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video generation has witnessed significant advancements, yet evaluating these models remains a challenge. A comprehensive evaluation benchmark for video generation is indispensable for two reasons: 1) Existing metrics do not fully align with human perceptions; 2) An ideal evaluation system should provide insights to inform future developments of video generation. To this end, we present VBench, a comprehensive benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. VBench has several appealing properties: 1) Comprehensive Dimensions: VBench comprises 16 dimensions in video generation (e.g., subject identity inconsistency, motion smoothness, temporal flickering, and spatial relationship, etc). The evaluation metrics with fine-grained levels reveal individual models' strengths and weaknesses. 2) Human Alignment: We also provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception, for each evaluation dimension respectively. 3) Valuable Insights: We look into current models' ability across various evaluation dimensions, and various content types. We also investigate the gaps between video and image generation models. 4) Versatile Benchmarking: VBench++ supports evaluating text-to-video and image-to-video. We introduce a high-quality Image Suite with an adaptive aspect ratio to enable fair evaluations across different image-to-video generation settings. Beyond assessing technical quality, VBench++ evaluates the trustworthiness of video generative models, providing a more holistic view of model performance. 5) Full Open-Sourcing: We fully open-source VBench++ and continually add new video generation models to our leaderboard to drive forward the field of video generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 21 Nov 2024 19:46:34 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8619c037/19aa5645.mp3" length="23993406" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1496</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 23 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu</p>

            <p><strong>Title:</strong><br>
            VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.13503v1">http://arxiv.org/abs/2411.13503v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video generation has witnessed significant advancements, yet evaluating these models remains a challenge. A comprehensive evaluation benchmark for video generation is indispensable for two reasons: 1) Existing metrics do not fully align with human perceptions; 2) An ideal evaluation system should provide insights to inform future developments of video generation. To this end, we present VBench, a comprehensive benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. VBench has several appealing properties: 1) Comprehensive Dimensions: VBench comprises 16 dimensions in video generation (e.g., subject identity inconsistency, motion smoothness, temporal flickering, and spatial relationship, etc). The evaluation metrics with fine-grained levels reveal individual models' strengths and weaknesses. 2) Human Alignment: We also provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception, for each evaluation dimension respectively. 3) Valuable Insights: We look into current models' ability across various evaluation dimensions, and various content types. We also investigate the gaps between video and image generation models. 4) Versatile Benchmarking: VBench++ supports evaluating text-to-video and image-to-video. We introduce a high-quality Image Suite with an adaptive aspect ratio to enable fair evaluations across different image-to-video generation settings. Beyond assessing technical quality, VBench++ evaluates the trustworthiness of video generative models, providing a more holistic view of model performance. 5) Full Open-Sourcing: We fully open-source VBench++ and continually add new video generation models to our leaderboard to drive forward the field of video generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation</title>
      <itunes:episode>117</itunes:episode>
      <podcast:episode>117</podcast:episode>
      <itunes:title>VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">22c1c0df-fbfc-4c27-b1aa-a49eff47eb1c</guid>
      <link>https://share.transistor.fm/s/78e2c94f</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.CV, cs.AI, cs.CL, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Ziyang Luo, Haoning Wu, Dongxu Li, Jing Ma, Mohan Kankanhalli, Junnan Li</p>

            <p><strong>Title:</strong><br>
            VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.13281v1">http://arxiv.org/abs/2411.13281v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice questions in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation-and due to the prohibitive cost and slow pace of human annotation for video tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena's framework, designed to automatically assess LMMs' video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a 'gold standard' using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAutoArena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.CV, cs.AI, cs.CL, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Ziyang Luo, Haoning Wu, Dongxu Li, Jing Ma, Mohan Kankanhalli, Junnan Li</p>

            <p><strong>Title:</strong><br>
            VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.13281v1">http://arxiv.org/abs/2411.13281v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice questions in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation-and due to the prohibitive cost and slow pace of human annotation for video tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena's framework, designed to automatically assess LMMs' video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a 'gold standard' using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAutoArena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 21 Nov 2024 19:46:13 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/78e2c94f/0dbe10a0.mp3" length="23275386" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1451</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.CV, cs.AI, cs.CL, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Ziyang Luo, Haoning Wu, Dongxu Li, Jing Ma, Mohan Kankanhalli, Junnan Li</p>

            <p><strong>Title:</strong><br>
            VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.13281v1">http://arxiv.org/abs/2411.13281v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice questions in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation-and due to the prohibitive cost and slow pace of human annotation for video tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena's framework, designed to automatically assess LMMs' video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a 'gold standard' using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAutoArena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory</title>
      <itunes:episode>116</itunes:episode>
      <podcast:episode>116</podcast:episode>
      <itunes:title>SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">63731511-b771-454d-8836-ff86083d860b</guid>
      <link>https://share.transistor.fm/s/bf62b866</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang</p>

            <p><strong>Title:</strong><br>
            SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11922v1">http://arxiv.org/abs/2411.11922v1</a></p>

            <p><strong>Abstract:</strong><br>
            The Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks but faces challenges in visual object tracking, particularly when managing crowded scenes with fast-moving or self-occluding objects. Furthermore, the fixed-window memory approach in the original model does not consider the quality of memories selected to condition the image features for the next frame, leading to error propagation in videos. This paper introduces SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking. By incorporating temporal motion cues with the proposed motion-aware memory selection mechanism, SAMURAI effectively predicts object motion and refines mask selection, achieving robust, accurate tracking without the need for retraining or fine-tuning. SAMURAI operates in real-time and demonstrates strong zero-shot performance across diverse benchmark datasets, showcasing its ability to generalize without fine-tuning. In evaluations, SAMURAI achieves significant improvements in success rate and precision over existing trackers, with a 7.1% AUC gain on LaSOT$_{\text{ext}}$ and a 3.5% AO gain on GOT-10k. Moreover, it achieves competitive results compared to fully supervised methods on LaSOT, underscoring its robustness in complex tracking scenarios and its potential for real-world applications in dynamic environments. Code and results are available at https://github.com/yangchris11/samurai.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang</p>

            <p><strong>Title:</strong><br>
            SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11922v1">http://arxiv.org/abs/2411.11922v1</a></p>

            <p><strong>Abstract:</strong><br>
            The Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks but faces challenges in visual object tracking, particularly when managing crowded scenes with fast-moving or self-occluding objects. Furthermore, the fixed-window memory approach in the original model does not consider the quality of memories selected to condition the image features for the next frame, leading to error propagation in videos. This paper introduces SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking. By incorporating temporal motion cues with the proposed motion-aware memory selection mechanism, SAMURAI effectively predicts object motion and refines mask selection, achieving robust, accurate tracking without the need for retraining or fine-tuning. SAMURAI operates in real-time and demonstrates strong zero-shot performance across diverse benchmark datasets, showcasing its ability to generalize without fine-tuning. In evaluations, SAMURAI achieves significant improvements in success rate and precision over existing trackers, with a 7.1% AUC gain on LaSOT$_{\text{ext}}$ and a 3.5% AO gain on GOT-10k. Moreover, it achieves competitive results compared to fully supervised methods on LaSOT, underscoring its robustness in complex tracking scenarios and its potential for real-world applications in dynamic environments. Code and results are available at https://github.com/yangchris11/samurai.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 21 Nov 2024 19:45:52 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bf62b866/ef21dccf.mp3" length="21190586" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1321</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang</p>

            <p><strong>Title:</strong><br>
            SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11922v1">http://arxiv.org/abs/2411.11922v1</a></p>

            <p><strong>Abstract:</strong><br>
            The Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks but faces challenges in visual object tracking, particularly when managing crowded scenes with fast-moving or self-occluding objects. Furthermore, the fixed-window memory approach in the original model does not consider the quality of memories selected to condition the image features for the next frame, leading to error propagation in videos. This paper introduces SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking. By incorporating temporal motion cues with the proposed motion-aware memory selection mechanism, SAMURAI effectively predicts object motion and refines mask selection, achieving robust, accurate tracking without the need for retraining or fine-tuning. SAMURAI operates in real-time and demonstrates strong zero-shot performance across diverse benchmark datasets, showcasing its ability to generalize without fine-tuning. In evaluations, SAMURAI achieves significant improvements in success rate and precision over existing trackers, with a 7.1% AUC gain on LaSOT$_{\text{ext}}$ and a 3.5% AO gain on GOT-10k. Moreover, it achieves competitive results compared to fully supervised methods on LaSOT, underscoring its robustness in complex tracking scenarios and its potential for real-world applications in dynamic environments. Code and results are available at https://github.com/yangchris11/samurai.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents</title>
      <itunes:episode>115</itunes:episode>
      <podcast:episode>115</podcast:episode>
      <itunes:title>Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">231bfe3f-ba31-475a-9eaa-ece33c72865d</guid>
      <link>https://share.transistor.fm/s/2559682a</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 9 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, Yu Su</p>

            <p><strong>Title:</strong><br>
            Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.06559v1">http://arxiv.org/abs/2411.06559v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language agents have demonstrated promising capabilities in automating web-based tasks, though their current reactive approaches still underperform largely compared to humans. While incorporating advanced planning algorithms, particularly tree search methods, could enhance these agents' performance, implementing tree search directly on live websites poses significant safety risks and practical constraints due to irreversible actions such as confirming a purchase. In this paper, we introduce a novel paradigm that augments language agents with model-based planning, pioneering the innovative use of large language models (LLMs) as world models in complex web environments. Our method, WebDreamer, builds on the key insight that LLMs inherently encode comprehensive knowledge about website structures and functionalities. Specifically, WebDreamer uses LLMs to simulate outcomes for each candidate action (e.g., "what would happen if I click this button?") using natural language descriptions, and then evaluates these imagined outcomes to determine the optimal action at each step. Empirical results on two representative web agent benchmarks with online interaction -- VisualWebArena and Mind2Web-live -- demonstrate that WebDreamer achieves substantial improvements over reactive baselines. By establishing the viability of LLMs as world models in web environments, this work lays the groundwork for a paradigm shift in automated web interaction. More broadly, our findings open exciting new avenues for future research into 1) optimizing LLMs specifically for world modeling in complex, dynamic environments, and 2) model-based speculative planning for language agents.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 9 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, Yu Su</p>

            <p><strong>Title:</strong><br>
            Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.06559v1">http://arxiv.org/abs/2411.06559v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language agents have demonstrated promising capabilities in automating web-based tasks, though their current reactive approaches still underperform largely compared to humans. While incorporating advanced planning algorithms, particularly tree search methods, could enhance these agents' performance, implementing tree search directly on live websites poses significant safety risks and practical constraints due to irreversible actions such as confirming a purchase. In this paper, we introduce a novel paradigm that augments language agents with model-based planning, pioneering the innovative use of large language models (LLMs) as world models in complex web environments. Our method, WebDreamer, builds on the key insight that LLMs inherently encode comprehensive knowledge about website structures and functionalities. Specifically, WebDreamer uses LLMs to simulate outcomes for each candidate action (e.g., "what would happen if I click this button?") using natural language descriptions, and then evaluates these imagined outcomes to determine the optimal action at each step. Empirical results on two representative web agent benchmarks with online interaction -- VisualWebArena and Mind2Web-live -- demonstrate that WebDreamer achieves substantial improvements over reactive baselines. By establishing the viability of LLMs as world models in web environments, this work lays the groundwork for a paradigm shift in automated web interaction. More broadly, our findings open exciting new avenues for future research into 1) optimizing LLMs specifically for world modeling in complex, dynamic environments, and 2) model-based speculative planning for language agents.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 21 Nov 2024 19:45:31 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2559682a/f44cd662.mp3" length="20693624" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1290</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 9 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, Yu Su</p>

            <p><strong>Title:</strong><br>
            Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.06559v1">http://arxiv.org/abs/2411.06559v1</a></p>

            <p><strong>Abstract:</strong><br>
            Language agents have demonstrated promising capabilities in automating web-based tasks, though their current reactive approaches still underperform largely compared to humans. While incorporating advanced planning algorithms, particularly tree search methods, could enhance these agents' performance, implementing tree search directly on live websites poses significant safety risks and practical constraints due to irreversible actions such as confirming a purchase. In this paper, we introduce a novel paradigm that augments language agents with model-based planning, pioneering the innovative use of large language models (LLMs) as world models in complex web environments. Our method, WebDreamer, builds on the key insight that LLMs inherently encode comprehensive knowledge about website structures and functionalities. Specifically, WebDreamer uses LLMs to simulate outcomes for each candidate action (e.g., "what would happen if I click this button?") using natural language descriptions, and then evaluates these imagined outcomes to determine the optimal action at each step. Empirical results on two representative web agent benchmarks with online interaction -- VisualWebArena and Mind2Web-live -- demonstrate that WebDreamer achieves substantial improvements over reactive baselines. By establishing the viability of LLMs as world models in web environments, this work lays the groundwork for a paradigm shift in automated web interaction. More broadly, our findings open exciting new avenues for future research into 1) optimizing LLMs specifically for world modeling in complex, dynamic environments, and 2) model-based speculative planning for language agents.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training</title>
      <itunes:episode>114</itunes:episode>
      <podcast:episode>114</podcast:episode>
      <itunes:title>When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c983eaa4-0e11-483e-8434-902c5472bc7d</guid>
      <link>https://share.transistor.fm/s/38a113b3</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji Kawaguchi, Tianyu Pang</p>

            <p><strong>Title:</strong><br>
            When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.13476v1">http://arxiv.org/abs/2411.13476v1</a></p>

            <p><strong>Abstract:</strong><br>
            Extending context window sizes allows large language models (LLMs) to process longer sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has become the de facto standard due to its relative positional encoding properties that benefit long-context training. However, we observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding, especially in long-context scenarios. This issue arises from BFloat16's limited precision and accumulates as context length increases, with the first token contributing significantly to this problem. To address this, we develop AnchorAttention, a plug-and-play attention method that alleviates numerical issues caused by BFloat16, improves long-context capabilities, and speeds up training. AnchorAttention reduces unnecessary attention computations, maintains semantic coherence, and boosts computational efficiency by treating the first token as a shared anchor with a consistent position ID, making it visible to all documents within the training context. Experiments on three types of LLMs demonstrate that AnchorAttention significantly improves long-context performance and reduces training time by over 50\% compared to standard full attention mechanisms, while preserving the original LLM's capabilities on general tasks. Our code is available at https://github.com/haonan3/AnchorContext.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji Kawaguchi, Tianyu Pang</p>

            <p><strong>Title:</strong><br>
            When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.13476v1">http://arxiv.org/abs/2411.13476v1</a></p>

            <p><strong>Abstract:</strong><br>
            Extending context window sizes allows large language models (LLMs) to process longer sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has become the de facto standard due to its relative positional encoding properties that benefit long-context training. However, we observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding, especially in long-context scenarios. This issue arises from BFloat16's limited precision and accumulates as context length increases, with the first token contributing significantly to this problem. To address this, we develop AnchorAttention, a plug-and-play attention method that alleviates numerical issues caused by BFloat16, improves long-context capabilities, and speeds up training. AnchorAttention reduces unnecessary attention computations, maintains semantic coherence, and boosts computational efficiency by treating the first token as a shared anchor with a consistent position ID, making it visible to all documents within the training context. Experiments on three types of LLMs demonstrate that AnchorAttention significantly improves long-context performance and reduces training time by over 50\% compared to standard full attention mechanisms, while preserving the original LLM's capabilities on general tasks. Our code is available at https://github.com/haonan3/AnchorContext.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 21 Nov 2024 19:45:10 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/38a113b3/d6b93bf4.mp3" length="23770216" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1482</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji Kawaguchi, Tianyu Pang</p>

            <p><strong>Title:</strong><br>
            When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.13476v1">http://arxiv.org/abs/2411.13476v1</a></p>

            <p><strong>Abstract:</strong><br>
            Extending context window sizes allows large language models (LLMs) to process longer sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has become the de facto standard due to its relative positional encoding properties that benefit long-context training. However, we observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding, especially in long-context scenarios. This issue arises from BFloat16's limited precision and accumulates as context length increases, with the first token contributing significantly to this problem. To address this, we develop AnchorAttention, a plug-and-play attention method that alleviates numerical issues caused by BFloat16, improves long-context capabilities, and speeds up training. AnchorAttention reduces unnecessary attention computations, maintains semantic coherence, and boosts computational efficiency by treating the first token as a shared anchor with a consistent position ID, making it visible to all documents within the training context. Experiments on three types of LLMs demonstrate that AnchorAttention significantly improves long-context performance and reduces training time by over 50\% compared to standard full attention mechanisms, while preserving the original LLM's capabilities on general tasks. Our code is available at https://github.com/haonan3/AnchorContext.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Stylecodes: Encoding Stylistic Information For Image Generation</title>
      <itunes:episode>113</itunes:episode>
      <podcast:episode>113</podcast:episode>
      <itunes:title>Stylecodes: Encoding Stylistic Information For Image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">60ba385b-b11f-49bf-a562-4fc14efcbcb1</guid>
      <link>https://share.transistor.fm/s/2344e583</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 6 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ciara Rowles</p>

            <p><strong>Title:</strong><br>
            Stylecodes: Encoding Stylistic Information For Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12811v1">http://arxiv.org/abs/2411.12811v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models excel in image generation, but controlling them remains a challenge. We focus on the problem of style-conditioned image generation. Although example images work, they are cumbersome: srefs (style-reference codes) from MidJourney solve this issue by expressing a specific image style in a short numeric code. These have seen widespread adoption throughout social media due to both their ease of sharing and the fact they allow using an image for style control, without having to post the source images themselves. However, users are not able to generate srefs from their own images, nor is the underlying training procedure public. We propose StyleCodes: an open-source and open-research style encoder architecture and training procedure to express image style as a 20-symbol base64 code. Our experiments show that our encoding results in minimal loss in quality compared to traditional image-to-style techniques.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 6 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ciara Rowles</p>

            <p><strong>Title:</strong><br>
            Stylecodes: Encoding Stylistic Information For Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12811v1">http://arxiv.org/abs/2411.12811v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models excel in image generation, but controlling them remains a challenge. We focus on the problem of style-conditioned image generation. Although example images work, they are cumbersome: srefs (style-reference codes) from MidJourney solve this issue by expressing a specific image style in a short numeric code. These have seen widespread adoption throughout social media due to both their ease of sharing and the fact they allow using an image for style control, without having to post the source images themselves. However, users are not able to generate srefs from their own images, nor is the underlying training procedure public. We propose StyleCodes: an open-source and open-research style encoder architecture and training procedure to express image style as a 20-symbol base64 code. Our experiments show that our encoding results in minimal loss in quality compared to traditional image-to-style techniques.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 21 Nov 2024 19:44:49 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2344e583/29fbf597.mp3" length="20640101" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1286</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 6 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Ciara Rowles</p>

            <p><strong>Title:</strong><br>
            Stylecodes: Encoding Stylistic Information For Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12811v1">http://arxiv.org/abs/2411.12811v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models excel in image generation, but controlling them remains a challenge. We focus on the problem of style-conditioned image generation. Although example images work, they are cumbersome: srefs (style-reference codes) from MidJourney solve this issue by expressing a specific image style in a short numeric code. These have seen widespread adoption throughout social media due to both their ease of sharing and the fact they allow using an image for style control, without having to post the source images themselves. However, users are not able to generate srefs from their own images, nor is the underlying training procedure public. We propose StyleCodes: an open-source and open-research style encoder architecture and training procedure to express image style as a 20-symbol base64 code. Our experiments show that our encoding results in minimal loss in quality compared to traditional image-to-style techniques.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models</title>
      <itunes:episode>112</itunes:episode>
      <podcast:episode>112</podcast:episode>
      <itunes:title>ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">50070e8d-607c-4e42-9bbe-f31c022047cf</guid>
      <link>https://share.transistor.fm/s/83dcf8aa</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Vipula Rawte, Sarthak Jain, Aarush Sinha, Garv Kaushik, Aman Bansal, Prathiksha Rumale Vishwanath, Samyak Rajesh Jain, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amit P. Sheth, Amitava Das</p>

            <p><strong>Title:</strong><br>
            ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10867v1">http://arxiv.org/abs/2411.10867v1</a></p>

            <p><strong>Abstract:</strong><br>
            Latest developments in Large Multimodal Models (LMMs) have broadened their capabilities to include video understanding. Specifically, Text-to-video (T2V) models have made significant progress in quality, comprehension, and duration, excelling at creating videos from simple textual prompts. Yet, they still frequently produce hallucinated content that clearly signals the video is AI-generated. We introduce ViBe: a large-scale Text-to-Video Benchmark of hallucinated videos from T2V models. We identify five major types of hallucination: Vanishing Subject, Numeric Variability, Temporal Dysmorphia, Omission Error, and Physical Incongruity. Using 10 open-source T2V models, we developed the first large-scale dataset of hallucinated videos, comprising 3,782 videos annotated by humans into these five categories. ViBe offers a unique resource for evaluating the reliability of T2V models and provides a foundation for improving hallucination detection and mitigation in video generation. We establish classification as a baseline and present various ensemble classifier configurations, with the TimeSFormer + CNN combination yielding the best performance, achieving 0.345 accuracy and 0.342 F1 score. This benchmark aims to drive the development of robust T2V models that produce videos more accurately aligned with input prompts.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Vipula Rawte, Sarthak Jain, Aarush Sinha, Garv Kaushik, Aman Bansal, Prathiksha Rumale Vishwanath, Samyak Rajesh Jain, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amit P. Sheth, Amitava Das</p>

            <p><strong>Title:</strong><br>
            ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10867v1">http://arxiv.org/abs/2411.10867v1</a></p>

            <p><strong>Abstract:</strong><br>
            Latest developments in Large Multimodal Models (LMMs) have broadened their capabilities to include video understanding. Specifically, Text-to-video (T2V) models have made significant progress in quality, comprehension, and duration, excelling at creating videos from simple textual prompts. Yet, they still frequently produce hallucinated content that clearly signals the video is AI-generated. We introduce ViBe: a large-scale Text-to-Video Benchmark of hallucinated videos from T2V models. We identify five major types of hallucination: Vanishing Subject, Numeric Variability, Temporal Dysmorphia, Omission Error, and Physical Incongruity. Using 10 open-source T2V models, we developed the first large-scale dataset of hallucinated videos, comprising 3,782 videos annotated by humans into these five categories. ViBe offers a unique resource for evaluating the reliability of T2V models and provides a foundation for improving hallucination detection and mitigation in video generation. We establish classification as a baseline and present various ensemble classifier configurations, with the TimeSFormer + CNN combination yielding the best performance, achieving 0.345 accuracy and 0.342 F1 score. This benchmark aims to drive the development of robust T2V models that produce videos more accurately aligned with input prompts.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 21 Nov 2024 19:44:29 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/83dcf8aa/55f5d936.mp3" length="22706098" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1415</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Vipula Rawte, Sarthak Jain, Aarush Sinha, Garv Kaushik, Aman Bansal, Prathiksha Rumale Vishwanath, Samyak Rajesh Jain, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amit P. Sheth, Amitava Das</p>

            <p><strong>Title:</strong><br>
            ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10867v1">http://arxiv.org/abs/2411.10867v1</a></p>

            <p><strong>Abstract:</strong><br>
            Latest developments in Large Multimodal Models (LMMs) have broadened their capabilities to include video understanding. Specifically, Text-to-video (T2V) models have made significant progress in quality, comprehension, and duration, excelling at creating videos from simple textual prompts. Yet, they still frequently produce hallucinated content that clearly signals the video is AI-generated. We introduce ViBe: a large-scale Text-to-Video Benchmark of hallucinated videos from T2V models. We identify five major types of hallucination: Vanishing Subject, Numeric Variability, Temporal Dysmorphia, Omission Error, and Physical Incongruity. Using 10 open-source T2V models, we developed the first large-scale dataset of hallucinated videos, comprising 3,782 videos annotated by humans into these five categories. ViBe offers a unique resource for evaluating the reliability of T2V models and provides a foundation for improving hallucination detection and mitigation in video generation. We establish classification as a baseline and present various ensemble classifier configurations, with the TimeSFormer + CNN combination yielding the best performance, achieving 0.345 accuracy and 0.342 F1 score. This benchmark aims to drive the development of robust T2V models that produce videos more accurately aligned with input prompts.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Loss-to-Loss Prediction: Scaling Laws for All Datasets</title>
      <itunes:episode>111</itunes:episode>
      <podcast:episode>111</podcast:episode>
      <itunes:title>Loss-to-Loss Prediction: Scaling Laws for All Datasets</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7e555fd2-27ab-4084-8de8-3909a7e504f9</guid>
      <link>https://share.transistor.fm/s/cc2e7a59</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 2 | cs.LG, cs.AI, cs.CL, stat.ML</p>

            <p><strong>Authors:</strong><br>
            David Brandfonbrener, Nikhil Anand, Nikhil Vyas, Eran Malach, Sham Kakade</p>

            <p><strong>Title:</strong><br>
            Loss-to-Loss Prediction: Scaling Laws for All Datasets</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12925v1">http://arxiv.org/abs/2411.12925v1</a></p>

            <p><strong>Abstract:</strong><br>
            While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how these predictions should change as we change the distribution. In this paper, we derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets and from pre-training data to downstream task data. Our predictions extrapolate well even at 20x the largest FLOP budget used to fit the curves. More precisely, we find that there are simple shifted power law relationships between (1) the train losses of two models trained on two separate datasets when the models are paired by training compute (train-to-train), (2) the train loss and the test loss on any downstream distribution for a single model (train-to-test), and (3) the test losses of two models trained on two separate train datasets (test-to-test). The results hold up for pre-training datasets that differ substantially (some are entirely code and others have no code at all) and across a variety of downstream tasks. Finally, we find that in some settings these shifted power law relationships can yield more accurate predictions than extrapolating single-dataset scaling laws.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 2 | cs.LG, cs.AI, cs.CL, stat.ML</p>

            <p><strong>Authors:</strong><br>
            David Brandfonbrener, Nikhil Anand, Nikhil Vyas, Eran Malach, Sham Kakade</p>

            <p><strong>Title:</strong><br>
            Loss-to-Loss Prediction: Scaling Laws for All Datasets</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12925v1">http://arxiv.org/abs/2411.12925v1</a></p>

            <p><strong>Abstract:</strong><br>
            While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how these predictions should change as we change the distribution. In this paper, we derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets and from pre-training data to downstream task data. Our predictions extrapolate well even at 20x the largest FLOP budget used to fit the curves. More precisely, we find that there are simple shifted power law relationships between (1) the train losses of two models trained on two separate datasets when the models are paired by training compute (train-to-train), (2) the train loss and the test loss on any downstream distribution for a single model (train-to-test), and (3) the test losses of two models trained on two separate train datasets (test-to-test). The results hold up for pre-training datasets that differ substantially (some are entirely code and others have no code at all) and across a variety of downstream tasks. Finally, we find that in some settings these shifted power law relationships can yield more accurate predictions than extrapolating single-dataset scaling laws.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 21 Nov 2024 19:44:07 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cc2e7a59/22ce0c17.mp3" length="20785960" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1295</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 2 | cs.LG, cs.AI, cs.CL, stat.ML</p>

            <p><strong>Authors:</strong><br>
            David Brandfonbrener, Nikhil Anand, Nikhil Vyas, Eran Malach, Sham Kakade</p>

            <p><strong>Title:</strong><br>
            Loss-to-Loss Prediction: Scaling Laws for All Datasets</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12925v1">http://arxiv.org/abs/2411.12925v1</a></p>

            <p><strong>Abstract:</strong><br>
            While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how these predictions should change as we change the distribution. In this paper, we derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets and from pre-training data to downstream task data. Our predictions extrapolate well even at 20x the largest FLOP budget used to fit the curves. More precisely, we find that there are simple shifted power law relationships between (1) the train losses of two models trained on two separate datasets when the models are paired by training compute (train-to-train), (2) the train loss and the test loss on any downstream distribution for a single model (train-to-test), and (3) the test losses of two models trained on two separate train datasets (test-to-test). The results hold up for pre-training datasets that differ substantially (some are entirely code and others have no code at all) and across a variety of downstream tasks. Finally, we find that in some settings these shifted power law relationships can yield more accurate predictions than extrapolating single-dataset scaling laws.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ORID: Organ-Regional Information Driven Framework for Radiology Report Generation</title>
      <itunes:episode>110</itunes:episode>
      <podcast:episode>110</podcast:episode>
      <itunes:title>ORID: Organ-Regional Information Driven Framework for Radiology Report Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f608f9da-24bc-48de-9d1c-9a90fef87838</guid>
      <link>https://share.transistor.fm/s/d74cc67a</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 2 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai</p>

            <p><strong>Title:</strong><br>
            ORID: Organ-Regional Information Driven Framework for Radiology Report Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.13025v1">http://arxiv.org/abs/2411.13025v1</a></p>

            <p><strong>Abstract:</strong><br>
            The objective of Radiology Report Generation (RRG) is to automatically generate coherent textual analyses of diseases based on radiological images, thereby alleviating the workload of radiologists. Current AI-based methods for RRG primarily focus on modifications to the encoder-decoder model architecture. To advance these approaches, this paper introduces an Organ-Regional Information Driven (ORID) framework which can effectively integrate multi-modal information and reduce the influence of noise from unrelated organs. Specifically, based on the LLaVA-Med, we first construct an RRG-related instruction dataset to improve organ-regional diagnosis description ability and get the LLaVA-Med-RRG. After that, we propose an organ-based cross-modal fusion module to effectively combine the information from the organ-regional diagnosis description and radiology image. To further reduce the influence of noise from unrelated organs on the radiology report generation, we introduce an organ importance coefficient analysis module, which leverages Graph Neural Network (GNN) to examine the interconnections of the cross-modal information of each organ region. Extensive experiments an1d comparisons with state-of-the-art methods across various evaluation metrics demonstrate the superior performance of our proposed method.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 2 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai</p>

            <p><strong>Title:</strong><br>
            ORID: Organ-Regional Information Driven Framework for Radiology Report Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.13025v1">http://arxiv.org/abs/2411.13025v1</a></p>

            <p><strong>Abstract:</strong><br>
            The objective of Radiology Report Generation (RRG) is to automatically generate coherent textual analyses of diseases based on radiological images, thereby alleviating the workload of radiologists. Current AI-based methods for RRG primarily focus on modifications to the encoder-decoder model architecture. To advance these approaches, this paper introduces an Organ-Regional Information Driven (ORID) framework which can effectively integrate multi-modal information and reduce the influence of noise from unrelated organs. Specifically, based on the LLaVA-Med, we first construct an RRG-related instruction dataset to improve organ-regional diagnosis description ability and get the LLaVA-Med-RRG. After that, we propose an organ-based cross-modal fusion module to effectively combine the information from the organ-regional diagnosis description and radiology image. To further reduce the influence of noise from unrelated organs on the radiology report generation, we introduce an organ importance coefficient analysis module, which leverages Graph Neural Network (GNN) to examine the interconnections of the cross-modal information of each organ region. Extensive experiments an1d comparisons with state-of-the-art methods across various evaluation metrics demonstrate the superior performance of our proposed method.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 21 Nov 2024 19:43:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d74cc67a/ec0ea13c.mp3" length="19560531" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1219</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 2 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai</p>

            <p><strong>Title:</strong><br>
            ORID: Organ-Regional Information Driven Framework for Radiology Report Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.13025v1">http://arxiv.org/abs/2411.13025v1</a></p>

            <p><strong>Abstract:</strong><br>
            The objective of Radiology Report Generation (RRG) is to automatically generate coherent textual analyses of diseases based on radiological images, thereby alleviating the workload of radiologists. Current AI-based methods for RRG primarily focus on modifications to the encoder-decoder model architecture. To advance these approaches, this paper introduces an Organ-Regional Information Driven (ORID) framework which can effectively integrate multi-modal information and reduce the influence of noise from unrelated organs. Specifically, based on the LLaVA-Med, we first construct an RRG-related instruction dataset to improve organ-regional diagnosis description ability and get the LLaVA-Med-RRG. After that, we propose an organ-based cross-modal fusion module to effectively combine the information from the organ-regional diagnosis description and radiology image. To further reduce the influence of noise from unrelated organs on the radiology report generation, we introduce an organ importance coefficient analysis module, which leverages Graph Neural Network (GNN) to examine the interconnections of the cross-modal information of each organ region. Extensive experiments an1d comparisons with state-of-the-art methods across various evaluation metrics demonstrate the superior performance of our proposed method.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization</title>
      <itunes:episode>109</itunes:episode>
      <podcast:episode>109</podcast:episode>
      <itunes:title>SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2251d743-664d-4367-a2f5-9262f06fc5d8</guid>
      <link>https://share.transistor.fm/s/9bed01a2</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hongrui Jia, Chaoya Jiang, Haiyang Xu, Wei Ye, Mengfan Dong, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang</p>

            <p><strong>Title:</strong><br>
            SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11909v1">http://arxiv.org/abs/2411.11909v1</a></p>

            <p><strong>Abstract:</strong><br>
            As language models continue to scale, Large Language Models (LLMs) have exhibited emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks by prefixing a few in-context demonstrations (ICDs) as context. Inspired by these advancements, researchers have extended these techniques to develop Large Multimodal Models (LMMs) with ICL capabilities. However, existing LMMs face a critical issue: they often fail to effectively leverage the visual context in multimodal demonstrations and instead simply follow textual patterns. This indicates that LMMs do not achieve effective alignment between multimodal demonstrations and model outputs. To address this problem, we propose Symbol Demonstration Direct Preference Optimization (SymDPO). Specifically, SymDPO aims to break the traditional paradigm of constructing multimodal demonstrations by using random symbols to replace text answers within instances. This forces the model to carefully understand the demonstration images and establish a relationship between the images and the symbols to answer questions correctly. We validate the effectiveness of this method on multiple benchmarks, demonstrating that with SymDPO, LMMs can more effectively understand the multimodal context within examples and utilize this knowledge to answer questions better.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hongrui Jia, Chaoya Jiang, Haiyang Xu, Wei Ye, Mengfan Dong, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang</p>

            <p><strong>Title:</strong><br>
            SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11909v1">http://arxiv.org/abs/2411.11909v1</a></p>

            <p><strong>Abstract:</strong><br>
            As language models continue to scale, Large Language Models (LLMs) have exhibited emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks by prefixing a few in-context demonstrations (ICDs) as context. Inspired by these advancements, researchers have extended these techniques to develop Large Multimodal Models (LMMs) with ICL capabilities. However, existing LMMs face a critical issue: they often fail to effectively leverage the visual context in multimodal demonstrations and instead simply follow textual patterns. This indicates that LMMs do not achieve effective alignment between multimodal demonstrations and model outputs. To address this problem, we propose Symbol Demonstration Direct Preference Optimization (SymDPO). Specifically, SymDPO aims to break the traditional paradigm of constructing multimodal demonstrations by using random symbols to replace text answers within instances. This forces the model to carefully understand the demonstration images and establish a relationship between the images and the symbols to answer questions correctly. We validate the effectiveness of this method on multiple benchmarks, demonstrating that with SymDPO, LMMs can more effectively understand the multimodal context within examples and utilize this knowledge to answer questions better.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 20 Nov 2024 19:39:30 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9bed01a2/4a3345d3.mp3" length="24414330" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1522</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hongrui Jia, Chaoya Jiang, Haiyang Xu, Wei Ye, Mengfan Dong, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang</p>

            <p><strong>Title:</strong><br>
            SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11909v1">http://arxiv.org/abs/2411.11909v1</a></p>

            <p><strong>Abstract:</strong><br>
            As language models continue to scale, Large Language Models (LLMs) have exhibited emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks by prefixing a few in-context demonstrations (ICDs) as context. Inspired by these advancements, researchers have extended these techniques to develop Large Multimodal Models (LMMs) with ICL capabilities. However, existing LMMs face a critical issue: they often fail to effectively leverage the visual context in multimodal demonstrations and instead simply follow textual patterns. This indicates that LMMs do not achieve effective alignment between multimodal demonstrations and model outputs. To address this problem, we propose Symbol Demonstration Direct Preference Optimization (SymDPO). Specifically, SymDPO aims to break the traditional paradigm of constructing multimodal demonstrations by using random symbols to replace text answers within instances. This forces the model to carefully understand the demonstration images and establish a relationship between the images and the symbols to answer questions correctly. We validate the effectiveness of this method on multiple benchmarks, demonstrating that with SymDPO, LMMs can more effectively understand the multimodal context within examples and utilize this knowledge to answer questions better.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Continuous Speculative Decoding for Autoregressive Image Generation</title>
      <itunes:episode>108</itunes:episode>
      <podcast:episode>108</podcast:episode>
      <itunes:title>Continuous Speculative Decoding for Autoregressive Image Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d72481ab-4079-434b-8b1c-02ab115808a9</guid>
      <link>https://share.transistor.fm/s/afaf6a3c</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zili Wang, Robert Zhang, Kun Ding, Qi Yang, Fei Li, Shiming Xiang</p>

            <p><strong>Title:</strong><br>
            Continuous Speculative Decoding for Autoregressive Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11925v1">http://arxiv.org/abs/2411.11925v1</a></p>

            <p><strong>Abstract:</strong><br>
            Continuous-valued Autoregressive (AR) image generation models have demonstrated notable superiority over their discrete-token counterparts, showcasing considerable reconstruction quality and higher generation fidelity. However, the computational demands of the autoregressive framework result in significant inference overhead. While speculative decoding has proven effective in accelerating Large Language Models (LLMs), their adaptation to continuous-valued visual autoregressive models remains unexplored. This work generalizes the speculative decoding algorithm from discrete tokens to continuous space. By analyzing the intrinsic properties of output distribution, we establish a tailored acceptance criterion for the diffusion distributions prevalent in such models. To overcome the inconsistency that occurred in speculative decoding output distributions, we introduce denoising trajectory alignment and token pre-filling methods. Additionally, we identify the hard-to-sample distribution in the rejection phase. To mitigate this issue, we propose a meticulous acceptance-rejection sampling method with a proper upper bound, thereby circumventing complex integration. Experimental results show that our continuous speculative decoding achieves a remarkable $2.33\times$ speed-up on off-the-shelf models while maintaining the output distribution. Codes will be available at https://github.com/MarkXCloud/CSpD</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zili Wang, Robert Zhang, Kun Ding, Qi Yang, Fei Li, Shiming Xiang</p>

            <p><strong>Title:</strong><br>
            Continuous Speculative Decoding for Autoregressive Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11925v1">http://arxiv.org/abs/2411.11925v1</a></p>

            <p><strong>Abstract:</strong><br>
            Continuous-valued Autoregressive (AR) image generation models have demonstrated notable superiority over their discrete-token counterparts, showcasing considerable reconstruction quality and higher generation fidelity. However, the computational demands of the autoregressive framework result in significant inference overhead. While speculative decoding has proven effective in accelerating Large Language Models (LLMs), their adaptation to continuous-valued visual autoregressive models remains unexplored. This work generalizes the speculative decoding algorithm from discrete tokens to continuous space. By analyzing the intrinsic properties of output distribution, we establish a tailored acceptance criterion for the diffusion distributions prevalent in such models. To overcome the inconsistency that occurred in speculative decoding output distributions, we introduce denoising trajectory alignment and token pre-filling methods. Additionally, we identify the hard-to-sample distribution in the rejection phase. To mitigate this issue, we propose a meticulous acceptance-rejection sampling method with a proper upper bound, thereby circumventing complex integration. Experimental results show that our continuous speculative decoding achieves a remarkable $2.33\times$ speed-up on off-the-shelf models while maintaining the output distribution. Codes will be available at https://github.com/MarkXCloud/CSpD</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 20 Nov 2024 19:39:09 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/afaf6a3c/391e919d.mp3" length="21767759" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1357</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zili Wang, Robert Zhang, Kun Ding, Qi Yang, Fei Li, Shiming Xiang</p>

            <p><strong>Title:</strong><br>
            Continuous Speculative Decoding for Autoregressive Image Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11925v1">http://arxiv.org/abs/2411.11925v1</a></p>

            <p><strong>Abstract:</strong><br>
            Continuous-valued Autoregressive (AR) image generation models have demonstrated notable superiority over their discrete-token counterparts, showcasing considerable reconstruction quality and higher generation fidelity. However, the computational demands of the autoregressive framework result in significant inference overhead. While speculative decoding has proven effective in accelerating Large Language Models (LLMs), their adaptation to continuous-valued visual autoregressive models remains unexplored. This work generalizes the speculative decoding algorithm from discrete tokens to continuous space. By analyzing the intrinsic properties of output distribution, we establish a tailored acceptance criterion for the diffusion distributions prevalent in such models. To overcome the inconsistency that occurred in speculative decoding output distributions, we introduce denoising trajectory alignment and token pre-filling methods. Additionally, we identify the hard-to-sample distribution in the rejection phase. To mitigate this issue, we propose a meticulous acceptance-rejection sampling method with a proper upper bound, thereby circumventing complex integration. Experimental results show that our continuous speculative decoding achieves a remarkable $2.33\times$ speed-up on off-the-shelf models while maintaining the output distribution. Codes will be available at https://github.com/MarkXCloud/CSpD</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements</title>
      <itunes:episode>107</itunes:episode>
      <podcast:episode>107</podcast:episode>
      <itunes:title>ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4c953533-9e99-4853-84c5-abc8d63e37ea</guid>
      <link>https://share.transistor.fm/s/9907d73c</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            M. Arda Aydın, Efe Mert Çırpar, Elvin Abdinli, Gozde Unal, Yusuf H. Sahin</p>

            <p><strong>Title:</strong><br>
            ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12044v1">http://arxiv.org/abs/2411.12044v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in foundational Vision Language Models (VLMs) have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks, including Open-Vocabulary Semantic Segmentation (OVSS). Although the initial results are promising, the dense prediction capabilities of VLMs still require further improvement. In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications: 1) architectural changes in the last layer of ViT and the incorporation of attention maps from the middle layers with the last layer, 2) Image Engineering: applying data augmentations to enrich input image representations, and 3) using Large Language Models (LLMs) to generate definitions and synonyms for each class name to leverage CLIP's open-vocabulary capabilities. Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on segmentation benchmarks such as COCO-Stuff, COCO-Object, Pascal Context, and Pascal VOC. Our code is available at https://github.com/m-arda-aydn/ITACLIP.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            M. Arda Aydın, Efe Mert Çırpar, Elvin Abdinli, Gozde Unal, Yusuf H. Sahin</p>

            <p><strong>Title:</strong><br>
            ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12044v1">http://arxiv.org/abs/2411.12044v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in foundational Vision Language Models (VLMs) have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks, including Open-Vocabulary Semantic Segmentation (OVSS). Although the initial results are promising, the dense prediction capabilities of VLMs still require further improvement. In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications: 1) architectural changes in the last layer of ViT and the incorporation of attention maps from the middle layers with the last layer, 2) Image Engineering: applying data augmentations to enrich input image representations, and 3) using Large Language Models (LLMs) to generate definitions and synonyms for each class name to leverage CLIP's open-vocabulary capabilities. Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on segmentation benchmarks such as COCO-Stuff, COCO-Object, Pascal Context, and Pascal VOC. Our code is available at https://github.com/m-arda-aydn/ITACLIP.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 20 Nov 2024 19:38:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9907d73c/ee8add4b.mp3" length="18559958" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1156</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 11 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            M. Arda Aydın, Efe Mert Çırpar, Elvin Abdinli, Gozde Unal, Yusuf H. Sahin</p>

            <p><strong>Title:</strong><br>
            ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12044v1">http://arxiv.org/abs/2411.12044v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent advances in foundational Vision Language Models (VLMs) have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks, including Open-Vocabulary Semantic Segmentation (OVSS). Although the initial results are promising, the dense prediction capabilities of VLMs still require further improvement. In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications: 1) architectural changes in the last layer of ViT and the incorporation of attention maps from the middle layers with the last layer, 2) Image Engineering: applying data augmentations to enrich input image representations, and 3) using Large Language Models (LLMs) to generate definitions and synonyms for each class name to leverage CLIP's open-vocabulary capabilities. Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on segmentation benchmarks such as COCO-Stuff, COCO-Object, Pascal Context, and Pascal VOC. Our code is available at https://github.com/m-arda-aydn/ITACLIP.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations</title>
      <itunes:episode>106</itunes:episode>
      <podcast:episode>106</podcast:episode>
      <itunes:title>FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2d1cffad-2210-4efc-93fb-7852986941b3</guid>
      <link>https://share.transistor.fm/s/d951c65f</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.GR, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hmrishav Bandyopadhyay, Yi-Zhe Song</p>

            <p><strong>Title:</strong><br>
            FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10818v1">http://arxiv.org/abs/2411.10818v1</a></p>

            <p><strong>Abstract:</strong><br>
            Sketch animations offer a powerful medium for visual storytelling, from simple flip-book doodles to professional studio productions. While traditional animation requires teams of skilled artists to draw key frames and in-between frames, existing automation attempts still demand significant artistic effort through precise motion paths or keyframe specification. We present FlipSketch, a system that brings back the magic of flip-book animation -- just draw your idea and describe how you want it to move! Our approach harnesses motion priors from text-to-video diffusion models, adapting them to generate sketch animations through three key innovations: (i) fine-tuning for sketch-style frame generation, (ii) a reference frame mechanism that preserves visual integrity of input sketch through noise refinement, and (iii) a dual-attention composition that enables fluid motion without losing visual consistency. Unlike constrained vector animations, our raster frames support dynamic sketch transformations, capturing the expressive freedom of traditional animation. The result is an intuitive system that makes sketch animation as simple as doodling and describing, while maintaining the artistic essence of hand-drawn animation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.GR, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hmrishav Bandyopadhyay, Yi-Zhe Song</p>

            <p><strong>Title:</strong><br>
            FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10818v1">http://arxiv.org/abs/2411.10818v1</a></p>

            <p><strong>Abstract:</strong><br>
            Sketch animations offer a powerful medium for visual storytelling, from simple flip-book doodles to professional studio productions. While traditional animation requires teams of skilled artists to draw key frames and in-between frames, existing automation attempts still demand significant artistic effort through precise motion paths or keyframe specification. We present FlipSketch, a system that brings back the magic of flip-book animation -- just draw your idea and describe how you want it to move! Our approach harnesses motion priors from text-to-video diffusion models, adapting them to generate sketch animations through three key innovations: (i) fine-tuning for sketch-style frame generation, (ii) a reference frame mechanism that preserves visual integrity of input sketch through noise refinement, and (iii) a dual-attention composition that enables fluid motion without losing visual consistency. Unlike constrained vector animations, our raster frames support dynamic sketch transformations, capturing the expressive freedom of traditional animation. The result is an intuitive system that makes sketch animation as simple as doodling and describing, while maintaining the artistic essence of hand-drawn animation.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 20 Nov 2024 19:38:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d951c65f/20fb63c3.mp3" length="24728166" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1542</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.GR, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Hmrishav Bandyopadhyay, Yi-Zhe Song</p>

            <p><strong>Title:</strong><br>
            FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10818v1">http://arxiv.org/abs/2411.10818v1</a></p>

            <p><strong>Abstract:</strong><br>
            Sketch animations offer a powerful medium for visual storytelling, from simple flip-book doodles to professional studio productions. While traditional animation requires teams of skilled artists to draw key frames and in-between frames, existing automation attempts still demand significant artistic effort through precise motion paths or keyframe specification. We present FlipSketch, a system that brings back the magic of flip-book animation -- just draw your idea and describe how you want it to move! Our approach harnesses motion priors from text-to-video diffusion models, adapting them to generate sketch animations through three key innovations: (i) fine-tuning for sketch-style frame generation, (ii) a reference frame mechanism that preserves visual integrity of input sketch through noise refinement, and (iii) a dual-attention composition that enables fluid motion without losing visual consistency. Unlike constrained vector animations, our raster frames support dynamic sketch transformations, capturing the expressive freedom of traditional animation. The result is an intuitive system that makes sketch animation as simple as doodling and describing, while maintaining the artistic essence of hand-drawn animation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Soft Robotic Dynamic In-Hand Pen Spinning</title>
      <itunes:episode>105</itunes:episode>
      <podcast:episode>105</podcast:episode>
      <itunes:title>Soft Robotic Dynamic In-Hand Pen Spinning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">26ab7f45-4d6e-4150-b231-5a297471b02d</guid>
      <link>https://share.transistor.fm/s/81f40704</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Yunchao Yao, Uksang Yoo, Jean Oh, Christopher G. Atkeson, Jeffrey Ichnowski</p>

            <p><strong>Title:</strong><br>
            Soft Robotic Dynamic In-Hand Pen Spinning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12734v1">http://arxiv.org/abs/2411.12734v1</a></p>

            <p><strong>Abstract:</strong><br>
            Dynamic in-hand manipulation remains a challenging task for soft robotic systems that have demonstrated advantages in safe compliant interactions but struggle with high-speed dynamic tasks. In this work, we present SWIFT, a system for learning dynamic tasks using a soft and compliant robotic hand. Unlike previous works that rely on simulation, quasi-static actions and precise object models, the proposed system learns to spin a pen through trial-and-error using only real-world data without requiring explicit prior knowledge of the pen's physical attributes. With self-labeled trials sampled from the real world, the system discovers the set of pen grasping and spinning primitive parameters that enables a soft hand to spin a pen robustly and reliably. After 130 sampled actions per object, SWIFT achieves 100% success rate across three pens with different weights and weight distributions, demonstrating the system's generalizability and robustness to changes in object properties. The results highlight the potential for soft robotic end-effectors to perform dynamic tasks including rapid in-hand manipulation. We also demonstrate that SWIFT generalizes to spinning items with different shapes and weights such as a brush and a screwdriver which we spin with 10/10 and 5/10 success rates respectively. Videos, data, and code are available at https://soft-spin.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Yunchao Yao, Uksang Yoo, Jean Oh, Christopher G. Atkeson, Jeffrey Ichnowski</p>

            <p><strong>Title:</strong><br>
            Soft Robotic Dynamic In-Hand Pen Spinning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12734v1">http://arxiv.org/abs/2411.12734v1</a></p>

            <p><strong>Abstract:</strong><br>
            Dynamic in-hand manipulation remains a challenging task for soft robotic systems that have demonstrated advantages in safe compliant interactions but struggle with high-speed dynamic tasks. In this work, we present SWIFT, a system for learning dynamic tasks using a soft and compliant robotic hand. Unlike previous works that rely on simulation, quasi-static actions and precise object models, the proposed system learns to spin a pen through trial-and-error using only real-world data without requiring explicit prior knowledge of the pen's physical attributes. With self-labeled trials sampled from the real world, the system discovers the set of pen grasping and spinning primitive parameters that enables a soft hand to spin a pen robustly and reliably. After 130 sampled actions per object, SWIFT achieves 100% success rate across three pens with different weights and weight distributions, demonstrating the system's generalizability and robustness to changes in object properties. The results highlight the potential for soft robotic end-effectors to perform dynamic tasks including rapid in-hand manipulation. We also demonstrate that SWIFT generalizes to spinning items with different shapes and weights such as a brush and a screwdriver which we spin with 10/10 and 5/10 success rates respectively. Videos, data, and code are available at https://soft-spin.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 20 Nov 2024 19:38:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/81f40704/cf0c7462.mp3" length="21704204" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1353</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.RO</p>

            <p><strong>Authors:</strong><br>
            Yunchao Yao, Uksang Yoo, Jean Oh, Christopher G. Atkeson, Jeffrey Ichnowski</p>

            <p><strong>Title:</strong><br>
            Soft Robotic Dynamic In-Hand Pen Spinning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12734v1">http://arxiv.org/abs/2411.12734v1</a></p>

            <p><strong>Abstract:</strong><br>
            Dynamic in-hand manipulation remains a challenging task for soft robotic systems that have demonstrated advantages in safe compliant interactions but struggle with high-speed dynamic tasks. In this work, we present SWIFT, a system for learning dynamic tasks using a soft and compliant robotic hand. Unlike previous works that rely on simulation, quasi-static actions and precise object models, the proposed system learns to spin a pen through trial-and-error using only real-world data without requiring explicit prior knowledge of the pen's physical attributes. With self-labeled trials sampled from the real world, the system discovers the set of pen grasping and spinning primitive parameters that enables a soft hand to spin a pen robustly and reliably. After 130 sampled actions per object, SWIFT achieves 100% success rate across three pens with different weights and weight distributions, demonstrating the system's generalizability and robustness to changes in object properties. The results highlight the potential for soft robotic end-effectors to perform dynamic tasks including rapid in-hand manipulation. We also demonstrate that SWIFT generalizes to spinning items with different shapes and weights such as a brush and a screwdriver which we spin with 10/10 and 5/10 success rates respectively. Videos, data, and code are available at https://soft-spin.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Building Trust: Foundations of Security, Safety and Transparency in AI</title>
      <itunes:episode>104</itunes:episode>
      <podcast:episode>104</podcast:episode>
      <itunes:title>Building Trust: Foundations of Security, Safety and Transparency in AI</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4576aa75-c29d-4c0a-b1f3-14e2695412f6</guid>
      <link>https://share.transistor.fm/s/2f47bd55</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.CY, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Huzaifa Sidhpurwala, Garth Mollett, Emily Fox, Mark Bestavros, Huamin Chen</p>

            <p><strong>Title:</strong><br>
            Building Trust: Foundations of Security, Safety and Transparency in AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12275v1">http://arxiv.org/abs/2411.12275v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper explores the rapidly evolving ecosystem of publicly available AI models, and their potential implications on the security and safety landscape. As AI models become increasingly prevalent, understanding their potential risks and vulnerabilities is crucial. We review the current security and safety scenarios while highlighting challenges such as tracking issues, remediation, and the apparent absence of AI model lifecycle and ownership processes. Comprehensive strategies to enhance security and safety for both model developers and end-users are proposed. This paper aims to provide some of the foundational pieces for more standardized security, safety, and transparency in the development and operation of AI models and the larger open ecosystems and communities forming around them.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.CY, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Huzaifa Sidhpurwala, Garth Mollett, Emily Fox, Mark Bestavros, Huamin Chen</p>

            <p><strong>Title:</strong><br>
            Building Trust: Foundations of Security, Safety and Transparency in AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12275v1">http://arxiv.org/abs/2411.12275v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper explores the rapidly evolving ecosystem of publicly available AI models, and their potential implications on the security and safety landscape. As AI models become increasingly prevalent, understanding their potential risks and vulnerabilities is crucial. We review the current security and safety scenarios while highlighting challenges such as tracking issues, remediation, and the apparent absence of AI model lifecycle and ownership processes. Comprehensive strategies to enhance security and safety for both model developers and end-users are proposed. This paper aims to provide some of the foundational pieces for more standardized security, safety, and transparency in the development and operation of AI models and the larger open ecosystems and communities forming around them.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 20 Nov 2024 19:37:44 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2f47bd55/1f7bcf23.mp3" length="21338518" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1330</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.CY, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Huzaifa Sidhpurwala, Garth Mollett, Emily Fox, Mark Bestavros, Huamin Chen</p>

            <p><strong>Title:</strong><br>
            Building Trust: Foundations of Security, Safety and Transparency in AI</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12275v1">http://arxiv.org/abs/2411.12275v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper explores the rapidly evolving ecosystem of publicly available AI models, and their potential implications on the security and safety landscape. As AI models become increasingly prevalent, understanding their potential risks and vulnerabilities is crucial. We review the current security and safety scenarios while highlighting challenges such as tracking issues, remediation, and the apparent absence of AI model lifecycle and ownership processes. Comprehensive strategies to enhance security and safety for both model developers and end-users are proposed. This paper aims to provide some of the foundational pieces for more standardized security, safety, and transparency in the development and operation of AI models and the larger open ecosystems and communities forming around them.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning</title>
      <itunes:episode>103</itunes:episode>
      <podcast:episode>103</podcast:episode>
      <itunes:title>SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7cbad4d7-73f7-4936-afc4-13c8ea82eb32</guid>
      <link>https://share.transistor.fm/s/388e6f95</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 5 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zewen Chen, Juan Wang, Wen Wang, Sunhan Xu, Hang Xiong, Yun Zeng, Jian Guo, Shuxun Wang, Chunfeng Yuan, Bing Li, Weiming Hu</p>

            <p><strong>Title:</strong><br>
            SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10161v1">http://arxiv.org/abs/2411.10161v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing Image Quality Assessment (IQA) methods achieve remarkable success in analyzing quality for overall image, but few works explore quality analysis for Regions of Interest (ROIs). The quality analysis of ROIs can provide fine-grained guidance for image quality improvement and is crucial for scenarios focusing on region-level quality. This paper proposes a novel network, SEAGULL, which can SEe and Assess ROIs quality with GUidance from a Large vision-Language model. SEAGULL incorporates a vision-language model (VLM), masks generated by Segment Anything Model (SAM) to specify ROIs, and a meticulously designed Mask-based Feature Extractor (MFE) to extract global and local tokens for specified ROIs, enabling accurate fine-grained IQA for ROIs. Moreover, this paper constructs two ROI-based IQA datasets, SEAGULL-100w and SEAGULL-3k, for training and evaluating ROI-based IQA. SEAGULL-100w comprises about 100w synthetic distortion images with 33 million ROIs for pre-training to improve the model's ability of regional quality perception, and SEAGULL-3k contains about 3k authentic distortion ROIs to enhance the model's ability to perceive real world distortions. After pre-training on SEAGULL-100w and fine-tuning on SEAGULL-3k, SEAGULL shows remarkable performance on fine-grained ROI quality assessment. Code and datasets are publicly available at the https://github.com/chencn2020/Seagull.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 5 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zewen Chen, Juan Wang, Wen Wang, Sunhan Xu, Hang Xiong, Yun Zeng, Jian Guo, Shuxun Wang, Chunfeng Yuan, Bing Li, Weiming Hu</p>

            <p><strong>Title:</strong><br>
            SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10161v1">http://arxiv.org/abs/2411.10161v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing Image Quality Assessment (IQA) methods achieve remarkable success in analyzing quality for overall image, but few works explore quality analysis for Regions of Interest (ROIs). The quality analysis of ROIs can provide fine-grained guidance for image quality improvement and is crucial for scenarios focusing on region-level quality. This paper proposes a novel network, SEAGULL, which can SEe and Assess ROIs quality with GUidance from a Large vision-Language model. SEAGULL incorporates a vision-language model (VLM), masks generated by Segment Anything Model (SAM) to specify ROIs, and a meticulously designed Mask-based Feature Extractor (MFE) to extract global and local tokens for specified ROIs, enabling accurate fine-grained IQA for ROIs. Moreover, this paper constructs two ROI-based IQA datasets, SEAGULL-100w and SEAGULL-3k, for training and evaluating ROI-based IQA. SEAGULL-100w comprises about 100w synthetic distortion images with 33 million ROIs for pre-training to improve the model's ability of regional quality perception, and SEAGULL-3k contains about 3k authentic distortion ROIs to enhance the model's ability to perceive real world distortions. After pre-training on SEAGULL-100w and fine-tuning on SEAGULL-3k, SEAGULL shows remarkable performance on fine-grained ROI quality assessment. Code and datasets are publicly available at the https://github.com/chencn2020/Seagull.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 20 Nov 2024 19:37:23 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/388e6f95/8a571512.mp3" length="19829307" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1236</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 5 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zewen Chen, Juan Wang, Wen Wang, Sunhan Xu, Hang Xiong, Yun Zeng, Jian Guo, Shuxun Wang, Chunfeng Yuan, Bing Li, Weiming Hu</p>

            <p><strong>Title:</strong><br>
            SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10161v1">http://arxiv.org/abs/2411.10161v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing Image Quality Assessment (IQA) methods achieve remarkable success in analyzing quality for overall image, but few works explore quality analysis for Regions of Interest (ROIs). The quality analysis of ROIs can provide fine-grained guidance for image quality improvement and is crucial for scenarios focusing on region-level quality. This paper proposes a novel network, SEAGULL, which can SEe and Assess ROIs quality with GUidance from a Large vision-Language model. SEAGULL incorporates a vision-language model (VLM), masks generated by Segment Anything Model (SAM) to specify ROIs, and a meticulously designed Mask-based Feature Extractor (MFE) to extract global and local tokens for specified ROIs, enabling accurate fine-grained IQA for ROIs. Moreover, this paper constructs two ROI-based IQA datasets, SEAGULL-100w and SEAGULL-3k, for training and evaluating ROI-based IQA. SEAGULL-100w comprises about 100w synthetic distortion images with 33 million ROIs for pre-training to improve the model's ability of regional quality perception, and SEAGULL-3k contains about 3k authentic distortion ROIs to enhance the model's ability to perceive real world distortions. After pre-training on SEAGULL-100w and fine-tuning on SEAGULL-3k, SEAGULL shows remarkable performance on fine-grained ROI quality assessment. Code and datasets are publicly available at the https://github.com/chencn2020/Seagull.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages</title>
      <itunes:episode>102</itunes:episode>
      <podcast:episode>102</podcast:episode>
      <itunes:title>Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fb3224d3-4d3f-4b91-8a3a-c1cb85192f12</guid>
      <link>https://share.transistor.fm/s/0b862bd7</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            S. Tamang, D. J. Bora</p>

            <p><strong>Title:</strong><br>
            Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12240v1">http://arxiv.org/abs/2411.12240v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) based on transformer architectures have revolutionized a variety of domains, with tokenization playing a pivotal role in their pre-processing and fine-tuning stages. In multilingual models, particularly those tailored for Indic languages, effective tokenization is crucial for optimizing performance. This paper presents a comprehensive evaluation of tokenizers used by 12 LLMs across all 22 official languages of India, with a focus on comparing the efficiency of their tokenization processes. We employed the Normalized Sequence Length (NSL) as a key metric in our analysis. Our findings reveal that the SUTRA tokenizer outperforms all other models, including several Indic-specific models, excelling in 14 languages. Notable insights include the SUTRA tokenizer's superior handling of Indic languages, GPT-4o's advancement over its predecessor GPT-4 in processing Indian languages, and the limited performance of Project Indus in certain languages. This study underscores the critical importance of developing targeted tokenization strategies for multilingual and Indic-centric models, laying the groundwork for future improvements in tokenizer design to enhance linguistic coverage and model efficiency.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            S. Tamang, D. J. Bora</p>

            <p><strong>Title:</strong><br>
            Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12240v1">http://arxiv.org/abs/2411.12240v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) based on transformer architectures have revolutionized a variety of domains, with tokenization playing a pivotal role in their pre-processing and fine-tuning stages. In multilingual models, particularly those tailored for Indic languages, effective tokenization is crucial for optimizing performance. This paper presents a comprehensive evaluation of tokenizers used by 12 LLMs across all 22 official languages of India, with a focus on comparing the efficiency of their tokenization processes. We employed the Normalized Sequence Length (NSL) as a key metric in our analysis. Our findings reveal that the SUTRA tokenizer outperforms all other models, including several Indic-specific models, excelling in 14 languages. Notable insights include the SUTRA tokenizer's superior handling of Indic languages, GPT-4o's advancement over its predecessor GPT-4 in processing Indian languages, and the limited performance of Project Indus in certain languages. This study underscores the critical importance of developing targeted tokenization strategies for multilingual and Indic-centric models, laying the groundwork for future improvements in tokenizer design to enhance linguistic coverage and model efficiency.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 20 Nov 2024 19:37:02 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0b862bd7/bc7f4756.mp3" length="23298767" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1452</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            S. Tamang, D. J. Bora</p>

            <p><strong>Title:</strong><br>
            Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.12240v1">http://arxiv.org/abs/2411.12240v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) based on transformer architectures have revolutionized a variety of domains, with tokenization playing a pivotal role in their pre-processing and fine-tuning stages. In multilingual models, particularly those tailored for Indic languages, effective tokenization is crucial for optimizing performance. This paper presents a comprehensive evaluation of tokenizers used by 12 LLMs across all 22 official languages of India, with a focus on comparing the efficiency of their tokenization processes. We employed the Normalized Sequence Length (NSL) as a key metric in our analysis. Our findings reveal that the SUTRA tokenizer outperforms all other models, including several Indic-specific models, excelling in 14 languages. Notable insights include the SUTRA tokenizer's superior handling of Indic languages, GPT-4o's advancement over its predecessor GPT-4 in processing Indian languages, and the limited performance of Project Indus in certain languages. This study underscores the critical importance of developing targeted tokenization strategies for multilingual and Indic-centric models, laying the groundwork for future improvements in tokenizer design to enhance linguistic coverage and model efficiency.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Generative World Explorer</title>
      <itunes:episode>101</itunes:episode>
      <podcast:episode>101</podcast:episode>
      <itunes:title>Generative World Explorer</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a2dfc3d1-a437-4daf-b023-84b63f9122ab</guid>
      <link>https://share.transistor.fm/s/c9cf5177</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 38 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Taiming Lu, Tianmin Shu, Alan Yuille, Daniel Khashabi, Jieneng Chen</p>

            <p><strong>Title:</strong><br>
            Generative World Explorer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11844v2">http://arxiv.org/abs/2411.11844v2</a></p>

            <p><strong>Abstract:</strong><br>
            Planning with partial observation is a central challenge in embodied AI. A majority of prior works have tackled this challenge by developing agents that physically explore their environment to update their beliefs about the world state. In contrast, humans can $\textit{imagine}$ unseen parts of the world through a mental exploration and $\textit{revise}$ their beliefs with imagined observations. Such updated beliefs can allow them to make more informed decisions, without necessitating the physical exploration of the world at all times. To achieve this human-like ability, we introduce the $\textit{Generative World Explorer (Genex)}$, an egocentric world exploration framework that allows an agent to mentally explore a large-scale 3D world (e.g., urban scenes) and acquire imagined observations to update its belief. This updated belief will then help the agent to make a more informed decision at the current step. To train $\textit{Genex}$, we create a synthetic urban scene dataset, Genex-DB. Our experimental results demonstrate that (1) $\textit{Genex}$ can generate high-quality and consistent observations during long-horizon exploration of a large virtual physical world and (2) the beliefs updated with the generated observations can inform an existing decision-making model (e.g., an LLM agent) to make better plans.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 38 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Taiming Lu, Tianmin Shu, Alan Yuille, Daniel Khashabi, Jieneng Chen</p>

            <p><strong>Title:</strong><br>
            Generative World Explorer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11844v2">http://arxiv.org/abs/2411.11844v2</a></p>

            <p><strong>Abstract:</strong><br>
            Planning with partial observation is a central challenge in embodied AI. A majority of prior works have tackled this challenge by developing agents that physically explore their environment to update their beliefs about the world state. In contrast, humans can $\textit{imagine}$ unseen parts of the world through a mental exploration and $\textit{revise}$ their beliefs with imagined observations. Such updated beliefs can allow them to make more informed decisions, without necessitating the physical exploration of the world at all times. To achieve this human-like ability, we introduce the $\textit{Generative World Explorer (Genex)}$, an egocentric world exploration framework that allows an agent to mentally explore a large-scale 3D world (e.g., urban scenes) and acquire imagined observations to update its belief. This updated belief will then help the agent to make a more informed decision at the current step. To train $\textit{Genex}$, we create a synthetic urban scene dataset, Genex-DB. Our experimental results demonstrate that (1) $\textit{Genex}$ can generate high-quality and consistent observations during long-horizon exploration of a large virtual physical world and (2) the beliefs updated with the generated observations can inform an existing decision-making model (e.g., an LLM agent) to make better plans.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 19 Nov 2024 19:46:23 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c9cf5177/29f59b49.mp3" length="20776736" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1295</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 38 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Taiming Lu, Tianmin Shu, Alan Yuille, Daniel Khashabi, Jieneng Chen</p>

            <p><strong>Title:</strong><br>
            Generative World Explorer</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11844v2">http://arxiv.org/abs/2411.11844v2</a></p>

            <p><strong>Abstract:</strong><br>
            Planning with partial observation is a central challenge in embodied AI. A majority of prior works have tackled this challenge by developing agents that physically explore their environment to update their beliefs about the world state. In contrast, humans can $\textit{imagine}$ unseen parts of the world through a mental exploration and $\textit{revise}$ their beliefs with imagined observations. Such updated beliefs can allow them to make more informed decisions, without necessitating the physical exploration of the world at all times. To achieve this human-like ability, we introduce the $\textit{Generative World Explorer (Genex)}$, an egocentric world exploration framework that allows an agent to mentally explore a large-scale 3D world (e.g., urban scenes) and acquire imagined observations to update its belief. This updated belief will then help the agent to make a more informed decision at the current step. To train $\textit{Genex}$, we create a synthetic urban scene dataset, Genex-DB. Our experimental results demonstrate that (1) $\textit{Genex}$ can generate high-quality and consistent observations during long-horizon exploration of a large virtual physical world and (2) the beliefs updated with the generated observations can inform an existing decision-making model (e.g., an LLM agent) to make better plans.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices</title>
      <itunes:episode>100</itunes:episode>
      <podcast:episode>100</podcast:episode>
      <itunes:title>BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">31664b1c-cd0e-4155-b2e8-d7a9f840d80a</guid>
      <link>https://share.transistor.fm/s/1217e5f6</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 31 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xudong Lu, Yinghao Chen, Cheng Chen, Hui Tan, Boheng Chen, Yina Xie, Rui Hu, Guanxin Tan, Renshou Wu, Yan Hu, Yi Zeng, Lei Wu, Liuyang Bian, Zhaoxiong Wang, Long Liu, Yanzhou Yang, Han Xiao, Aojun Zhou, Yafei Wen, Xiaoxin Chen, Shuai Ren, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10640v1">http://arxiv.org/abs/2411.10640v1</a></p>

            <p><strong>Abstract:</strong><br>
            The emergence and growing popularity of multimodal large language models (MLLMs) have significant potential to enhance various aspects of daily life, from improving communication to facilitating learning and problem-solving. Mobile phones, as essential daily companions, represent the most effective and accessible deployment platform for MLLMs, enabling seamless integration into everyday tasks. However, deploying MLLMs on mobile phones presents challenges due to limitations in memory size and computational capability, making it difficult to achieve smooth and real-time processing without extensive optimization. In this paper, we present BlueLM-V-3B, an algorithm and system co-design approach specifically tailored for the efficient deployment of MLLMs on mobile platforms. To be specific, we redesign the dynamic resolution scheme adopted by mainstream MLLMs and implement system optimization for hardware-aware deployment to optimize model inference on mobile phones. BlueLM-V-3B boasts the following key highlights: (1) Small Size: BlueLM-V-3B features a language model with 2.7B parameters and a vision encoder with 400M parameters. (2) Fast Speed: BlueLM-V-3B achieves a generation speed of 24.4 token/s on the MediaTek Dimensity 9300 processor with 4-bit LLM weight quantization. (3) Strong Performance: BlueLM-V-3B has attained the highest average score of 66.1 on the OpenCompass benchmark among models with $\leq$ 4B parameters and surpassed a series of models with much larger parameter sizes (e.g., MiniCPM-V-2.6, InternVL2-8B).</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 31 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xudong Lu, Yinghao Chen, Cheng Chen, Hui Tan, Boheng Chen, Yina Xie, Rui Hu, Guanxin Tan, Renshou Wu, Yan Hu, Yi Zeng, Lei Wu, Liuyang Bian, Zhaoxiong Wang, Long Liu, Yanzhou Yang, Han Xiao, Aojun Zhou, Yafei Wen, Xiaoxin Chen, Shuai Ren, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10640v1">http://arxiv.org/abs/2411.10640v1</a></p>

            <p><strong>Abstract:</strong><br>
            The emergence and growing popularity of multimodal large language models (MLLMs) have significant potential to enhance various aspects of daily life, from improving communication to facilitating learning and problem-solving. Mobile phones, as essential daily companions, represent the most effective and accessible deployment platform for MLLMs, enabling seamless integration into everyday tasks. However, deploying MLLMs on mobile phones presents challenges due to limitations in memory size and computational capability, making it difficult to achieve smooth and real-time processing without extensive optimization. In this paper, we present BlueLM-V-3B, an algorithm and system co-design approach specifically tailored for the efficient deployment of MLLMs on mobile platforms. To be specific, we redesign the dynamic resolution scheme adopted by mainstream MLLMs and implement system optimization for hardware-aware deployment to optimize model inference on mobile phones. BlueLM-V-3B boasts the following key highlights: (1) Small Size: BlueLM-V-3B features a language model with 2.7B parameters and a vision encoder with 400M parameters. (2) Fast Speed: BlueLM-V-3B achieves a generation speed of 24.4 token/s on the MediaTek Dimensity 9300 processor with 4-bit LLM weight quantization. (3) Strong Performance: BlueLM-V-3B has attained the highest average score of 66.1 on the OpenCompass benchmark among models with $\leq$ 4B parameters and surpassed a series of models with much larger parameter sizes (e.g., MiniCPM-V-2.6, InternVL2-8B).</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 19 Nov 2024 19:46:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1217e5f6/90a798ed.mp3" length="19109152" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1191</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 31 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Xudong Lu, Yinghao Chen, Cheng Chen, Hui Tan, Boheng Chen, Yina Xie, Rui Hu, Guanxin Tan, Renshou Wu, Yan Hu, Yi Zeng, Lei Wu, Liuyang Bian, Zhaoxiong Wang, Long Liu, Yanzhou Yang, Han Xiao, Aojun Zhou, Yafei Wen, Xiaoxin Chen, Shuai Ren, Hongsheng Li</p>

            <p><strong>Title:</strong><br>
            BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10640v1">http://arxiv.org/abs/2411.10640v1</a></p>

            <p><strong>Abstract:</strong><br>
            The emergence and growing popularity of multimodal large language models (MLLMs) have significant potential to enhance various aspects of daily life, from improving communication to facilitating learning and problem-solving. Mobile phones, as essential daily companions, represent the most effective and accessible deployment platform for MLLMs, enabling seamless integration into everyday tasks. However, deploying MLLMs on mobile phones presents challenges due to limitations in memory size and computational capability, making it difficult to achieve smooth and real-time processing without extensive optimization. In this paper, we present BlueLM-V-3B, an algorithm and system co-design approach specifically tailored for the efficient deployment of MLLMs on mobile platforms. To be specific, we redesign the dynamic resolution scheme adopted by mainstream MLLMs and implement system optimization for hardware-aware deployment to optimize model inference on mobile phones. BlueLM-V-3B boasts the following key highlights: (1) Small Size: BlueLM-V-3B features a language model with 2.7B parameters and a vision encoder with 400M parameters. (2) Fast Speed: BlueLM-V-3B achieves a generation speed of 24.4 token/s on the MediaTek Dimensity 9300 processor with 4-bit LLM weight quantization. (3) Strong Performance: BlueLM-V-3B has attained the highest average score of 66.1 on the OpenCompass benchmark among models with $\leq$ 4B parameters and surpassed a series of models with much larger parameter sizes (e.g., MiniCPM-V-2.6, InternVL2-8B).</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering</title>
      <itunes:episode>99</itunes:episode>
      <podcast:episode>99</podcast:episode>
      <itunes:title>Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ea22fbfd-0f18-4bff-aa89-ec3b7ca5872b</guid>
      <link>https://share.transistor.fm/s/4b39425a</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.AI, cs.CL, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Xinyan Guan, Yanjiang Liu, Xinyu Lu, Boxi Cao, Ben He, Xianpei Han, Le Sun, Jie Lou, Bowen Yu, Yaojie Lu, Hongyu Lin</p>

            <p><strong>Title:</strong><br>
            Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11504v1">http://arxiv.org/abs/2411.11504v1</a></p>

            <p><strong>Abstract:</strong><br>
            The evolution of machine learning has increasingly prioritized the development of powerful models and more scalable supervision signals. However, the emergence of foundation models presents significant challenges in providing effective supervision signals necessary for further enhancing their capabilities. Consequently, there is an urgent need to explore novel supervision signals and technical approaches. In this paper, we propose verifier engineering, a novel post-training paradigm specifically designed for the era of foundation models. The core of verifier engineering involves leveraging a suite of automated verifiers to perform verification tasks and deliver meaningful feedback to foundation models. We systematically categorize the verifier engineering process into three essential stages: search, verify, and feedback, and provide a comprehensive review of state-of-the-art research developments within each stage. We believe that verifier engineering constitutes a fundamental pathway toward achieving Artificial General Intelligence.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.AI, cs.CL, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Xinyan Guan, Yanjiang Liu, Xinyu Lu, Boxi Cao, Ben He, Xianpei Han, Le Sun, Jie Lou, Bowen Yu, Yaojie Lu, Hongyu Lin</p>

            <p><strong>Title:</strong><br>
            Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11504v1">http://arxiv.org/abs/2411.11504v1</a></p>

            <p><strong>Abstract:</strong><br>
            The evolution of machine learning has increasingly prioritized the development of powerful models and more scalable supervision signals. However, the emergence of foundation models presents significant challenges in providing effective supervision signals necessary for further enhancing their capabilities. Consequently, there is an urgent need to explore novel supervision signals and technical approaches. In this paper, we propose verifier engineering, a novel post-training paradigm specifically designed for the era of foundation models. The core of verifier engineering involves leveraging a suite of automated verifiers to perform verification tasks and deliver meaningful feedback to foundation models. We systematically categorize the verifier engineering process into three essential stages: search, verify, and feedback, and provide a comprehensive review of state-of-the-art research developments within each stage. We believe that verifier engineering constitutes a fundamental pathway toward achieving Artificial General Intelligence.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 19 Nov 2024 19:45:40 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4b39425a/1d090abb.mp3" length="20344661" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1268</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.AI, cs.CL, stat.ML</p>

            <p><strong>Authors:</strong><br>
            Xinyan Guan, Yanjiang Liu, Xinyu Lu, Boxi Cao, Ben He, Xianpei Han, Le Sun, Jie Lou, Bowen Yu, Yaojie Lu, Hongyu Lin</p>

            <p><strong>Title:</strong><br>
            Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11504v1">http://arxiv.org/abs/2411.11504v1</a></p>

            <p><strong>Abstract:</strong><br>
            The evolution of machine learning has increasingly prioritized the development of powerful models and more scalable supervision signals. However, the emergence of foundation models presents significant challenges in providing effective supervision signals necessary for further enhancing their capabilities. Consequently, there is an urgent need to explore novel supervision signals and technical approaches. In this paper, we propose verifier engineering, a novel post-training paradigm specifically designed for the era of foundation models. The core of verifier engineering involves leveraging a suite of automated verifiers to perform verification tasks and deliver meaningful feedback to foundation models. We systematically categorize the verifier engineering process into three essential stages: search, verify, and feedback, and provide a comprehensive review of state-of-the-art research developments within each stage. We believe that verifier engineering constitutes a fundamental pathway toward achieving Artificial General Intelligence.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AnimateAnything: Consistent and Controllable Animation for Video Generation</title>
      <itunes:episode>98</itunes:episode>
      <podcast:episode>98</podcast:episode>
      <itunes:title>AnimateAnything: Consistent and Controllable Animation for Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3ea398e8-fe86-4f49-a50f-d67a38666da2</guid>
      <link>https://share.transistor.fm/s/94672d96</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Guojun Lei, Chi Wang, Hong Li, Rong Zhang, Yikai Wang, Weiwei Xu</p>

            <p><strong>Title:</strong><br>
            AnimateAnything: Consistent and Controllable Animation for Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10836v1">http://arxiv.org/abs/2411.10836v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a unified controllable video generation approach AnimateAnything that facilitates precise and consistent video manipulation across various conditions, including camera trajectories, text prompts, and user motion annotations. Specifically, we carefully design a multi-scale control feature fusion network to construct a common motion representation for different conditions. It explicitly converts all control information into frame-by-frame optical flows. Then we incorporate the optical flows as motion priors to guide final video generation. In addition, to reduce the flickering issues caused by large-scale motion, we propose a frequency-based stabilization module. It can enhance temporal coherence by ensuring the video's frequency domain consistency. Experiments demonstrate that our method outperforms the state-of-the-art approaches. For more details and videos, please refer to the webpage: https://yu-shaonian.github.io/Animate_Anything/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Guojun Lei, Chi Wang, Hong Li, Rong Zhang, Yikai Wang, Weiwei Xu</p>

            <p><strong>Title:</strong><br>
            AnimateAnything: Consistent and Controllable Animation for Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10836v1">http://arxiv.org/abs/2411.10836v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a unified controllable video generation approach AnimateAnything that facilitates precise and consistent video manipulation across various conditions, including camera trajectories, text prompts, and user motion annotations. Specifically, we carefully design a multi-scale control feature fusion network to construct a common motion representation for different conditions. It explicitly converts all control information into frame-by-frame optical flows. Then we incorporate the optical flows as motion priors to guide final video generation. In addition, to reduce the flickering issues caused by large-scale motion, we propose a frequency-based stabilization module. It can enhance temporal coherence by ensuring the video's frequency domain consistency. Experiments demonstrate that our method outperforms the state-of-the-art approaches. For more details and videos, please refer to the webpage: https://yu-shaonian.github.io/Animate_Anything/.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 19 Nov 2024 19:45:18 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/94672d96/f3bb7217.mp3" length="21425876" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1335</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Guojun Lei, Chi Wang, Hong Li, Rong Zhang, Yikai Wang, Weiwei Xu</p>

            <p><strong>Title:</strong><br>
            AnimateAnything: Consistent and Controllable Animation for Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10836v1">http://arxiv.org/abs/2411.10836v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present a unified controllable video generation approach AnimateAnything that facilitates precise and consistent video manipulation across various conditions, including camera trajectories, text prompts, and user motion annotations. Specifically, we carefully design a multi-scale control feature fusion network to construct a common motion representation for different conditions. It explicitly converts all control information into frame-by-frame optical flows. Then we incorporate the optical flows as motion priors to guide final video generation. In addition, to reduce the flickering issues caused by large-scale motion, we propose a frequency-based stabilization module. It can enhance temporal coherence by ensuring the video's frequency domain consistency. Experiments demonstrate that our method outperforms the state-of-the-art approaches. For more details and videos, please refer to the webpage: https://yu-shaonian.github.io/Animate_Anything/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Top-$nσ$: Not All Logits Are You Need</title>
      <itunes:episode>97</itunes:episode>
      <podcast:episode>97</podcast:episode>
      <itunes:title>Top-$nσ$: Not All Logits Are You Need</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">cc41e0c1-f663-49f9-b278-baae5b09272b</guid>
      <link>https://share.transistor.fm/s/96203aa2</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chenxia Tang, Jianchun Liu, Hongli Xu, Liusheng Huang</p>

            <p><strong>Title:</strong><br>
            Top-$nσ$: Not All Logits Are You Need</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07641v1">http://arxiv.org/abs/2411.07641v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) typically employ greedy decoding or low-temperature sampling for reasoning tasks, reflecting a perceived trade-off between diversity and accuracy. We challenge this convention by introducing top-$n\sigma$, a novel sampling method that operates directly on pre-softmax logits by leveraging a statistical threshold. Our key insight is that logits naturally separate into a Gaussian-distributed noisy region and a distinct informative region, enabling efficient token filtering without complex probability manipulations. Unlike existing methods (e.g., top-$p$, min-$p$) that inadvertently include more noise tokens at higher temperatures, top-$n\sigma$ maintains a stable sampling space regardless of temperature scaling. We also provide a theoretical analysis of top-$n\sigma$ to better understand its behavior. The extensive experimental results across four reasoning-focused datasets demonstrate that our method not only outperforms existing sampling approaches but also surpasses greedy decoding, while maintaining consistent performance even at high temperatures.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chenxia Tang, Jianchun Liu, Hongli Xu, Liusheng Huang</p>

            <p><strong>Title:</strong><br>
            Top-$nσ$: Not All Logits Are You Need</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07641v1">http://arxiv.org/abs/2411.07641v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) typically employ greedy decoding or low-temperature sampling for reasoning tasks, reflecting a perceived trade-off between diversity and accuracy. We challenge this convention by introducing top-$n\sigma$, a novel sampling method that operates directly on pre-softmax logits by leveraging a statistical threshold. Our key insight is that logits naturally separate into a Gaussian-distributed noisy region and a distinct informative region, enabling efficient token filtering without complex probability manipulations. Unlike existing methods (e.g., top-$p$, min-$p$) that inadvertently include more noise tokens at higher temperatures, top-$n\sigma$ maintains a stable sampling space regardless of temperature scaling. We also provide a theoretical analysis of top-$n\sigma$ to better understand its behavior. The extensive experimental results across four reasoning-focused datasets demonstrate that our method not only outperforms existing sampling approaches but also surpasses greedy decoding, while maintaining consistent performance even at high temperatures.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 19 Nov 2024 19:44:56 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/96203aa2/11305a8e.mp3" length="20513054" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1278</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Chenxia Tang, Jianchun Liu, Hongli Xu, Liusheng Huang</p>

            <p><strong>Title:</strong><br>
            Top-$nσ$: Not All Logits Are You Need</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07641v1">http://arxiv.org/abs/2411.07641v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) typically employ greedy decoding or low-temperature sampling for reasoning tasks, reflecting a perceived trade-off between diversity and accuracy. We challenge this convention by introducing top-$n\sigma$, a novel sampling method that operates directly on pre-softmax logits by leveraging a statistical threshold. Our key insight is that logits naturally separate into a Gaussian-distributed noisy region and a distinct informative region, enabling efficient token filtering without complex probability manipulations. Unlike existing methods (e.g., top-$p$, min-$p$) that inadvertently include more noise tokens at higher temperatures, top-$n\sigma$ maintains a stable sampling space regardless of temperature scaling. We also provide a theoretical analysis of top-$n\sigma$ to better understand its behavior. The extensive experimental results across four reasoning-focused datasets demonstrate that our method not only outperforms existing sampling approaches but also surpasses greedy decoding, while maintaining consistent performance even at high temperatures.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Drowning in Documents: Consequences of Scaling Reranker Inference</title>
      <itunes:episode>96</itunes:episode>
      <podcast:episode>96</podcast:episode>
      <itunes:title>Drowning in Documents: Consequences of Scaling Reranker Inference</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c81aab63-2b22-40a0-ab03-2204d1cc7df8</guid>
      <link>https://share.transistor.fm/s/52078923</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.IR, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Mathew Jacob, Erik Lindgren, Matei Zaharia, Michael Carbin, Omar Khattab, Andrew Drozdov</p>

            <p><strong>Title:</strong><br>
            Drowning in Documents: Consequences of Scaling Reranker Inference</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11767v1">http://arxiv.org/abs/2411.11767v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rerankers, typically cross-encoders, are often used to re-score the documents retrieved by cheaper initial IR systems. This is because, though expensive, rerankers are assumed to be more effective. We challenge this assumption by measuring reranker performance for full retrieval, not just re-scoring first-stage retrieval. Our experiments reveal a surprising trend: the best existing rerankers provide diminishing returns when scoring progressively more documents and actually degrade quality beyond a certain limit. In fact, in this setting, rerankers can frequently assign high scores to documents with no lexical or semantic overlap with the query. We hope that our findings will spur future research to improve reranking.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.IR, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Mathew Jacob, Erik Lindgren, Matei Zaharia, Michael Carbin, Omar Khattab, Andrew Drozdov</p>

            <p><strong>Title:</strong><br>
            Drowning in Documents: Consequences of Scaling Reranker Inference</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11767v1">http://arxiv.org/abs/2411.11767v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rerankers, typically cross-encoders, are often used to re-score the documents retrieved by cheaper initial IR systems. This is because, though expensive, rerankers are assumed to be more effective. We challenge this assumption by measuring reranker performance for full retrieval, not just re-scoring first-stage retrieval. Our experiments reveal a surprising trend: the best existing rerankers provide diminishing returns when scoring progressively more documents and actually degrade quality beyond a certain limit. In fact, in this setting, rerankers can frequently assign high scores to documents with no lexical or semantic overlap with the query. We hope that our findings will spur future research to improve reranking.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 19 Nov 2024 19:44:35 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/52078923/3e5c761a.mp3" length="20871234" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1301</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.IR, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Mathew Jacob, Erik Lindgren, Matei Zaharia, Michael Carbin, Omar Khattab, Andrew Drozdov</p>

            <p><strong>Title:</strong><br>
            Drowning in Documents: Consequences of Scaling Reranker Inference</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11767v1">http://arxiv.org/abs/2411.11767v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rerankers, typically cross-encoders, are often used to re-score the documents retrieved by cheaper initial IR systems. This is because, though expensive, rerankers are assumed to be more effective. We challenge this assumption by measuring reranker performance for full retrieval, not just re-scoring first-stage retrieval. Our experiments reveal a surprising trend: the best existing rerankers provide diminishing returns when scoring progressively more documents and actually degrade quality beyond a certain limit. In fact, in this setting, rerankers can frequently assign high scores to documents with no lexical or semantic overlap with the query. We hope that our findings will spur future research to improve reranking.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SlimLM: An Efficient Small Language Model for On-Device Document Assistance</title>
      <itunes:episode>95</itunes:episode>
      <podcast:episode>95</podcast:episode>
      <itunes:title>SlimLM: An Efficient Small Language Model for On-Device Document Assistance</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4ad3c2cd-0a7a-45ae-80c1-569003894e97</guid>
      <link>https://share.transistor.fm/s/118a010d</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Thang M. Pham, Phat T. Nguyen, Seunghyun Yoon, Viet Dac Lai, Franck Dernoncourt, Trung Bui</p>

            <p><strong>Title:</strong><br>
            SlimLM: An Efficient Small Language Model for On-Device Document Assistance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.09944v1">http://arxiv.org/abs/2411.09944v1</a></p>

            <p><strong>Abstract:</strong><br>
            While small language models (SLMs) show promises for mobile deployment, their real-world performance and applications on smartphones remains underexplored. We present SlimLM, a series of SLMs optimized for document assistance tasks on mobile devices. Through extensive experiments on a Samsung Galaxy S24, we identify the optimal trade-offs between model size (ranging from 125M to 7B parameters), context length, and inference time for efficient on-device processing. SlimLM is pre-trained on SlimPajama-627B and fine-tuned on DocAssist, our constructed dataset for summarization, question answering and suggestion tasks. Our smallest model demonstrates efficient performance on S24, while larger variants offer enhanced capabilities within mobile constraints. We evaluate SlimLM against existing SLMs, showing comparable or superior performance and offering a benchmark for future research in on-device language models. We also provide an Android application, offering practical insights into SLM deployment. Our findings provide valuable insights and illuminate the capabilities of running advanced language models on high-end smartphones, potentially reducing server costs and enhancing privacy through on-device processing.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Thang M. Pham, Phat T. Nguyen, Seunghyun Yoon, Viet Dac Lai, Franck Dernoncourt, Trung Bui</p>

            <p><strong>Title:</strong><br>
            SlimLM: An Efficient Small Language Model for On-Device Document Assistance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.09944v1">http://arxiv.org/abs/2411.09944v1</a></p>

            <p><strong>Abstract:</strong><br>
            While small language models (SLMs) show promises for mobile deployment, their real-world performance and applications on smartphones remains underexplored. We present SlimLM, a series of SLMs optimized for document assistance tasks on mobile devices. Through extensive experiments on a Samsung Galaxy S24, we identify the optimal trade-offs between model size (ranging from 125M to 7B parameters), context length, and inference time for efficient on-device processing. SlimLM is pre-trained on SlimPajama-627B and fine-tuned on DocAssist, our constructed dataset for summarization, question answering and suggestion tasks. Our smallest model demonstrates efficient performance on S24, while larger variants offer enhanced capabilities within mobile constraints. We evaluate SlimLM against existing SLMs, showing comparable or superior performance and offering a benchmark for future research in on-device language models. We also provide an Android application, offering practical insights into SLM deployment. Our findings provide valuable insights and illuminate the capabilities of running advanced language models on high-end smartphones, potentially reducing server costs and enhancing privacy through on-device processing.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 19 Nov 2024 19:44:13 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/118a010d/7bf15cae.mp3" length="24912073" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1553</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Thang M. Pham, Phat T. Nguyen, Seunghyun Yoon, Viet Dac Lai, Franck Dernoncourt, Trung Bui</p>

            <p><strong>Title:</strong><br>
            SlimLM: An Efficient Small Language Model for On-Device Document Assistance</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.09944v1">http://arxiv.org/abs/2411.09944v1</a></p>

            <p><strong>Abstract:</strong><br>
            While small language models (SLMs) show promises for mobile deployment, their real-world performance and applications on smartphones remains underexplored. We present SlimLM, a series of SLMs optimized for document assistance tasks on mobile devices. Through extensive experiments on a Samsung Galaxy S24, we identify the optimal trade-offs between model size (ranging from 125M to 7B parameters), context length, and inference time for efficient on-device processing. SlimLM is pre-trained on SlimPajama-627B and fine-tuned on DocAssist, our constructed dataset for summarization, question answering and suggestion tasks. Our smallest model demonstrates efficient performance on S24, while larger variants offer enhanced capabilities within mobile constraints. We evaluate SlimLM against existing SLMs, showing comparable or superior performance and offering a benchmark for future research in on-device language models. We also provide an Android application, offering practical insights into SLM deployment. Our findings provide valuable insights and illuminate the capabilities of running advanced language models on high-end smartphones, potentially reducing server costs and enhancing privacy through on-device processing.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts</title>
      <itunes:episode>94</itunes:episode>
      <podcast:episode>94</podcast:episode>
      <itunes:title>Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">23f2348d-00b9-4ec4-9e78-b56d6ea73ddd</guid>
      <link>https://share.transistor.fm/s/aed4bc72</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jinqiang Long, Yanqi Dai, Guoxing Yang, Hongpeng Lin, Nanyi Fei, Yizhao Gao, Zhiwu Lu</p>

            <p><strong>Title:</strong><br>
            Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10669v1">http://arxiv.org/abs/2411.10669v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the research of Multimodal Large Language Models (MLLMs) becomes popular, an advancing MLLM model is typically required to handle various textual and visual tasks (e.g., VQA, Detection, OCR, and ChartQA) simultaneously for real-world applications. However, due to the significant differences in representation and distribution among data from various tasks, simply mixing data of all tasks together leads to the well-known``multi-task conflict" issue, resulting in performance degradation across various tasks. To address this issue, we propose Awaker2.5-VL, a Mixture of Experts~(MoE) architecture suitable for MLLM, which acquires the multi-task capabilities through multiple sparsely activated experts. To speed up the training and inference of Awaker2.5-VL, each expert in our model is devised as a low-rank adaptation (LoRA) structure. Extensive experiments on multiple latest benchmarks demonstrate the effectiveness of Awaker2.5-VL. The code and model weight are released in our Project Page: https://github.com/MetabrainAGI/Awaker.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jinqiang Long, Yanqi Dai, Guoxing Yang, Hongpeng Lin, Nanyi Fei, Yizhao Gao, Zhiwu Lu</p>

            <p><strong>Title:</strong><br>
            Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10669v1">http://arxiv.org/abs/2411.10669v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the research of Multimodal Large Language Models (MLLMs) becomes popular, an advancing MLLM model is typically required to handle various textual and visual tasks (e.g., VQA, Detection, OCR, and ChartQA) simultaneously for real-world applications. However, due to the significant differences in representation and distribution among data from various tasks, simply mixing data of all tasks together leads to the well-known``multi-task conflict" issue, resulting in performance degradation across various tasks. To address this issue, we propose Awaker2.5-VL, a Mixture of Experts~(MoE) architecture suitable for MLLM, which acquires the multi-task capabilities through multiple sparsely activated experts. To speed up the training and inference of Awaker2.5-VL, each expert in our model is devised as a low-rank adaptation (LoRA) structure. Extensive experiments on multiple latest benchmarks demonstrate the effectiveness of Awaker2.5-VL. The code and model weight are released in our Project Page: https://github.com/MetabrainAGI/Awaker.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 19 Nov 2024 19:43:51 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/aed4bc72/37bb1d81.mp3" length="19072768" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1188</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Jinqiang Long, Yanqi Dai, Guoxing Yang, Hongpeng Lin, Nanyi Fei, Yizhao Gao, Zhiwu Lu</p>

            <p><strong>Title:</strong><br>
            Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10669v1">http://arxiv.org/abs/2411.10669v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the research of Multimodal Large Language Models (MLLMs) becomes popular, an advancing MLLM model is typically required to handle various textual and visual tasks (e.g., VQA, Detection, OCR, and ChartQA) simultaneously for real-world applications. However, due to the significant differences in representation and distribution among data from various tasks, simply mixing data of all tasks together leads to the well-known``multi-task conflict" issue, resulting in performance degradation across various tasks. To address this issue, we propose Awaker2.5-VL, a Mixture of Experts~(MoE) architecture suitable for MLLM, which acquires the multi-task capabilities through multiple sparsely activated experts. To speed up the training and inference of Awaker2.5-VL, each expert in our model is devised as a low-rank adaptation (LoRA) structure. Extensive experiments on multiple latest benchmarks demonstrate the effectiveness of Awaker2.5-VL. The code and model weight are released in our Project Page: https://github.com/MetabrainAGI/Awaker.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers</title>
      <itunes:episode>93</itunes:episode>
      <podcast:episode>93</podcast:episode>
      <itunes:title>SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9a08b23a-e64e-40d4-83e1-ad01861fc847</guid>
      <link>https://share.transistor.fm/s/80294936</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Joseph Liu, Joshua Geddes, Ziyu Guo, Haomiao Jiang, Mahesh Kumar Nandwana</p>

            <p><strong>Title:</strong><br>
            SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10510v1">http://arxiv.org/abs/2411.10510v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformers (DiT) have emerged as powerful generative models for various tasks, including image, video, and speech synthesis. However, their inference process remains computationally expensive due to the repeated evaluation of resource-intensive attention and feed-forward modules. To address this, we introduce SmoothCache, a model-agnostic inference acceleration technique for DiT architectures. SmoothCache leverages the observed high similarity between layer outputs across adjacent diffusion timesteps. By analyzing layer-wise representation errors from a small calibration set, SmoothCache adaptively caches and reuses key features during inference. Our experiments demonstrate that SmoothCache achieves 8% to 71% speed up while maintaining or even improving generation quality across diverse modalities. We showcase its effectiveness on DiT-XL for image generation, Open-Sora for text-to-video, and Stable Audio Open for text-to-audio, highlighting its potential to enable real-time applications and broaden the accessibility of powerful DiT models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Joseph Liu, Joshua Geddes, Ziyu Guo, Haomiao Jiang, Mahesh Kumar Nandwana</p>

            <p><strong>Title:</strong><br>
            SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10510v1">http://arxiv.org/abs/2411.10510v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformers (DiT) have emerged as powerful generative models for various tasks, including image, video, and speech synthesis. However, their inference process remains computationally expensive due to the repeated evaluation of resource-intensive attention and feed-forward modules. To address this, we introduce SmoothCache, a model-agnostic inference acceleration technique for DiT architectures. SmoothCache leverages the observed high similarity between layer outputs across adjacent diffusion timesteps. By analyzing layer-wise representation errors from a small calibration set, SmoothCache adaptively caches and reuses key features during inference. Our experiments demonstrate that SmoothCache achieves 8% to 71% speed up while maintaining or even improving generation quality across diverse modalities. We showcase its effectiveness on DiT-XL for image generation, Open-Sora for text-to-video, and Stable Audio Open for text-to-audio, highlighting its potential to enable real-time applications and broaden the accessibility of powerful DiT models.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 19 Nov 2024 19:43:30 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/80294936/486897b4.mp3" length="26471906" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1651</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.LG</p>

            <p><strong>Authors:</strong><br>
            Joseph Liu, Joshua Geddes, Ziyu Guo, Haomiao Jiang, Mahesh Kumar Nandwana</p>

            <p><strong>Title:</strong><br>
            SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10510v1">http://arxiv.org/abs/2411.10510v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion Transformers (DiT) have emerged as powerful generative models for various tasks, including image, video, and speech synthesis. However, their inference process remains computationally expensive due to the repeated evaluation of resource-intensive attention and feed-forward modules. To address this, we introduce SmoothCache, a model-agnostic inference acceleration technique for DiT architectures. SmoothCache leverages the observed high similarity between layer outputs across adjacent diffusion timesteps. By analyzing layer-wise representation errors from a small calibration set, SmoothCache adaptively caches and reuses key features during inference. Our experiments demonstrate that SmoothCache achieves 8% to 71% speed up while maintaining or even improving generation quality across diverse modalities. We showcase its effectiveness on DiT-XL for image generation, Open-Sora for text-to-video, and Stable Audio Open for text-to-audio, highlighting its potential to enable real-time applications and broaden the accessibility of powerful DiT models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LLäMmlein: Compact and Competitive German-Only Language Models from Scratch</title>
      <itunes:episode>92</itunes:episode>
      <podcast:episode>92</podcast:episode>
      <itunes:title>LLäMmlein: Compact and Competitive German-Only Language Models from Scratch</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">74354cb5-b2d6-433e-bd1c-ea5518a6ee15</guid>
      <link>https://share.transistor.fm/s/cec33fa9</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jan Pfister, Julia Wunderle, Andreas Hotho</p>

            <p><strong>Title:</strong><br>
            LLäMmlein: Compact and Competitive German-Only Language Models from Scratch</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11171v1">http://arxiv.org/abs/2411.11171v1</a></p>

            <p><strong>Abstract:</strong><br>
            We create two German-only decoder models, LL\"aMmlein 120M and 1B, transparently from scratch and publish them, along with the training data, for the German NLP research community to use. The model training involved several key steps, including extensive data preprocessing, the creation of a custom German tokenizer, the training itself, as well as the evaluation of the final models on various benchmarks. Throughout the training process, multiple checkpoints were saved and analyzed using the SuperGLEBer benchmark to monitor the models' learning dynamics. Compared to state-of-the-art models on the SuperGLEBer benchmark, both LL\"aMmlein models performed competitively, consistently matching or surpassing models with similar parameter sizes. The results show that the models' quality scales with size as expected, but performance improvements on some tasks plateaued early, offering valuable insights into resource allocation for future model development.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jan Pfister, Julia Wunderle, Andreas Hotho</p>

            <p><strong>Title:</strong><br>
            LLäMmlein: Compact and Competitive German-Only Language Models from Scratch</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11171v1">http://arxiv.org/abs/2411.11171v1</a></p>

            <p><strong>Abstract:</strong><br>
            We create two German-only decoder models, LL\"aMmlein 120M and 1B, transparently from scratch and publish them, along with the training data, for the German NLP research community to use. The model training involved several key steps, including extensive data preprocessing, the creation of a custom German tokenizer, the training itself, as well as the evaluation of the final models on various benchmarks. Throughout the training process, multiple checkpoints were saved and analyzed using the SuperGLEBer benchmark to monitor the models' learning dynamics. Compared to state-of-the-art models on the SuperGLEBer benchmark, both LL\"aMmlein models performed competitively, consistently matching or surpassing models with similar parameter sizes. The results show that the models' quality scales with size as expected, but performance improvements on some tasks plateaued early, offering valuable insights into resource allocation for future model development.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 19 Nov 2024 19:43:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/cec33fa9/ce07b548.mp3" length="21382826" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1333</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Jan Pfister, Julia Wunderle, Andreas Hotho</p>

            <p><strong>Title:</strong><br>
            LLäMmlein: Compact and Competitive German-Only Language Models from Scratch</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.11171v1">http://arxiv.org/abs/2411.11171v1</a></p>

            <p><strong>Abstract:</strong><br>
            We create two German-only decoder models, LL\"aMmlein 120M and 1B, transparently from scratch and publish them, along with the training data, for the German NLP research community to use. The model training involved several key steps, including extensive data preprocessing, the creation of a custom German tokenizer, the training itself, as well as the evaluation of the final models on various benchmarks. Throughout the training process, multiple checkpoints were saved and analyzed using the SuperGLEBer benchmark to monitor the models' learning dynamics. Compared to state-of-the-art models on the SuperGLEBer benchmark, both LL\"aMmlein models performed competitively, consistently matching or surpassing models with similar parameter sizes. The results show that the models' quality scales with size as expected, but performance improvements on some tasks plateaued early, offering valuable insights into resource allocation for future model development.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LLaVA-o1: Let Vision Language Models Reason Step-by-Step</title>
      <itunes:episode>91</itunes:episode>
      <podcast:episode>91</podcast:episode>
      <itunes:title>LLaVA-o1: Let Vision Language Models Reason Step-by-Step</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8f13c12e-b529-406d-b359-c3cd83bb977b</guid>
      <link>https://share.transistor.fm/s/0ae9cef2</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 64 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, Li Yuan</p>

            <p><strong>Title:</strong><br>
            LLaVA-o1: Let Vision Language Models Reason Step-by-Step</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10440v1">http://arxiv.org/abs/2411.10440v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-o1, a novel VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-o1 independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-o1 to achieve marked improvements in precision on reasoning-intensive tasks. To accomplish this, we compile the LLaVA-o1-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose an inference-time stage-level beam search method, which enables effective inference-time scaling. Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-o1 not only outperforms its base model by 8.9% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 64 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, Li Yuan</p>

            <p><strong>Title:</strong><br>
            LLaVA-o1: Let Vision Language Models Reason Step-by-Step</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10440v1">http://arxiv.org/abs/2411.10440v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-o1, a novel VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-o1 independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-o1 to achieve marked improvements in precision on reasoning-intensive tasks. To accomplish this, we compile the LLaVA-o1-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose an inference-time stage-level beam search method, which enables effective inference-time scaling. Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-o1 not only outperforms its base model by 8.9% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 18 Nov 2024 19:02:24 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0ae9cef2/dcffcae7.mp3" length="24625334" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1535</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 64 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, Li Yuan</p>

            <p><strong>Title:</strong><br>
            LLaVA-o1: Let Vision Language Models Reason Step-by-Step</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10440v1">http://arxiv.org/abs/2411.10440v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-o1, a novel VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-o1 independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-o1 to achieve marked improvements in precision on reasoning-intensive tasks. To accomplish this, we compile the LLaVA-o1-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose an inference-time stage-level beam search method, which enables effective inference-time scaling. Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-o1 not only outperforms its base model by 8.9% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation</title>
      <itunes:episode>90</itunes:episode>
      <podcast:episode>90</podcast:episode>
      <itunes:title>GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c1276e3a-31f0-43e7-9066-7dc757c14a9e</guid>
      <link>https://share.transistor.fm/s/0c581bb9</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 19 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Yushi Lan, Shangchen Zhou, Zhaoyang Lyu, Fangzhou Hong, Shuai Yang, Bo Dai, Xingang Pan, Chen Change Loy</p>

            <p><strong>Title:</strong><br>
            GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08033v1">http://arxiv.org/abs/2411.08033v1</a></p>

            <p><strong>Abstract:</strong><br>
            While 3D content generation has advanced significantly, existing methods still face challenges with input formats, latent space design, and output representations. This paper introduces a novel 3D generation framework that addresses these challenges, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space. Our framework employs a Variational Autoencoder (VAE) with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space design that preserves 3D shape information, and incorporates a cascaded latent diffusion model for improved shape-texture disentanglement. The proposed method, GaussianAnything, supports multi-modal conditional 3D generation, allowing for point cloud, caption, and single/multi-view image inputs. Notably, the newly proposed latent space naturally enables geometry-texture disentanglement, thus allowing 3D-aware editing. Experimental results demonstrate the effectiveness of our approach on multiple datasets, outperforming existing methods in both text- and image-conditioned 3D generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 19 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Yushi Lan, Shangchen Zhou, Zhaoyang Lyu, Fangzhou Hong, Shuai Yang, Bo Dai, Xingang Pan, Chen Change Loy</p>

            <p><strong>Title:</strong><br>
            GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08033v1">http://arxiv.org/abs/2411.08033v1</a></p>

            <p><strong>Abstract:</strong><br>
            While 3D content generation has advanced significantly, existing methods still face challenges with input formats, latent space design, and output representations. This paper introduces a novel 3D generation framework that addresses these challenges, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space. Our framework employs a Variational Autoencoder (VAE) with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space design that preserves 3D shape information, and incorporates a cascaded latent diffusion model for improved shape-texture disentanglement. The proposed method, GaussianAnything, supports multi-modal conditional 3D generation, allowing for point cloud, caption, and single/multi-view image inputs. Notably, the newly proposed latent space naturally enables geometry-texture disentanglement, thus allowing 3D-aware editing. Experimental results demonstrate the effectiveness of our approach on multiple datasets, outperforming existing methods in both text- and image-conditioned 3D generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 18 Nov 2024 19:01:52 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0c581bb9/e2be226a.mp3" length="23564574" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1469</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 19 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Yushi Lan, Shangchen Zhou, Zhaoyang Lyu, Fangzhou Hong, Shuai Yang, Bo Dai, Xingang Pan, Chen Change Loy</p>

            <p><strong>Title:</strong><br>
            GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08033v1">http://arxiv.org/abs/2411.08033v1</a></p>

            <p><strong>Abstract:</strong><br>
            While 3D content generation has advanced significantly, existing methods still face challenges with input formats, latent space design, and output representations. This paper introduces a novel 3D generation framework that addresses these challenges, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space. Our framework employs a Variational Autoencoder (VAE) with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space design that preserves 3D shape information, and incorporates a cascaded latent diffusion model for improved shape-texture disentanglement. The proposed method, GaussianAnything, supports multi-modal conditional 3D generation, allowing for point cloud, caption, and single/multi-view image inputs. Notably, the newly proposed latent space naturally enables geometry-texture disentanglement, thus allowing 3D-aware editing. Experimental results demonstrate the effectiveness of our approach on multiple datasets, outperforming existing methods in both text- and image-conditioned 3D generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Xmodel-1.5: An 1B-scale Multilingual LLM</title>
      <itunes:episode>89</itunes:episode>
      <podcast:episode>89</podcast:episode>
      <itunes:title>Xmodel-1.5: An 1B-scale Multilingual LLM</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a3b99eb0-8378-4e2d-bcb6-a334853f2373</guid>
      <link>https://share.transistor.fm/s/b9f76d36</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Wang Qun, Liu Yang, Lin Qingquan, Jiang Ling</p>

            <p><strong>Title:</strong><br>
            Xmodel-1.5: An 1B-scale Multilingual LLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10083v1">http://arxiv.org/abs/2411.10083v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Xmodel-1.5, a novel 1-billion-parameter multilingual large model pretrained on approximately 2 trillion tokens. The model demonstrates strong performance across several languages, with particularly notable results in Thai, Arabic, and French, alongside its effectiveness in Chinese and English. In addition, we contribute to the research community by releasing a Thai evaluation dataset, which includes hundreds of questions annotated by students from Chulalongkorn University's School of Integrated Innovation. While the results are promising, we acknowledge that there is still room for improvement. We hope this work advances ongoing efforts in multilingual AI research and promotes better cross-linguistic understanding in various natural language processing tasks. Our models and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelLM.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Wang Qun, Liu Yang, Lin Qingquan, Jiang Ling</p>

            <p><strong>Title:</strong><br>
            Xmodel-1.5: An 1B-scale Multilingual LLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10083v1">http://arxiv.org/abs/2411.10083v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Xmodel-1.5, a novel 1-billion-parameter multilingual large model pretrained on approximately 2 trillion tokens. The model demonstrates strong performance across several languages, with particularly notable results in Thai, Arabic, and French, alongside its effectiveness in Chinese and English. In addition, we contribute to the research community by releasing a Thai evaluation dataset, which includes hundreds of questions annotated by students from Chulalongkorn University's School of Integrated Innovation. While the results are promising, we acknowledge that there is still room for improvement. We hope this work advances ongoing efforts in multilingual AI research and promotes better cross-linguistic understanding in various natural language processing tasks. Our models and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelLM.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 18 Nov 2024 19:01:10 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b9f76d36/0aaeb294.mp3" length="20095477" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1252</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Wang Qun, Liu Yang, Lin Qingquan, Jiang Ling</p>

            <p><strong>Title:</strong><br>
            Xmodel-1.5: An 1B-scale Multilingual LLM</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.10083v1">http://arxiv.org/abs/2411.10083v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Xmodel-1.5, a novel 1-billion-parameter multilingual large model pretrained on approximately 2 trillion tokens. The model demonstrates strong performance across several languages, with particularly notable results in Thai, Arabic, and French, alongside its effectiveness in Chinese and English. In addition, we contribute to the research community by releasing a Thai evaluation dataset, which includes hundreds of questions annotated by students from Chulalongkorn University's School of Integrated Innovation. While the results are promising, we acknowledge that there is still room for improvement. We hope this work advances ongoing efforts in multilingual AI research and promotes better cross-linguistic understanding in various natural language processing tasks. Our models and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelLM.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models</title>
      <itunes:episode>88</itunes:episode>
      <podcast:episode>88</podcast:episode>
      <itunes:title>LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8688beca-44d1-4c71-9c96-fd5f321f0db2</guid>
      <link>https://share.transistor.fm/s/6668c555</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 32 | cs.LG, cs.AI, cs.CL, cs.CV, 68T05, I.3.5; I.2.10; I.2.6</p>

            <p><strong>Authors:</strong><br>
            Zhengyi Wang, Jonathan Lorraine, Yikai Wang, Hang Su, Jun Zhu, Sanja Fidler, Xiaohui Zeng</p>

            <p><strong>Title:</strong><br>
            LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.09595v1">http://arxiv.org/abs/2411.09595v1</a></p>

            <p><strong>Abstract:</strong><br>
            This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model. This offers key advantages of (1) leveraging spatial knowledge already embedded in LLMs, derived from textual sources like 3D tutorials, and (2) enabling conversational 3D generation and mesh understanding. A primary challenge is effectively tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly. To address this, we introduce LLaMA-Mesh, a novel approach that represents the vertex coordinates and face definitions of 3D meshes as plain text, allowing direct integration with LLMs without expanding the vocabulary. We construct a supervised fine-tuning (SFT) dataset enabling pretrained LLMs to (1) generate 3D meshes from text prompts, (2) produce interleaved text and 3D mesh outputs as required, and (3) understand and interpret 3D meshes. Our work is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format, effectively unifying the 3D and text modalities. LLaMA-Mesh achieves mesh generation quality on par with models trained from scratch while maintaining strong text generation performance.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 32 | cs.LG, cs.AI, cs.CL, cs.CV, 68T05, I.3.5; I.2.10; I.2.6</p>

            <p><strong>Authors:</strong><br>
            Zhengyi Wang, Jonathan Lorraine, Yikai Wang, Hang Su, Jun Zhu, Sanja Fidler, Xiaohui Zeng</p>

            <p><strong>Title:</strong><br>
            LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.09595v1">http://arxiv.org/abs/2411.09595v1</a></p>

            <p><strong>Abstract:</strong><br>
            This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model. This offers key advantages of (1) leveraging spatial knowledge already embedded in LLMs, derived from textual sources like 3D tutorials, and (2) enabling conversational 3D generation and mesh understanding. A primary challenge is effectively tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly. To address this, we introduce LLaMA-Mesh, a novel approach that represents the vertex coordinates and face definitions of 3D meshes as plain text, allowing direct integration with LLMs without expanding the vocabulary. We construct a supervised fine-tuning (SFT) dataset enabling pretrained LLMs to (1) generate 3D meshes from text prompts, (2) produce interleaved text and 3D mesh outputs as required, and (3) understand and interpret 3D meshes. Our work is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format, effectively unifying the 3D and text modalities. LLaMA-Mesh achieves mesh generation quality on par with models trained from scratch while maintaining strong text generation performance.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 15 Nov 2024 19:22:21 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6668c555/7d2a3922.mp3" length="22983595" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1433</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 32 | cs.LG, cs.AI, cs.CL, cs.CV, 68T05, I.3.5; I.2.10; I.2.6</p>

            <p><strong>Authors:</strong><br>
            Zhengyi Wang, Jonathan Lorraine, Yikai Wang, Hang Su, Jun Zhu, Sanja Fidler, Xiaohui Zeng</p>

            <p><strong>Title:</strong><br>
            LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.09595v1">http://arxiv.org/abs/2411.09595v1</a></p>

            <p><strong>Abstract:</strong><br>
            This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model. This offers key advantages of (1) leveraging spatial knowledge already embedded in LLMs, derived from textual sources like 3D tutorials, and (2) enabling conversational 3D generation and mesh understanding. A primary challenge is effectively tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly. To address this, we introduce LLaMA-Mesh, a novel approach that represents the vertex coordinates and face definitions of 3D meshes as plain text, allowing direct integration with LLMs without expanding the vocabulary. We construct a supervised fine-tuning (SFT) dataset enabling pretrained LLMs to (1) generate 3D meshes from text prompts, (2) produce interleaved text and 3D mesh outputs as required, and (3) understand and interpret 3D meshes. Our work is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format, effectively unifying the 3D and text modalities. LLaMA-Mesh achieves mesh generation quality on par with models trained from scratch while maintaining strong text generation performance.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MagicQuill: An Intelligent Interactive Image Editing System</title>
      <itunes:episode>87</itunes:episode>
      <podcast:episode>87</podcast:episode>
      <itunes:title>MagicQuill: An Intelligent Interactive Image Editing System</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9f5f643d-ef10-4ec3-b35e-f1b7dc242349</guid>
      <link>https://share.transistor.fm/s/77fe53eb</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Wen Wang, Zhiheng Liu, Qifeng Chen, Yujun Shen</p>

            <p><strong>Title:</strong><br>
            MagicQuill: An Intelligent Interactive Image Editing System</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.09703v1">http://arxiv.org/abs/2411.09703v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image editing involves a variety of complex tasks and requires efficient and precise manipulation techniques. In this paper, we present MagicQuill, an integrated image editing system that enables swift actualization of creative ideas. Our system features a streamlined yet functionally robust interface, allowing for the articulation of editing operations (e.g., inserting elements, erasing objects, altering color) with minimal input. These interactions are monitored by a multimodal large language model (MLLM) to anticipate editing intentions in real time, bypassing the need for explicit prompt entry. Finally, we apply a powerful diffusion prior, enhanced by a carefully learned two-branch plug-in module, to process editing requests with precise control. Experimental results demonstrate the effectiveness of MagicQuill in achieving high-quality image edits. Please visit https://magic-quill.github.io to try out our system.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Wen Wang, Zhiheng Liu, Qifeng Chen, Yujun Shen</p>

            <p><strong>Title:</strong><br>
            MagicQuill: An Intelligent Interactive Image Editing System</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.09703v1">http://arxiv.org/abs/2411.09703v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image editing involves a variety of complex tasks and requires efficient and precise manipulation techniques. In this paper, we present MagicQuill, an integrated image editing system that enables swift actualization of creative ideas. Our system features a streamlined yet functionally robust interface, allowing for the articulation of editing operations (e.g., inserting elements, erasing objects, altering color) with minimal input. These interactions are monitored by a multimodal large language model (MLLM) to anticipate editing intentions in real time, bypassing the need for explicit prompt entry. Finally, we apply a powerful diffusion prior, enhanced by a carefully learned two-branch plug-in module, to process editing requests with precise control. Experimental results demonstrate the effectiveness of MagicQuill in achieving high-quality image edits. Please visit https://magic-quill.github.io to try out our system.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 15 Nov 2024 19:21:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/77fe53eb/f60a1b87.mp3" length="19471483" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1213</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 31 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Wen Wang, Zhiheng Liu, Qifeng Chen, Yujun Shen</p>

            <p><strong>Title:</strong><br>
            MagicQuill: An Intelligent Interactive Image Editing System</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.09703v1">http://arxiv.org/abs/2411.09703v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image editing involves a variety of complex tasks and requires efficient and precise manipulation techniques. In this paper, we present MagicQuill, an integrated image editing system that enables swift actualization of creative ideas. Our system features a streamlined yet functionally robust interface, allowing for the articulation of editing operations (e.g., inserting elements, erasing objects, altering color) with minimal input. These interactions are monitored by a multimodal large language model (MLLM) to anticipate editing intentions in real time, bypassing the need for explicit prompt entry. Finally, we apply a powerful diffusion prior, enhanced by a carefully learned two-branch plug-in module, to process editing requests with precise control. Experimental results demonstrate the effectiveness of MagicQuill in achieving high-quality image edits. Please visit https://magic-quill.github.io to try out our system.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Cut Your Losses in Large-Vocabulary Language Models</title>
      <itunes:episode>86</itunes:episode>
      <podcast:episode>86</podcast:episode>
      <itunes:title>Cut Your Losses in Large-Vocabulary Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6e3afbe8-3926-4943-a33f-bb6aca1a6c27</guid>
      <link>https://share.transistor.fm/s/0b7bced0</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Erik Wijmans, Brody Huval, Alexander Hertzberg, Vladlen Koltun, Philipp Krähenbühl</p>

            <p><strong>Title:</strong><br>
            Cut Your Losses in Large-Vocabulary Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.09009v1">http://arxiv.org/abs/2411.09009v1</a></p>

            <p><strong>Abstract:</strong><br>
            As language models grow ever larger, so do their vocabularies. This has shifted the memory footprint of LLMs during training disproportionately to one single layer: the cross-entropy in the loss computation. Cross-entropy builds up a logit matrix with entries for each pair of input tokens and vocabulary items and, for small models, consumes an order of magnitude more memory than the rest of the LLM combined. We propose Cut Cross-Entropy (CCE), a method that computes the cross-entropy loss without materializing the logits for all tokens into global memory. Rather, CCE only computes the logit for the correct token and evaluates the log-sum-exp over all logits on the fly. We implement a custom kernel that performs the matrix multiplications and the log-sum-exp reduction over the vocabulary in flash memory, making global memory consumption for the cross-entropy computation negligible. This has a dramatic effect. Taking the Gemma 2 (2B) model as an example, CCE reduces the memory footprint of the loss computation from 24 GB to 1 MB, and the total training-time memory consumption of the classifier head from 28 GB to 1 GB. To improve the throughput of CCE, we leverage the inherent sparsity of softmax and propose to skip elements of the gradient computation that have a negligible (i.e., below numerical precision) contribution to the gradient. Experiments demonstrate that the dramatic reduction in memory consumption is accomplished without sacrificing training speed or convergence.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Erik Wijmans, Brody Huval, Alexander Hertzberg, Vladlen Koltun, Philipp Krähenbühl</p>

            <p><strong>Title:</strong><br>
            Cut Your Losses in Large-Vocabulary Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.09009v1">http://arxiv.org/abs/2411.09009v1</a></p>

            <p><strong>Abstract:</strong><br>
            As language models grow ever larger, so do their vocabularies. This has shifted the memory footprint of LLMs during training disproportionately to one single layer: the cross-entropy in the loss computation. Cross-entropy builds up a logit matrix with entries for each pair of input tokens and vocabulary items and, for small models, consumes an order of magnitude more memory than the rest of the LLM combined. We propose Cut Cross-Entropy (CCE), a method that computes the cross-entropy loss without materializing the logits for all tokens into global memory. Rather, CCE only computes the logit for the correct token and evaluates the log-sum-exp over all logits on the fly. We implement a custom kernel that performs the matrix multiplications and the log-sum-exp reduction over the vocabulary in flash memory, making global memory consumption for the cross-entropy computation negligible. This has a dramatic effect. Taking the Gemma 2 (2B) model as an example, CCE reduces the memory footprint of the loss computation from 24 GB to 1 MB, and the total training-time memory consumption of the classifier head from 28 GB to 1 GB. To improve the throughput of CCE, we leverage the inherent sparsity of softmax and propose to skip elements of the gradient computation that have a negligible (i.e., below numerical precision) contribution to the gradient. Experiments demonstrate that the dramatic reduction in memory consumption is accomplished without sacrificing training speed or convergence.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 15 Nov 2024 19:21:31 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0b7bced0/c84a055b.mp3" length="20026524" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1248</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.LG, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Erik Wijmans, Brody Huval, Alexander Hertzberg, Vladlen Koltun, Philipp Krähenbühl</p>

            <p><strong>Title:</strong><br>
            Cut Your Losses in Large-Vocabulary Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.09009v1">http://arxiv.org/abs/2411.09009v1</a></p>

            <p><strong>Abstract:</strong><br>
            As language models grow ever larger, so do their vocabularies. This has shifted the memory footprint of LLMs during training disproportionately to one single layer: the cross-entropy in the loss computation. Cross-entropy builds up a logit matrix with entries for each pair of input tokens and vocabulary items and, for small models, consumes an order of magnitude more memory than the rest of the LLM combined. We propose Cut Cross-Entropy (CCE), a method that computes the cross-entropy loss without materializing the logits for all tokens into global memory. Rather, CCE only computes the logit for the correct token and evaluates the log-sum-exp over all logits on the fly. We implement a custom kernel that performs the matrix multiplications and the log-sum-exp reduction over the vocabulary in flash memory, making global memory consumption for the cross-entropy computation negligible. This has a dramatic effect. Taking the Gemma 2 (2B) model as an example, CCE reduces the memory footprint of the loss computation from 24 GB to 1 MB, and the total training-time memory consumption of the classifier head from 28 GB to 1 GB. To improve the throughput of CCE, we leverage the inherent sparsity of softmax and propose to skip elements of the gradient computation that have a negligible (i.e., below numerical precision) contribution to the gradient. Experiments demonstrate that the dramatic reduction in memory consumption is accomplished without sacrificing training speed or convergence.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?</title>
      <itunes:episode>85</itunes:episode>
      <podcast:episode>85</podcast:episode>
      <itunes:title>ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">da6425ec-e657-48d1-8d48-b99ae488f5df</guid>
      <link>https://share.transistor.fm/s/6c47565f</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 9 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Canyu Chen, Jian Yu, Shan Chen, Che Liu, Zhongwei Wan, Danielle Bitterman, Fei Wang, Kai Shu</p>

            <p><strong>Title:</strong><br>
            ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.06469v1">http://arxiv.org/abs/2411.06469v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) hold great promise to revolutionize current clinical systems for their superior capacities on medical text processing tasks and medical licensing exams. Meanwhile, traditional ML models such as SVM and XGBoost have still been mainly adopted in clinical prediction tasks. An emerging question is Can LLMs beat traditional ML models in clinical prediction? Thus, we build a new benchmark ClinicalBench to comprehensively study the clinical predictive modeling capacities of both general-purpose and medical LLMs, and compare them with traditional ML models. ClinicalBench embraces three common clinical prediction tasks, two databases, 14 general-purpose LLMs, 8 medical LLMs, and 11 traditional ML models. Through extensive empirical investigation, we discover that both general-purpose and medical LLMs, even with different model scales, diverse prompting or fine-tuning strategies, still cannot beat traditional ML models in clinical prediction yet, shedding light on their potential deficiency in clinical reasoning and decision-making. We call for caution when practitioners adopt LLMs in clinical applications. ClinicalBench can be utilized to bridge the gap between LLMs' development for healthcare and real-world clinical practice.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 9 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Canyu Chen, Jian Yu, Shan Chen, Che Liu, Zhongwei Wan, Danielle Bitterman, Fei Wang, Kai Shu</p>

            <p><strong>Title:</strong><br>
            ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.06469v1">http://arxiv.org/abs/2411.06469v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) hold great promise to revolutionize current clinical systems for their superior capacities on medical text processing tasks and medical licensing exams. Meanwhile, traditional ML models such as SVM and XGBoost have still been mainly adopted in clinical prediction tasks. An emerging question is Can LLMs beat traditional ML models in clinical prediction? Thus, we build a new benchmark ClinicalBench to comprehensively study the clinical predictive modeling capacities of both general-purpose and medical LLMs, and compare them with traditional ML models. ClinicalBench embraces three common clinical prediction tasks, two databases, 14 general-purpose LLMs, 8 medical LLMs, and 11 traditional ML models. Through extensive empirical investigation, we discover that both general-purpose and medical LLMs, even with different model scales, diverse prompting or fine-tuning strategies, still cannot beat traditional ML models in clinical prediction yet, shedding light on their potential deficiency in clinical reasoning and decision-making. We call for caution when practitioners adopt LLMs in clinical applications. ClinicalBench can be utilized to bridge the gap between LLMs' development for healthcare and real-world clinical practice.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 15 Nov 2024 19:21:08 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6c47565f/bdd0e428.mp3" length="22837741" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1424</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 9 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Canyu Chen, Jian Yu, Shan Chen, Che Liu, Zhongwei Wan, Danielle Bitterman, Fei Wang, Kai Shu</p>

            <p><strong>Title:</strong><br>
            ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.06469v1">http://arxiv.org/abs/2411.06469v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) hold great promise to revolutionize current clinical systems for their superior capacities on medical text processing tasks and medical licensing exams. Meanwhile, traditional ML models such as SVM and XGBoost have still been mainly adopted in clinical prediction tasks. An emerging question is Can LLMs beat traditional ML models in clinical prediction? Thus, we build a new benchmark ClinicalBench to comprehensively study the clinical predictive modeling capacities of both general-purpose and medical LLMs, and compare them with traditional ML models. ClinicalBench embraces three common clinical prediction tasks, two databases, 14 general-purpose LLMs, 8 medical LLMs, and 11 traditional ML models. Through extensive empirical investigation, we discover that both general-purpose and medical LLMs, even with different model scales, diverse prompting or fine-tuning strategies, still cannot beat traditional ML models in clinical prediction yet, shedding light on their potential deficiency in clinical reasoning and decision-making. We call for caution when practitioners adopt LLMs in clinical applications. ClinicalBench can be utilized to bridge the gap between LLMs' development for healthcare and real-world clinical practice.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Sharingan: Extract User Action Sequence from Desktop Recordings</title>
      <itunes:episode>84</itunes:episode>
      <podcast:episode>84</podcast:episode>
      <itunes:title>Sharingan: Extract User Action Sequence from Desktop Recordings</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">32f4be24-232c-4f70-b1f8-dac45166e945</guid>
      <link>https://share.transistor.fm/s/7db39082</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yanting Chen, Yi Ren, Xiaoting Qin, Jue Zhang, Kehong Yuan, Lu Han, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang</p>

            <p><strong>Title:</strong><br>
            Sharingan: Extract User Action Sequence from Desktop Recordings</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08768v1">http://arxiv.org/abs/2411.08768v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video recordings of user activities, particularly desktop recordings, offer a rich source of data for understanding user behaviors and automating processes. However, despite advancements in Vision-Language Models (VLMs) and their increasing use in video analysis, extracting user actions from desktop recordings remains an underexplored area. This paper addresses this gap by proposing two novel VLM-based methods for user action extraction: the Direct Frame-Based Approach (DF), which inputs sampled frames directly into VLMs, and the Differential Frame-Based Approach (DiffF), which incorporates explicit frame differences detected via computer vision techniques. We evaluate these methods using a basic self-curated dataset and an advanced benchmark adapted from prior work. Our results show that the DF approach achieves an accuracy of 70% to 80% in identifying user actions, with the extracted action sequences being re-playable though Robotic Process Automation. We find that while VLMs show potential, incorporating explicit UI changes can degrade performance, making the DF approach more reliable. This work represents the first application of VLMs for extracting user action sequences from desktop recordings, contributing new methods, benchmarks, and insights for future research.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yanting Chen, Yi Ren, Xiaoting Qin, Jue Zhang, Kehong Yuan, Lu Han, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang</p>

            <p><strong>Title:</strong><br>
            Sharingan: Extract User Action Sequence from Desktop Recordings</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08768v1">http://arxiv.org/abs/2411.08768v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video recordings of user activities, particularly desktop recordings, offer a rich source of data for understanding user behaviors and automating processes. However, despite advancements in Vision-Language Models (VLMs) and their increasing use in video analysis, extracting user actions from desktop recordings remains an underexplored area. This paper addresses this gap by proposing two novel VLM-based methods for user action extraction: the Direct Frame-Based Approach (DF), which inputs sampled frames directly into VLMs, and the Differential Frame-Based Approach (DiffF), which incorporates explicit frame differences detected via computer vision techniques. We evaluate these methods using a basic self-curated dataset and an advanced benchmark adapted from prior work. Our results show that the DF approach achieves an accuracy of 70% to 80% in identifying user actions, with the extracted action sequences being re-playable though Robotic Process Automation. We find that while VLMs show potential, incorporating explicit UI changes can degrade performance, making the DF approach more reliable. This work represents the first application of VLMs for extracting user action sequences from desktop recordings, contributing new methods, benchmarks, and insights for future research.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 15 Nov 2024 19:20:45 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7db39082/62d7525f.mp3" length="21787816" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1358</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yanting Chen, Yi Ren, Xiaoting Qin, Jue Zhang, Kehong Yuan, Lu Han, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang</p>

            <p><strong>Title:</strong><br>
            Sharingan: Extract User Action Sequence from Desktop Recordings</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08768v1">http://arxiv.org/abs/2411.08768v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video recordings of user activities, particularly desktop recordings, offer a rich source of data for understanding user behaviors and automating processes. However, despite advancements in Vision-Language Models (VLMs) and their increasing use in video analysis, extracting user actions from desktop recordings remains an underexplored area. This paper addresses this gap by proposing two novel VLM-based methods for user action extraction: the Direct Frame-Based Approach (DF), which inputs sampled frames directly into VLMs, and the Differential Frame-Based Approach (DiffF), which incorporates explicit frame differences detected via computer vision techniques. We evaluate these methods using a basic self-curated dataset and an advanced benchmark adapted from prior work. Our results show that the DF approach achieves an accuracy of 70% to 80% in identifying user actions, with the extracted action sequences being re-playable though Robotic Process Automation. We find that while VLMs show potential, incorporating explicit UI changes can degrade performance, making the DF approach more reliable. This work represents the first application of VLMs for extracting user action sequences from desktop recordings, contributing new methods, benchmarks, and insights for future research.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Hermes: A Large Language Model Framework on the Journey to Autonomous Networks</title>
      <itunes:episode>83</itunes:episode>
      <podcast:episode>83</podcast:episode>
      <itunes:title>Hermes: A Large Language Model Framework on the Journey to Autonomous Networks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">9bcba98d-b0b1-42c1-bd4a-ea481426d8a2</guid>
      <link>https://share.transistor.fm/s/7762d530</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 2 | cs.AI, cs.NI</p>

            <p><strong>Authors:</strong><br>
            Fadhel Ayed, Ali Maatouk, Nicola Piovesan, Antonio De Domenico, Merouane Debbah, Zhi-Quan Luo</p>

            <p><strong>Title:</strong><br>
            Hermes: A Large Language Model Framework on the Journey to Autonomous Networks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.06490v1">http://arxiv.org/abs/2411.06490v1</a></p>

            <p><strong>Abstract:</strong><br>
            The drive toward automating cellular network operations has grown with the increasing complexity of these systems. Despite advancements, full autonomy currently remains out of reach due to reliance on human intervention for modeling network behaviors and defining policies to meet target requirements. Network Digital Twins (NDTs) have shown promise in enhancing network intelligence, but the successful implementation of this technology is constrained by use case-specific architectures, limiting its role in advancing network autonomy. A more capable network intelligence, or "telecommunications brain", is needed to enable seamless, autonomous management of cellular network. Large Language Models (LLMs) have emerged as potential enablers for this vision but face challenges in network modeling, especially in reasoning and handling diverse data types. To address these gaps, we introduce Hermes, a chain of LLM agents that uses "blueprints" for constructing NDT instances through structured and explainable logical steps. Hermes allows automatic, reliable, and accurate network modeling of diverse use cases and configurations, thus marking progress toward fully autonomous network operations.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 2 | cs.AI, cs.NI</p>

            <p><strong>Authors:</strong><br>
            Fadhel Ayed, Ali Maatouk, Nicola Piovesan, Antonio De Domenico, Merouane Debbah, Zhi-Quan Luo</p>

            <p><strong>Title:</strong><br>
            Hermes: A Large Language Model Framework on the Journey to Autonomous Networks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.06490v1">http://arxiv.org/abs/2411.06490v1</a></p>

            <p><strong>Abstract:</strong><br>
            The drive toward automating cellular network operations has grown with the increasing complexity of these systems. Despite advancements, full autonomy currently remains out of reach due to reliance on human intervention for modeling network behaviors and defining policies to meet target requirements. Network Digital Twins (NDTs) have shown promise in enhancing network intelligence, but the successful implementation of this technology is constrained by use case-specific architectures, limiting its role in advancing network autonomy. A more capable network intelligence, or "telecommunications brain", is needed to enable seamless, autonomous management of cellular network. Large Language Models (LLMs) have emerged as potential enablers for this vision but face challenges in network modeling, especially in reasoning and handling diverse data types. To address these gaps, we introduce Hermes, a chain of LLM agents that uses "blueprints" for constructing NDT instances through structured and explainable logical steps. Hermes allows automatic, reliable, and accurate network modeling of diverse use cases and configurations, thus marking progress toward fully autonomous network operations.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 15 Nov 2024 19:20:22 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7762d530/0eb6bef0.mp3" length="21603929" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1347</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 2 | cs.AI, cs.NI</p>

            <p><strong>Authors:</strong><br>
            Fadhel Ayed, Ali Maatouk, Nicola Piovesan, Antonio De Domenico, Merouane Debbah, Zhi-Quan Luo</p>

            <p><strong>Title:</strong><br>
            Hermes: A Large Language Model Framework on the Journey to Autonomous Networks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.06490v1">http://arxiv.org/abs/2411.06490v1</a></p>

            <p><strong>Abstract:</strong><br>
            The drive toward automating cellular network operations has grown with the increasing complexity of these systems. Despite advancements, full autonomy currently remains out of reach due to reliance on human intervention for modeling network behaviors and defining policies to meet target requirements. Network Digital Twins (NDTs) have shown promise in enhancing network intelligence, but the successful implementation of this technology is constrained by use case-specific architectures, limiting its role in advancing network autonomy. A more capable network intelligence, or "telecommunications brain", is needed to enable seamless, autonomous management of cellular network. Large Language Models (LLMs) have emerged as potential enablers for this vision but face challenges in network modeling, especially in reasoning and handling diverse data types. To address these gaps, we introduce Hermes, a chain of LLM agents that uses "blueprints" for constructing NDT instances through structured and explainable logical steps. Hermes allows automatic, reliable, and accurate network modeling of diverse use cases and configurations, thus marking progress toward fully autonomous network operations.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better Samples</title>
      <itunes:episode>82</itunes:episode>
      <podcast:episode>82</podcast:episode>
      <itunes:title>Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better Samples</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">94c8e469-a043-4f61-8a6f-9b80c10273e1</guid>
      <link>https://share.transistor.fm/s/245ca19d</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 2 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Noël Vouitsis, Rasa Hosseinzadeh, Brendan Leigh Ross, Valentin Villecroze, Satya Krishna Gorti, Jesse C. Cresswell, Gabriel Loaiza-Ganem</p>

            <p><strong>Title:</strong><br>
            Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better Samples</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08954v1">http://arxiv.org/abs/2411.08954v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although diffusion models can generate remarkably high-quality samples, they are intrinsically bottlenecked by their expensive iterative sampling procedure. Consistency models (CMs) have recently emerged as a promising diffusion model distillation method, reducing the cost of sampling by generating high-fidelity samples in just a few iterations. Consistency model distillation aims to solve the probability flow ordinary differential equation (ODE) defined by an existing diffusion model. CMs are not directly trained to minimize error against an ODE solver, rather they use a more computationally tractable objective. As a way to study how effectively CMs solve the probability flow ODE, and the effect that any induced error has on the quality of generated samples, we introduce Direct CMs, which \textit{directly} minimize this error. Intriguingly, we find that Direct CMs reduce the ODE solving error compared to CMs but also result in significantly worse sample quality, calling into question why exactly CMs work well in the first place. Full code is available at: https://github.com/layer6ai-labs/direct-cms.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 2 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Noël Vouitsis, Rasa Hosseinzadeh, Brendan Leigh Ross, Valentin Villecroze, Satya Krishna Gorti, Jesse C. Cresswell, Gabriel Loaiza-Ganem</p>

            <p><strong>Title:</strong><br>
            Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better Samples</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08954v1">http://arxiv.org/abs/2411.08954v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although diffusion models can generate remarkably high-quality samples, they are intrinsically bottlenecked by their expensive iterative sampling procedure. Consistency models (CMs) have recently emerged as a promising diffusion model distillation method, reducing the cost of sampling by generating high-fidelity samples in just a few iterations. Consistency model distillation aims to solve the probability flow ordinary differential equation (ODE) defined by an existing diffusion model. CMs are not directly trained to minimize error against an ODE solver, rather they use a more computationally tractable objective. As a way to study how effectively CMs solve the probability flow ODE, and the effect that any induced error has on the quality of generated samples, we introduce Direct CMs, which \textit{directly} minimize this error. Intriguingly, we find that Direct CMs reduce the ODE solving error compared to CMs but also result in significantly worse sample quality, calling into question why exactly CMs work well in the first place. Full code is available at: https://github.com/layer6ai-labs/direct-cms.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 15 Nov 2024 19:19:59 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/245ca19d/01eddf9e.mp3" length="21317636" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1329</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 2 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Noël Vouitsis, Rasa Hosseinzadeh, Brendan Leigh Ross, Valentin Villecroze, Satya Krishna Gorti, Jesse C. Cresswell, Gabriel Loaiza-Ganem</p>

            <p><strong>Title:</strong><br>
            Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better Samples</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08954v1">http://arxiv.org/abs/2411.08954v1</a></p>

            <p><strong>Abstract:</strong><br>
            Although diffusion models can generate remarkably high-quality samples, they are intrinsically bottlenecked by their expensive iterative sampling procedure. Consistency models (CMs) have recently emerged as a promising diffusion model distillation method, reducing the cost of sampling by generating high-fidelity samples in just a few iterations. Consistency model distillation aims to solve the probability flow ordinary differential equation (ODE) defined by an existing diffusion model. CMs are not directly trained to minimize error against an ODE solver, rather they use a more computationally tractable objective. As a way to study how effectively CMs solve the probability flow ODE, and the effect that any induced error has on the quality of generated samples, we introduce Direct CMs, which \textit{directly} minimize this error. Intriguingly, we find that Direct CMs reduce the ODE solving error compared to CMs but also result in significantly worse sample quality, calling into question why exactly CMs work well in the first place. Full code is available at: https://github.com/layer6ai-labs/direct-cms.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Direct Preference Optimization Using Sparse Feature-Level Constraints</title>
      <itunes:episode>81</itunes:episode>
      <podcast:episode>81</podcast:episode>
      <itunes:title>Direct Preference Optimization Using Sparse Feature-Level Constraints</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b565f30d-7169-4575-9ba3-1e532b0ea98f</guid>
      <link>https://share.transistor.fm/s/3be49f2c</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qingyu Yin, Chak Tou Leong, Hongbo Zhang, Minjun Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li, Jun Wang, Yue Zhang, Linyi Yang</p>

            <p><strong>Title:</strong><br>
            Direct Preference Optimization Using Sparse Feature-Level Constraints</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07618v1">http://arxiv.org/abs/2411.07618v1</a></p>

            <p><strong>Abstract:</strong><br>
            The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often introduce computational inefficiencies and training instability. In this paper, we propose Feature-level constrained Preference Optimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves a 5.08% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qingyu Yin, Chak Tou Leong, Hongbo Zhang, Minjun Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li, Jun Wang, Yue Zhang, Linyi Yang</p>

            <p><strong>Title:</strong><br>
            Direct Preference Optimization Using Sparse Feature-Level Constraints</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07618v1">http://arxiv.org/abs/2411.07618v1</a></p>

            <p><strong>Abstract:</strong><br>
            The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often introduce computational inefficiencies and training instability. In this paper, we propose Feature-level constrained Preference Optimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves a 5.08% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 14 Nov 2024 22:20:37 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3be49f2c/6f97d28d.mp3" length="20464982" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1275</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Qingyu Yin, Chak Tou Leong, Hongbo Zhang, Minjun Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li, Jun Wang, Yue Zhang, Linyi Yang</p>

            <p><strong>Title:</strong><br>
            Direct Preference Optimization Using Sparse Feature-Level Constraints</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07618v1">http://arxiv.org/abs/2411.07618v1</a></p>

            <p><strong>Abstract:</strong><br>
            The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often introduce computational inefficiencies and training instability. In this paper, we propose Feature-level constrained Preference Optimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves a 5.08% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CamemBERT 2.0: A Smarter French Language Model Aged to Perfection</title>
      <itunes:episode>80</itunes:episode>
      <podcast:episode>80</podcast:episode>
      <itunes:title>CamemBERT 2.0: A Smarter French Language Model Aged to Perfection</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c50eebe1-ac5c-496c-b0bf-fe09c2081663</guid>
      <link>https://share.transistor.fm/s/041af552</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Wissam Antoun, Francis Kulumba, Rian Touchent, Éric de la Clergerie, Benoît Sagot, Djamé Seddah</p>

            <p><strong>Title:</strong><br>
            CamemBERT 2.0: A Smarter French Language Model Aged to Perfection</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08868v1">http://arxiv.org/abs/2411.08868v1</a></p>

            <p><strong>Abstract:</strong><br>
            French language models, such as CamemBERT, have been widely adopted across industries for natural language processing (NLP) tasks, with models like CamemBERT seeing over 4 million downloads per month. However, these models face challenges due to temporal concept drift, where outdated training data leads to a decline in performance, especially when encountering new topics and terminology. This issue emphasizes the need for updated models that reflect current linguistic trends. In this paper, we introduce two new versions of the CamemBERT base model-CamemBERTav2 and CamemBERTv2-designed to address these challenges. CamemBERTav2 is based on the DeBERTaV3 architecture and makes use of the Replaced Token Detection (RTD) objective for better contextual understanding, while CamemBERTv2 is built on RoBERTa, which uses the Masked Language Modeling (MLM) objective. Both models are trained on a significantly larger and more recent dataset with longer context length and an updated tokenizer that enhances tokenization performance for French. We evaluate the performance of these models on both general-domain NLP tasks and domain-specific applications, such as medical field tasks, demonstrating their versatility and effectiveness across a range of use cases. Our results show that these updated models vastly outperform their predecessors, making them valuable tools for modern NLP systems. All our new models, as well as intermediate checkpoints, are made openly available on Huggingface.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Wissam Antoun, Francis Kulumba, Rian Touchent, Éric de la Clergerie, Benoît Sagot, Djamé Seddah</p>

            <p><strong>Title:</strong><br>
            CamemBERT 2.0: A Smarter French Language Model Aged to Perfection</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08868v1">http://arxiv.org/abs/2411.08868v1</a></p>

            <p><strong>Abstract:</strong><br>
            French language models, such as CamemBERT, have been widely adopted across industries for natural language processing (NLP) tasks, with models like CamemBERT seeing over 4 million downloads per month. However, these models face challenges due to temporal concept drift, where outdated training data leads to a decline in performance, especially when encountering new topics and terminology. This issue emphasizes the need for updated models that reflect current linguistic trends. In this paper, we introduce two new versions of the CamemBERT base model-CamemBERTav2 and CamemBERTv2-designed to address these challenges. CamemBERTav2 is based on the DeBERTaV3 architecture and makes use of the Replaced Token Detection (RTD) objective for better contextual understanding, while CamemBERTv2 is built on RoBERTa, which uses the Masked Language Modeling (MLM) objective. Both models are trained on a significantly larger and more recent dataset with longer context length and an updated tokenizer that enhances tokenization performance for French. We evaluate the performance of these models on both general-domain NLP tasks and domain-specific applications, such as medical field tasks, demonstrating their versatility and effectiveness across a range of use cases. Our results show that these updated models vastly outperform their predecessors, making them valuable tools for modern NLP systems. All our new models, as well as intermediate checkpoints, are made openly available on Huggingface.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 14 Nov 2024 22:20:13 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/041af552/3b660511.mp3" length="23547844" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1468</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Wissam Antoun, Francis Kulumba, Rian Touchent, Éric de la Clergerie, Benoît Sagot, Djamé Seddah</p>

            <p><strong>Title:</strong><br>
            CamemBERT 2.0: A Smarter French Language Model Aged to Perfection</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08868v1">http://arxiv.org/abs/2411.08868v1</a></p>

            <p><strong>Abstract:</strong><br>
            French language models, such as CamemBERT, have been widely adopted across industries for natural language processing (NLP) tasks, with models like CamemBERT seeing over 4 million downloads per month. However, these models face challenges due to temporal concept drift, where outdated training data leads to a decline in performance, especially when encountering new topics and terminology. This issue emphasizes the need for updated models that reflect current linguistic trends. In this paper, we introduce two new versions of the CamemBERT base model-CamemBERTav2 and CamemBERTv2-designed to address these challenges. CamemBERTav2 is based on the DeBERTaV3 architecture and makes use of the Replaced Token Detection (RTD) objective for better contextual understanding, while CamemBERTv2 is built on RoBERTa, which uses the Masked Language Modeling (MLM) objective. Both models are trained on a significantly larger and more recent dataset with longer context length and an updated tokenizer that enhances tokenization performance for French. We evaluate the performance of these models on both general-domain NLP tasks and domain-specific applications, such as medical field tasks, demonstrating their versatility and effectiveness across a range of use cases. Our results show that these updated models vastly outperform their predecessors, making them valuable tools for modern NLP systems. All our new models, as well as intermediate checkpoints, are made openly available on Huggingface.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Can sparse autoencoders be used to decompose and interpret steering vectors?</title>
      <itunes:episode>79</itunes:episode>
      <podcast:episode>79</podcast:episode>
      <itunes:title>Can sparse autoencoders be used to decompose and interpret steering vectors?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">dbd1c225-7f31-4405-bde2-ad79efaa8132</guid>
      <link>https://share.transistor.fm/s/6ee7b4d3</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 6 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Harry Mayne, Yushi Yang, Adam Mahdi</p>

            <p><strong>Title:</strong><br>
            Can sparse autoencoders be used to decompose and interpret steering vectors?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08790v1">http://arxiv.org/abs/2411.08790v1</a></p>

            <p><strong>Abstract:</strong><br>
            Steering vectors are a promising approach to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While sparse autoencoders (SAEs) may offer a potential method to interpret steering vectors, recent findings show that SAE-reconstructed vectors often lack the steering properties of the original vectors. This paper investigates why directly applying SAEs to steering vectors yields misleading decompositions, identifying two reasons: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs are not designed to accommodate. These limitations hinder the direct use of SAEs for interpreting steering vectors.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 6 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Harry Mayne, Yushi Yang, Adam Mahdi</p>

            <p><strong>Title:</strong><br>
            Can sparse autoencoders be used to decompose and interpret steering vectors?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08790v1">http://arxiv.org/abs/2411.08790v1</a></p>

            <p><strong>Abstract:</strong><br>
            Steering vectors are a promising approach to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While sparse autoencoders (SAEs) may offer a potential method to interpret steering vectors, recent findings show that SAE-reconstructed vectors often lack the steering properties of the original vectors. This paper investigates why directly applying SAEs to steering vectors yields misleading decompositions, identifying two reasons: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs are not designed to accommodate. These limitations hinder the direct use of SAEs for interpreting steering vectors.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 14 Nov 2024 22:19:50 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6ee7b4d3/9ce12110.mp3" length="21076881" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1314</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 6 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Harry Mayne, Yushi Yang, Adam Mahdi</p>

            <p><strong>Title:</strong><br>
            Can sparse autoencoders be used to decompose and interpret steering vectors?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08790v1">http://arxiv.org/abs/2411.08790v1</a></p>

            <p><strong>Abstract:</strong><br>
            Steering vectors are a promising approach to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While sparse autoencoders (SAEs) may offer a potential method to interpret steering vectors, recent findings show that SAE-reconstructed vectors often lack the steering properties of the original vectors. This paper investigates why directly applying SAEs to steering vectors yields misleading decompositions, identifying two reasons: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs are not designed to accommodate. These limitations hinder the direct use of SAEs for interpreting steering vectors.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation</title>
      <itunes:episode>78</itunes:episode>
      <podcast:episode>78</podcast:episode>
      <itunes:title>PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fc9488a8-eef8-4105-8e27-a998472b9102</guid>
      <link>https://share.transistor.fm/s/0c9d1715</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 5 | cs.AI, cs.MM, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Yungang Yi, Weihua Li, Matthew Kuo, Quan Bai</p>

            <p><strong>Title:</strong><br>
            PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08307v1">http://arxiv.org/abs/2411.08307v1</a></p>

            <p><strong>Abstract:</strong><br>
            Music generation has progressed significantly, especially in the domain of audio generation. However, generating symbolic music that is both long-structured and expressive remains a significant challenge. In this paper, we propose PerceiverS (Segmentation and Scale), a novel architecture designed to address this issue by leveraging both Effective Segmentation and Multi-Scale attention mechanisms. Our approach enhances symbolic music generation by simultaneously learning long-term structural dependencies and short-term expressive details. By combining cross-attention and self-attention in a Multi-Scale setting, PerceiverS captures long-range musical structure while preserving performance nuances. The proposed model, evaluated on datasets like Maestro, demonstrates improvements in generating coherent and diverse music with both structural consistency and expressive variation. The project demos and the generated music samples can be accessed through the link: https://perceivers.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 5 | cs.AI, cs.MM, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Yungang Yi, Weihua Li, Matthew Kuo, Quan Bai</p>

            <p><strong>Title:</strong><br>
            PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08307v1">http://arxiv.org/abs/2411.08307v1</a></p>

            <p><strong>Abstract:</strong><br>
            Music generation has progressed significantly, especially in the domain of audio generation. However, generating symbolic music that is both long-structured and expressive remains a significant challenge. In this paper, we propose PerceiverS (Segmentation and Scale), a novel architecture designed to address this issue by leveraging both Effective Segmentation and Multi-Scale attention mechanisms. Our approach enhances symbolic music generation by simultaneously learning long-term structural dependencies and short-term expressive details. By combining cross-attention and self-attention in a Multi-Scale setting, PerceiverS captures long-range musical structure while preserving performance nuances. The proposed model, evaluated on datasets like Maestro, demonstrates improvements in generating coherent and diverse music with both structural consistency and expressive variation. The project demos and the generated music samples can be accessed through the link: https://perceivers.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 14 Nov 2024 22:19:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/0c9d1715/1c3d90ad.mp3" length="18281608" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1139</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 5 | cs.AI, cs.MM, cs.SD, eess.AS</p>

            <p><strong>Authors:</strong><br>
            Yungang Yi, Weihua Li, Matthew Kuo, Quan Bai</p>

            <p><strong>Title:</strong><br>
            PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08307v1">http://arxiv.org/abs/2411.08307v1</a></p>

            <p><strong>Abstract:</strong><br>
            Music generation has progressed significantly, especially in the domain of audio generation. However, generating symbolic music that is both long-structured and expressive remains a significant challenge. In this paper, we propose PerceiverS (Segmentation and Scale), a novel architecture designed to address this issue by leveraging both Effective Segmentation and Multi-Scale attention mechanisms. Our approach enhances symbolic music generation by simultaneously learning long-term structural dependencies and short-term expressive details. By combining cross-attention and self-attention in a Multi-Scale setting, PerceiverS captures long-range musical structure while preserving performance nuances. The proposed model, evaluated on datasets like Maestro, demonstrates improvements in generating coherent and diverse music with both structural consistency and expressive variation. The project demos and the generated music samples can be accessed through the link: https://perceivers.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SAMPart3D: Segment Any Part in 3D Objects</title>
      <itunes:episode>77</itunes:episode>
      <podcast:episode>77</podcast:episode>
      <itunes:title>SAMPart3D: Segment Any Part in 3D Objects</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ce433988-757d-47df-adf3-1d4408434079</guid>
      <link>https://share.transistor.fm/s/07d6acd8</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 18 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunhan Yang, Yukun Huang, Yuan-Chen Guo, Liangjun Lu, Xiaoyang Wu, Edmund Y. Lam, Yan-Pei Cao, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            SAMPart3D: Segment Any Part in 3D Objects</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07184v1">http://arxiv.org/abs/2411.07184v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D part segmentation is a crucial and challenging task in 3D perception, playing a vital role in applications such as robotics, 3D generation, and 3D editing. Recent methods harness the powerful Vision Language Models (VLMs) for 2D-to-3D knowledge distillation, achieving zero-shot 3D part segmentation. However, these methods are limited by their reliance on text prompts, which restricts the scalability to large-scale unlabeled datasets and the flexibility in handling part ambiguities. In this work, we introduce SAMPart3D, a scalable zero-shot 3D part segmentation framework that segments any 3D object into semantic parts at multiple granularities, without requiring predefined part label sets as text prompts. For scalability, we use text-agnostic vision foundation models to distill a 3D feature extraction backbone, allowing scaling to large unlabeled 3D datasets to learn rich 3D priors. For flexibility, we distill scale-conditioned part-aware 3D features for 3D part segmentation at multiple granularities. Once the segmented parts are obtained from the scale-conditioned part-aware 3D features, we use VLMs to assign semantic labels to each part based on the multi-view renderings. Compared to previous methods, our SAMPart3D can scale to the recent large-scale 3D object dataset Objaverse and handle complex, non-ordinary objects. Additionally, we contribute a new 3D part segmentation benchmark to address the lack of diversity and complexity of objects and parts in existing benchmarks. Experiments show that our SAMPart3D significantly outperforms existing zero-shot 3D part segmentation methods, and can facilitate various applications such as part-level editing and interactive segmentation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 18 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunhan Yang, Yukun Huang, Yuan-Chen Guo, Liangjun Lu, Xiaoyang Wu, Edmund Y. Lam, Yan-Pei Cao, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            SAMPart3D: Segment Any Part in 3D Objects</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07184v1">http://arxiv.org/abs/2411.07184v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D part segmentation is a crucial and challenging task in 3D perception, playing a vital role in applications such as robotics, 3D generation, and 3D editing. Recent methods harness the powerful Vision Language Models (VLMs) for 2D-to-3D knowledge distillation, achieving zero-shot 3D part segmentation. However, these methods are limited by their reliance on text prompts, which restricts the scalability to large-scale unlabeled datasets and the flexibility in handling part ambiguities. In this work, we introduce SAMPart3D, a scalable zero-shot 3D part segmentation framework that segments any 3D object into semantic parts at multiple granularities, without requiring predefined part label sets as text prompts. For scalability, we use text-agnostic vision foundation models to distill a 3D feature extraction backbone, allowing scaling to large unlabeled 3D datasets to learn rich 3D priors. For flexibility, we distill scale-conditioned part-aware 3D features for 3D part segmentation at multiple granularities. Once the segmented parts are obtained from the scale-conditioned part-aware 3D features, we use VLMs to assign semantic labels to each part based on the multi-view renderings. Compared to previous methods, our SAMPart3D can scale to the recent large-scale 3D object dataset Objaverse and handle complex, non-ordinary objects. Additionally, we contribute a new 3D part segmentation benchmark to address the lack of diversity and complexity of objects and parts in existing benchmarks. Experiments show that our SAMPart3D significantly outperforms existing zero-shot 3D part segmentation methods, and can facilitate various applications such as part-level editing and interactive segmentation.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 13 Nov 2024 19:39:31 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/07d6acd8/f264de6b.mp3" length="20080013" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1251</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 18 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yunhan Yang, Yukun Huang, Yuan-Chen Guo, Liangjun Lu, Xiaoyang Wu, Edmund Y. Lam, Yan-Pei Cao, Xihui Liu</p>

            <p><strong>Title:</strong><br>
            SAMPart3D: Segment Any Part in 3D Objects</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07184v1">http://arxiv.org/abs/2411.07184v1</a></p>

            <p><strong>Abstract:</strong><br>
            3D part segmentation is a crucial and challenging task in 3D perception, playing a vital role in applications such as robotics, 3D generation, and 3D editing. Recent methods harness the powerful Vision Language Models (VLMs) for 2D-to-3D knowledge distillation, achieving zero-shot 3D part segmentation. However, these methods are limited by their reliance on text prompts, which restricts the scalability to large-scale unlabeled datasets and the flexibility in handling part ambiguities. In this work, we introduce SAMPart3D, a scalable zero-shot 3D part segmentation framework that segments any 3D object into semantic parts at multiple granularities, without requiring predefined part label sets as text prompts. For scalability, we use text-agnostic vision foundation models to distill a 3D feature extraction backbone, allowing scaling to large unlabeled 3D datasets to learn rich 3D priors. For flexibility, we distill scale-conditioned part-aware 3D features for 3D part segmentation at multiple granularities. Once the segmented parts are obtained from the scale-conditioned part-aware 3D features, we use VLMs to assign semantic labels to each part based on the multi-view renderings. Compared to previous methods, our SAMPart3D can scale to the recent large-scale 3D object dataset Objaverse and handle complex, non-ordinary objects. Additionally, we contribute a new 3D part segmentation benchmark to address the lack of diversity and complexity of objects and parts in existing benchmarks. Experiments show that our SAMPart3D significantly outperforms existing zero-shot 3D part segmentation methods, and can facilitate various applications such as part-level editing and interactive segmentation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation</title>
      <itunes:episode>76</itunes:episode>
      <podcast:episode>76</podcast:episode>
      <itunes:title>JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ee75f4b2-c612-42f1-874a-d82f6c576dff</guid>
      <link>https://share.transistor.fm/s/2bd3a9f6</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, Chong Ruan</p>

            <p><strong>Title:</strong><br>
            JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07975v1">http://arxiv.org/abs/2411.07975v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model. JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling. Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications. To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training. Extensive experiments show that JanusFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches across standard benchmarks. This work represents a step toward more efficient and versatile vision-language models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, Chong Ruan</p>

            <p><strong>Title:</strong><br>
            JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07975v1">http://arxiv.org/abs/2411.07975v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model. JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling. Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications. To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training. Extensive experiments show that JanusFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches across standard benchmarks. This work represents a step toward more efficient and versatile vision-language models.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 13 Nov 2024 19:39:10 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/2bd3a9f6/74533d87.mp3" length="21655368" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1350</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, Chong Ruan</p>

            <p><strong>Title:</strong><br>
            JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07975v1">http://arxiv.org/abs/2411.07975v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model. JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling. Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications. To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training. Extensive experiments show that JanusFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches across standard benchmarks. This work represents a step toward more efficient and versatile vision-language models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Stronger Models are NOT Stronger Teachers for Instruction Tuning</title>
      <itunes:episode>75</itunes:episode>
      <podcast:episode>75</podcast:episode>
      <itunes:title>Stronger Models are NOT Stronger Teachers for Instruction Tuning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7b7a8a08-1e9e-4ae4-aa66-7c5714e5ec73</guid>
      <link>https://share.transistor.fm/s/36390e0e</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Radha Poovendran</p>

            <p><strong>Title:</strong><br>
            Stronger Models are NOT Stronger Teachers for Instruction Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07133v2">http://arxiv.org/abs/2411.07133v2</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction tuning has been widely adopted to ensure large language models (LLMs) follow user instructions effectively. The resulting instruction-following capabilities of LLMs heavily rely on the instruction datasets used for tuning. Recently, synthetic instruction datasets have emerged as an economically viable solution to provide LLMs diverse and high-quality instructions. However, existing approaches typically assume that larger or stronger models are stronger teachers for instruction tuning, and hence simply adopt these models as response generators to the synthetic instructions. In this paper, we challenge this commonly-adopted assumption. Our extensive experiments across five base models and twenty response generators reveal that larger and stronger models are not necessarily stronger teachers of smaller models. We refer to this phenomenon as the Larger Models' Paradox. We observe that existing metrics cannot precisely predict the effectiveness of response generators since they ignore the compatibility between teachers and base models being fine-tuned. We thus develop a novel metric, named as Compatibility-Adjusted Reward (CAR) to measure the effectiveness of response generators. Our experiments across five base models demonstrate that CAR outperforms almost all baselines.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Radha Poovendran</p>

            <p><strong>Title:</strong><br>
            Stronger Models are NOT Stronger Teachers for Instruction Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07133v2">http://arxiv.org/abs/2411.07133v2</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction tuning has been widely adopted to ensure large language models (LLMs) follow user instructions effectively. The resulting instruction-following capabilities of LLMs heavily rely on the instruction datasets used for tuning. Recently, synthetic instruction datasets have emerged as an economically viable solution to provide LLMs diverse and high-quality instructions. However, existing approaches typically assume that larger or stronger models are stronger teachers for instruction tuning, and hence simply adopt these models as response generators to the synthetic instructions. In this paper, we challenge this commonly-adopted assumption. Our extensive experiments across five base models and twenty response generators reveal that larger and stronger models are not necessarily stronger teachers of smaller models. We refer to this phenomenon as the Larger Models' Paradox. We observe that existing metrics cannot precisely predict the effectiveness of response generators since they ignore the compatibility between teachers and base models being fine-tuned. We thus develop a novel metric, named as Compatibility-Adjusted Reward (CAR) to measure the effectiveness of response generators. Our experiments across five base models demonstrate that CAR outperforms almost all baselines.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 13 Nov 2024 19:38:49 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/36390e0e/64aa8a81.mp3" length="26725587" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1667</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Radha Poovendran</p>

            <p><strong>Title:</strong><br>
            Stronger Models are NOT Stronger Teachers for Instruction Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07133v2">http://arxiv.org/abs/2411.07133v2</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction tuning has been widely adopted to ensure large language models (LLMs) follow user instructions effectively. The resulting instruction-following capabilities of LLMs heavily rely on the instruction datasets used for tuning. Recently, synthetic instruction datasets have emerged as an economically viable solution to provide LLMs diverse and high-quality instructions. However, existing approaches typically assume that larger or stronger models are stronger teachers for instruction tuning, and hence simply adopt these models as response generators to the synthetic instructions. In this paper, we challenge this commonly-adopted assumption. Our extensive experiments across five base models and twenty response generators reveal that larger and stronger models are not necessarily stronger teachers of smaller models. We refer to this phenomenon as the Larger Models' Paradox. We observe that existing metrics cannot precisely predict the effectiveness of response generators since they ignore the compatibility between teachers and base models being fine-tuned. We thus develop a novel metric, named as Compatibility-Adjusted Reward (CAR) to measure the effectiveness of response generators. Our experiments across five base models demonstrate that CAR outperforms almost all baselines.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions</title>
      <itunes:episode>74</itunes:episode>
      <podcast:episode>74</podcast:episode>
      <itunes:title>BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">359e251b-8b35-490c-a18f-39e1d2073424</guid>
      <link>https://share.transistor.fm/s/c5e87e20</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 11 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Anas Awadalla, Le Xue, Manli Shu, An Yan, Jun Wang, Senthil Purushwalkam, Sheng Shen, Hannah Lee, Oscar Lo, Jae Sung Park, Etash Guha, Silvio Savarese, Ludwig Schmidt, Yejin Choi, Caiming Xiong, Ran Xu</p>

            <p><strong>Title:</strong><br>
            BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07461v1">http://arxiv.org/abs/2411.07461v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factual web-scale alt-text. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. Our two-stage approach leverages large vision-language models and language models to create knowledge-augmented captions, which are then used to train a specialized VLM for scaling up the dataset. We train vision-language models on KALE and demonstrate improvements on vision-language tasks. Our experiments show the utility of KALE for training more capable and knowledgeable multimodal models. We release the KALE dataset at https://huggingface.co/datasets/Salesforce/blip3-kale</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 11 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Anas Awadalla, Le Xue, Manli Shu, An Yan, Jun Wang, Senthil Purushwalkam, Sheng Shen, Hannah Lee, Oscar Lo, Jae Sung Park, Etash Guha, Silvio Savarese, Ludwig Schmidt, Yejin Choi, Caiming Xiong, Ran Xu</p>

            <p><strong>Title:</strong><br>
            BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07461v1">http://arxiv.org/abs/2411.07461v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factual web-scale alt-text. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. Our two-stage approach leverages large vision-language models and language models to create knowledge-augmented captions, which are then used to train a specialized VLM for scaling up the dataset. We train vision-language models on KALE and demonstrate improvements on vision-language tasks. Our experiments show the utility of KALE for training more capable and knowledgeable multimodal models. We release the KALE dataset at https://huggingface.co/datasets/Salesforce/blip3-kale</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 13 Nov 2024 19:38:28 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c5e87e20/8bb0b196.mp3" length="19890695" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1239</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 11 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Anas Awadalla, Le Xue, Manli Shu, An Yan, Jun Wang, Senthil Purushwalkam, Sheng Shen, Hannah Lee, Oscar Lo, Jae Sung Park, Etash Guha, Silvio Savarese, Ludwig Schmidt, Yejin Choi, Caiming Xiong, Ran Xu</p>

            <p><strong>Title:</strong><br>
            BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07461v1">http://arxiv.org/abs/2411.07461v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factual web-scale alt-text. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. Our two-stage approach leverages large vision-language models and language models to create knowledge-augmented captions, which are then used to train a specialized VLM for scaling up the dataset. We train vision-language models on KALE and demonstrate improvements on vision-language tasks. Our experiments show the utility of KALE for training more capable and knowledgeable multimodal models. We release the KALE dataset at https://huggingface.co/datasets/Salesforce/blip3-kale</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Scaling Properties of Diffusion Models for Perceptual Tasks</title>
      <itunes:episode>73</itunes:episode>
      <podcast:episode>73</podcast:episode>
      <itunes:title>Scaling Properties of Diffusion Models for Perceptual Tasks</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3c97075e-95f0-47bb-901c-84eef5904004</guid>
      <link>https://share.transistor.fm/s/450aefe7</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Rahul Ravishankar, Zeeshan Patel, Jathushan Rajasegaran, Jitendra Malik</p>

            <p><strong>Title:</strong><br>
            Scaling Properties of Diffusion Models for Perceptual Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08034v2">http://arxiv.org/abs/2411.08034v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Through a careful analysis of these scaling properties, we formulate compute-optimal training and inference recipes to scale diffusion models for visual perception tasks. Our models achieve competitive performance to state-of-the-art methods using significantly less data and compute. To access our code and models, see https://scaling-diffusion-perception.github.io .</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Rahul Ravishankar, Zeeshan Patel, Jathushan Rajasegaran, Jitendra Malik</p>

            <p><strong>Title:</strong><br>
            Scaling Properties of Diffusion Models for Perceptual Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08034v2">http://arxiv.org/abs/2411.08034v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Through a careful analysis of these scaling properties, we formulate compute-optimal training and inference recipes to scale diffusion models for visual perception tasks. Our models achieve competitive performance to state-of-the-art methods using significantly less data and compute. To access our code and models, see https://scaling-diffusion-perception.github.io .</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 13 Nov 2024 19:38:07 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/450aefe7/b4c94498.mp3" length="24206960" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1509</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Rahul Ravishankar, Zeeshan Patel, Jathushan Rajasegaran, Jitendra Malik</p>

            <p><strong>Title:</strong><br>
            Scaling Properties of Diffusion Models for Perceptual Tasks</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08034v2">http://arxiv.org/abs/2411.08034v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Through a careful analysis of these scaling properties, we formulate compute-optimal training and inference recipes to scale diffusion models for visual perception tasks. Our models achieve competitive performance to state-of-the-art methods using significantly less data and compute. To access our code and models, see https://scaling-diffusion-perception.github.io .</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings</title>
      <itunes:episode>72</itunes:episode>
      <podcast:episode>72</podcast:episode>
      <itunes:title>Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2a396eae-7a8d-4890-a124-e403f0ca9314</guid>
      <link>https://share.transistor.fm/s/1861a69d</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 5 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Aditya Sanghi, Aliasghar Khani, Pradyumna Reddy, Arianna Rampini, Derek Cheung, Kamal Rahimi Malekshan, Kanika Madan, Hooman Shayani</p>

            <p><strong>Title:</strong><br>
            Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08017v1">http://arxiv.org/abs/2411.08017v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale 3D generative models require substantial computational resources yet often fall short in capturing fine details and complex geometries at high resolutions. We attribute this limitation to the inefficiency of current representations, which lack the compactness required to model the generative models effectively. To address this, we introduce a novel approach called Wavelet Latent Diffusion, or WaLa, that encodes 3D shapes into wavelet-based, compact latent encodings. Specifically, we compress a $256^3$ signed distance field into a $12^3 \times 4$ latent grid, achieving an impressive 2427x compression ratio with minimal loss of detail. This high level of compression allows our method to efficiently train large-scale generative networks without increasing the inference time. Our models, both conditional and unconditional, contain approximately one billion parameters and successfully generate high-quality 3D shapes at $256^3$ resolution. Moreover, WaLa offers rapid inference, producing shapes within two to four seconds depending on the condition, despite the model's scale. We demonstrate state-of-the-art performance across multiple datasets, with significant improvements in generation quality, diversity, and computational efficiency. We open-source our code and, to the best of our knowledge, release the largest pretrained 3D generative models across different modalities.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 5 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Aditya Sanghi, Aliasghar Khani, Pradyumna Reddy, Arianna Rampini, Derek Cheung, Kamal Rahimi Malekshan, Kanika Madan, Hooman Shayani</p>

            <p><strong>Title:</strong><br>
            Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08017v1">http://arxiv.org/abs/2411.08017v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale 3D generative models require substantial computational resources yet often fall short in capturing fine details and complex geometries at high resolutions. We attribute this limitation to the inefficiency of current representations, which lack the compactness required to model the generative models effectively. To address this, we introduce a novel approach called Wavelet Latent Diffusion, or WaLa, that encodes 3D shapes into wavelet-based, compact latent encodings. Specifically, we compress a $256^3$ signed distance field into a $12^3 \times 4$ latent grid, achieving an impressive 2427x compression ratio with minimal loss of detail. This high level of compression allows our method to efficiently train large-scale generative networks without increasing the inference time. Our models, both conditional and unconditional, contain approximately one billion parameters and successfully generate high-quality 3D shapes at $256^3$ resolution. Moreover, WaLa offers rapid inference, producing shapes within two to four seconds depending on the condition, despite the model's scale. We demonstrate state-of-the-art performance across multiple datasets, with significant improvements in generation quality, diversity, and computational efficiency. We open-source our code and, to the best of our knowledge, release the largest pretrained 3D generative models across different modalities.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 13 Nov 2024 19:37:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/1861a69d/c7fa8800.mp3" length="21532899" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1342</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 5 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Aditya Sanghi, Aliasghar Khani, Pradyumna Reddy, Arianna Rampini, Derek Cheung, Kamal Rahimi Malekshan, Kanika Madan, Hooman Shayani</p>

            <p><strong>Title:</strong><br>
            Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.08017v1">http://arxiv.org/abs/2411.08017v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale 3D generative models require substantial computational resources yet often fall short in capturing fine details and complex geometries at high resolutions. We attribute this limitation to the inefficiency of current representations, which lack the compactness required to model the generative models effectively. To address this, we introduce a novel approach called Wavelet Latent Diffusion, or WaLa, that encodes 3D shapes into wavelet-based, compact latent encodings. Specifically, we compress a $256^3$ signed distance field into a $12^3 \times 4$ latent grid, achieving an impressive 2427x compression ratio with minimal loss of detail. This high level of compression allows our method to efficiently train large-scale generative networks without increasing the inference time. Our models, both conditional and unconditional, contain approximately one billion parameters and successfully generate high-quality 3D shapes at $256^3$ resolution. Moreover, WaLa offers rapid inference, producing shapes within two to four seconds depending on the condition, despite the model's scale. We demonstrate state-of-the-art performance across multiple datasets, with significant improvements in generation quality, diversity, and computational efficiency. We open-source our code and, to the best of our knowledge, release the largest pretrained 3D generative models across different modalities.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models</title>
      <itunes:episode>71</itunes:episode>
      <podcast:episode>71</podcast:episode>
      <itunes:title>Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">7d774e36-c3ba-4514-ab96-d94ce0f86dd7</guid>
      <link>https://share.transistor.fm/s/596edfcf</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 44 | cs.CV, cs.AI, cs.GR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yoad Tewel, Rinon Gal, Dvir Samuel, Yuval Atzmon, Lior Wolf, Gal Chechik</p>

            <p><strong>Title:</strong><br>
            Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07232v2">http://arxiv.org/abs/2411.07232v2</a></p>

            <p><strong>Abstract:</strong><br>
            Adding Object into images based on text instructions is a challenging task in semantic image editing, requiring a balance between preserving the original scene and seamlessly integrating the new object in a fitting location. Despite extensive efforts, existing models often struggle with this balance, particularly with finding a natural location for adding an object in complex scenes. We introduce Add-it, a training-free approach that extends diffusion models' attention mechanisms to incorporate information from three key sources: the scene image, the text prompt, and the generated image itself. Our weighted extended-attention mechanism maintains structural consistency and fine details while ensuring natural object placement. Without task-specific fine-tuning, Add-it achieves state-of-the-art results on both real and generated image insertion benchmarks, including our newly constructed "Additing Affordance Benchmark" for evaluating object placement plausibility, outperforming supervised methods. Human evaluations show that Add-it is preferred in over 80% of cases, and it also demonstrates improvements in various automated metrics.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 44 | cs.CV, cs.AI, cs.GR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yoad Tewel, Rinon Gal, Dvir Samuel, Yuval Atzmon, Lior Wolf, Gal Chechik</p>

            <p><strong>Title:</strong><br>
            Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07232v2">http://arxiv.org/abs/2411.07232v2</a></p>

            <p><strong>Abstract:</strong><br>
            Adding Object into images based on text instructions is a challenging task in semantic image editing, requiring a balance between preserving the original scene and seamlessly integrating the new object in a fitting location. Despite extensive efforts, existing models often struggle with this balance, particularly with finding a natural location for adding an object in complex scenes. We introduce Add-it, a training-free approach that extends diffusion models' attention mechanisms to incorporate information from three key sources: the scene image, the text prompt, and the generated image itself. Our weighted extended-attention mechanism maintains structural consistency and fine details while ensuring natural object placement. Without task-specific fine-tuning, Add-it achieves state-of-the-art results on both real and generated image insertion benchmarks, including our newly constructed "Additing Affordance Benchmark" for evaluating object placement plausibility, outperforming supervised methods. Human evaluations show that Add-it is preferred in over 80% of cases, and it also demonstrates improvements in various automated metrics.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 12 Nov 2024 19:36:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/596edfcf/91be915c.mp3" length="22711524" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1416</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 44 | cs.CV, cs.AI, cs.GR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yoad Tewel, Rinon Gal, Dvir Samuel, Yuval Atzmon, Lior Wolf, Gal Chechik</p>

            <p><strong>Title:</strong><br>
            Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07232v2">http://arxiv.org/abs/2411.07232v2</a></p>

            <p><strong>Abstract:</strong><br>
            Adding Object into images based on text instructions is a challenging task in semantic image editing, requiring a balance between preserving the original scene and seamlessly integrating the new object in a fitting location. Despite extensive efforts, existing models often struggle with this balance, particularly with finding a natural location for adding an object in complex scenes. We introduce Add-it, a training-free approach that extends diffusion models' attention mechanisms to incorporate information from three key sources: the scene image, the text prompt, and the generated image itself. Our weighted extended-attention mechanism maintains structural consistency and fine details while ensuring natural object placement. Without task-specific fine-tuning, Add-it achieves state-of-the-art results on both real and generated image insertion benchmarks, including our newly constructed "Additing Affordance Benchmark" for evaluating object placement plausibility, outperforming supervised methods. Human evaluations show that Add-it is preferred in over 80% of cases, and it also demonstrates improvements in various automated metrics.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision</title>
      <itunes:episode>70</itunes:episode>
      <podcast:episode>70</podcast:episode>
      <itunes:title>OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d943bf85-2eaf-4a05-aed2-86d59021b36f</guid>
      <link>https://share.transistor.fm/s/73e108ac</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 39 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07199v1">http://arxiv.org/abs/2411.07199v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-guided image editing methods have demonstrated significant potential by training diffusion models on automatically synthesized or manually annotated image editing pairs. However, these methods remain far from practical, real-life applications. We identify three primary challenges contributing to this gap. Firstly, existing models have limited editing skills due to the biased synthesis process. Secondly, these methods are trained with datasets with a high volume of noise and artifacts. This is due to the application of simple filtering methods like CLIP-score. Thirdly, all these datasets are restricted to a single low resolution and fixed aspect ratio, limiting the versatility to handle real-world use cases. In this paper, we present \omniedit, which is an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly. Our contribution is in four folds: (1) \omniedit is trained by utilizing the supervision from seven different specialist models to ensure task coverage. (2) we utilize importance sampling based on the scores provided by large multimodal models (like GPT-4o) instead of CLIP-score to improve the data quality. (3) we propose a new editing architecture called EditNet to greatly boost the editing success rate, (4) we provide images with different aspect ratios to ensure that our model can handle any image in the wild. We have curated a test set containing images of different aspect ratios, accompanied by diverse instructions to cover different tasks. Both automatic evaluation and human evaluations demonstrate that \omniedit can significantly outperform all the existing models. Our code, dataset and model will be available at \url{https://tiger-ai-lab.github.io/OmniEdit/}</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 39 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07199v1">http://arxiv.org/abs/2411.07199v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-guided image editing methods have demonstrated significant potential by training diffusion models on automatically synthesized or manually annotated image editing pairs. However, these methods remain far from practical, real-life applications. We identify three primary challenges contributing to this gap. Firstly, existing models have limited editing skills due to the biased synthesis process. Secondly, these methods are trained with datasets with a high volume of noise and artifacts. This is due to the application of simple filtering methods like CLIP-score. Thirdly, all these datasets are restricted to a single low resolution and fixed aspect ratio, limiting the versatility to handle real-world use cases. In this paper, we present \omniedit, which is an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly. Our contribution is in four folds: (1) \omniedit is trained by utilizing the supervision from seven different specialist models to ensure task coverage. (2) we utilize importance sampling based on the scores provided by large multimodal models (like GPT-4o) instead of CLIP-score to improve the data quality. (3) we propose a new editing architecture called EditNet to greatly boost the editing success rate, (4) we provide images with different aspect ratios to ensure that our model can handle any image in the wild. We have curated a test set containing images of different aspect ratios, accompanied by diverse instructions to cover different tasks. Both automatic evaluation and human evaluations demonstrate that \omniedit can significantly outperform all the existing models. Our code, dataset and model will be available at \url{https://tiger-ai-lab.github.io/OmniEdit/}</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 12 Nov 2024 19:36:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/73e108ac/f15f1db4.mp3" length="19058143" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1187</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 39 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, Wenhu Chen</p>

            <p><strong>Title:</strong><br>
            OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07199v1">http://arxiv.org/abs/2411.07199v1</a></p>

            <p><strong>Abstract:</strong><br>
            Instruction-guided image editing methods have demonstrated significant potential by training diffusion models on automatically synthesized or manually annotated image editing pairs. However, these methods remain far from practical, real-life applications. We identify three primary challenges contributing to this gap. Firstly, existing models have limited editing skills due to the biased synthesis process. Secondly, these methods are trained with datasets with a high volume of noise and artifacts. This is due to the application of simple filtering methods like CLIP-score. Thirdly, all these datasets are restricted to a single low resolution and fixed aspect ratio, limiting the versatility to handle real-world use cases. In this paper, we present \omniedit, which is an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly. Our contribution is in four folds: (1) \omniedit is trained by utilizing the supervision from seven different specialist models to ensure task coverage. (2) we utilize importance sampling based on the scores provided by large multimodal models (like GPT-4o) instead of CLIP-score to improve the data quality. (3) we propose a new editing architecture called EditNet to greatly boost the editing success rate, (4) we provide images with different aspect ratios to ensure that our model can handle any image in the wild. We have curated a test set containing images of different aspect ratios, accompanied by diverse instructions to cover different tasks. Both automatic evaluation and human evaluations demonstrate that \omniedit can significantly outperform all the existing models. Our code, dataset and model will be available at \url{https://tiger-ai-lab.github.io/OmniEdit/}</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models</title>
      <itunes:episode>69</itunes:episode>
      <podcast:episode>69</podcast:episode>
      <itunes:title>Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">fedc6eaa-dd1e-4f78-8722-c3c5cb55c9ff</guid>
      <link>https://share.transistor.fm/s/c5d36b4a</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 30 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yancheng He, Shilong Li, Jiaheng Liu, Yingshui Tan, Hui Huang, Weixun Wang, Xingyuan Bu, Hangyu Guo, Chengwei Hu, Boren Zheng, Xuepeng Liu, Dekai Sun, Wenbo Su, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07140v1">http://arxiv.org/abs/2411.07140v1</a></p>

            <p><strong>Abstract:</strong><br>
            New LLM evaluation benchmarks are important to align with the rapid development of Large Language Models (LLMs). In this work, we present Chinese SimpleQA, the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions, and Chinese SimpleQA mainly has five properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate). Specifically, first, we focus on the Chinese language over 6 major topics with 99 diverse subtopics. Second, we conduct a comprehensive quality control process to achieve high-quality questions and answers, where the reference answers are static and cannot be changed over time. Third, following SimpleQA, the questions and answers are very short, and the grading process is easy-to-evaluate based on OpenAI API. Based on Chinese SimpleQA, we perform a comprehensive evaluation on the factuality abilities of existing LLMs. Finally, we hope that Chinese SimpleQA could guide the developers to better understand the Chinese factuality abilities of their models and facilitate the growth of foundation models.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 30 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yancheng He, Shilong Li, Jiaheng Liu, Yingshui Tan, Hui Huang, Weixun Wang, Xingyuan Bu, Hangyu Guo, Chengwei Hu, Boren Zheng, Xuepeng Liu, Dekai Sun, Wenbo Su, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07140v1">http://arxiv.org/abs/2411.07140v1</a></p>

            <p><strong>Abstract:</strong><br>
            New LLM evaluation benchmarks are important to align with the rapid development of Large Language Models (LLMs). In this work, we present Chinese SimpleQA, the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions, and Chinese SimpleQA mainly has five properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate). Specifically, first, we focus on the Chinese language over 6 major topics with 99 diverse subtopics. Second, we conduct a comprehensive quality control process to achieve high-quality questions and answers, where the reference answers are static and cannot be changed over time. Third, following SimpleQA, the questions and answers are very short, and the grading process is easy-to-evaluate based on OpenAI API. Based on Chinese SimpleQA, we perform a comprehensive evaluation on the factuality abilities of existing LLMs. Finally, we hope that Chinese SimpleQA could guide the developers to better understand the Chinese factuality abilities of their models and facilitate the growth of foundation models.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 12 Nov 2024 19:35:43 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c5d36b4a/645ee139.mp3" length="20459554" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1275</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 30 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yancheng He, Shilong Li, Jiaheng Liu, Yingshui Tan, Hui Huang, Weixun Wang, Xingyuan Bu, Hangyu Guo, Chengwei Hu, Boren Zheng, Xuepeng Liu, Dekai Sun, Wenbo Su, Bo Zheng</p>

            <p><strong>Title:</strong><br>
            Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07140v1">http://arxiv.org/abs/2411.07140v1</a></p>

            <p><strong>Abstract:</strong><br>
            New LLM evaluation benchmarks are important to align with the rapid development of Large Language Models (LLMs). In this work, we present Chinese SimpleQA, the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions, and Chinese SimpleQA mainly has five properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate). Specifically, first, we focus on the Chinese language over 6 major topics with 99 diverse subtopics. Second, we conduct a comprehensive quality control process to achieve high-quality questions and answers, where the reference answers are static and cannot be changed over time. Third, following SimpleQA, the questions and answers are very short, and the grading process is easy-to-evaluate based on OpenAI API. Based on Chinese SimpleQA, we perform a comprehensive evaluation on the factuality abilities of existing LLMs. Finally, we hope that Chinese SimpleQA could guide the developers to better understand the Chinese factuality abilities of their models and facilitate the growth of foundation models.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework</title>
      <itunes:episode>68</itunes:episode>
      <podcast:episode>68</podcast:episode>
      <itunes:title>M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1d2d9eb3-9555-4070-97c8-f358bc3c8e73</guid>
      <link>https://share.transistor.fm/s/17c291b0</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 28 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yew Ken Chia, Liying Cheng, Hou Pong Chan, Chaoqun Liu, Maojia Song, Sharifah Mahani Aljunied, Soujanya Poria, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.06176v1">http://arxiv.org/abs/2411.06176v1</a></p>

            <p><strong>Abstract:</strong><br>
            The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and diverse multimodal contents such as texts, figures, and tables, which are very time-consuming for humans to read thoroughly. Hence, there is an urgent need to develop effective and automated methods to aid humans in this task. In this work, we introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models. We further propose a retrieval-aware tuning approach for efficient and effective multimodal document reading. Compared to existing works, our benchmark consists of more recent and lengthy documents with hundreds of pages, while also requiring open-ended solutions and not just extractive answers. To our knowledge, our training framework is the first to directly address the retrieval setting for multimodal long documents. To enable tuning open-source models, we construct a training corpus in a fully automatic manner for the question-answering task over such documents. Experiments show that our tuning approach achieves a relative improvement of 4.6% for the correctness of model responses, compared to the baseline open-source models. Our data, code, and models are available at https://multimodal-documents.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 28 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yew Ken Chia, Liying Cheng, Hou Pong Chan, Chaoqun Liu, Maojia Song, Sharifah Mahani Aljunied, Soujanya Poria, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.06176v1">http://arxiv.org/abs/2411.06176v1</a></p>

            <p><strong>Abstract:</strong><br>
            The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and diverse multimodal contents such as texts, figures, and tables, which are very time-consuming for humans to read thoroughly. Hence, there is an urgent need to develop effective and automated methods to aid humans in this task. In this work, we introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models. We further propose a retrieval-aware tuning approach for efficient and effective multimodal document reading. Compared to existing works, our benchmark consists of more recent and lengthy documents with hundreds of pages, while also requiring open-ended solutions and not just extractive answers. To our knowledge, our training framework is the first to directly address the retrieval setting for multimodal long documents. To enable tuning open-source models, we construct a training corpus in a fully automatic manner for the question-answering task over such documents. Experiments show that our tuning approach achieves a relative improvement of 4.6% for the correctness of model responses, compared to the baseline open-source models. Our data, code, and models are available at https://multimodal-documents.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 12 Nov 2024 19:35:21 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/17c291b0/85361f90.mp3" length="19950515" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1243</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 28 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Yew Ken Chia, Liying Cheng, Hou Pong Chan, Chaoqun Liu, Maojia Song, Sharifah Mahani Aljunied, Soujanya Poria, Lidong Bing</p>

            <p><strong>Title:</strong><br>
            M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.06176v1">http://arxiv.org/abs/2411.06176v1</a></p>

            <p><strong>Abstract:</strong><br>
            The ability to understand and answer questions over documents can be useful in many business and practical applications. However, documents often contain lengthy and diverse multimodal contents such as texts, figures, and tables, which are very time-consuming for humans to read thoroughly. Hence, there is an urgent need to develop effective and automated methods to aid humans in this task. In this work, we introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models. We further propose a retrieval-aware tuning approach for efficient and effective multimodal document reading. Compared to existing works, our benchmark consists of more recent and lengthy documents with hundreds of pages, while also requiring open-ended solutions and not just extractive answers. To our knowledge, our training framework is the first to directly address the retrieval setting for multimodal long documents. To enable tuning open-source models, we construct a training corpus in a fully automatic manner for the question-answering task over such documents. Experiments show that our tuning approach achieves a relative improvement of 4.6% for the correctness of model responses, compared to the baseline open-source models. Our data, code, and models are available at https://multimodal-documents.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models</title>
      <itunes:episode>67</itunes:episode>
      <podcast:episode>67</podcast:episode>
      <itunes:title>Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">eefe6b7b-0ff1-4d03-8097-793408479cf2</guid>
      <link>https://share.transistor.fm/s/94c2fe47</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 21 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            NVIDIA, :, Yuval Atzmon, Maciej Bala, Yogesh Balaji, Tiffany Cai, Yin Cui, Jiaojiao Fan, Yunhao Ge, Siddharth Gururani, Jacob Huffman, Ronald Isaac, Pooya Jannaty, Tero Karras, Grace Lam, J. P. Lewis, Aaron Licata, Yen-Chen Lin, Ming-Yu Liu, Qianli Ma, Arun Mallya, Ashlee Martino-Tarr, Doug Mendez, Seungjun Nah, Chris Pruett, Fitsum Reda, Jiaming Song, Ting-Chun Wang, Fangyin Wei, Xiaohui Zeng, Yu Zeng, Qinsheng Zhang</p>

            <p><strong>Title:</strong><br>
            Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07126v1">http://arxiv.org/abs/2411.07126v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Edify Image, a family of diffusion models capable of generating photorealistic image content with pixel-perfect accuracy. Edify Image utilizes cascaded pixel-space diffusion models trained using a novel Laplacian diffusion process, in which image signals at different frequency bands are attenuated at varying rates. Edify Image supports a wide range of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360 HDR panorama generation, and finetuning for image customization.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 21 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            NVIDIA, :, Yuval Atzmon, Maciej Bala, Yogesh Balaji, Tiffany Cai, Yin Cui, Jiaojiao Fan, Yunhao Ge, Siddharth Gururani, Jacob Huffman, Ronald Isaac, Pooya Jannaty, Tero Karras, Grace Lam, J. P. Lewis, Aaron Licata, Yen-Chen Lin, Ming-Yu Liu, Qianli Ma, Arun Mallya, Ashlee Martino-Tarr, Doug Mendez, Seungjun Nah, Chris Pruett, Fitsum Reda, Jiaming Song, Ting-Chun Wang, Fangyin Wei, Xiaohui Zeng, Yu Zeng, Qinsheng Zhang</p>

            <p><strong>Title:</strong><br>
            Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07126v1">http://arxiv.org/abs/2411.07126v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Edify Image, a family of diffusion models capable of generating photorealistic image content with pixel-perfect accuracy. Edify Image utilizes cascaded pixel-space diffusion models trained using a novel Laplacian diffusion process, in which image signals at different frequency bands are attenuated at varying rates. Edify Image supports a wide range of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360 HDR panorama generation, and finetuning for image customization.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 12 Nov 2024 19:35:00 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/94c2fe47/f3433ea6.mp3" length="23857991" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1487</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 21 | cs.CV, cs.LG</p>

            <p><strong>Authors:</strong><br>
            NVIDIA, :, Yuval Atzmon, Maciej Bala, Yogesh Balaji, Tiffany Cai, Yin Cui, Jiaojiao Fan, Yunhao Ge, Siddharth Gururani, Jacob Huffman, Ronald Isaac, Pooya Jannaty, Tero Karras, Grace Lam, J. P. Lewis, Aaron Licata, Yen-Chen Lin, Ming-Yu Liu, Qianli Ma, Arun Mallya, Ashlee Martino-Tarr, Doug Mendez, Seungjun Nah, Chris Pruett, Fitsum Reda, Jiaming Song, Ting-Chun Wang, Fangyin Wei, Xiaohui Zeng, Yu Zeng, Qinsheng Zhang</p>

            <p><strong>Title:</strong><br>
            Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07126v1">http://arxiv.org/abs/2411.07126v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Edify Image, a family of diffusion models capable of generating photorealistic image content with pixel-perfect accuracy. Edify Image utilizes cascaded pixel-space diffusion models trained using a novel Laplacian diffusion process, in which image signals at different frequency bands are attenuated at varying rates. Edify Image supports a wide range of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360 HDR panorama generation, and finetuning for image customization.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models</title>
      <itunes:episode>66</itunes:episode>
      <podcast:episode>66</podcast:episode>
      <itunes:title>GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">c06d7b85-5389-478e-bc20-22fb9247e8d8</guid>
      <link>https://share.transistor.fm/s/66a7492e</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 18 | cs.SE, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Nizar Islah, Justine Gehring, Diganta Misra, Eilif Muller, Irina Rish, Terry Yue Zhuo, Massimo Caccia</p>

            <p><strong>Title:</strong><br>
            GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05830v1">http://arxiv.org/abs/2411.05830v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid evolution of software libraries presents a significant challenge for code generation models, which must adapt to frequent version updates while maintaining compatibility with previous versions. Existing code completion benchmarks often overlook this dynamic aspect, and the one that does consider it relies on static code prediction tasks without execution-based evaluation, offering a limited perspective on a model's practical usability. To address this gap, we introduce \textbf{\GitChameleon{}}, a novel, manually curated dataset comprising 116 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. \GitChameleon{} is designed to rigorously assess the ability of modern large language models (LLMs) to generate version-specific code that is not only syntactically correct but also functionally accurate upon execution. Our comprehensive evaluations reveal that state-of-the-art LLMs struggle with this task; for instance, \textbf{GPT-4o} achieves a pass@10 of only 39.9\% (43.7\% when provided with error feedback), highlighting the complexity of the problem and the limitations of current models. By providing an execution-based benchmark that emphasizes the dynamic nature of code libraries, \GitChameleon{} serves as a critical tool to advance the development of more adaptable and reliable code generation models. For facilitation for further exploration of version-conditioned code generation, we make our code repository publicly accessible at \url{https://github.com/NizarIslah/GitChameleon}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 18 | cs.SE, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Nizar Islah, Justine Gehring, Diganta Misra, Eilif Muller, Irina Rish, Terry Yue Zhuo, Massimo Caccia</p>

            <p><strong>Title:</strong><br>
            GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05830v1">http://arxiv.org/abs/2411.05830v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid evolution of software libraries presents a significant challenge for code generation models, which must adapt to frequent version updates while maintaining compatibility with previous versions. Existing code completion benchmarks often overlook this dynamic aspect, and the one that does consider it relies on static code prediction tasks without execution-based evaluation, offering a limited perspective on a model's practical usability. To address this gap, we introduce \textbf{\GitChameleon{}}, a novel, manually curated dataset comprising 116 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. \GitChameleon{} is designed to rigorously assess the ability of modern large language models (LLMs) to generate version-specific code that is not only syntactically correct but also functionally accurate upon execution. Our comprehensive evaluations reveal that state-of-the-art LLMs struggle with this task; for instance, \textbf{GPT-4o} achieves a pass@10 of only 39.9\% (43.7\% when provided with error feedback), highlighting the complexity of the problem and the limitations of current models. By providing an execution-based benchmark that emphasizes the dynamic nature of code libraries, \GitChameleon{} serves as a critical tool to advance the development of more adaptable and reliable code generation models. For facilitation for further exploration of version-conditioned code generation, we make our code repository publicly accessible at \url{https://github.com/NizarIslah/GitChameleon}.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 12 Nov 2024 19:34:38 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/66a7492e/d5c3b955.mp3" length="23608050" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1472</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 18 | cs.SE, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Nizar Islah, Justine Gehring, Diganta Misra, Eilif Muller, Irina Rish, Terry Yue Zhuo, Massimo Caccia</p>

            <p><strong>Title:</strong><br>
            GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05830v1">http://arxiv.org/abs/2411.05830v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid evolution of software libraries presents a significant challenge for code generation models, which must adapt to frequent version updates while maintaining compatibility with previous versions. Existing code completion benchmarks often overlook this dynamic aspect, and the one that does consider it relies on static code prediction tasks without execution-based evaluation, offering a limited perspective on a model's practical usability. To address this gap, we introduce \textbf{\GitChameleon{}}, a novel, manually curated dataset comprising 116 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. \GitChameleon{} is designed to rigorously assess the ability of modern large language models (LLMs) to generate version-specific code that is not only syntactically correct but also functionally accurate upon execution. Our comprehensive evaluations reveal that state-of-the-art LLMs struggle with this task; for instance, \textbf{GPT-4o} achieves a pass@10 of only 39.9\% (43.7\% when provided with error feedback), highlighting the complexity of the problem and the limitations of current models. By providing an execution-based benchmark that emphasizes the dynamic nature of code libraries, \GitChameleon{} serves as a critical tool to advance the development of more adaptable and reliable code generation models. For facilitation for further exploration of version-conditioned code generation, we make our code repository publicly accessible at \url{https://github.com/NizarIslah/GitChameleon}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Watermark Anything with Localized Messages</title>
      <itunes:episode>65</itunes:episode>
      <podcast:episode>65</podcast:episode>
      <itunes:title>Watermark Anything with Localized Messages</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">05440cbe-5fc5-4b71-8453-0eb9dfae9908</guid>
      <link>https://share.transistor.fm/s/95046dec</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 11 | cs.CV, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Tom Sander, Pierre Fernandez, Alain Durmus, Teddy Furon, Matthijs Douze</p>

            <p><strong>Title:</strong><br>
            Watermark Anything with Localized Messages</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07231v1">http://arxiv.org/abs/2411.07231v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image watermarking methods are not tailored to handle small watermarked areas. This restricts applications in real-world scenarios where parts of the image may come from different sources or have been edited. We introduce a deep-learning model for localized image watermarking, dubbed the Watermark Anything Model (WAM). The WAM embedder imperceptibly modifies the input image, while the extractor segments the received image into watermarked and non-watermarked areas and recovers one or several hidden messages from the areas found to be watermarked. The models are jointly trained at low resolution and without perceptual constraints, then post-trained for imperceptibility and multiple watermarks. Experiments show that WAM is competitive with state-of-the art methods in terms of imperceptibility and robustness, especially against inpainting and splicing, even on high-resolution images. Moreover, it offers new capabilities: WAM can locate watermarked areas in spliced images and extract distinct 32-bit messages with less than 1 bit error from multiple small regions - no larger than 10% of the image surface - even for small $256\times 256$ images.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 11 | cs.CV, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Tom Sander, Pierre Fernandez, Alain Durmus, Teddy Furon, Matthijs Douze</p>

            <p><strong>Title:</strong><br>
            Watermark Anything with Localized Messages</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07231v1">http://arxiv.org/abs/2411.07231v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image watermarking methods are not tailored to handle small watermarked areas. This restricts applications in real-world scenarios where parts of the image may come from different sources or have been edited. We introduce a deep-learning model for localized image watermarking, dubbed the Watermark Anything Model (WAM). The WAM embedder imperceptibly modifies the input image, while the extractor segments the received image into watermarked and non-watermarked areas and recovers one or several hidden messages from the areas found to be watermarked. The models are jointly trained at low resolution and without perceptual constraints, then post-trained for imperceptibility and multiple watermarks. Experiments show that WAM is competitive with state-of-the art methods in terms of imperceptibility and robustness, especially against inpainting and splicing, even on high-resolution images. Moreover, it offers new capabilities: WAM can locate watermarked areas in spliced images and extract distinct 32-bit messages with less than 1 bit error from multiple small regions - no larger than 10% of the image surface - even for small $256\times 256$ images.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 12 Nov 2024 19:34:06 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/95046dec/bc400cd1.mp3" length="22543048" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1405</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 11 | cs.CV, cs.CR</p>

            <p><strong>Authors:</strong><br>
            Tom Sander, Pierre Fernandez, Alain Durmus, Teddy Furon, Matthijs Douze</p>

            <p><strong>Title:</strong><br>
            Watermark Anything with Localized Messages</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.07231v1">http://arxiv.org/abs/2411.07231v1</a></p>

            <p><strong>Abstract:</strong><br>
            Image watermarking methods are not tailored to handle small watermarked areas. This restricts applications in real-world scenarios where parts of the image may come from different sources or have been edited. We introduce a deep-learning model for localized image watermarking, dubbed the Watermark Anything Model (WAM). The WAM embedder imperceptibly modifies the input image, while the extractor segments the received image into watermarked and non-watermarked areas and recovers one or several hidden messages from the areas found to be watermarked. The models are jointly trained at low resolution and without perceptual constraints, then post-trained for imperceptibility and multiple watermarks. Experiments show that WAM is competitive with state-of-the art methods in terms of imperceptibility and robustness, especially against inpainting and splicing, even on high-resolution images. Moreover, it offers new capabilities: WAM can locate watermarked areas in spliced images and extract distinct 32-bit messages with less than 1 bit error from multiple small regions - no larger than 10% of the image surface - even for small $256\times 256$ images.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Autoregressive Models in Vision: A Survey</title>
      <itunes:episode>64</itunes:episode>
      <podcast:episode>64</podcast:episode>
      <itunes:title>Autoregressive Models in Vision: A Survey</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5d3be3e5-12d3-4623-8562-3ed18a5b54ad</guid>
      <link>https://share.transistor.fm/s/183b2577</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jing Xiong, Gongye Liu, Lun Huang, Chengyue Wu, Taiqiang Wu, Yao Mu, Yuan Yao, Hui Shen, Zhongwei Wan, Jinfa Huang, Chaofan Tao, Shen Yan, Huaxiu Yao, Lingpeng Kong, Hongxia Yang, Mi Zhang, Guillermo Sapiro, Jiebo Luo, Ping Luo, Ngai Wong</p>

            <p><strong>Title:</strong><br>
            Autoregressive Models in Vision: A Survey</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05902v1">http://arxiv.org/abs/2411.05902v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, \textit{i.e.}, pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the strategy of representation. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multi-faceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multi-modal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: \url{https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jing Xiong, Gongye Liu, Lun Huang, Chengyue Wu, Taiqiang Wu, Yao Mu, Yuan Yao, Hui Shen, Zhongwei Wan, Jinfa Huang, Chaofan Tao, Shen Yan, Huaxiu Yao, Lingpeng Kong, Hongxia Yang, Mi Zhang, Guillermo Sapiro, Jiebo Luo, Ping Luo, Ngai Wong</p>

            <p><strong>Title:</strong><br>
            Autoregressive Models in Vision: A Survey</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05902v1">http://arxiv.org/abs/2411.05902v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, \textit{i.e.}, pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the strategy of representation. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multi-faceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multi-modal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: \url{https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey}.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 12 Nov 2024 19:33:34 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/183b2577/cd1e2027.mp3" length="22008895" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1372</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jing Xiong, Gongye Liu, Lun Huang, Chengyue Wu, Taiqiang Wu, Yao Mu, Yuan Yao, Hui Shen, Zhongwei Wan, Jinfa Huang, Chaofan Tao, Shen Yan, Huaxiu Yao, Lingpeng Kong, Hongxia Yang, Mi Zhang, Guillermo Sapiro, Jiebo Luo, Ping Luo, Ngai Wong</p>

            <p><strong>Title:</strong><br>
            Autoregressive Models in Vision: A Survey</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05902v1">http://arxiv.org/abs/2411.05902v1</a></p>

            <p><strong>Abstract:</strong><br>
            Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, \textit{i.e.}, pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the strategy of representation. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multi-faceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multi-modal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: \url{https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation</title>
      <itunes:episode>63</itunes:episode>
      <podcast:episode>63</podcast:episode>
      <itunes:title>LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1a731374-3a9e-4137-aae0-efdb836e25c1</guid>
      <link>https://share.transistor.fm/s/990770a5</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, Lili Qiu</p>

            <p><strong>Title:</strong><br>
            LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04997v1">http://arxiv.org/abs/2411.04997v1</a></p>

            <p><strong>Abstract:</strong><br>
            CLIP is one of the most important multimodal foundational models today. What powers CLIP's capabilities? The rich supervision signals provided by natural language, the carrier of human knowledge, shape a powerful cross-modal representation space. However, with the rapid advancements in large language models LLMs like GPT-4 and LLaMA, the boundaries of language comprehension and generation are continually being pushed. This raises an intriguing question: can the capabilities of LLMs be harnessed to further improve multimodal representation learning? The potential benefits of incorporating LLMs into CLIP are clear. LLMs' strong textual understanding can fundamentally improve CLIP's ability to handle image captions, drastically enhancing its ability to process long and complex texts, a well-known limitation of vanilla CLIP. Moreover, LLMs are trained on a vast corpus of text, possessing open-world knowledge. This allows them to expand on caption information during training, increasing the efficiency of the learning process. In this paper, we propose LLM2CLIP, a novel approach that embraces the power of LLMs to unlock CLIP's potential. By fine-tuning the LLM in the caption space with contrastive learning, we extract its textual capabilities into the output embeddings, significantly improving the output layer's textual discriminability. We then design an efficient training process where the fine-tuned LLM acts as a powerful teacher for CLIP's visual encoder. Thanks to the LLM's presence, we can now incorporate longer and more complex captions without being restricted by vanilla CLIP's text encoder's context window and ability limitations. Our experiments demonstrate that this approach brings substantial improvements in cross-modal tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, Lili Qiu</p>

            <p><strong>Title:</strong><br>
            LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04997v1">http://arxiv.org/abs/2411.04997v1</a></p>

            <p><strong>Abstract:</strong><br>
            CLIP is one of the most important multimodal foundational models today. What powers CLIP's capabilities? The rich supervision signals provided by natural language, the carrier of human knowledge, shape a powerful cross-modal representation space. However, with the rapid advancements in large language models LLMs like GPT-4 and LLaMA, the boundaries of language comprehension and generation are continually being pushed. This raises an intriguing question: can the capabilities of LLMs be harnessed to further improve multimodal representation learning? The potential benefits of incorporating LLMs into CLIP are clear. LLMs' strong textual understanding can fundamentally improve CLIP's ability to handle image captions, drastically enhancing its ability to process long and complex texts, a well-known limitation of vanilla CLIP. Moreover, LLMs are trained on a vast corpus of text, possessing open-world knowledge. This allows them to expand on caption information during training, increasing the efficiency of the learning process. In this paper, we propose LLM2CLIP, a novel approach that embraces the power of LLMs to unlock CLIP's potential. By fine-tuning the LLM in the caption space with contrastive learning, we extract its textual capabilities into the output embeddings, significantly improving the output layer's textual discriminability. We then design an efficient training process where the fine-tuned LLM acts as a powerful teacher for CLIP's visual encoder. Thanks to the LLM's presence, we can now incorporate longer and more complex captions without being restricted by vanilla CLIP's text encoder's context window and ability limitations. Our experiments demonstrate that this approach brings substantial improvements in cross-modal tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 11 Nov 2024 19:41:35 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/990770a5/b54ee699.mp3" length="24518350" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1529</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.CV, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, Lili Qiu</p>

            <p><strong>Title:</strong><br>
            LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04997v1">http://arxiv.org/abs/2411.04997v1</a></p>

            <p><strong>Abstract:</strong><br>
            CLIP is one of the most important multimodal foundational models today. What powers CLIP's capabilities? The rich supervision signals provided by natural language, the carrier of human knowledge, shape a powerful cross-modal representation space. However, with the rapid advancements in large language models LLMs like GPT-4 and LLaMA, the boundaries of language comprehension and generation are continually being pushed. This raises an intriguing question: can the capabilities of LLMs be harnessed to further improve multimodal representation learning? The potential benefits of incorporating LLMs into CLIP are clear. LLMs' strong textual understanding can fundamentally improve CLIP's ability to handle image captions, drastically enhancing its ability to process long and complex texts, a well-known limitation of vanilla CLIP. Moreover, LLMs are trained on a vast corpus of text, possessing open-world knowledge. This allows them to expand on caption information during training, increasing the efficiency of the learning process. In this paper, we propose LLM2CLIP, a novel approach that embraces the power of LLMs to unlock CLIP's potential. By fine-tuning the LLM in the caption space with contrastive learning, we extract its textual capabilities into the output embeddings, significantly improving the output layer's textual discriminability. We then design an efficient training process where the fine-tuned LLM acts as a powerful teacher for CLIP's visual encoder. Thanks to the LLM's presence, we can now incorporate longer and more complex captions without being restricted by vanilla CLIP's text encoder's context window and ability limitations. Our experiments demonstrate that this approach brings substantial improvements in cross-modal tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Balancing Pipeline Parallelism with Vocabulary Parallelism</title>
      <itunes:episode>62</itunes:episode>
      <podcast:episode>62</podcast:episode>
      <itunes:title>Balancing Pipeline Parallelism with Vocabulary Parallelism</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e89e5539-d9ae-4264-86a0-4884c4ea94be</guid>
      <link>https://share.transistor.fm/s/096d97e7</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.DC</p>

            <p><strong>Authors:</strong><br>
            Man Tsung Yeung, Penghui Qi, Min Lin, Xinyi Wan</p>

            <p><strong>Title:</strong><br>
            Balancing Pipeline Parallelism with Vocabulary Parallelism</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05288v1">http://arxiv.org/abs/2411.05288v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and memory usage across pipeline stages, worsening pipeline bubbles and the memory bottleneck. To tackle this, we partition the vocabulary layers evenly across pipeline devices and group the computation into pipeline passes. To reduce the activation memory overhead, we propose several algorithms to reduce communication barriers within vocabulary layers. Additionally, we utilize a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. By combining these techniques, our methods effectively balance the computation and parameter memory, with only a small constant activation memory overhead. Notably, when combined with activation memory-balanced schedules like V-Half, our approach achieves perfect balance in both memory and computation. Extensive evaluations demonstrate that our method achieves computation and memory balance regardless of the vocabulary size, resulting in a 5% to 51% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios. Our implementation is open-sourced at https://github.com/sail-sg/VocabularyParallelism .</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.DC</p>

            <p><strong>Authors:</strong><br>
            Man Tsung Yeung, Penghui Qi, Min Lin, Xinyi Wan</p>

            <p><strong>Title:</strong><br>
            Balancing Pipeline Parallelism with Vocabulary Parallelism</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05288v1">http://arxiv.org/abs/2411.05288v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and memory usage across pipeline stages, worsening pipeline bubbles and the memory bottleneck. To tackle this, we partition the vocabulary layers evenly across pipeline devices and group the computation into pipeline passes. To reduce the activation memory overhead, we propose several algorithms to reduce communication barriers within vocabulary layers. Additionally, we utilize a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. By combining these techniques, our methods effectively balance the computation and parameter memory, with only a small constant activation memory overhead. Notably, when combined with activation memory-balanced schedules like V-Half, our approach achieves perfect balance in both memory and computation. Extensive evaluations demonstrate that our method achieves computation and memory balance regardless of the vocabulary size, resulting in a 5% to 51% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios. Our implementation is open-sourced at https://github.com/sail-sg/VocabularyParallelism .</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 11 Nov 2024 19:41:14 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/096d97e7/6e2e5f5b.mp3" length="22682662" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1414</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.DC</p>

            <p><strong>Authors:</strong><br>
            Man Tsung Yeung, Penghui Qi, Min Lin, Xinyi Wan</p>

            <p><strong>Title:</strong><br>
            Balancing Pipeline Parallelism with Vocabulary Parallelism</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05288v1">http://arxiv.org/abs/2411.05288v1</a></p>

            <p><strong>Abstract:</strong><br>
            Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and memory usage across pipeline stages, worsening pipeline bubbles and the memory bottleneck. To tackle this, we partition the vocabulary layers evenly across pipeline devices and group the computation into pipeline passes. To reduce the activation memory overhead, we propose several algorithms to reduce communication barriers within vocabulary layers. Additionally, we utilize a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. By combining these techniques, our methods effectively balance the computation and parameter memory, with only a small constant activation memory overhead. Notably, when combined with activation memory-balanced schedules like V-Half, our approach achieves perfect balance in both memory and computation. Extensive evaluations demonstrate that our method achieves computation and memory balance regardless of the vocabulary size, resulting in a 5% to 51% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios. Our implementation is open-sourced at https://github.com/sail-sg/VocabularyParallelism .</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>StdGEN: Semantic-Decomposed 3D Character Generation from Single Images</title>
      <itunes:episode>61</itunes:episode>
      <podcast:episode>61</podcast:episode>
      <itunes:title>StdGEN: Semantic-Decomposed 3D Character Generation from Single Images</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f8401b60-0c48-4c91-bec7-b1bd2cdcd736</guid>
      <link>https://share.transistor.fm/s/78cf1985</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuze He, Yanning Zhou, Wang Zhao, Zhongkai Wu, Kaiwen Xiao, Wei Yang, Yong-Jin Liu, Xiao Han</p>

            <p><strong>Title:</strong><br>
            StdGEN: Semantic-Decomposed 3D Character Generation from Single Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05738v1">http://arxiv.org/abs/2411.05738v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present StdGEN, an innovative pipeline for generating semantically decomposed high-quality 3D characters from single images, enabling broad applications in virtual reality, gaming, and filmmaking, etc. Unlike previous methods which struggle with limited decomposability, unsatisfactory quality, and long optimization times, StdGEN features decomposability, effectiveness and efficiency; i.e., it generates intricately detailed 3D characters with separated semantic components such as the body, clothes, and hair, in three minutes. At the core of StdGEN is our proposed Semantic-aware Large Reconstruction Model (S-LRM), a transformer-based generalizable model that jointly reconstructs geometry, color and semantics from multi-view images in a feed-forward manner. A differentiable multi-layer semantic surface extraction scheme is introduced to acquire meshes from hybrid implicit fields reconstructed by our S-LRM. Additionally, a specialized efficient multi-view diffusion model and an iterative multi-layer surface refinement module are integrated into the pipeline to facilitate high-quality, decomposable 3D character generation. Extensive experiments demonstrate our state-of-the-art performance in 3D anime character generation, surpassing existing baselines by a significant margin in geometry, texture and decomposability. StdGEN offers ready-to-use semantic-decomposed 3D characters and enables flexible customization for a wide range of applications. Project page: https://stdgen.github.io</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuze He, Yanning Zhou, Wang Zhao, Zhongkai Wu, Kaiwen Xiao, Wei Yang, Yong-Jin Liu, Xiao Han</p>

            <p><strong>Title:</strong><br>
            StdGEN: Semantic-Decomposed 3D Character Generation from Single Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05738v1">http://arxiv.org/abs/2411.05738v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present StdGEN, an innovative pipeline for generating semantically decomposed high-quality 3D characters from single images, enabling broad applications in virtual reality, gaming, and filmmaking, etc. Unlike previous methods which struggle with limited decomposability, unsatisfactory quality, and long optimization times, StdGEN features decomposability, effectiveness and efficiency; i.e., it generates intricately detailed 3D characters with separated semantic components such as the body, clothes, and hair, in three minutes. At the core of StdGEN is our proposed Semantic-aware Large Reconstruction Model (S-LRM), a transformer-based generalizable model that jointly reconstructs geometry, color and semantics from multi-view images in a feed-forward manner. A differentiable multi-layer semantic surface extraction scheme is introduced to acquire meshes from hybrid implicit fields reconstructed by our S-LRM. Additionally, a specialized efficient multi-view diffusion model and an iterative multi-layer surface refinement module are integrated into the pipeline to facilitate high-quality, decomposable 3D character generation. Extensive experiments demonstrate our state-of-the-art performance in 3D anime character generation, surpassing existing baselines by a significant margin in geometry, texture and decomposability. StdGEN offers ready-to-use semantic-decomposed 3D characters and enables flexible customization for a wide range of applications. Project page: https://stdgen.github.io</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 11 Nov 2024 19:40:53 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/78cf1985/4766ae82.mp3" length="20972803" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1307</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yuze He, Yanning Zhou, Wang Zhao, Zhongkai Wu, Kaiwen Xiao, Wei Yang, Yong-Jin Liu, Xiao Han</p>

            <p><strong>Title:</strong><br>
            StdGEN: Semantic-Decomposed 3D Character Generation from Single Images</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05738v1">http://arxiv.org/abs/2411.05738v1</a></p>

            <p><strong>Abstract:</strong><br>
            We present StdGEN, an innovative pipeline for generating semantically decomposed high-quality 3D characters from single images, enabling broad applications in virtual reality, gaming, and filmmaking, etc. Unlike previous methods which struggle with limited decomposability, unsatisfactory quality, and long optimization times, StdGEN features decomposability, effectiveness and efficiency; i.e., it generates intricately detailed 3D characters with separated semantic components such as the body, clothes, and hair, in three minutes. At the core of StdGEN is our proposed Semantic-aware Large Reconstruction Model (S-LRM), a transformer-based generalizable model that jointly reconstructs geometry, color and semantics from multi-view images in a feed-forward manner. A differentiable multi-layer semantic surface extraction scheme is introduced to acquire meshes from hybrid implicit fields reconstructed by our S-LRM. Additionally, a specialized efficient multi-view diffusion model and an iterative multi-layer surface refinement module are integrated into the pipeline to facilitate high-quality, decomposable 3D character generation. Extensive experiments demonstrate our state-of-the-art performance in 3D anime character generation, surpassing existing baselines by a significant margin in geometry, texture and decomposability. StdGEN offers ready-to-use semantic-decomposed 3D characters and enables flexible customization for a wide range of applications. Project page: https://stdgen.github.io</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DELIFT: Data Efficient Language model Instruction Fine Tuning</title>
      <itunes:episode>60</itunes:episode>
      <podcast:episode>60</podcast:episode>
      <itunes:title>DELIFT: Data Efficient Language model Instruction Fine Tuning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2c66bc75-6000-4e9f-8d13-6a678ddd70f5</guid>
      <link>https://share.transistor.fm/s/9283d062</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 5 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ishika Agarwal, Krishnateja Killamsetty, Lucian Popa, Marina Danilevksy</p>

            <p><strong>Title:</strong><br>
            DELIFT: Data Efficient Language model Instruction Fine Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04425v2">http://arxiv.org/abs/2411.04425v2</a></p>

            <p><strong>Abstract:</strong><br>
            Fine-tuning large language models (LLMs) is essential for enhancing their performance on specific tasks but is often resource-intensive due to redundant or uninformative data. To address this inefficiency, we introduce DELIFT (Data Efficient Language model Instruction Fine-Tuning), a novel algorithm that systematically optimizes data selection across the three key stages of fine-tuning: (1) instruction tuning, (2) task-specific fine-tuning (e.g., reasoning, question-answering), and (3) continual fine-tuning (e.g., incorporating new data versions). Unlike existing methods that focus on single-stage optimization or rely on computationally intensive gradient calculations, DELIFT operates efficiently across all stages. Central to our approach is a pairwise utility metric that quantifies how beneficial a data sample is for improving the model's responses to other samples, effectively measuring the informational value relative to the model's current capabilities. By leveraging different submodular functions applied to this metric, DELIFT selects diverse and optimal subsets that are useful across all stages of fine-tuning. Experiments across various tasks and model scales demonstrate that DELIFT can reduce the fine-tuning data size by up to 70% without compromising performance, offering significant computational savings and outperforming existing methods in both efficiency and efficacy.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 5 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ishika Agarwal, Krishnateja Killamsetty, Lucian Popa, Marina Danilevksy</p>

            <p><strong>Title:</strong><br>
            DELIFT: Data Efficient Language model Instruction Fine Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04425v2">http://arxiv.org/abs/2411.04425v2</a></p>

            <p><strong>Abstract:</strong><br>
            Fine-tuning large language models (LLMs) is essential for enhancing their performance on specific tasks but is often resource-intensive due to redundant or uninformative data. To address this inefficiency, we introduce DELIFT (Data Efficient Language model Instruction Fine-Tuning), a novel algorithm that systematically optimizes data selection across the three key stages of fine-tuning: (1) instruction tuning, (2) task-specific fine-tuning (e.g., reasoning, question-answering), and (3) continual fine-tuning (e.g., incorporating new data versions). Unlike existing methods that focus on single-stage optimization or rely on computationally intensive gradient calculations, DELIFT operates efficiently across all stages. Central to our approach is a pairwise utility metric that quantifies how beneficial a data sample is for improving the model's responses to other samples, effectively measuring the informational value relative to the model's current capabilities. By leveraging different submodular functions applied to this metric, DELIFT selects diverse and optimal subsets that are useful across all stages of fine-tuning. Experiments across various tasks and model scales demonstrate that DELIFT can reduce the fine-tuning data size by up to 70% without compromising performance, offering significant computational savings and outperforming existing methods in both efficiency and efficacy.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 11 Nov 2024 19:40:22 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9283d062/5360197f.mp3" length="20497574" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1277</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 5 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ishika Agarwal, Krishnateja Killamsetty, Lucian Popa, Marina Danilevksy</p>

            <p><strong>Title:</strong><br>
            DELIFT: Data Efficient Language model Instruction Fine Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04425v2">http://arxiv.org/abs/2411.04425v2</a></p>

            <p><strong>Abstract:</strong><br>
            Fine-tuning large language models (LLMs) is essential for enhancing their performance on specific tasks but is often resource-intensive due to redundant or uninformative data. To address this inefficiency, we introduce DELIFT (Data Efficient Language model Instruction Fine-Tuning), a novel algorithm that systematically optimizes data selection across the three key stages of fine-tuning: (1) instruction tuning, (2) task-specific fine-tuning (e.g., reasoning, question-answering), and (3) continual fine-tuning (e.g., incorporating new data versions). Unlike existing methods that focus on single-stage optimization or rely on computationally intensive gradient calculations, DELIFT operates efficiently across all stages. Central to our approach is a pairwise utility metric that quantifies how beneficial a data sample is for improving the model's responses to other samples, effectively measuring the informational value relative to the model's current capabilities. By leveraging different submodular functions applied to this metric, DELIFT selects diverse and optimal subsets that are useful across all stages of fine-tuning. Experiments across various tasks and model scales demonstrate that DELIFT can reduce the fine-tuning data size by up to 70% without compromising performance, offering significant computational savings and outperforming existing methods in both efficiency and efficacy.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study</title>
      <itunes:episode>59</itunes:episode>
      <podcast:episode>59</podcast:episode>
      <itunes:title>Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">dcfe5521-aad9-440b-baf3-9e8f4a11fb47</guid>
      <link>https://share.transistor.fm/s/e52960c2</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 4 | cs.SE, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            André Storhaug, Jingyue Li</p>

            <p><strong>Title:</strong><br>
            Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02462v1">http://arxiv.org/abs/2411.02462v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advent of large language models (LLMs) like GitHub Copilot has significantly enhanced programmers' productivity, particularly in code generation. However, these models often struggle with real-world tasks without fine-tuning. As LLMs grow larger and more performant, fine-tuning for specialized tasks becomes increasingly expensive. Parameter-efficient fine-tuning (PEFT) methods, which fine-tune only a subset of model parameters, offer a promising solution by reducing the computational costs of tuning LLMs while maintaining their performance. Existing studies have explored using PEFT and LLMs for various code-related tasks and found that the effectiveness of PEFT techniques is task-dependent. The application of PEFT techniques in unit test generation remains underexplored. The state-of-the-art is limited to using LLMs with full fine-tuning to generate unit tests. This paper investigates both full fine-tuning and various PEFT methods, including LoRA, (IA)^3, and prompt tuning, across different model architectures and sizes. We use well-established benchmark datasets to evaluate their effectiveness in unit test generation. Our findings show that PEFT methods can deliver performance comparable to full fine-tuning for unit test generation, making specialized fine-tuning more accessible and cost-effective. Notably, prompt tuning is the most effective in terms of cost and resource utilization, while LoRA approaches the effectiveness of full fine-tuning in several cases.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 4 | cs.SE, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            André Storhaug, Jingyue Li</p>

            <p><strong>Title:</strong><br>
            Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02462v1">http://arxiv.org/abs/2411.02462v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advent of large language models (LLMs) like GitHub Copilot has significantly enhanced programmers' productivity, particularly in code generation. However, these models often struggle with real-world tasks without fine-tuning. As LLMs grow larger and more performant, fine-tuning for specialized tasks becomes increasingly expensive. Parameter-efficient fine-tuning (PEFT) methods, which fine-tune only a subset of model parameters, offer a promising solution by reducing the computational costs of tuning LLMs while maintaining their performance. Existing studies have explored using PEFT and LLMs for various code-related tasks and found that the effectiveness of PEFT techniques is task-dependent. The application of PEFT techniques in unit test generation remains underexplored. The state-of-the-art is limited to using LLMs with full fine-tuning to generate unit tests. This paper investigates both full fine-tuning and various PEFT methods, including LoRA, (IA)^3, and prompt tuning, across different model architectures and sizes. We use well-established benchmark datasets to evaluate their effectiveness in unit test generation. Our findings show that PEFT methods can deliver performance comparable to full fine-tuning for unit test generation, making specialized fine-tuning more accessible and cost-effective. Notably, prompt tuning is the most effective in terms of cost and resource utilization, while LoRA approaches the effectiveness of full fine-tuning in several cases.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 11 Nov 2024 19:40:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e52960c2/e462d895.mp3" length="24149742" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1506</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 4 | cs.SE, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            André Storhaug, Jingyue Li</p>

            <p><strong>Title:</strong><br>
            Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02462v1">http://arxiv.org/abs/2411.02462v1</a></p>

            <p><strong>Abstract:</strong><br>
            The advent of large language models (LLMs) like GitHub Copilot has significantly enhanced programmers' productivity, particularly in code generation. However, these models often struggle with real-world tasks without fine-tuning. As LLMs grow larger and more performant, fine-tuning for specialized tasks becomes increasingly expensive. Parameter-efficient fine-tuning (PEFT) methods, which fine-tune only a subset of model parameters, offer a promising solution by reducing the computational costs of tuning LLMs while maintaining their performance. Existing studies have explored using PEFT and LLMs for various code-related tasks and found that the effectiveness of PEFT techniques is task-dependent. The application of PEFT techniques in unit test generation remains underexplored. The state-of-the-art is limited to using LLMs with full fine-tuning to generate unit tests. This paper investigates both full fine-tuning and various PEFT methods, including LoRA, (IA)^3, and prompt tuning, across different model architectures and sizes. We use well-established benchmark datasets to evaluate their effectiveness in unit test generation. Our findings show that PEFT methods can deliver performance comparable to full fine-tuning for unit test generation, making specialized fine-tuning more accessible and cost-effective. Notably, prompt tuning is the most effective in terms of cost and resource utilization, while LoRA approaches the effectiveness of full fine-tuning in several cases.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models</title>
      <itunes:episode>58</itunes:episode>
      <podcast:episode>58</podcast:episode>
      <itunes:title>RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4da0c7db-960e-4b7e-bb9c-aba8d2b95a27</guid>
      <link>https://share.transistor.fm/s/88da00de</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Maya Varma, Jean-Benoit Delbrouck, Zhihong Chen, Akshay Chaudhari, Curtis Langlotz</p>

            <p><strong>Title:</strong><br>
            RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04097v1">http://arxiv.org/abs/2411.04097v1</a></p>

            <p><strong>Abstract:</strong><br>
            Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurious correlations (i) primarily operate at the global image-level rather than intervening directly on fine-grained image features and (ii) are predominantly designed for unimodal settings. In this work, we present RaVL, which takes a fine-grained perspective on VLM robustness by discovering and mitigating spurious correlations using local image features rather than operating at the global image level. Given a fine-tuned VLM, RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors. Then, RaVL mitigates the identified spurious correlation with a novel region-aware loss function that enables the VLM to focus on relevant regions and ignore spurious relationships during fine-tuning. We evaluate RaVL on 654 VLMs with various model architectures, data domains, and learned spurious correlations. Our results show that RaVL accurately discovers (191% improvement over the closest baseline) and mitigates (8.2% improvement on worst-group image classification accuracy) spurious correlations. Qualitative evaluations on general-domain and medical-domain VLMs confirm our findings.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Maya Varma, Jean-Benoit Delbrouck, Zhihong Chen, Akshay Chaudhari, Curtis Langlotz</p>

            <p><strong>Title:</strong><br>
            RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04097v1">http://arxiv.org/abs/2411.04097v1</a></p>

            <p><strong>Abstract:</strong><br>
            Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurious correlations (i) primarily operate at the global image-level rather than intervening directly on fine-grained image features and (ii) are predominantly designed for unimodal settings. In this work, we present RaVL, which takes a fine-grained perspective on VLM robustness by discovering and mitigating spurious correlations using local image features rather than operating at the global image level. Given a fine-tuned VLM, RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors. Then, RaVL mitigates the identified spurious correlation with a novel region-aware loss function that enables the VLM to focus on relevant regions and ignore spurious relationships during fine-tuning. We evaluate RaVL on 654 VLMs with various model architectures, data domains, and learned spurious correlations. Our results show that RaVL accurately discovers (191% improvement over the closest baseline) and mitigates (8.2% improvement on worst-group image classification accuracy) spurious correlations. Qualitative evaluations on general-domain and medical-domain VLMs confirm our findings.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 11 Nov 2024 19:39:40 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/88da00de/c46b3533.mp3" length="21536233" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1342</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Maya Varma, Jean-Benoit Delbrouck, Zhihong Chen, Akshay Chaudhari, Curtis Langlotz</p>

            <p><strong>Title:</strong><br>
            RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04097v1">http://arxiv.org/abs/2411.04097v1</a></p>

            <p><strong>Abstract:</strong><br>
            Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurious correlations (i) primarily operate at the global image-level rather than intervening directly on fine-grained image features and (ii) are predominantly designed for unimodal settings. In this work, we present RaVL, which takes a fine-grained perspective on VLM robustness by discovering and mitigating spurious correlations using local image features rather than operating at the global image level. Given a fine-tuned VLM, RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors. Then, RaVL mitigates the identified spurious correlation with a novel region-aware loss function that enables the VLM to focus on relevant regions and ignore spurious relationships during fine-tuning. We evaluate RaVL on 654 VLMs with various model architectures, data domains, and learned spurious correlations. Our results show that RaVL accurately discovers (191% improvement over the closest baseline) and mitigates (8.2% improvement on worst-group image classification accuracy) spurious correlations. Qualitative evaluations on general-domain and medical-domain VLMs confirm our findings.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities</title>
      <itunes:episode>57</itunes:episode>
      <podcast:episode>57</podcast:episode>
      <itunes:title>The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">579e9c4d-407a-49e2-a48e-2457f5fbad7e</guid>
      <link>https://share.transistor.fm/s/20a8a9a6</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhaofeng Wu, Xinyan Velocity Yu, Dani Yogatama, Jiasen Lu, Yoon Kim</p>

            <p><strong>Title:</strong><br>
            The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04986v1">http://arxiv.org/abs/2411.04986v1</a></p>

            <p><strong>Abstract:</strong><br>
            Modern language models can process inputs across diverse languages and modalities. We hypothesize that models acquire this capability through learning a shared representation space across heterogeneous data types (e.g., different languages and modalities), which places semantically similar inputs near one another, even if they are from different modalities/languages. We term this the semantic hub hypothesis, following the hub-and-spoke model from neuroscience (Patterson et al., 2007) which posits that semantic knowledge in the human brain is organized through a transmodal semantic "hub" which integrates information from various modality-specific "spokes" regions. We first show that model representations for semantically equivalent inputs in different languages are similar in the intermediate layers, and that this space can be interpreted using the model's dominant pretraining language via the logit lens. This tendency extends to other data types, including arithmetic expressions, code, and visual/audio inputs. Interventions in the shared representation space in one data type also predictably affect model outputs in other data types, suggesting that this shared representations space is not simply a vestigial byproduct of large-scale training on broad data, but something that is actively utilized by the model during input processing.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhaofeng Wu, Xinyan Velocity Yu, Dani Yogatama, Jiasen Lu, Yoon Kim</p>

            <p><strong>Title:</strong><br>
            The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04986v1">http://arxiv.org/abs/2411.04986v1</a></p>

            <p><strong>Abstract:</strong><br>
            Modern language models can process inputs across diverse languages and modalities. We hypothesize that models acquire this capability through learning a shared representation space across heterogeneous data types (e.g., different languages and modalities), which places semantically similar inputs near one another, even if they are from different modalities/languages. We term this the semantic hub hypothesis, following the hub-and-spoke model from neuroscience (Patterson et al., 2007) which posits that semantic knowledge in the human brain is organized through a transmodal semantic "hub" which integrates information from various modality-specific "spokes" regions. We first show that model representations for semantically equivalent inputs in different languages are similar in the intermediate layers, and that this space can be interpreted using the model's dominant pretraining language via the logit lens. This tendency extends to other data types, including arithmetic expressions, code, and visual/audio inputs. Interventions in the shared representation space in one data type also predictably affect model outputs in other data types, suggesting that this shared representations space is not simply a vestigial byproduct of large-scale training on broad data, but something that is actively utilized by the model during input processing.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 11 Nov 2024 19:39:19 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/20a8a9a6/f70e03c8.mp3" length="23112373" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1441</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhaofeng Wu, Xinyan Velocity Yu, Dani Yogatama, Jiasen Lu, Yoon Kim</p>

            <p><strong>Title:</strong><br>
            The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04986v1">http://arxiv.org/abs/2411.04986v1</a></p>

            <p><strong>Abstract:</strong><br>
            Modern language models can process inputs across diverse languages and modalities. We hypothesize that models acquire this capability through learning a shared representation space across heterogeneous data types (e.g., different languages and modalities), which places semantically similar inputs near one another, even if they are from different modalities/languages. We term this the semantic hub hypothesis, following the hub-and-spoke model from neuroscience (Patterson et al., 2007) which posits that semantic knowledge in the human brain is organized through a transmodal semantic "hub" which integrates information from various modality-specific "spokes" regions. We first show that model representations for semantically equivalent inputs in different languages are similar in the intermediate layers, and that this space can be interpreted using the model's dominant pretraining language via the logit lens. This tendency extends to other data types, including arithmetic expressions, code, and visual/audio inputs. Interventions in the shared representation space in one data type also predictably affect model outputs in other data types, suggesting that this shared representations space is not simply a vestigial byproduct of large-scale training on broad data, but something that is actively utilized by the model during input processing.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Improving the detection of technical debt in Java source code with an enriched dataset</title>
      <itunes:episode>56</itunes:episode>
      <podcast:episode>56</podcast:episode>
      <itunes:title>Improving the detection of technical debt in Java source code with an enriched dataset</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2785d646-8b00-42c9-a411-ff7093de3e0d</guid>
      <link>https://share.transistor.fm/s/39302c45</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 2 | cs.SE</p>

            <p><strong>Authors:</strong><br>
            Nam Le Hai, Anh M. T. Bui, Phuong T. Nguyen, Davide Di Ruscio, Rick Kazman</p>

            <p><strong>Title:</strong><br>
            Improving the detection of technical debt in Java source code with an enriched dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05457v1">http://arxiv.org/abs/2411.05457v1</a></p>

            <p><strong>Abstract:</strong><br>
            Technical debt (TD) is a term used to describe the additional work and costs that emerge when developers have opted for a quick and easy solution to a problem, rather than a more effective and well-designed, but time-consuming approach. Self-Admitted Technical Debts (SATDs) are a specific type of technical debts that developers intentionally document and acknowledge, typically via textual comments. While these self-admitted comments are a useful tool for identifying technical debts, most of the existing approaches focus on capturing crucial tokens associated with various categories of TD, neglecting the rich information embedded within the source code itself. Recent research has focused on detecting SATDs by analyzing comments embedded in source code, and there has been little work dealing with technical debts contained in the source code. To fill such a gap, in this study, through the analysis of comments and their associated source code from 974 Java projects hosted in the Stack corpus, we curated the first ever dataset of TD identified by code comments, coupled with its associated source code. Through an empirical evaluation, we found out that the comments of the resulting dataset help enhance the prediction performance of state-of-the-art SATD detection models. More importantly, including the classified source code significantly improves the accuracy in predicting various types of technical debt. In this respect, our work is two-fold: (i) We believe that our dataset will catalyze future work in the domain, inspiring various research issues related to the recognition of technical debt; (ii) The proposed classifiers may serve as baselines for other studies on the detection of TD by means of the curated dataset.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 2 | cs.SE</p>

            <p><strong>Authors:</strong><br>
            Nam Le Hai, Anh M. T. Bui, Phuong T. Nguyen, Davide Di Ruscio, Rick Kazman</p>

            <p><strong>Title:</strong><br>
            Improving the detection of technical debt in Java source code with an enriched dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05457v1">http://arxiv.org/abs/2411.05457v1</a></p>

            <p><strong>Abstract:</strong><br>
            Technical debt (TD) is a term used to describe the additional work and costs that emerge when developers have opted for a quick and easy solution to a problem, rather than a more effective and well-designed, but time-consuming approach. Self-Admitted Technical Debts (SATDs) are a specific type of technical debts that developers intentionally document and acknowledge, typically via textual comments. While these self-admitted comments are a useful tool for identifying technical debts, most of the existing approaches focus on capturing crucial tokens associated with various categories of TD, neglecting the rich information embedded within the source code itself. Recent research has focused on detecting SATDs by analyzing comments embedded in source code, and there has been little work dealing with technical debts contained in the source code. To fill such a gap, in this study, through the analysis of comments and their associated source code from 974 Java projects hosted in the Stack corpus, we curated the first ever dataset of TD identified by code comments, coupled with its associated source code. Through an empirical evaluation, we found out that the comments of the resulting dataset help enhance the prediction performance of state-of-the-art SATD detection models. More importantly, including the classified source code significantly improves the accuracy in predicting various types of technical debt. In this respect, our work is two-fold: (i) We believe that our dataset will catalyze future work in the domain, inspiring various research issues related to the recognition of technical debt; (ii) The proposed classifiers may serve as baselines for other studies on the detection of TD by means of the curated dataset.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 11 Nov 2024 19:38:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/39302c45/dd687268.mp3" length="25292009" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1577</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 2 | cs.SE</p>

            <p><strong>Authors:</strong><br>
            Nam Le Hai, Anh M. T. Bui, Phuong T. Nguyen, Davide Di Ruscio, Rick Kazman</p>

            <p><strong>Title:</strong><br>
            Improving the detection of technical debt in Java source code with an enriched dataset</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05457v1">http://arxiv.org/abs/2411.05457v1</a></p>

            <p><strong>Abstract:</strong><br>
            Technical debt (TD) is a term used to describe the additional work and costs that emerge when developers have opted for a quick and easy solution to a problem, rather than a more effective and well-designed, but time-consuming approach. Self-Admitted Technical Debts (SATDs) are a specific type of technical debts that developers intentionally document and acknowledge, typically via textual comments. While these self-admitted comments are a useful tool for identifying technical debts, most of the existing approaches focus on capturing crucial tokens associated with various categories of TD, neglecting the rich information embedded within the source code itself. Recent research has focused on detecting SATDs by analyzing comments embedded in source code, and there has been little work dealing with technical debts contained in the source code. To fill such a gap, in this study, through the analysis of comments and their associated source code from 974 Java projects hosted in the Stack corpus, we curated the first ever dataset of TD identified by code comments, coupled with its associated source code. Through an empirical evaluation, we found out that the comments of the resulting dataset help enhance the prediction performance of state-of-the-art SATD detection models. More importantly, including the classified source code significantly improves the accuracy in predicting various types of technical debt. In this respect, our work is two-fold: (i) We believe that our dataset will catalyze future work in the domain, inspiring various research issues related to the recognition of technical debt; (ii) The proposed classifiers may serve as baselines for other studies on the detection of TD by means of the curated dataset.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models</title>
      <itunes:episode>55</itunes:episode>
      <podcast:episode>55</podcast:episode>
      <itunes:title>OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">afa0b9f8-a4a7-4609-8c65-a6ae050df50a</guid>
      <link>https://share.transistor.fm/s/27e3474f</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 69 | cs.CL, cs.PL</p>

            <p><strong>Authors:</strong><br>
            Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J. Yang, J. H. Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Zhaoxiang Zhang, Jie Fu, Qian Liu, Ge Zhang, Zili Wang, Yuan Qi, Yinghui Xu, Wei Chu</p>

            <p><strong>Title:</strong><br>
            OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04905v1">http://arxiv.org/abs/2411.04905v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs suitable for rigorous scientific investigation, particularly those with reproducible data processing pipelines and transparent training protocols, remain limited. The scarcity is due to various challenges, including resource constraints, ethical considerations, and the competitive advantages of keeping models advanced. To address the gap, we introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an ``open cookbook'' for the research community. Unlike most prior efforts, we release not only model weights and inference code, but also the reproducible training data, complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols for open scientific research. Through this comprehensive release, we identify the key ingredients for building a top-tier code LLM: (1) code optimized heuristic rules for data cleaning and methods for data deduplication, (2) recall of text corpus related to code and (3) high-quality synthetic data in both annealing and supervised fine-tuning stages. By offering this level of openness, we aim to broaden access to all aspects of a top-tier code LLM, with OpenCoder serving as both a powerful model and an open foundation to accelerate research, and enable reproducible advancements in code AI.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 69 | cs.CL, cs.PL</p>

            <p><strong>Authors:</strong><br>
            Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J. Yang, J. H. Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Zhaoxiang Zhang, Jie Fu, Qian Liu, Ge Zhang, Zili Wang, Yuan Qi, Yinghui Xu, Wei Chu</p>

            <p><strong>Title:</strong><br>
            OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04905v1">http://arxiv.org/abs/2411.04905v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs suitable for rigorous scientific investigation, particularly those with reproducible data processing pipelines and transparent training protocols, remain limited. The scarcity is due to various challenges, including resource constraints, ethical considerations, and the competitive advantages of keeping models advanced. To address the gap, we introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an ``open cookbook'' for the research community. Unlike most prior efforts, we release not only model weights and inference code, but also the reproducible training data, complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols for open scientific research. Through this comprehensive release, we identify the key ingredients for building a top-tier code LLM: (1) code optimized heuristic rules for data cleaning and methods for data deduplication, (2) recall of text corpus related to code and (3) high-quality synthetic data in both annealing and supervised fine-tuning stages. By offering this level of openness, we aim to broaden access to all aspects of a top-tier code LLM, with OpenCoder serving as both a powerful model and an open foundation to accelerate research, and enable reproducible advancements in code AI.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 08 Nov 2024 19:38:15 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/27e3474f/749413db.mp3" length="21916971" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1366</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 69 | cs.CL, cs.PL</p>

            <p><strong>Authors:</strong><br>
            Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J. Yang, J. H. Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Zhaoxiang Zhang, Jie Fu, Qian Liu, Ge Zhang, Zili Wang, Yuan Qi, Yinghui Xu, Wei Chu</p>

            <p><strong>Title:</strong><br>
            OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04905v1">http://arxiv.org/abs/2411.04905v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs suitable for rigorous scientific investigation, particularly those with reproducible data processing pipelines and transparent training protocols, remain limited. The scarcity is due to various challenges, including resource constraints, ethical considerations, and the competitive advantages of keeping models advanced. To address the gap, we introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an ``open cookbook'' for the research community. Unlike most prior efforts, we release not only model weights and inference code, but also the reproducible training data, complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols for open scientific research. Through this comprehensive release, we identify the key ingredients for building a top-tier code LLM: (1) code optimized heuristic rules for data cleaning and methods for data deduplication, (2) recall of text corpus related to code and (3) high-quality synthetic data in both annealing and supervised fine-tuning stages. By offering this level of openness, we aim to broaden access to all aspects of a top-tier code LLM, with OpenCoder serving as both a powerful model and an open foundation to accelerate research, and enable reproducible advancements in code AI.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning</title>
      <itunes:episode>54</itunes:episode>
      <podcast:episode>54</podcast:episode>
      <itunes:title>ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">2de40a8f-88c3-4a23-b2fb-c9c9050fbbcb</guid>
      <link>https://share.transistor.fm/s/6eda0de1</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 50 | cs.CV, cs.AI, cs.GR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            David Junhao Zhang, Roni Paiss, Shiran Zada, Nikhil Karnad, David E. Jacobs, Yael Pritch, Inbar Mosseri, Mike Zheng Shou, Neal Wadhwa, Nataniel Ruiz</p>

            <p><strong>Title:</strong><br>
            ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05003v1">http://arxiv.org/abs/2411.05003v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, breakthroughs in video modeling have allowed for controllable camera trajectories in generated videos. However, these methods cannot be directly applied to user-provided videos that are not generated by a video model. In this paper, we present ReCapture, a method for generating new videos with novel camera trajectories from a single user-provided video. Our method allows us to re-generate the reference video, with all its existing scene motion, from vastly different angles and with cinematic camera motion. Notably, using our method we can also plausibly hallucinate parts of the scene that were not observable in the reference video. Our method works by (1) generating a noisy anchor video with a new camera trajectory using multiview diffusion models or depth-based point cloud rendering and then (2) regenerating the anchor video into a clean and temporally consistent reangled video using our proposed masked video fine-tuning technique.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 50 | cs.CV, cs.AI, cs.GR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            David Junhao Zhang, Roni Paiss, Shiran Zada, Nikhil Karnad, David E. Jacobs, Yael Pritch, Inbar Mosseri, Mike Zheng Shou, Neal Wadhwa, Nataniel Ruiz</p>

            <p><strong>Title:</strong><br>
            ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05003v1">http://arxiv.org/abs/2411.05003v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, breakthroughs in video modeling have allowed for controllable camera trajectories in generated videos. However, these methods cannot be directly applied to user-provided videos that are not generated by a video model. In this paper, we present ReCapture, a method for generating new videos with novel camera trajectories from a single user-provided video. Our method allows us to re-generate the reference video, with all its existing scene motion, from vastly different angles and with cinematic camera motion. Notably, using our method we can also plausibly hallucinate parts of the scene that were not observable in the reference video. Our method works by (1) generating a noisy anchor video with a new camera trajectory using multiview diffusion models or depth-based point cloud rendering and then (2) regenerating the anchor video into a clean and temporally consistent reangled video using our proposed masked video fine-tuning technique.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 08 Nov 2024 19:37:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6eda0de1/3e008953.mp3" length="19151366" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1193</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 50 | cs.CV, cs.AI, cs.GR, cs.LG</p>

            <p><strong>Authors:</strong><br>
            David Junhao Zhang, Roni Paiss, Shiran Zada, Nikhil Karnad, David E. Jacobs, Yael Pritch, Inbar Mosseri, Mike Zheng Shou, Neal Wadhwa, Nataniel Ruiz</p>

            <p><strong>Title:</strong><br>
            ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05003v1">http://arxiv.org/abs/2411.05003v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, breakthroughs in video modeling have allowed for controllable camera trajectories in generated videos. However, these methods cannot be directly applied to user-provided videos that are not generated by a video model. In this paper, we present ReCapture, a method for generating new videos with novel camera trajectories from a single user-provided video. Our method allows us to re-generate the reference video, with all its existing scene motion, from vastly different angles and with cinematic camera motion. Notably, using our method we can also plausibly hallucinate parts of the scene that were not observable in the reference video. Our method works by (1) generating a noisy anchor video with a new camera trajectory using multiview diffusion models or depth-based point cloud rendering and then (2) regenerating the anchor video into a clean and temporally consistent reangled video using our proposed masked video fine-tuning technique.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BitNet a4.8: 4-bit Activations for 1-bit LLMs</title>
      <itunes:episode>53</itunes:episode>
      <podcast:episode>53</podcast:episode>
      <itunes:title>BitNet a4.8: 4-bit Activations for 1-bit LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8792caf0-3a5a-409e-8275-12df394936a9</guid>
      <link>https://share.transistor.fm/s/6ecbaf6b</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 41 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hongyu Wang, Shuming Ma, Furu Wei</p>

            <p><strong>Title:</strong><br>
            BitNet a4.8: 4-bit Activations for 1-bit LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04965v1">http://arxiv.org/abs/2411.04965v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent research on the 1-bit Large Language Models (LLMs), such as BitNet b1.58, presents a promising direction for reducing the inference cost of LLMs while maintaining their performance. In this work, we introduce BitNet a4.8, enabling 4-bit activations for 1-bit LLMs. BitNet a4.8 employs a hybrid quantization and sparsification strategy to mitigate the quantization errors introduced by the outlier channels. Specifically, we utilize 4-bit activations for inputs to the attention and feed-forward network layers, while sparsifying intermediate states followed with 8-bit quantization. Extensive experiments demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs, while being faster in inference with enabling 4-bit (INT4/FP4) kernels. Additionally, BitNet a4.8 activates only 55% of parameters and supports 3-bit KV cache, further enhancing the efficiency of large-scale LLM deployment and inference.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 41 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hongyu Wang, Shuming Ma, Furu Wei</p>

            <p><strong>Title:</strong><br>
            BitNet a4.8: 4-bit Activations for 1-bit LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04965v1">http://arxiv.org/abs/2411.04965v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent research on the 1-bit Large Language Models (LLMs), such as BitNet b1.58, presents a promising direction for reducing the inference cost of LLMs while maintaining their performance. In this work, we introduce BitNet a4.8, enabling 4-bit activations for 1-bit LLMs. BitNet a4.8 employs a hybrid quantization and sparsification strategy to mitigate the quantization errors introduced by the outlier channels. Specifically, we utilize 4-bit activations for inputs to the attention and feed-forward network layers, while sparsifying intermediate states followed with 8-bit quantization. Extensive experiments demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs, while being faster in inference with enabling 4-bit (INT4/FP4) kernels. Additionally, BitNet a4.8 activates only 55% of parameters and supports 3-bit KV cache, further enhancing the efficiency of large-scale LLM deployment and inference.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 08 Nov 2024 19:37:33 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6ecbaf6b/0d4fd151.mp3" length="24434316" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1523</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 41 | cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Hongyu Wang, Shuming Ma, Furu Wei</p>

            <p><strong>Title:</strong><br>
            BitNet a4.8: 4-bit Activations for 1-bit LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04965v1">http://arxiv.org/abs/2411.04965v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recent research on the 1-bit Large Language Models (LLMs), such as BitNet b1.58, presents a promising direction for reducing the inference cost of LLMs while maintaining their performance. In this work, we introduce BitNet a4.8, enabling 4-bit activations for 1-bit LLMs. BitNet a4.8 employs a hybrid quantization and sparsification strategy to mitigate the quantization errors introduced by the outlier channels. Specifically, we utilize 4-bit activations for inputs to the attention and feed-forward network layers, while sparsifying intermediate states followed with 8-bit quantization. Extensive experiments demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs, while being faster in inference with enabling 4-bit (INT4/FP4) kernels. Additionally, BitNet a4.8 activates only 55% of parameters and supports 3-bit KV cache, further enhancing the efficiency of large-scale LLM deployment and inference.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion</title>
      <itunes:episode>52</itunes:episode>
      <podcast:episode>52</podcast:episode>
      <itunes:title>DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d62561be-5481-4ff3-b119-38355eb9d16c</guid>
      <link>https://share.transistor.fm/s/db861a93</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 27 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, Yikai Wang</p>

            <p><strong>Title:</strong><br>
            DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04928v1">http://arxiv.org/abs/2411.04928v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce \textbf{DimensionX}, a framework designed to generate photorealistic 3D and 4D scenes from just a single image with video diffusion. Our approach begins with the insight that both the spatial structure of a 3D scene and the temporal evolution of a 4D scene can be effectively represented through sequences of video frames. While recent video diffusion models have shown remarkable success in producing vivid visuals, they face limitations in directly recovering 3D/4D scenes due to limited spatial and temporal controllability during generation. To overcome this, we propose ST-Director, which decouples spatial and temporal factors in video diffusion by learning dimension-aware LoRAs from dimension-variant data. This controllable video diffusion approach enables precise manipulation of spatial structure and temporal dynamics, allowing us to reconstruct both 3D and 4D representations from sequential frames with the combination of spatial and temporal dimensions. Additionally, to bridge the gap between generated videos and real-world scenes, we introduce a trajectory-aware mechanism for 3D generation and an identity-preserving denoising strategy for 4D generation. Extensive experiments on various real-world and synthetic datasets demonstrate that DimensionX achieves superior results in controllable video generation, as well as in 3D and 4D scene generation, compared with previous methods.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 27 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, Yikai Wang</p>

            <p><strong>Title:</strong><br>
            DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04928v1">http://arxiv.org/abs/2411.04928v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce \textbf{DimensionX}, a framework designed to generate photorealistic 3D and 4D scenes from just a single image with video diffusion. Our approach begins with the insight that both the spatial structure of a 3D scene and the temporal evolution of a 4D scene can be effectively represented through sequences of video frames. While recent video diffusion models have shown remarkable success in producing vivid visuals, they face limitations in directly recovering 3D/4D scenes due to limited spatial and temporal controllability during generation. To overcome this, we propose ST-Director, which decouples spatial and temporal factors in video diffusion by learning dimension-aware LoRAs from dimension-variant data. This controllable video diffusion approach enables precise manipulation of spatial structure and temporal dynamics, allowing us to reconstruct both 3D and 4D representations from sequential frames with the combination of spatial and temporal dimensions. Additionally, to bridge the gap between generated videos and real-world scenes, we introduce a trajectory-aware mechanism for 3D generation and an identity-preserving denoising strategy for 4D generation. Extensive experiments on various real-world and synthetic datasets demonstrate that DimensionX achieves superior results in controllable video generation, as well as in 3D and 4D scene generation, compared with previous methods.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 08 Nov 2024 19:37:12 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/db861a93/5849c6f5.mp3" length="22147291" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1381</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 27 | cs.CV, cs.AI, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, Yikai Wang</p>

            <p><strong>Title:</strong><br>
            DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04928v1">http://arxiv.org/abs/2411.04928v1</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce \textbf{DimensionX}, a framework designed to generate photorealistic 3D and 4D scenes from just a single image with video diffusion. Our approach begins with the insight that both the spatial structure of a 3D scene and the temporal evolution of a 4D scene can be effectively represented through sequences of video frames. While recent video diffusion models have shown remarkable success in producing vivid visuals, they face limitations in directly recovering 3D/4D scenes due to limited spatial and temporal controllability during generation. To overcome this, we propose ST-Director, which decouples spatial and temporal factors in video diffusion by learning dimension-aware LoRAs from dimension-variant data. This controllable video diffusion approach enables precise manipulation of spatial structure and temporal dynamics, allowing us to reconstruct both 3D and 4D representations from sequential frames with the combination of spatial and temporal dimensions. Additionally, to bridge the gap between generated videos and real-world scenes, we introduce a trajectory-aware mechanism for 3D generation and an identity-preserving denoising strategy for 4D generation. Extensive experiments on various real-world and synthetic datasets demonstrate that DimensionX achieves superior results in controllable video generation, as well as in 3D and 4D scene generation, compared with previous methods.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models</title>
      <itunes:episode>51</itunes:episode>
      <podcast:episode>51</podcast:episode>
      <itunes:title>Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3702b238-d010-4158-bbe7-a5fdf8201e46</guid>
      <link>https://share.transistor.fm/s/126991cc</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin</p>

            <p><strong>Title:</strong><br>
            Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04996v1">http://arxiv.org/abs/2411.04996v1</a></p>

            <p><strong>Abstract:</strong><br>
            The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin</p>

            <p><strong>Title:</strong><br>
            Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04996v1">http://arxiv.org/abs/2411.04996v1</a></p>

            <p><strong>Abstract:</strong><br>
            The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 08 Nov 2024 19:36:51 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/126991cc/c83dad57.mp3" length="23925290" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1492</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin</p>

            <p><strong>Title:</strong><br>
            Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04996v1">http://arxiv.org/abs/2411.04996v1</a></p>

            <p><strong>Abstract:</strong><br>
            The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation</title>
      <itunes:episode>50</itunes:episode>
      <podcast:episode>50</podcast:episode>
      <itunes:title>TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">05e4bab2-8fe9-46dc-a28d-769a55baef31</guid>
      <link>https://share.transistor.fm/s/02d2e69a</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 20 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wenhao Wang, Yi Yang</p>

            <p><strong>Title:</strong><br>
            TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04709v1">http://arxiv.org/abs/2411.04709v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video generation models are revolutionizing content creation, with image-to-video models drawing increasing attention due to their enhanced controllability, visual consistency, and practical applications. However, despite their popularity, these models rely on user-provided text and image prompts, and there is currently no dedicated dataset for studying these prompts. In this paper, we introduce TIP-I2V, the first large-scale dataset of over 1.70 million unique user-provided Text and Image Prompts specifically for Image-to-Video generation. Additionally, we provide the corresponding generated videos from five state-of-the-art image-to-video models. We begin by outlining the time-consuming and costly process of curating this large-scale dataset. Next, we compare TIP-I2V to two popular prompt datasets, VidProM (text-to-video) and DiffusionDB (text-to-image), highlighting differences in both basic and semantic information. This dataset enables advancements in image-to-video research. For instance, to develop better models, researchers can use the prompts in TIP-I2V to analyze user preferences and evaluate the multi-dimensional performance of their trained models; and to enhance model safety, they may focus on addressing the misinformation issue caused by image-to-video models. The new research inspired by TIP-I2V and the differences with existing datasets emphasize the importance of a specialized image-to-video prompt dataset. The project is publicly available at https://tip-i2v.github.io.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 20 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wenhao Wang, Yi Yang</p>

            <p><strong>Title:</strong><br>
            TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04709v1">http://arxiv.org/abs/2411.04709v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video generation models are revolutionizing content creation, with image-to-video models drawing increasing attention due to their enhanced controllability, visual consistency, and practical applications. However, despite their popularity, these models rely on user-provided text and image prompts, and there is currently no dedicated dataset for studying these prompts. In this paper, we introduce TIP-I2V, the first large-scale dataset of over 1.70 million unique user-provided Text and Image Prompts specifically for Image-to-Video generation. Additionally, we provide the corresponding generated videos from five state-of-the-art image-to-video models. We begin by outlining the time-consuming and costly process of curating this large-scale dataset. Next, we compare TIP-I2V to two popular prompt datasets, VidProM (text-to-video) and DiffusionDB (text-to-image), highlighting differences in both basic and semantic information. This dataset enables advancements in image-to-video research. For instance, to develop better models, researchers can use the prompts in TIP-I2V to analyze user preferences and evaluate the multi-dimensional performance of their trained models; and to enhance model safety, they may focus on addressing the misinformation issue caused by image-to-video models. The new research inspired by TIP-I2V and the differences with existing datasets emphasize the importance of a specialized image-to-video prompt dataset. The project is publicly available at https://tip-i2v.github.io.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 08 Nov 2024 19:36:30 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/02d2e69a/5a9a9744.mp3" length="23726755" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1479</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 20 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wenhao Wang, Yi Yang</p>

            <p><strong>Title:</strong><br>
            TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04709v1">http://arxiv.org/abs/2411.04709v1</a></p>

            <p><strong>Abstract:</strong><br>
            Video generation models are revolutionizing content creation, with image-to-video models drawing increasing attention due to their enhanced controllability, visual consistency, and practical applications. However, despite their popularity, these models rely on user-provided text and image prompts, and there is currently no dedicated dataset for studying these prompts. In this paper, we introduce TIP-I2V, the first large-scale dataset of over 1.70 million unique user-provided Text and Image Prompts specifically for Image-to-Video generation. Additionally, we provide the corresponding generated videos from five state-of-the-art image-to-video models. We begin by outlining the time-consuming and costly process of curating this large-scale dataset. Next, we compare TIP-I2V to two popular prompt datasets, VidProM (text-to-video) and DiffusionDB (text-to-image), highlighting differences in both basic and semantic information. This dataset enables advancements in image-to-video research. For instance, to develop better models, researchers can use the prompts in TIP-I2V to analyze user preferences and evaluate the multi-dimensional performance of their trained models; and to enhance model safety, they may focus on addressing the misinformation issue caused by image-to-video models. The new research inspired by TIP-I2V and the differences with existing datasets emphasize the importance of a specialized image-to-video prompt dataset. The project is publicly available at https://tip-i2v.github.io.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model</title>
      <itunes:episode>49</itunes:episode>
      <podcast:episode>49</podcast:episode>
      <itunes:title>Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d2cd6463-6221-4a0e-a8d8-a82c4138a58b</guid>
      <link>https://share.transistor.fm/s/8db29d60</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Young-Jun Lee, Dokyong Lee, Junyoung Youn, Kyeongjin Oh, Ho-Jin Choi</p>

            <p><strong>Title:</strong><br>
            Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04496v1">http://arxiv.org/abs/2411.04496v1</a></p>

            <p><strong>Abstract:</strong><br>
            To increase social bonding with interlocutors, humans naturally acquire the ability to respond appropriately in a given situation by considering which conversational skill is most suitable for the response - a process we call skill-of-mind. For large language model (LLM)-based conversational agents, planning appropriate conversational skills, as humans do, is challenging due to the complexity of social dialogue, especially in interactive scenarios. To address this, we propose a skill-of-mind-annotated conversation dataset, named Multifaceted Skill-of-Mind, which includes multi-turn and multifaceted conversational skills across various interactive scenarios (e.g., long-term, counseling, task-oriented), grounded in diverse social contexts (e.g., demographics, persona, rules of thumb). This dataset consists of roughly 100K conversations. Using this dataset, we introduce a new family of skill-of-mind-infused LLMs, named Thanos, with model sizes of 1B, 3B, and 8B parameters. With extensive experiments, these models successfully demonstrate the skill-of-mind process and exhibit strong generalizability in inferring multifaceted skills across a variety of domains. Moreover, we show that Thanos significantly enhances the quality of responses generated by LLM-based conversational agents and promotes prosocial behavior in human evaluations.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Young-Jun Lee, Dokyong Lee, Junyoung Youn, Kyeongjin Oh, Ho-Jin Choi</p>

            <p><strong>Title:</strong><br>
            Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04496v1">http://arxiv.org/abs/2411.04496v1</a></p>

            <p><strong>Abstract:</strong><br>
            To increase social bonding with interlocutors, humans naturally acquire the ability to respond appropriately in a given situation by considering which conversational skill is most suitable for the response - a process we call skill-of-mind. For large language model (LLM)-based conversational agents, planning appropriate conversational skills, as humans do, is challenging due to the complexity of social dialogue, especially in interactive scenarios. To address this, we propose a skill-of-mind-annotated conversation dataset, named Multifaceted Skill-of-Mind, which includes multi-turn and multifaceted conversational skills across various interactive scenarios (e.g., long-term, counseling, task-oriented), grounded in diverse social contexts (e.g., demographics, persona, rules of thumb). This dataset consists of roughly 100K conversations. Using this dataset, we introduce a new family of skill-of-mind-infused LLMs, named Thanos, with model sizes of 1B, 3B, and 8B parameters. With extensive experiments, these models successfully demonstrate the skill-of-mind process and exhibit strong generalizability in inferring multifaceted skills across a variety of domains. Moreover, we show that Thanos significantly enhances the quality of responses generated by LLM-based conversational agents and promotes prosocial behavior in human evaluations.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 08 Nov 2024 19:36:09 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8db29d60/70a690a8.mp3" length="21661617" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1350</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 15 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Young-Jun Lee, Dokyong Lee, Junyoung Youn, Kyeongjin Oh, Ho-Jin Choi</p>

            <p><strong>Title:</strong><br>
            Thanos: Enhancing Conversational Agents with Skill-of-Mind-Infused Large Language Model</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04496v1">http://arxiv.org/abs/2411.04496v1</a></p>

            <p><strong>Abstract:</strong><br>
            To increase social bonding with interlocutors, humans naturally acquire the ability to respond appropriately in a given situation by considering which conversational skill is most suitable for the response - a process we call skill-of-mind. For large language model (LLM)-based conversational agents, planning appropriate conversational skills, as humans do, is challenging due to the complexity of social dialogue, especially in interactive scenarios. To address this, we propose a skill-of-mind-annotated conversation dataset, named Multifaceted Skill-of-Mind, which includes multi-turn and multifaceted conversational skills across various interactive scenarios (e.g., long-term, counseling, task-oriented), grounded in diverse social contexts (e.g., demographics, persona, rules of thumb). This dataset consists of roughly 100K conversations. Using this dataset, we introduce a new family of skill-of-mind-infused LLMs, named Thanos, with model sizes of 1B, 3B, and 8B parameters. With extensive experiments, these models successfully demonstrate the skill-of-mind process and exhibit strong generalizability in inferring multifaceted skills across a variety of domains. Moreover, we show that Thanos significantly enhances the quality of responses generated by LLM-based conversational agents and promotes prosocial behavior in human evaluations.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?</title>
      <itunes:episode>48</itunes:episode>
      <podcast:episode>48</podcast:episode>
      <itunes:title>Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">be8cf53a-76f3-47b2-93dc-3453d12ed8e5</guid>
      <link>https://share.transistor.fm/s/e668d2c9</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jonathan Roberts, Kai Han, Samuel Albanie</p>

            <p><strong>Title:</strong><br>
            Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05000v1">http://arxiv.org/abs/2411.05000v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has traditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the capabilities of 17 leading LLMs, such as their ability to follow threads of information through the context window. Strikingly, we find that many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows. Our study also highlights the important point that token counts from different tokenizers should not be directly compared -- they often correspond to substantially different numbers of written characters. We release our code and long-context experimental data.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jonathan Roberts, Kai Han, Samuel Albanie</p>

            <p><strong>Title:</strong><br>
            Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05000v1">http://arxiv.org/abs/2411.05000v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has traditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the capabilities of 17 leading LLMs, such as their ability to follow threads of information through the context window. Strikingly, we find that many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows. Our study also highlights the important point that token counts from different tokenizers should not be directly compared -- they often correspond to substantially different numbers of written characters. We release our code and long-context experimental data.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 08 Nov 2024 19:35:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/e668d2c9/a53c1bfb.mp3" length="21191405" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1321</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Jonathan Roberts, Kai Han, Samuel Albanie</p>

            <p><strong>Title:</strong><br>
            Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.05000v1">http://arxiv.org/abs/2411.05000v1</a></p>

            <p><strong>Abstract:</strong><br>
            As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has traditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the capabilities of 17 leading LLMs, such as their ability to follow threads of information through the context window. Strikingly, we find that many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows. Our study also highlights the important point that token counts from different tokenizers should not be directly compared -- they often correspond to substantially different numbers of written characters. We release our code and long-context experimental data.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation</title>
      <itunes:episode>47</itunes:episode>
      <podcast:episode>47</podcast:episode>
      <itunes:title>DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ed5b9b4a-a948-4189-b8be-a51d285af183</guid>
      <link>https://share.transistor.fm/s/94fede27</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.RO, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Peiqi Liu, Zhanqiu Guo, Mohit Warke, Soumith Chintala, Chris Paxton, Nur Muhammad Mahi Shafiullah, Lerrel Pinto</p>

            <p><strong>Title:</strong><br>
            DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04999v1">http://arxiv.org/abs/2411.04999v1</a></p>

            <p><strong>Abstract:</strong><br>
            Significant progress has been made in open-vocabulary mobile manipulation, where the goal is for a robot to perform tasks in any environment given a natural language description. However, most current systems assume a static environment, which limits the system's applicability in real-world scenarios where environments frequently change due to human intervention or the robot's own actions. In this work, we present DynaMem, a new approach to open-world mobile manipulation that uses a dynamic spatio-semantic memory to represent a robot's environment. DynaMem constructs a 3D data structure to maintain a dynamic memory of point clouds, and answers open-vocabulary object localization queries using multimodal LLMs or open-vocabulary features generated by state-of-the-art vision-language models. Powered by DynaMem, our robots can explore novel environments, search for objects not found in memory, and continuously update the memory as objects move, appear, or disappear in the scene. We run extensive experiments on the Stretch SE3 robots in three real and nine offline scenes, and achieve an average pick-and-drop success rate of 70% on non-stationary objects, which is more than a 2x improvement over state-of-the-art static systems. Our code as well as our experiment and deployment videos are open sourced and can be found on our project website: https://dynamem.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.RO, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Peiqi Liu, Zhanqiu Guo, Mohit Warke, Soumith Chintala, Chris Paxton, Nur Muhammad Mahi Shafiullah, Lerrel Pinto</p>

            <p><strong>Title:</strong><br>
            DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04999v1">http://arxiv.org/abs/2411.04999v1</a></p>

            <p><strong>Abstract:</strong><br>
            Significant progress has been made in open-vocabulary mobile manipulation, where the goal is for a robot to perform tasks in any environment given a natural language description. However, most current systems assume a static environment, which limits the system's applicability in real-world scenarios where environments frequently change due to human intervention or the robot's own actions. In this work, we present DynaMem, a new approach to open-world mobile manipulation that uses a dynamic spatio-semantic memory to represent a robot's environment. DynaMem constructs a 3D data structure to maintain a dynamic memory of point clouds, and answers open-vocabulary object localization queries using multimodal LLMs or open-vocabulary features generated by state-of-the-art vision-language models. Powered by DynaMem, our robots can explore novel environments, search for objects not found in memory, and continuously update the memory as objects move, appear, or disappear in the scene. We run extensive experiments on the Stretch SE3 robots in three real and nine offline scenes, and achieve an average pick-and-drop success rate of 70% on non-stationary objects, which is more than a 2x improvement over state-of-the-art static systems. Our code as well as our experiment and deployment videos are open sourced and can be found on our project website: https://dynamem.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 08 Nov 2024 19:35:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/94fede27/a2e1a08a.mp3" length="20434065" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1273</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.RO, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Peiqi Liu, Zhanqiu Guo, Mohit Warke, Soumith Chintala, Chris Paxton, Nur Muhammad Mahi Shafiullah, Lerrel Pinto</p>

            <p><strong>Title:</strong><br>
            DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04999v1">http://arxiv.org/abs/2411.04999v1</a></p>

            <p><strong>Abstract:</strong><br>
            Significant progress has been made in open-vocabulary mobile manipulation, where the goal is for a robot to perform tasks in any environment given a natural language description. However, most current systems assume a static environment, which limits the system's applicability in real-world scenarios where environments frequently change due to human intervention or the robot's own actions. In this work, we present DynaMem, a new approach to open-world mobile manipulation that uses a dynamic spatio-semantic memory to represent a robot's environment. DynaMem constructs a 3D data structure to maintain a dynamic memory of point clouds, and answers open-vocabulary object localization queries using multimodal LLMs or open-vocabulary features generated by state-of-the-art vision-language models. Powered by DynaMem, our robots can explore novel environments, search for objects not found in memory, and continuously update the memory as objects move, appear, or disappear in the scene. We run extensive experiments on the Stretch SE3 robots in three real and nine offline scenes, and achieve an average pick-and-drop success rate of 70% on non-stationary objects, which is more than a 2x improvement over state-of-the-art static systems. Our code as well as our experiment and deployment videos are open sourced and can be found on our project website: https://dynamem.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos</title>
      <itunes:episode>46</itunes:episode>
      <podcast:episode>46</podcast:episode>
      <itunes:title>VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b04e1447-4fd3-4657-8308-ce72fb0afc27</guid>
      <link>https://share.transistor.fm/s/896f99ba</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Shahbaz Khan, Salman Khan</p>

            <p><strong>Title:</strong><br>
            VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04923v1">http://arxiv.org/abs/2411.04923v1</a></p>

            <p><strong>Abstract:</strong><br>
            Fine-grained alignment between videos and text is challenging due to complex spatial and temporal dynamics in videos. Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine-grained pixel-level grounding in videos based on user-provided textual inputs. Our design seamlessly connects three key components: a Large Language Model, a dual vision encoder that emphasizes both spatial and temporal details, and a spatio-temporal decoder for accurate mask generation. This connection is facilitated via tunable V-L and L-V adapters that enable close Vision-Language (VL) alignment. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. To enable fine-grained grounding, we curate a multimodal dataset featuring detailed visually-grounded conversations using a semiautomatic annotation pipeline, resulting in a diverse set of 38k video-QA triplets along with 83k objects and 671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded Conversation Generation, Visual Grounding, and Referring Video Segmentation. Experimental results show that our model consistently outperforms existing approaches across all three tasks.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Shahbaz Khan, Salman Khan</p>

            <p><strong>Title:</strong><br>
            VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04923v1">http://arxiv.org/abs/2411.04923v1</a></p>

            <p><strong>Abstract:</strong><br>
            Fine-grained alignment between videos and text is challenging due to complex spatial and temporal dynamics in videos. Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine-grained pixel-level grounding in videos based on user-provided textual inputs. Our design seamlessly connects three key components: a Large Language Model, a dual vision encoder that emphasizes both spatial and temporal details, and a spatio-temporal decoder for accurate mask generation. This connection is facilitated via tunable V-L and L-V adapters that enable close Vision-Language (VL) alignment. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. To enable fine-grained grounding, we curate a multimodal dataset featuring detailed visually-grounded conversations using a semiautomatic annotation pipeline, resulting in a diverse set of 38k video-QA triplets along with 83k objects and 671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded Conversation Generation, Visual Grounding, and Referring Video Segmentation. Experimental results show that our model consistently outperforms existing approaches across all three tasks.</p>
            ]]>
      </content:encoded>
      <pubDate>Fri, 08 Nov 2024 19:35:06 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/896f99ba/34e40510.mp3" length="26693001" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1665</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 12 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric Xing, Fahad Shahbaz Khan, Salman Khan</p>

            <p><strong>Title:</strong><br>
            VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04923v1">http://arxiv.org/abs/2411.04923v1</a></p>

            <p><strong>Abstract:</strong><br>
            Fine-grained alignment between videos and text is challenging due to complex spatial and temporal dynamics in videos. Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine-grained pixel-level grounding in videos based on user-provided textual inputs. Our design seamlessly connects three key components: a Large Language Model, a dual vision encoder that emphasizes both spatial and temporal details, and a spatio-temporal decoder for accurate mask generation. This connection is facilitated via tunable V-L and L-V adapters that enable close Vision-Language (VL) alignment. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. To enable fine-grained grounding, we curate a multimodal dataset featuring detailed visually-grounded conversations using a semiautomatic annotation pipeline, resulting in a diverse set of 38k video-QA triplets along with 83k objects and 671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded Conversation Generation, Visual Grounding, and Referring Video Segmentation. Experimental results show that our model consistently outperforms existing approaches across all three tasks.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination</title>
      <itunes:episode>45</itunes:episode>
      <podcast:episode>45</podcast:episode>
      <itunes:title>Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0888337d-cf6d-430d-8bce-4649b82288d1</guid>
      <link>https://share.transistor.fm/s/8a1f0694</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 33 | cs.CV, cs.AI, cs.CL, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Dingjie Song, Sicheng Lai, Shunian Chen, Lichao Sun, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.03823v1">http://arxiv.org/abs/2411.03823v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid progression of multimodal large language models (MLLMs) has demonstrated superior performance on various multimodal benchmarks. However, the issue of data contamination during training creates challenges in performance evaluation and comparison. While numerous methods exist for detecting dataset contamination in large language models (LLMs), they are less effective for MLLMs due to their various modalities and multiple training phases. In this study, we introduce a multimodal data contamination detection framework, MM-Detect, designed for MLLMs. Our experimental results indicate that MM-Detect is sensitive to varying degrees of contamination and can highlight significant performance improvements due to leakage of the training set of multimodal benchmarks. Furthermore, We also explore the possibility of contamination originating from the pre-training phase of LLMs used by MLLMs and the fine-tuning phase of MLLMs, offering new insights into the stages at which contamination may be introduced.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 33 | cs.CV, cs.AI, cs.CL, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Dingjie Song, Sicheng Lai, Shunian Chen, Lichao Sun, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.03823v1">http://arxiv.org/abs/2411.03823v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid progression of multimodal large language models (MLLMs) has demonstrated superior performance on various multimodal benchmarks. However, the issue of data contamination during training creates challenges in performance evaluation and comparison. While numerous methods exist for detecting dataset contamination in large language models (LLMs), they are less effective for MLLMs due to their various modalities and multiple training phases. In this study, we introduce a multimodal data contamination detection framework, MM-Detect, designed for MLLMs. Our experimental results indicate that MM-Detect is sensitive to varying degrees of contamination and can highlight significant performance improvements due to leakage of the training set of multimodal benchmarks. Furthermore, We also explore the possibility of contamination originating from the pre-training phase of LLMs used by MLLMs and the fine-tuning phase of MLLMs, offering new insights into the stages at which contamination may be introduced.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 07 Nov 2024 19:17:49 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/8a1f0694/5386c6da.mp3" length="22991981" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1433</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 33 | cs.CV, cs.AI, cs.CL, cs.MM</p>

            <p><strong>Authors:</strong><br>
            Dingjie Song, Sicheng Lai, Shunian Chen, Lichao Sun, Benyou Wang</p>

            <p><strong>Title:</strong><br>
            Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.03823v1">http://arxiv.org/abs/2411.03823v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid progression of multimodal large language models (MLLMs) has demonstrated superior performance on various multimodal benchmarks. However, the issue of data contamination during training creates challenges in performance evaluation and comparison. While numerous methods exist for detecting dataset contamination in large language models (LLMs), they are less effective for MLLMs due to their various modalities and multiple training phases. In this study, we introduce a multimodal data contamination detection framework, MM-Detect, designed for MLLMs. Our experimental results indicate that MM-Detect is sensitive to varying degrees of contamination and can highlight significant performance improvements due to leakage of the training set of multimodal benchmarks. Furthermore, We also explore the possibility of contamination originating from the pre-training phase of LLMs used by MLLMs and the fine-tuning phase of MLLMs, offering new insights into the stages at which contamination may be introduced.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level</title>
      <itunes:episode>44</itunes:episode>
      <podcast:episode>44</podcast:episode>
      <itunes:title>Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">240533ee-38ee-4b81-93fb-f37af006004d</guid>
      <link>https://share.transistor.fm/s/bbad91e7</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 26 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Antoine Grosnit, Alexandre Maraval, James Doran, Giuseppe Paolo, Albert Thomas, Refinath Shahul Hameed Nabeezath Beevi, Jonas Gonzalez, Khyati Khandelwal, Ignacio Iacobacci, Abdelhakim Benechehab, Hamza Cherkaoui, Youssef Attia El-Hili, Kun Shao, Jianye Hao, Jun Yao, Balazs Kegl, Haitham Bou-Ammar, Jun Wang</p>

            <p><strong>Title:</strong><br>
            Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.03562v1">http://arxiv.org/abs/2411.03562v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Agent K v1.0, an end-to-end autonomous data science agent designed to automate, optimise, and generalise across diverse data science tasks. Fully automated, Agent K v1.0 manages the entire data science life cycle by learning from experience. It leverages a highly flexible structured reasoning framework to enable it to dynamically process memory in a nested structure, effectively learning from accumulated experience stored to handle complex reasoning tasks. It optimises long- and short-term memory by selectively storing and retrieving key information, guiding future decisions based on environmental rewards. This iterative approach allows it to refine decisions without fine-tuning or backpropagation, achieving continuous improvement through experiential learning. We evaluate our agent's apabilities using Kaggle competitions as a case study. Following a fully automated protocol, Agent K v1.0 systematically addresses complex and multimodal data science tasks, employing Bayesian optimisation for hyperparameter tuning and feature engineering. Our new evaluation framework rigorously assesses Agent K v1.0's end-to-end capabilities to generate and send submissions starting from a Kaggle competition URL. Results demonstrate that Agent K v1.0 achieves a 92.5\% success rate across tasks, spanning tabular, computer vision, NLP, and multimodal domains. When benchmarking against 5,856 human Kaggle competitors by calculating Elo-MMR scores for each, Agent K v1.0 ranks in the top 38\%, demonstrating an overall skill level comparable to Expert-level users. Notably, its Elo-MMR score falls between the first and third quartiles of scores achieved by human Grandmasters. Furthermore, our results indicate that Agent K v1.0 has reached a performance level equivalent to Kaggle Grandmaster, with a record of 6 gold, 3 silver, and 7 bronze medals, as defined by Kaggle's progression system.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 26 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Antoine Grosnit, Alexandre Maraval, James Doran, Giuseppe Paolo, Albert Thomas, Refinath Shahul Hameed Nabeezath Beevi, Jonas Gonzalez, Khyati Khandelwal, Ignacio Iacobacci, Abdelhakim Benechehab, Hamza Cherkaoui, Youssef Attia El-Hili, Kun Shao, Jianye Hao, Jun Yao, Balazs Kegl, Haitham Bou-Ammar, Jun Wang</p>

            <p><strong>Title:</strong><br>
            Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.03562v1">http://arxiv.org/abs/2411.03562v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Agent K v1.0, an end-to-end autonomous data science agent designed to automate, optimise, and generalise across diverse data science tasks. Fully automated, Agent K v1.0 manages the entire data science life cycle by learning from experience. It leverages a highly flexible structured reasoning framework to enable it to dynamically process memory in a nested structure, effectively learning from accumulated experience stored to handle complex reasoning tasks. It optimises long- and short-term memory by selectively storing and retrieving key information, guiding future decisions based on environmental rewards. This iterative approach allows it to refine decisions without fine-tuning or backpropagation, achieving continuous improvement through experiential learning. We evaluate our agent's apabilities using Kaggle competitions as a case study. Following a fully automated protocol, Agent K v1.0 systematically addresses complex and multimodal data science tasks, employing Bayesian optimisation for hyperparameter tuning and feature engineering. Our new evaluation framework rigorously assesses Agent K v1.0's end-to-end capabilities to generate and send submissions starting from a Kaggle competition URL. Results demonstrate that Agent K v1.0 achieves a 92.5\% success rate across tasks, spanning tabular, computer vision, NLP, and multimodal domains. When benchmarking against 5,856 human Kaggle competitors by calculating Elo-MMR scores for each, Agent K v1.0 ranks in the top 38\%, demonstrating an overall skill level comparable to Expert-level users. Notably, its Elo-MMR score falls between the first and third quartiles of scores achieved by human Grandmasters. Furthermore, our results indicate that Agent K v1.0 has reached a performance level equivalent to Kaggle Grandmaster, with a record of 6 gold, 3 silver, and 7 bronze medals, as defined by Kaggle's progression system.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 07 Nov 2024 19:17:28 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/bbad91e7/9fe38b02.mp3" length="19458974" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1213</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 26 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Antoine Grosnit, Alexandre Maraval, James Doran, Giuseppe Paolo, Albert Thomas, Refinath Shahul Hameed Nabeezath Beevi, Jonas Gonzalez, Khyati Khandelwal, Ignacio Iacobacci, Abdelhakim Benechehab, Hamza Cherkaoui, Youssef Attia El-Hili, Kun Shao, Jianye Hao, Jun Yao, Balazs Kegl, Haitham Bou-Ammar, Jun Wang</p>

            <p><strong>Title:</strong><br>
            Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.03562v1">http://arxiv.org/abs/2411.03562v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce Agent K v1.0, an end-to-end autonomous data science agent designed to automate, optimise, and generalise across diverse data science tasks. Fully automated, Agent K v1.0 manages the entire data science life cycle by learning from experience. It leverages a highly flexible structured reasoning framework to enable it to dynamically process memory in a nested structure, effectively learning from accumulated experience stored to handle complex reasoning tasks. It optimises long- and short-term memory by selectively storing and retrieving key information, guiding future decisions based on environmental rewards. This iterative approach allows it to refine decisions without fine-tuning or backpropagation, achieving continuous improvement through experiential learning. We evaluate our agent's apabilities using Kaggle competitions as a case study. Following a fully automated protocol, Agent K v1.0 systematically addresses complex and multimodal data science tasks, employing Bayesian optimisation for hyperparameter tuning and feature engineering. Our new evaluation framework rigorously assesses Agent K v1.0's end-to-end capabilities to generate and send submissions starting from a Kaggle competition URL. Results demonstrate that Agent K v1.0 achieves a 92.5\% success rate across tasks, spanning tabular, computer vision, NLP, and multimodal domains. When benchmarking against 5,856 human Kaggle competitors by calculating Elo-MMR scores for each, Agent K v1.0 ranks in the top 38\%, demonstrating an overall skill level comparable to Expert-level users. Notably, its Elo-MMR score falls between the first and third quartiles of scores achieved by human Grandmasters. Furthermore, our results indicate that Agent K v1.0 has reached a performance level equivalent to Kaggle Grandmaster, with a record of 6 gold, 3 silver, and 7 bronze medals, as defined by Kaggle's progression system.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models</title>
      <itunes:episode>43</itunes:episode>
      <podcast:episode>43</podcast:episode>
      <itunes:title>Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4e95ede4-77ab-4609-a600-1913ab007edb</guid>
      <link>https://share.transistor.fm/s/28de42e2</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, Jinwen Ma</p>

            <p><strong>Title:</strong><br>
            Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.03884v1">http://arxiv.org/abs/2411.03884v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transformers have found extensive applications across various domains due to the powerful fitting capabilities. This success can be partially attributed to their inherent nonlinearity. Thus, in addition to the ReLU function employed in the original transformer architecture, researchers have explored alternative modules such as GeLU and SwishGLU to enhance nonlinearity and thereby augment representational capacity. In this paper, we propose a novel category of polynomial composition activations (PolyCom), designed to optimize the dynamics of transformers. Theoretically, we provide a comprehensive mathematical analysis of PolyCom, highlighting its enhanced expressivity and efficacy relative to other activation functions. Notably, we demonstrate that networks incorporating PolyCom achieve the $\textbf{optimal approximation rate}$, indicating that PolyCom networks require minimal parameters to approximate general smooth functions in Sobolev spaces. We conduct empirical experiments on the pre-training configurations of large language models (LLMs), including both dense and sparse architectures. By substituting conventional activation functions with PolyCom, we enable LLMs to capture higher-order interactions within the data, thus improving performance metrics in terms of accuracy and convergence rates. Extensive experimental results demonstrate the effectiveness of our method, showing substantial improvements over other activation functions. Code is available at https://github.com/BryceZhuo/PolyCom.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, Jinwen Ma</p>

            <p><strong>Title:</strong><br>
            Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.03884v1">http://arxiv.org/abs/2411.03884v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transformers have found extensive applications across various domains due to the powerful fitting capabilities. This success can be partially attributed to their inherent nonlinearity. Thus, in addition to the ReLU function employed in the original transformer architecture, researchers have explored alternative modules such as GeLU and SwishGLU to enhance nonlinearity and thereby augment representational capacity. In this paper, we propose a novel category of polynomial composition activations (PolyCom), designed to optimize the dynamics of transformers. Theoretically, we provide a comprehensive mathematical analysis of PolyCom, highlighting its enhanced expressivity and efficacy relative to other activation functions. Notably, we demonstrate that networks incorporating PolyCom achieve the $\textbf{optimal approximation rate}$, indicating that PolyCom networks require minimal parameters to approximate general smooth functions in Sobolev spaces. We conduct empirical experiments on the pre-training configurations of large language models (LLMs), including both dense and sparse architectures. By substituting conventional activation functions with PolyCom, we enable LLMs to capture higher-order interactions within the data, thus improving performance metrics in terms of accuracy and convergence rates. Extensive experimental results demonstrate the effectiveness of our method, showing substantial improvements over other activation functions. Code is available at https://github.com/BryceZhuo/PolyCom.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 07 Nov 2024 19:17:07 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/28de42e2/8129e6a3.mp3" length="22442361" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1399</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, Jinwen Ma</p>

            <p><strong>Title:</strong><br>
            Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.03884v1">http://arxiv.org/abs/2411.03884v1</a></p>

            <p><strong>Abstract:</strong><br>
            Transformers have found extensive applications across various domains due to the powerful fitting capabilities. This success can be partially attributed to their inherent nonlinearity. Thus, in addition to the ReLU function employed in the original transformer architecture, researchers have explored alternative modules such as GeLU and SwishGLU to enhance nonlinearity and thereby augment representational capacity. In this paper, we propose a novel category of polynomial composition activations (PolyCom), designed to optimize the dynamics of transformers. Theoretically, we provide a comprehensive mathematical analysis of PolyCom, highlighting its enhanced expressivity and efficacy relative to other activation functions. Notably, we demonstrate that networks incorporating PolyCom achieve the $\textbf{optimal approximation rate}$, indicating that PolyCom networks require minimal parameters to approximate general smooth functions in Sobolev spaces. We conduct empirical experiments on the pre-training configurations of large language models (LLMs), including both dense and sparse architectures. By substituting conventional activation functions with PolyCom, we enable LLMs to capture higher-order interactions within the data, thus improving performance metrics in terms of accuracy and convergence rates. Extensive experimental results demonstrate the effectiveness of our method, showing substantial improvements over other activation functions. Code is available at https://github.com/BryceZhuo/PolyCom.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Self-Consistency Preference Optimization</title>
      <itunes:episode>42</itunes:episode>
      <podcast:episode>42</podcast:episode>
      <itunes:title>Self-Consistency Preference Optimization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">43508940-4b82-444b-bd98-77c0e3ece55e</guid>
      <link>https://share.transistor.fm/s/764f84d8</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 5 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sainbayar Sukhbaatar, Jason Weston, Jane Yu</p>

            <p><strong>Title:</strong><br>
            Self-Consistency Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04109v1">http://arxiv.org/abs/2411.04109v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-alignment, whereby models learn to improve themselves without human annotation, is a rapidly growing research area. However, existing techniques often fail to improve complex reasoning tasks due to the difficulty of assigning correct rewards. An orthogonal approach that is known to improve correctness is self-consistency, a method applied at inference time based on multiple sampling in order to find the most consistent answer. In this work, we extend the self-consistency concept to help train models. We thus introduce self-consistency preference optimization (ScPO), which iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. We show ScPO leads to large improvements over conventional reward model training on reasoning tasks such as GSM8K and MATH, closing the gap with supervised training with gold answers or preferences, and that combining ScPO with standard supervised learning improves results even further. On ZebraLogic, ScPO finetunes Llama-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 5 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sainbayar Sukhbaatar, Jason Weston, Jane Yu</p>

            <p><strong>Title:</strong><br>
            Self-Consistency Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04109v1">http://arxiv.org/abs/2411.04109v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-alignment, whereby models learn to improve themselves without human annotation, is a rapidly growing research area. However, existing techniques often fail to improve complex reasoning tasks due to the difficulty of assigning correct rewards. An orthogonal approach that is known to improve correctness is self-consistency, a method applied at inference time based on multiple sampling in order to find the most consistent answer. In this work, we extend the self-consistency concept to help train models. We thus introduce self-consistency preference optimization (ScPO), which iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. We show ScPO leads to large improvements over conventional reward model training on reasoning tasks such as GSM8K and MATH, closing the gap with supervised training with gold answers or preferences, and that combining ScPO with standard supervised learning improves results even further. On ZebraLogic, ScPO finetunes Llama-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 07 Nov 2024 19:16:46 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/764f84d8/edf4d45b.mp3" length="19909067" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1241</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 5 | cs.CL, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sainbayar Sukhbaatar, Jason Weston, Jane Yu</p>

            <p><strong>Title:</strong><br>
            Self-Consistency Preference Optimization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.04109v1">http://arxiv.org/abs/2411.04109v1</a></p>

            <p><strong>Abstract:</strong><br>
            Self-alignment, whereby models learn to improve themselves without human annotation, is a rapidly growing research area. However, existing techniques often fail to improve complex reasoning tasks due to the difficulty of assigning correct rewards. An orthogonal approach that is known to improve correctness is self-consistency, a method applied at inference time based on multiple sampling in order to find the most consistent answer. In this work, we extend the self-consistency concept to help train models. We thus introduce self-consistency preference optimization (ScPO), which iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. We show ScPO leads to large improvements over conventional reward model training on reasoning tasks such as GSM8K and MATH, closing the gap with supervised training with gold answers or preferences, and that combining ScPO with standard supervised learning improves results even further. On ZebraLogic, ScPO finetunes Llama-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond</title>
      <itunes:episode>41</itunes:episode>
      <podcast:episode>41</podcast:episode>
      <itunes:title>From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ff3f4a67-b09b-40b8-b2b7-518ce5985640</guid>
      <link>https://share.transistor.fm/s/b96677db</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Harsha Nori, Naoto Usuyama, Nicholas King, Scott Mayer McKinney, Xavier Fernandes, Sheng Zhang, Eric Horvitz</p>

            <p><strong>Title:</strong><br>
            From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.03590v1">http://arxiv.org/abs/2411.03590v1</a></p>

            <p><strong>Abstract:</strong><br>
            Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Harsha Nori, Naoto Usuyama, Nicholas King, Scott Mayer McKinney, Xavier Fernandes, Sheng Zhang, Eric Horvitz</p>

            <p><strong>Title:</strong><br>
            From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.03590v1">http://arxiv.org/abs/2411.03590v1</a></p>

            <p><strong>Abstract:</strong><br>
            Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Thu, 07 Nov 2024 19:16:25 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b96677db/70df6d5b.mp3" length="16592201" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1033</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Harsha Nori, Naoto Usuyama, Nicholas King, Scott Mayer McKinney, Xavier Fernandes, Sheng Zhang, Eric Horvitz</p>

            <p><strong>Title:</strong><br>
            From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.03590v1">http://arxiv.org/abs/2411.03590v1</a></p>

            <p><strong>Abstract:</strong><br>
            Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems</title>
      <itunes:episode>40</itunes:episode>
      <podcast:episode>40</podcast:episode>
      <itunes:title>HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1415482f-555a-4841-b2eb-cd9808074513</guid>
      <link>https://share.transistor.fm/s/7f0123e6</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 34 | cs.IR</p>

            <p><strong>Authors:</strong><br>
            Jiejun Tan, Zhicheng Dou, Wen Wang, Mang Wang, Weipeng Chen, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02959v1">http://arxiv.org/abs/2411.02959v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-Augmented Generation (RAG) has been shown to improve knowledge capabilities and alleviate the hallucination problem of LLMs. The Web is a major source of external knowledge used in RAG systems, and many commercial systems such as ChatGPT and Perplexity have used Web search engines as their major retrieval systems. Typically, such RAG systems retrieve search results, download HTML sources of the results, and then extract plain texts from the HTML sources. Plain text documents or chunks are fed into the LLMs to augment the generation. However, much of the structural and semantic information inherent in HTML, such as headings and table structures, is lost during this plain-text-based RAG process. To alleviate this problem, we propose HtmlRAG, which uses HTML instead of plain text as the format of retrieved knowledge in RAG. We believe HTML is better than plain text in modeling knowledge in external documents, and most LLMs possess robust capacities to understand HTML. However, utilizing HTML presents new challenges. HTML contains additional content such as tags, JavaScript, and CSS specifications, which bring extra input tokens and noise to the RAG system. To address this issue, we propose HTML cleaning, compression, and pruning strategies, to shorten the HTML while minimizing the loss of information. Specifically, we design a two-step block-tree-based pruning method that prunes useless HTML blocks and keeps only the relevant part of the HTML. Experiments on six QA datasets confirm the superiority of using HTML in RAG systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 34 | cs.IR</p>

            <p><strong>Authors:</strong><br>
            Jiejun Tan, Zhicheng Dou, Wen Wang, Mang Wang, Weipeng Chen, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02959v1">http://arxiv.org/abs/2411.02959v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-Augmented Generation (RAG) has been shown to improve knowledge capabilities and alleviate the hallucination problem of LLMs. The Web is a major source of external knowledge used in RAG systems, and many commercial systems such as ChatGPT and Perplexity have used Web search engines as their major retrieval systems. Typically, such RAG systems retrieve search results, download HTML sources of the results, and then extract plain texts from the HTML sources. Plain text documents or chunks are fed into the LLMs to augment the generation. However, much of the structural and semantic information inherent in HTML, such as headings and table structures, is lost during this plain-text-based RAG process. To alleviate this problem, we propose HtmlRAG, which uses HTML instead of plain text as the format of retrieved knowledge in RAG. We believe HTML is better than plain text in modeling knowledge in external documents, and most LLMs possess robust capacities to understand HTML. However, utilizing HTML presents new challenges. HTML contains additional content such as tags, JavaScript, and CSS specifications, which bring extra input tokens and noise to the RAG system. To address this issue, we propose HTML cleaning, compression, and pruning strategies, to shorten the HTML while minimizing the loss of information. Specifically, we design a two-step block-tree-based pruning method that prunes useless HTML blocks and keeps only the relevant part of the HTML. Experiments on six QA datasets confirm the superiority of using HTML in RAG systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 06 Nov 2024 19:48:59 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7f0123e6/a8be2bd9.mp3" length="20282769" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1264</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 34 | cs.IR</p>

            <p><strong>Authors:</strong><br>
            Jiejun Tan, Zhicheng Dou, Wen Wang, Mang Wang, Weipeng Chen, Ji-Rong Wen</p>

            <p><strong>Title:</strong><br>
            HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02959v1">http://arxiv.org/abs/2411.02959v1</a></p>

            <p><strong>Abstract:</strong><br>
            Retrieval-Augmented Generation (RAG) has been shown to improve knowledge capabilities and alleviate the hallucination problem of LLMs. The Web is a major source of external knowledge used in RAG systems, and many commercial systems such as ChatGPT and Perplexity have used Web search engines as their major retrieval systems. Typically, such RAG systems retrieve search results, download HTML sources of the results, and then extract plain texts from the HTML sources. Plain text documents or chunks are fed into the LLMs to augment the generation. However, much of the structural and semantic information inherent in HTML, such as headings and table structures, is lost during this plain-text-based RAG process. To alleviate this problem, we propose HtmlRAG, which uses HTML instead of plain text as the format of retrieved knowledge in RAG. We believe HTML is better than plain text in modeling knowledge in external documents, and most LLMs possess robust capacities to understand HTML. However, utilizing HTML presents new challenges. HTML contains additional content such as tags, JavaScript, and CSS specifications, which bring extra input tokens and noise to the RAG system. To address this issue, we propose HTML cleaning, compression, and pruning strategies, to shorten the HTML while minimizing the loss of information. Specifically, we design a two-step block-tree-based pruning method that prunes useless HTML blocks and keeps only the relevant part of the HTML. Experiments on six QA datasets confirm the superiority of using HTML in RAG systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>LLaMo: Large Language Model-based Molecular Graph Assistant</title>
      <itunes:episode>39</itunes:episode>
      <podcast:episode>39</podcast:episode>
      <itunes:title>LLaMo: Large Language Model-based Molecular Graph Assistant</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e4d4ccc1-f78b-4afd-aaa2-4698d190acc4</guid>
      <link>https://share.transistor.fm/s/fd1cbb37</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.LG, cs.AI, q-bio.MN</p>

            <p><strong>Authors:</strong><br>
            Jinyoung Park, Minseong Bae, Dohwan Ko, Hyunwoo J. Kim</p>

            <p><strong>Title:</strong><br>
            LLaMo: Large Language Model-based Molecular Graph Assistant</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00871v1">http://arxiv.org/abs/2411.00871v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated remarkable generalization and instruction-following capabilities with instruction tuning. The advancements in LLMs and instruction tuning have led to the development of Large Vision-Language Models (LVLMs). However, the competency of the LLMs and instruction tuning have been less explored in the molecular domain. Thus, we propose LLaMo: Large Language Model-based Molecular graph assistant, which is an end-to-end trained large molecular graph-language model. To bridge the discrepancy between the language and graph modalities, we present the multi-level graph projector that transforms graph representations into graph tokens by abstracting the output representations of each GNN layer and motif representations with the cross-attention mechanism. We also introduce machine-generated molecular graph instruction data to instruction-tune the large molecular graph-language model for general-purpose molecule and language understanding. Our extensive experiments demonstrate that LLaMo shows the best performance on diverse tasks, such as molecular description generation, property prediction, and IUPAC name prediction. The code of LLaMo is available at https://github.com/mlvlab/LLaMo.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.LG, cs.AI, q-bio.MN</p>

            <p><strong>Authors:</strong><br>
            Jinyoung Park, Minseong Bae, Dohwan Ko, Hyunwoo J. Kim</p>

            <p><strong>Title:</strong><br>
            LLaMo: Large Language Model-based Molecular Graph Assistant</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00871v1">http://arxiv.org/abs/2411.00871v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated remarkable generalization and instruction-following capabilities with instruction tuning. The advancements in LLMs and instruction tuning have led to the development of Large Vision-Language Models (LVLMs). However, the competency of the LLMs and instruction tuning have been less explored in the molecular domain. Thus, we propose LLaMo: Large Language Model-based Molecular graph assistant, which is an end-to-end trained large molecular graph-language model. To bridge the discrepancy between the language and graph modalities, we present the multi-level graph projector that transforms graph representations into graph tokens by abstracting the output representations of each GNN layer and motif representations with the cross-attention mechanism. We also introduce machine-generated molecular graph instruction data to instruction-tune the large molecular graph-language model for general-purpose molecule and language understanding. Our extensive experiments demonstrate that LLaMo shows the best performance on diverse tasks, such as molecular description generation, property prediction, and IUPAC name prediction. The code of LLaMo is available at https://github.com/mlvlab/LLaMo.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 06 Nov 2024 19:48:35 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fd1cbb37/a9fd9867.mp3" length="23941974" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1493</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.LG, cs.AI, q-bio.MN</p>

            <p><strong>Authors:</strong><br>
            Jinyoung Park, Minseong Bae, Dohwan Ko, Hyunwoo J. Kim</p>

            <p><strong>Title:</strong><br>
            LLaMo: Large Language Model-based Molecular Graph Assistant</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00871v1">http://arxiv.org/abs/2411.00871v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) have demonstrated remarkable generalization and instruction-following capabilities with instruction tuning. The advancements in LLMs and instruction tuning have led to the development of Large Vision-Language Models (LVLMs). However, the competency of the LLMs and instruction tuning have been less explored in the molecular domain. Thus, we propose LLaMo: Large Language Model-based Molecular graph assistant, which is an end-to-end trained large molecular graph-language model. To bridge the discrepancy between the language and graph modalities, we present the multi-level graph projector that transforms graph representations into graph tokens by abstracting the output representations of each GNN layer and motif representations with the cross-attention mechanism. We also introduce machine-generated molecular graph instruction data to instruction-tune the large molecular graph-language model for general-purpose molecule and language understanding. Our extensive experiments demonstrate that LLaMo shows the best performance on diverse tasks, such as molecular description generation, property prediction, and IUPAC name prediction. The code of LLaMo is available at https://github.com/mlvlab/LLaMo.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution</title>
      <itunes:episode>38</itunes:episode>
      <podcast:episode>38</podcast:episode>
      <itunes:title>DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d7df4054-4858-4c86-85b3-5b8498c8bd69</guid>
      <link>https://share.transistor.fm/s/460612e7</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.RO, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, Gao Huang</p>

            <p><strong>Title:</strong><br>
            DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02359v1">http://arxiv.org/abs/2411.02359v1</a></p>

            <p><strong>Abstract:</strong><br>
            MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at https://github.com/yueyang130/DeeR-VLA.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.RO, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, Gao Huang</p>

            <p><strong>Title:</strong><br>
            DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02359v1">http://arxiv.org/abs/2411.02359v1</a></p>

            <p><strong>Abstract:</strong><br>
            MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at https://github.com/yueyang130/DeeR-VLA.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 06 Nov 2024 19:48:12 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/460612e7/0428b33f.mp3" length="18393182" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1146</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.RO, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, Gao Huang</p>

            <p><strong>Title:</strong><br>
            DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02359v1">http://arxiv.org/abs/2411.02359v1</a></p>

            <p><strong>Abstract:</strong><br>
            MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at https://github.com/yueyang130/DeeR-VLA.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Controlling Language and Diffusion Models by Transporting Activations</title>
      <itunes:episode>37</itunes:episode>
      <podcast:episode>37</podcast:episode>
      <itunes:title>Controlling Language and Diffusion Models by Transporting Activations</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">82d41bfb-8b12-4616-8aba-2369ed93e17d</guid>
      <link>https://share.transistor.fm/s/c189b571</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.LG, cs.AI, cs.CL, cs.CV, 68T07, 49Q22, I.2.6; I.2.7; I.4.8</p>

            <p><strong>Authors:</strong><br>
            Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, Marco Cuturi, Xavier Suau</p>

            <p><strong>Title:</strong><br>
            Controlling Language and Diffusion Models by Transporting Activations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2410.23054v1">http://arxiv.org/abs/2410.23054v1</a></p>

            <p><strong>Abstract:</strong><br>
            The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.LG, cs.AI, cs.CL, cs.CV, 68T07, 49Q22, I.2.6; I.2.7; I.4.8</p>

            <p><strong>Authors:</strong><br>
            Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, Marco Cuturi, Xavier Suau</p>

            <p><strong>Title:</strong><br>
            Controlling Language and Diffusion Models by Transporting Activations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2410.23054v1">http://arxiv.org/abs/2410.23054v1</a></p>

            <p><strong>Abstract:</strong><br>
            The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 06 Nov 2024 19:47:49 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c189b571/37e11563.mp3" length="21919062" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1366</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.LG, cs.AI, cs.CL, cs.CV, 68T07, 49Q22, I.2.6; I.2.7; I.4.8</p>

            <p><strong>Authors:</strong><br>
            Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, Marco Cuturi, Xavier Suau</p>

            <p><strong>Title:</strong><br>
            Controlling Language and Diffusion Models by Transporting Activations</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2410.23054v1">http://arxiv.org/abs/2410.23054v1</a></p>

            <p><strong>Abstract:</strong><br>
            The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Sample-Efficient Alignment for LLMs</title>
      <itunes:episode>36</itunes:episode>
      <podcast:episode>36</podcast:episode>
      <itunes:title>Sample-Efficient Alignment for LLMs</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a6bce88e-8153-4c62-92da-345b33c44075</guid>
      <link>https://share.transistor.fm/s/7ee909c5</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zichen Liu, Changyu Chen, Chao Du, Wee Sun Lee, Min Lin</p>

            <p><strong>Title:</strong><br>
            Sample-Efficient Alignment for LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.01493v1">http://arxiv.org/abs/2411.01493v1</a></p>

            <p><strong>Abstract:</strong><br>
            We study methods for efficiently aligning large language models (LLMs) with human preferences given budgeted online feedback. We first formulate the LLM alignment problem in the frame of contextual dueling bandits. This formulation, subsuming recent paradigms such as online RLHF and online DPO, inherently quests for sample-efficient algorithms that incorporate online active exploration. Leveraging insights from bandit theory, we introduce a unified algorithm based on Thompson sampling and highlight its applications in two distinct LLM alignment scenarios. The practical agent that efficiently implements this algorithm, named SEA (Sample-Efficient Alignment), is empirically validated through extensive experiments across three model scales (1B, 2.8B, 6.9B) and three preference learning algorithms (DPO, IPO, SLiC). The results demonstrate that SEA achieves highly sample-efficient alignment with oracle's preferences, outperforming recent active exploration methods for LLMs. Additionally, we release the implementation of SEA together with an efficient codebase designed for online alignment of LLMs, aiming to accelerate future research in this field.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zichen Liu, Changyu Chen, Chao Du, Wee Sun Lee, Min Lin</p>

            <p><strong>Title:</strong><br>
            Sample-Efficient Alignment for LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.01493v1">http://arxiv.org/abs/2411.01493v1</a></p>

            <p><strong>Abstract:</strong><br>
            We study methods for efficiently aligning large language models (LLMs) with human preferences given budgeted online feedback. We first formulate the LLM alignment problem in the frame of contextual dueling bandits. This formulation, subsuming recent paradigms such as online RLHF and online DPO, inherently quests for sample-efficient algorithms that incorporate online active exploration. Leveraging insights from bandit theory, we introduce a unified algorithm based on Thompson sampling and highlight its applications in two distinct LLM alignment scenarios. The practical agent that efficiently implements this algorithm, named SEA (Sample-Efficient Alignment), is empirically validated through extensive experiments across three model scales (1B, 2.8B, 6.9B) and three preference learning algorithms (DPO, IPO, SLiC). The results demonstrate that SEA achieves highly sample-efficient alignment with oracle's preferences, outperforming recent active exploration methods for LLMs. Additionally, we release the implementation of SEA together with an efficient codebase designed for online alignment of LLMs, aiming to accelerate future research in this field.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 06 Nov 2024 19:47:26 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7ee909c5/13c862f1.mp3" length="20274358" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1263</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.LG, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zichen Liu, Changyu Chen, Chao Du, Wee Sun Lee, Min Lin</p>

            <p><strong>Title:</strong><br>
            Sample-Efficient Alignment for LLMs</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.01493v1">http://arxiv.org/abs/2411.01493v1</a></p>

            <p><strong>Abstract:</strong><br>
            We study methods for efficiently aligning large language models (LLMs) with human preferences given budgeted online feedback. We first formulate the LLM alignment problem in the frame of contextual dueling bandits. This formulation, subsuming recent paradigms such as online RLHF and online DPO, inherently quests for sample-efficient algorithms that incorporate online active exploration. Leveraging insights from bandit theory, we introduce a unified algorithm based on Thompson sampling and highlight its applications in two distinct LLM alignment scenarios. The practical agent that efficiently implements this algorithm, named SEA (Sample-Efficient Alignment), is empirically validated through extensive experiments across three model scales (1B, 2.8B, 6.9B) and three preference learning algorithms (DPO, IPO, SLiC). The results demonstrate that SEA achieves highly sample-efficient alignment with oracle's preferences, outperforming recent active exploration methods for LLMs. Additionally, we release the implementation of SEA together with an efficient codebase designed for online alignment of LLMs, aiming to accelerate future research in this field.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DreamPolish: Domain Score Distillation With Progressive Geometry Generation</title>
      <itunes:episode>35</itunes:episode>
      <podcast:episode>35</podcast:episode>
      <itunes:title>DreamPolish: Domain Score Distillation With Progressive Geometry Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">893884ff-7979-4a45-bd87-bbf8740757e8</guid>
      <link>https://share.transistor.fm/s/49f93f54</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 6 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yean Cheng, Ziqi Cai, Ming Ding, Wendi Zheng, Shiyu Huang, Yuxiao Dong, Jie Tang, Boxin Shi</p>

            <p><strong>Title:</strong><br>
            DreamPolish: Domain Score Distillation With Progressive Geometry Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.01602v1">http://arxiv.org/abs/2411.01602v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce DreamPolish, a text-to-3D generation model that excels in producing refined geometry and high-quality textures. In the geometry construction phase, our approach leverages multiple neural representations to enhance the stability of the synthesis process. Instead of relying solely on a view-conditioned diffusion prior in the novel sampled views, which often leads to undesired artifacts in the geometric surface, we incorporate an additional normal estimator to polish the geometry details, conditioned on viewpoints with varying field-of-views. We propose to add a surface polishing stage with only a few training steps, which can effectively refine the artifacts attributed to limited guidance from previous stages and produce 3D objects with more desirable geometry. The key topic of texture generation using pretrained text-to-image models is to find a suitable domain in the vast latent distribution of these models that contains photorealistic and consistent renderings. In the texture generation phase, we introduce a novel score distillation objective, namely domain score distillation (DSD), to guide neural representations toward such a domain. We draw inspiration from the classifier-free guidance (CFG) in textconditioned image generation tasks and show that CFG and variational distribution guidance represent distinct aspects in gradient guidance and are both imperative domains for the enhancement of texture quality. Extensive experiments show our proposed model can produce 3D assets with polished surfaces and photorealistic textures, outperforming existing state-of-the-art methods.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 6 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yean Cheng, Ziqi Cai, Ming Ding, Wendi Zheng, Shiyu Huang, Yuxiao Dong, Jie Tang, Boxin Shi</p>

            <p><strong>Title:</strong><br>
            DreamPolish: Domain Score Distillation With Progressive Geometry Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.01602v1">http://arxiv.org/abs/2411.01602v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce DreamPolish, a text-to-3D generation model that excels in producing refined geometry and high-quality textures. In the geometry construction phase, our approach leverages multiple neural representations to enhance the stability of the synthesis process. Instead of relying solely on a view-conditioned diffusion prior in the novel sampled views, which often leads to undesired artifacts in the geometric surface, we incorporate an additional normal estimator to polish the geometry details, conditioned on viewpoints with varying field-of-views. We propose to add a surface polishing stage with only a few training steps, which can effectively refine the artifacts attributed to limited guidance from previous stages and produce 3D objects with more desirable geometry. The key topic of texture generation using pretrained text-to-image models is to find a suitable domain in the vast latent distribution of these models that contains photorealistic and consistent renderings. In the texture generation phase, we introduce a novel score distillation objective, namely domain score distillation (DSD), to guide neural representations toward such a domain. We draw inspiration from the classifier-free guidance (CFG) in textconditioned image generation tasks and show that CFG and variational distribution guidance represent distinct aspects in gradient guidance and are both imperative domains for the enhancement of texture quality. Extensive experiments show our proposed model can produce 3D assets with polished surfaces and photorealistic textures, outperforming existing state-of-the-art methods.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 06 Nov 2024 19:47:03 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/49f93f54/fbb3e34d.mp3" length="17450666" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1087</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 6 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yean Cheng, Ziqi Cai, Ming Ding, Wendi Zheng, Shiyu Huang, Yuxiao Dong, Jie Tang, Boxin Shi</p>

            <p><strong>Title:</strong><br>
            DreamPolish: Domain Score Distillation With Progressive Geometry Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.01602v1">http://arxiv.org/abs/2411.01602v1</a></p>

            <p><strong>Abstract:</strong><br>
            We introduce DreamPolish, a text-to-3D generation model that excels in producing refined geometry and high-quality textures. In the geometry construction phase, our approach leverages multiple neural representations to enhance the stability of the synthesis process. Instead of relying solely on a view-conditioned diffusion prior in the novel sampled views, which often leads to undesired artifacts in the geometric surface, we incorporate an additional normal estimator to polish the geometry details, conditioned on viewpoints with varying field-of-views. We propose to add a surface polishing stage with only a few training steps, which can effectively refine the artifacts attributed to limited guidance from previous stages and produce 3D objects with more desirable geometry. The key topic of texture generation using pretrained text-to-image models is to find a suitable domain in the vast latent distribution of these models that contains photorealistic and consistent renderings. In the texture generation phase, we introduce a novel score distillation objective, namely domain score distillation (DSD), to guide neural representations toward such a domain. We draw inspiration from the classifier-free guidance (CFG) in textconditioned image generation tasks and show that CFG and variational distribution guidance represent distinct aspects in gradient guidance and are both imperative domains for the enhancement of texture quality. Extensive experiments show our proposed model can produce 3D assets with polished surfaces and photorealistic textures, outperforming existing state-of-the-art methods.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Adaptive Length Image Tokenization via Recurrent Allocation</title>
      <itunes:episode>34</itunes:episode>
      <podcast:episode>34</podcast:episode>
      <itunes:title>Adaptive Length Image Tokenization via Recurrent Allocation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">85fb1c67-4897-431f-b97f-da9662bec6fc</guid>
      <link>https://share.transistor.fm/s/999d414d</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 4 | cs.CV, cs.AI, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Shivam Duggal, Phillip Isola, Antonio Torralba, William T. Freeman</p>

            <p><strong>Title:</strong><br>
            Adaptive Length Image Tokenization via Recurrent Allocation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02393v1">http://arxiv.org/abs/2411.02393v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 4 | cs.CV, cs.AI, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Shivam Duggal, Phillip Isola, Antonio Torralba, William T. Freeman</p>

            <p><strong>Title:</strong><br>
            Adaptive Length Image Tokenization via Recurrent Allocation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02393v1">http://arxiv.org/abs/2411.02393v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 06 Nov 2024 19:46:40 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/999d414d/b321c59a.mp3" length="20343763" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1268</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 4 | cs.CV, cs.AI, cs.LG, cs.RO</p>

            <p><strong>Authors:</strong><br>
            Shivam Duggal, Phillip Isola, Antonio Torralba, William T. Freeman</p>

            <p><strong>Title:</strong><br>
            Adaptive Length Image Tokenization via Recurrent Allocation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02393v1">http://arxiv.org/abs/2411.02393v1</a></p>

            <p><strong>Abstract:</strong><br>
            Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details</title>
      <itunes:episode>33</itunes:episode>
      <podcast:episode>33</podcast:episode>
      <itunes:title>GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">aa32d0f1-64f7-4ce8-9eab-6a57abec8e99</guid>
      <link>https://share.transistor.fm/s/3845b41f</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Zhongjin Luo, Haolin Liu, Chenghong Li, Wanghao Du, Zirong Jin, Wanhu Sun, Yinyu Nie, Weikai Chen, Xiaoguang Han</p>

            <p><strong>Title:</strong><br>
            GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.03047v1">http://arxiv.org/abs/2411.03047v1</a></p>

            <p><strong>Abstract:</strong><br>
            Neural implicit functions have brought impressive advances to the state-of-the-art of clothed human digitization from multiple or even single images. However, despite the progress, current arts still have difficulty generalizing to unseen images with complex cloth deformation and body poses. In this work, we present GarVerseLOD, a new dataset and framework that paves the way to achieving unprecedented robustness in high-fidelity 3D garment reconstruction from a single unconstrained image. Inspired by the recent success of large generative models, we believe that one key to addressing the generalization challenge lies in the quantity and quality of 3D garment data. Towards this end, GarVerseLOD collects 6,000 high-quality cloth models with fine-grained geometry details manually created by professional artists. In addition to the scale of training data, we observe that having disentangled granularities of geometry can play an important role in boosting the generalization capability and inference accuracy of the learned model. We hence craft GarVerseLOD as a hierarchical dataset with levels of details (LOD), spanning from detail-free stylized shape to pose-blended garment with pixel-aligned details. This allows us to make this highly under-constrained problem tractable by factorizing the inference into easier tasks, each narrowed down with smaller searching space. To ensure GarVerseLOD can generalize well to in-the-wild images, we propose a novel labeling paradigm based on conditional diffusion models to generate extensive paired images for each garment model with high photorealism. We evaluate our method on a massive amount of in-the-wild images. Experimental results demonstrate that GarVerseLOD can generate standalone garment pieces with significantly better quality than prior approaches. Project page: https://garverselod.github.io/</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Zhongjin Luo, Haolin Liu, Chenghong Li, Wanghao Du, Zirong Jin, Wanhu Sun, Yinyu Nie, Weikai Chen, Xiaoguang Han</p>

            <p><strong>Title:</strong><br>
            GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.03047v1">http://arxiv.org/abs/2411.03047v1</a></p>

            <p><strong>Abstract:</strong><br>
            Neural implicit functions have brought impressive advances to the state-of-the-art of clothed human digitization from multiple or even single images. However, despite the progress, current arts still have difficulty generalizing to unseen images with complex cloth deformation and body poses. In this work, we present GarVerseLOD, a new dataset and framework that paves the way to achieving unprecedented robustness in high-fidelity 3D garment reconstruction from a single unconstrained image. Inspired by the recent success of large generative models, we believe that one key to addressing the generalization challenge lies in the quantity and quality of 3D garment data. Towards this end, GarVerseLOD collects 6,000 high-quality cloth models with fine-grained geometry details manually created by professional artists. In addition to the scale of training data, we observe that having disentangled granularities of geometry can play an important role in boosting the generalization capability and inference accuracy of the learned model. We hence craft GarVerseLOD as a hierarchical dataset with levels of details (LOD), spanning from detail-free stylized shape to pose-blended garment with pixel-aligned details. This allows us to make this highly under-constrained problem tractable by factorizing the inference into easier tasks, each narrowed down with smaller searching space. To ensure GarVerseLOD can generalize well to in-the-wild images, we propose a novel labeling paradigm based on conditional diffusion models to generate extensive paired images for each garment model with high photorealism. We evaluate our method on a massive amount of in-the-wild images. Experimental results demonstrate that GarVerseLOD can generate standalone garment pieces with significantly better quality than prior approaches. Project page: https://garverselod.github.io/</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 06 Nov 2024 19:46:17 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/3845b41f/9ae478bb.mp3" length="18421215" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1148</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Zhongjin Luo, Haolin Liu, Chenghong Li, Wanghao Du, Zirong Jin, Wanhu Sun, Yinyu Nie, Weikai Chen, Xiaoguang Han</p>

            <p><strong>Title:</strong><br>
            GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.03047v1">http://arxiv.org/abs/2411.03047v1</a></p>

            <p><strong>Abstract:</strong><br>
            Neural implicit functions have brought impressive advances to the state-of-the-art of clothed human digitization from multiple or even single images. However, despite the progress, current arts still have difficulty generalizing to unseen images with complex cloth deformation and body poses. In this work, we present GarVerseLOD, a new dataset and framework that paves the way to achieving unprecedented robustness in high-fidelity 3D garment reconstruction from a single unconstrained image. Inspired by the recent success of large generative models, we believe that one key to addressing the generalization challenge lies in the quantity and quality of 3D garment data. Towards this end, GarVerseLOD collects 6,000 high-quality cloth models with fine-grained geometry details manually created by professional artists. In addition to the scale of training data, we observe that having disentangled granularities of geometry can play an important role in boosting the generalization capability and inference accuracy of the learned model. We hence craft GarVerseLOD as a hierarchical dataset with levels of details (LOD), spanning from detail-free stylized shape to pose-blended garment with pixel-aligned details. This allows us to make this highly under-constrained problem tractable by factorizing the inference into easier tasks, each narrowed down with smaller searching space. To ensure GarVerseLOD can generalize well to in-the-wild images, we propose a novel labeling paradigm based on conditional diffusion models to generate extensive paired images for each garment model with high photorealism. We evaluate our method on a massive amount of in-the-wild images. Experimental results demonstrate that GarVerseLOD can generate standalone garment pieces with significantly better quality than prior approaches. Project page: https://garverselod.github.io/</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge</title>
      <itunes:episode>32</itunes:episode>
      <podcast:episode>32</podcast:episode>
      <itunes:title>Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">611be412-7695-49eb-8cd6-902ae8230357</guid>
      <link>https://share.transistor.fm/s/eba2568c</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Karthik Soman, Andrew Langdon, Catalina Villouta, Chinmay Agrawal, Lashaw Salta, Braian Peetoom, Gianmarco Bellucci, Orion J Buske</p>

            <p><strong>Title:</strong><br>
            Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02657v1">http://arxiv.org/abs/2411.02657v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rare diseases present unique challenges in healthcare, often suffering from delayed diagnosis and fragmented information landscapes. The scarcity of reliable knowledge in these conditions poses a distinct challenge for Large Language Models (LLMs) in supporting clinical management and delivering precise patient information underscoring the need for focused training on these 'zebra' cases. We present Zebra-Llama, a specialized context-aware language model with high precision Retrieval Augmented Generation (RAG) capability, focusing on Ehlers-Danlos Syndrome (EDS) as our case study. EDS, affecting 1 in 5,000 individuals, exemplifies the complexities of rare diseases with its diverse symptoms, multiple subtypes, and evolving diagnostic criteria. By implementing a novel context-aware fine-tuning methodology trained on questions derived from medical literature, patient experiences, and clinical resources, along with expertly curated responses, Zebra-Llama demonstrates unprecedented capabilities in handling EDS-related queries. On a test set of real-world questions collected from EDS patients and clinicians, medical experts evaluated the responses generated by both models, revealing Zebra-Llama's substantial improvements over base model (Llama 3.1-8B-Instruct) in thoroughness (77.5% vs. 70.1%), accuracy (83.0% vs. 78.8%), clarity (74.7% vs. 72.0%) and citation reliability (70.6% vs. 52.3%). Released as an open-source resource, Zebra-Llama not only provides more accessible and reliable EDS information but also establishes a framework for developing specialized AI solutions for other rare conditions. This work represents a crucial step towards democratizing expert-level knowledge in rare disease management, potentially transforming how healthcare providers and patients navigate the complex landscape of rare diseases.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Karthik Soman, Andrew Langdon, Catalina Villouta, Chinmay Agrawal, Lashaw Salta, Braian Peetoom, Gianmarco Bellucci, Orion J Buske</p>

            <p><strong>Title:</strong><br>
            Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02657v1">http://arxiv.org/abs/2411.02657v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rare diseases present unique challenges in healthcare, often suffering from delayed diagnosis and fragmented information landscapes. The scarcity of reliable knowledge in these conditions poses a distinct challenge for Large Language Models (LLMs) in supporting clinical management and delivering precise patient information underscoring the need for focused training on these 'zebra' cases. We present Zebra-Llama, a specialized context-aware language model with high precision Retrieval Augmented Generation (RAG) capability, focusing on Ehlers-Danlos Syndrome (EDS) as our case study. EDS, affecting 1 in 5,000 individuals, exemplifies the complexities of rare diseases with its diverse symptoms, multiple subtypes, and evolving diagnostic criteria. By implementing a novel context-aware fine-tuning methodology trained on questions derived from medical literature, patient experiences, and clinical resources, along with expertly curated responses, Zebra-Llama demonstrates unprecedented capabilities in handling EDS-related queries. On a test set of real-world questions collected from EDS patients and clinicians, medical experts evaluated the responses generated by both models, revealing Zebra-Llama's substantial improvements over base model (Llama 3.1-8B-Instruct) in thoroughness (77.5% vs. 70.1%), accuracy (83.0% vs. 78.8%), clarity (74.7% vs. 72.0%) and citation reliability (70.6% vs. 52.3%). Released as an open-source resource, Zebra-Llama not only provides more accessible and reliable EDS information but also establishes a framework for developing specialized AI solutions for other rare conditions. This work represents a crucial step towards democratizing expert-level knowledge in rare disease management, potentially transforming how healthcare providers and patients navigate the complex landscape of rare diseases.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 06 Nov 2024 19:45:54 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/eba2568c/ca11d95c.mp3" length="24975200" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1557</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 3 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Karthik Soman, Andrew Langdon, Catalina Villouta, Chinmay Agrawal, Lashaw Salta, Braian Peetoom, Gianmarco Bellucci, Orion J Buske</p>

            <p><strong>Title:</strong><br>
            Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02657v1">http://arxiv.org/abs/2411.02657v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rare diseases present unique challenges in healthcare, often suffering from delayed diagnosis and fragmented information landscapes. The scarcity of reliable knowledge in these conditions poses a distinct challenge for Large Language Models (LLMs) in supporting clinical management and delivering precise patient information underscoring the need for focused training on these 'zebra' cases. We present Zebra-Llama, a specialized context-aware language model with high precision Retrieval Augmented Generation (RAG) capability, focusing on Ehlers-Danlos Syndrome (EDS) as our case study. EDS, affecting 1 in 5,000 individuals, exemplifies the complexities of rare diseases with its diverse symptoms, multiple subtypes, and evolving diagnostic criteria. By implementing a novel context-aware fine-tuning methodology trained on questions derived from medical literature, patient experiences, and clinical resources, along with expertly curated responses, Zebra-Llama demonstrates unprecedented capabilities in handling EDS-related queries. On a test set of real-world questions collected from EDS patients and clinicians, medical experts evaluated the responses generated by both models, revealing Zebra-Llama's substantial improvements over base model (Llama 3.1-8B-Instruct) in thoroughness (77.5% vs. 70.1%), accuracy (83.0% vs. 78.8%), clarity (74.7% vs. 72.0%) and citation reliability (70.6% vs. 52.3%). Released as an open-source resource, Zebra-Llama not only provides more accessible and reliable EDS information but also establishes a framework for developing specialized AI solutions for other rare conditions. This work represents a crucial step towards democratizing expert-level knowledge in rare disease management, potentially transforming how healthcare providers and patients navigate the complex landscape of rare diseases.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Inference Optimal VLMs Need Only One Visual Token but Larger Models</title>
      <itunes:episode>31</itunes:episode>
      <podcast:episode>31</podcast:episode>
      <itunes:title>Inference Optimal VLMs Need Only One Visual Token but Larger Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">a65c14fe-2946-41de-8205-cc5f8d48e036</guid>
      <link>https://share.transistor.fm/s/83068572</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 2 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kevin Y. Li, Sachin Goyal, Joao D. Semedo, J. Zico Kolter</p>

            <p><strong>Title:</strong><br>
            Inference Optimal VLMs Need Only One Visual Token but Larger Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.03312v1">http://arxiv.org/abs/2411.03312v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. However, their real-world deployment is often constrained by high latency during inference due to substantial compute required to process the large number of input tokens (predominantly from the image) by the LLM. To reduce inference costs, one can either downsize the LLM or reduce the number of input image-tokens, the latter of which has been the focus of many recent works around token compression. However, it is unclear what the optimal trade-off is, as both the factors directly affect the VLM performance. We first characterize this optimal trade-off between the number of visual tokens and LLM parameters by establishing scaling laws that capture variations in performance with these two factors. Our results reveal a surprising trend: for visual reasoning tasks, the inference-optimal behavior in VLMs, i.e., minimum downstream error at any given fixed inference compute, is achieved when using the largest LLM that fits within the inference budget while minimizing visual token count - often to a single token. While the token reduction literature has mainly focused on maintaining base model performance by modestly reducing the token count (e.g., $5-10\times$), our results indicate that the compute-optimal inference regime requires operating under even higher token compression ratios. Based on these insights, we take some initial steps towards building approaches tailored for high token compression settings. Code is available at https://github.com/locuslab/llava-token-compression.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 2 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kevin Y. Li, Sachin Goyal, Joao D. Semedo, J. Zico Kolter</p>

            <p><strong>Title:</strong><br>
            Inference Optimal VLMs Need Only One Visual Token but Larger Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.03312v1">http://arxiv.org/abs/2411.03312v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. However, their real-world deployment is often constrained by high latency during inference due to substantial compute required to process the large number of input tokens (predominantly from the image) by the LLM. To reduce inference costs, one can either downsize the LLM or reduce the number of input image-tokens, the latter of which has been the focus of many recent works around token compression. However, it is unclear what the optimal trade-off is, as both the factors directly affect the VLM performance. We first characterize this optimal trade-off between the number of visual tokens and LLM parameters by establishing scaling laws that capture variations in performance with these two factors. Our results reveal a surprising trend: for visual reasoning tasks, the inference-optimal behavior in VLMs, i.e., minimum downstream error at any given fixed inference compute, is achieved when using the largest LLM that fits within the inference budget while minimizing visual token count - often to a single token. While the token reduction literature has mainly focused on maintaining base model performance by modestly reducing the token count (e.g., $5-10\times$), our results indicate that the compute-optimal inference regime requires operating under even higher token compression ratios. Based on these insights, we take some initial steps towards building approaches tailored for high token compression settings. Code is available at https://github.com/locuslab/llava-token-compression.</p>
            ]]>
      </content:encoded>
      <pubDate>Wed, 06 Nov 2024 19:45:31 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/83068572/15ee2ee7.mp3" length="21312183" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1328</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 2 | cs.CV, cs.AI, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Kevin Y. Li, Sachin Goyal, Joao D. Semedo, J. Zico Kolter</p>

            <p><strong>Title:</strong><br>
            Inference Optimal VLMs Need Only One Visual Token but Larger Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.03312v1">http://arxiv.org/abs/2411.03312v1</a></p>

            <p><strong>Abstract:</strong><br>
            Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. However, their real-world deployment is often constrained by high latency during inference due to substantial compute required to process the large number of input tokens (predominantly from the image) by the LLM. To reduce inference costs, one can either downsize the LLM or reduce the number of input image-tokens, the latter of which has been the focus of many recent works around token compression. However, it is unclear what the optimal trade-off is, as both the factors directly affect the VLM performance. We first characterize this optimal trade-off between the number of visual tokens and LLM parameters by establishing scaling laws that capture variations in performance with these two factors. Our results reveal a surprising trend: for visual reasoning tasks, the inference-optimal behavior in VLMs, i.e., minimum downstream error at any given fixed inference compute, is achieved when using the largest LLM that fits within the inference budget while minimizing visual token count - often to a single token. While the token reduction literature has mainly focused on maintaining base model performance by modestly reducing the token count (e.g., $5-10\times$), our results indicate that the compute-optimal inference regime requires operating under even higher token compression ratios. Based on these insights, we take some initial steps towards building approaches tailored for high token compression settings. Code is available at https://github.com/locuslab/llava-token-compression.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents</title>
      <itunes:episode>30</itunes:episode>
      <podcast:episode>30</podcast:episode>
      <itunes:title>AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0fe8d5ab-6b30-4f5e-9c7b-6ab732257545</guid>
      <link>https://share.transistor.fm/s/48643361</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 40 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong</p>

            <p><strong>Title:</strong><br>
            AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2410.24024v2">http://arxiv.org/abs/2410.24024v2</a></p>

            <p><strong>Abstract:</strong><br>
            Autonomous agents have become increasingly important for interacting with the real world. Android agents, in particular, have been recently a frequently-mentioned interaction method. However, existing studies for training and evaluating Android agents lack systematic research on both open-source and closed-source models. In this work, we propose AndroidLab as a systematic Android agent framework. It includes an operation environment with different modalities, action space, and a reproducible benchmark. It supports both large language models (LLMs) and multimodal models (LMMs) in the same action space. AndroidLab benchmark includes predefined Android virtual devices and 138 tasks across nine apps built on these devices. By using the AndroidLab environment, we develop an Android Instruction dataset and train six open-source LLMs and LMMs, lifting the average success rates from 4.59% to 21.50% for LLMs and from 1.93% to 13.28% for LMMs. AndroidLab is open-sourced and publicly available at https://github.com/THUDM/Android-Lab.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 40 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong</p>

            <p><strong>Title:</strong><br>
            AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2410.24024v2">http://arxiv.org/abs/2410.24024v2</a></p>

            <p><strong>Abstract:</strong><br>
            Autonomous agents have become increasingly important for interacting with the real world. Android agents, in particular, have been recently a frequently-mentioned interaction method. However, existing studies for training and evaluating Android agents lack systematic research on both open-source and closed-source models. In this work, we propose AndroidLab as a systematic Android agent framework. It includes an operation environment with different modalities, action space, and a reproducible benchmark. It supports both large language models (LLMs) and multimodal models (LMMs) in the same action space. AndroidLab benchmark includes predefined Android virtual devices and 138 tasks across nine apps built on these devices. By using the AndroidLab environment, we develop an Android Instruction dataset and train six open-source LLMs and LMMs, lifting the average success rates from 4.59% to 21.50% for LLMs and from 1.93% to 13.28% for LMMs. AndroidLab is open-sourced and publicly available at https://github.com/THUDM/Android-Lab.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 05 Nov 2024 23:48:55 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/48643361/e29f82f9.mp3" length="21953760" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1368</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 40 | cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, Yuxiao Dong</p>

            <p><strong>Title:</strong><br>
            AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2410.24024v2">http://arxiv.org/abs/2410.24024v2</a></p>

            <p><strong>Abstract:</strong><br>
            Autonomous agents have become increasingly important for interacting with the real world. Android agents, in particular, have been recently a frequently-mentioned interaction method. However, existing studies for training and evaluating Android agents lack systematic research on both open-source and closed-source models. In this work, we propose AndroidLab as a systematic Android agent framework. It includes an operation environment with different modalities, action space, and a reproducible benchmark. It supports both large language models (LLMs) and multimodal models (LMMs) in the same action space. AndroidLab benchmark includes predefined Android virtual devices and 138 tasks across nine apps built on these devices. By using the AndroidLab environment, we develop an Android Instruction dataset and train six open-source LLMs and LMMs, lifting the average success rates from 4.59% to 21.50% for LLMs and from 1.93% to 13.28% for LMMs. AndroidLab is open-sourced and publicly available at https://github.com/THUDM/Android-Lab.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization</title>
      <itunes:episode>29</itunes:episode>
      <podcast:episode>29</podcast:episode>
      <itunes:title>"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4deeb13b-0186-4dc7-bc3e-9ec2dd57b37d</guid>
      <link>https://share.transistor.fm/s/d6514f3a</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 28 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh</p>

            <p><strong>Title:</strong><br>
            "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02355v1">http://arxiv.org/abs/2411.02355v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the popularity of large language model (LLM) quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.1 model family. Additionally, our study examines the difference in text generated by quantized models versus their uncompressed counterparts. Beyond benchmarks, we also present a couple of quantization improvements which allowed us to obtain state-of-the-art accuracy recovery results. Our investigation, encompassing over 500,000 individual evaluations, yields several key findings: (1) FP8 weight and activation quantization (W8A8-FP) is lossless across all model scales, (2) INT8 weight and activation quantization (W8A8-INT), when properly tuned, incurs surprisingly low 1-3% accuracy degradation, and (3) INT4 weight-only quantization (W4A16-INT) is competitive with 8-bit integer weight and activation quantization. To address the question of the "best" format for a given deployment environment, we conduct inference performance analysis using the popular open-source vLLM framework on various GPU architectures. We find that W4A16 offers the best cost-efficiency for synchronous deployments, and for asynchronous deployment on mid-tier GPUs. At the same time, W8A8 formats excel in asynchronous "continuous batching" deployment of mid- and large-size models on high-end GPUs. Our results provide a set of practical guidelines for deploying quantized LLMs across scales and performance requirements.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 28 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh</p>

            <p><strong>Title:</strong><br>
            "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02355v1">http://arxiv.org/abs/2411.02355v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the popularity of large language model (LLM) quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.1 model family. Additionally, our study examines the difference in text generated by quantized models versus their uncompressed counterparts. Beyond benchmarks, we also present a couple of quantization improvements which allowed us to obtain state-of-the-art accuracy recovery results. Our investigation, encompassing over 500,000 individual evaluations, yields several key findings: (1) FP8 weight and activation quantization (W8A8-FP) is lossless across all model scales, (2) INT8 weight and activation quantization (W8A8-INT), when properly tuned, incurs surprisingly low 1-3% accuracy degradation, and (3) INT4 weight-only quantization (W4A16-INT) is competitive with 8-bit integer weight and activation quantization. To address the question of the "best" format for a given deployment environment, we conduct inference performance analysis using the popular open-source vLLM framework on various GPU architectures. We find that W4A16 offers the best cost-efficiency for synchronous deployments, and for asynchronous deployment on mid-tier GPUs. At the same time, W8A8 formats excel in asynchronous "continuous batching" deployment of mid- and large-size models on high-end GPUs. Our results provide a set of practical guidelines for deploying quantized LLMs across scales and performance requirements.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 05 Nov 2024 23:48:34 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d6514f3a/6a06f5bf.mp3" length="23957046" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1494</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 28 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh</p>

            <p><strong>Title:</strong><br>
            "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02355v1">http://arxiv.org/abs/2411.02355v1</a></p>

            <p><strong>Abstract:</strong><br>
            Despite the popularity of large language model (LLM) quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.1 model family. Additionally, our study examines the difference in text generated by quantized models versus their uncompressed counterparts. Beyond benchmarks, we also present a couple of quantization improvements which allowed us to obtain state-of-the-art accuracy recovery results. Our investigation, encompassing over 500,000 individual evaluations, yields several key findings: (1) FP8 weight and activation quantization (W8A8-FP) is lossless across all model scales, (2) INT8 weight and activation quantization (W8A8-INT), when properly tuned, incurs surprisingly low 1-3% accuracy degradation, and (3) INT4 weight-only quantization (W4A16-INT) is competitive with 8-bit integer weight and activation quantization. To address the question of the "best" format for a given deployment environment, we conduct inference performance analysis using the popular open-source vLLM framework on various GPU architectures. We find that W4A16 offers the best cost-efficiency for synchronous deployments, and for asynchronous deployment on mid-tier GPUs. At the same time, W8A8 formats excel in asynchronous "continuous batching" deployment of mid- and large-size models on high-end GPUs. Our results provide a set of practical guidelines for deploying quantized LLMs across scales and performance requirements.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning</title>
      <itunes:episode>28</itunes:episode>
      <podcast:episode>28</podcast:episode>
      <itunes:title>WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">891b0903-a168-4bc3-a87a-a5fe8ee8f9b1</guid>
      <link>https://share.transistor.fm/s/894e14ad</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Xinyue Yang, Jiadai Sun, Yu Yang, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, Yuxiao Dong</p>

            <p><strong>Title:</strong><br>
            WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02337v1">http://arxiv.org/abs/2411.02337v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have shown remarkable potential as autonomous agents, particularly in web-based tasks. However, existing LLM web agents heavily rely on expensive proprietary LLM APIs, while open LLMs lack the necessary decision-making capabilities. This paper introduces WebRL, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open LLMs. WebRL addresses three key challenges in building LLM web agents, including the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. Specifically, WebRL incorporates 1) a self-evolving curriculum that generates new tasks from unsuccessful attempts, 2) a robust outcome-supervised reward model (ORM), and 3) adaptive reinforcement learning strategies to ensure consistent improvements. We apply WebRL to transform open Llama-3.1 and GLM-4 models into proficient web agents. On WebArena-Lite, WebRL improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM-4-9B. These open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%) and outperform previous state-of-the-art web agents trained on open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL's effectiveness in bridging the gap between open and proprietary LLM-based web agents, paving the way for more accessible and powerful autonomous web interaction systems.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Xinyue Yang, Jiadai Sun, Yu Yang, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, Yuxiao Dong</p>

            <p><strong>Title:</strong><br>
            WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02337v1">http://arxiv.org/abs/2411.02337v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have shown remarkable potential as autonomous agents, particularly in web-based tasks. However, existing LLM web agents heavily rely on expensive proprietary LLM APIs, while open LLMs lack the necessary decision-making capabilities. This paper introduces WebRL, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open LLMs. WebRL addresses three key challenges in building LLM web agents, including the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. Specifically, WebRL incorporates 1) a self-evolving curriculum that generates new tasks from unsuccessful attempts, 2) a robust outcome-supervised reward model (ORM), and 3) adaptive reinforcement learning strategies to ensure consistent improvements. We apply WebRL to transform open Llama-3.1 and GLM-4 models into proficient web agents. On WebArena-Lite, WebRL improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM-4-9B. These open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%) and outperform previous state-of-the-art web agents trained on open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL's effectiveness in bridging the gap between open and proprietary LLM-based web agents, paving the way for more accessible and powerful autonomous web interaction systems.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 05 Nov 2024 23:48:12 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/894e14ad/53787035.mp3" length="21316802" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1329</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 25 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Xinyue Yang, Jiadai Sun, Yu Yang, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, Yuxiao Dong</p>

            <p><strong>Title:</strong><br>
            WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02337v1">http://arxiv.org/abs/2411.02337v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large language models (LLMs) have shown remarkable potential as autonomous agents, particularly in web-based tasks. However, existing LLM web agents heavily rely on expensive proprietary LLM APIs, while open LLMs lack the necessary decision-making capabilities. This paper introduces WebRL, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open LLMs. WebRL addresses three key challenges in building LLM web agents, including the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. Specifically, WebRL incorporates 1) a self-evolving curriculum that generates new tasks from unsuccessful attempts, 2) a robust outcome-supervised reward model (ORM), and 3) adaptive reinforcement learning strategies to ensure consistent improvements. We apply WebRL to transform open Llama-3.1 and GLM-4 models into proficient web agents. On WebArena-Lite, WebRL improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM-4-9B. These open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%) and outperform previous state-of-the-art web agents trained on open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL's effectiveness in bridging the gap between open and proprietary LLM-based web agents, paving the way for more accessible and powerful autonomous web interaction systems.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D</title>
      <itunes:episode>27</itunes:episode>
      <podcast:episode>27</podcast:episode>
      <itunes:title>MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5c47f3f7-d9b9-4a7b-a686-fb032e77199b</guid>
      <link>https://share.transistor.fm/s/c411ace4</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 20 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wei Cheng, Juncheng Mu, Xianfang Zeng, Xin Chen, Anqi Pang, Chi Zhang, Zhibin Wang, Bin Fu, Gang Yu, Ziwei Liu, Liang Pan</p>

            <p><strong>Title:</strong><br>
            MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02336v1">http://arxiv.org/abs/2411.02336v1</a></p>

            <p><strong>Abstract:</strong><br>
            Texturing is a crucial step in the 3D asset production workflow, which enhances the visual appeal and diversity of 3D assets. Despite recent advancements in Text-to-Texture (T2T) generation, existing methods often yield subpar results, primarily due to local discontinuities, inconsistencies across multiple views, and their heavy dependence on UV unwrapping outcomes. To tackle these challenges, we propose a novel generation-refinement 3D texturing framework called MVPaint, which can generate high-resolution, seamless textures while emphasizing multi-view consistency. MVPaint mainly consists of three key modules. 1) Synchronized Multi-view Generation (SMG). Given a 3D mesh model, MVPaint first simultaneously generates multi-view images by employing an SMG model, which leads to coarse texturing results with unpainted parts due to missing observations. 2) Spatial-aware 3D Inpainting (S3I). To ensure complete 3D texturing, we introduce the S3I method, specifically designed to effectively texture previously unobserved areas. 3) UV Refinement (UVR). Furthermore, MVPaint employs a UVR module to improve the texture quality in the UV space, which first performs a UV-space Super-Resolution, followed by a Spatial-aware Seam-Smoothing algorithm for revising spatial texturing discontinuities caused by UV unwrapping. Moreover, we establish two T2T evaluation benchmarks: the Objaverse T2T benchmark and the GSO T2T benchmark, based on selected high-quality 3D meshes from the Objaverse dataset and the entire GSO dataset, respectively. Extensive experimental results demonstrate that MVPaint surpasses existing state-of-the-art methods. Notably, MVPaint could generate high-fidelity textures with minimal Janus issues and highly enhanced cross-view consistency.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 20 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wei Cheng, Juncheng Mu, Xianfang Zeng, Xin Chen, Anqi Pang, Chi Zhang, Zhibin Wang, Bin Fu, Gang Yu, Ziwei Liu, Liang Pan</p>

            <p><strong>Title:</strong><br>
            MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02336v1">http://arxiv.org/abs/2411.02336v1</a></p>

            <p><strong>Abstract:</strong><br>
            Texturing is a crucial step in the 3D asset production workflow, which enhances the visual appeal and diversity of 3D assets. Despite recent advancements in Text-to-Texture (T2T) generation, existing methods often yield subpar results, primarily due to local discontinuities, inconsistencies across multiple views, and their heavy dependence on UV unwrapping outcomes. To tackle these challenges, we propose a novel generation-refinement 3D texturing framework called MVPaint, which can generate high-resolution, seamless textures while emphasizing multi-view consistency. MVPaint mainly consists of three key modules. 1) Synchronized Multi-view Generation (SMG). Given a 3D mesh model, MVPaint first simultaneously generates multi-view images by employing an SMG model, which leads to coarse texturing results with unpainted parts due to missing observations. 2) Spatial-aware 3D Inpainting (S3I). To ensure complete 3D texturing, we introduce the S3I method, specifically designed to effectively texture previously unobserved areas. 3) UV Refinement (UVR). Furthermore, MVPaint employs a UVR module to improve the texture quality in the UV space, which first performs a UV-space Super-Resolution, followed by a Spatial-aware Seam-Smoothing algorithm for revising spatial texturing discontinuities caused by UV unwrapping. Moreover, we establish two T2T evaluation benchmarks: the Objaverse T2T benchmark and the GSO T2T benchmark, based on selected high-quality 3D meshes from the Objaverse dataset and the entire GSO dataset, respectively. Extensive experimental results demonstrate that MVPaint surpasses existing state-of-the-art methods. Notably, MVPaint could generate high-fidelity textures with minimal Janus issues and highly enhanced cross-view consistency.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 05 Nov 2024 23:47:51 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c411ace4/2bca8685.mp3" length="20505522" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1278</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 20 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Wei Cheng, Juncheng Mu, Xianfang Zeng, Xin Chen, Anqi Pang, Chi Zhang, Zhibin Wang, Bin Fu, Gang Yu, Ziwei Liu, Liang Pan</p>

            <p><strong>Title:</strong><br>
            MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02336v1">http://arxiv.org/abs/2411.02336v1</a></p>

            <p><strong>Abstract:</strong><br>
            Texturing is a crucial step in the 3D asset production workflow, which enhances the visual appeal and diversity of 3D assets. Despite recent advancements in Text-to-Texture (T2T) generation, existing methods often yield subpar results, primarily due to local discontinuities, inconsistencies across multiple views, and their heavy dependence on UV unwrapping outcomes. To tackle these challenges, we propose a novel generation-refinement 3D texturing framework called MVPaint, which can generate high-resolution, seamless textures while emphasizing multi-view consistency. MVPaint mainly consists of three key modules. 1) Synchronized Multi-view Generation (SMG). Given a 3D mesh model, MVPaint first simultaneously generates multi-view images by employing an SMG model, which leads to coarse texturing results with unpainted parts due to missing observations. 2) Spatial-aware 3D Inpainting (S3I). To ensure complete 3D texturing, we introduce the S3I method, specifically designed to effectively texture previously unobserved areas. 3) UV Refinement (UVR). Furthermore, MVPaint employs a UVR module to improve the texture quality in the UV space, which first performs a UV-space Super-Resolution, followed by a Spatial-aware Seam-Smoothing algorithm for revising spatial texturing discontinuities caused by UV unwrapping. Moreover, we establish two T2T evaluation benchmarks: the Objaverse T2T benchmark and the GSO T2T benchmark, based on selected high-quality 3D meshes from the Objaverse dataset and the entire GSO dataset, respectively. Extensive experimental results demonstrate that MVPaint surpasses existing state-of-the-art methods. Notably, MVPaint could generate high-fidelity textures with minimal Janus issues and highly enhanced cross-view consistency.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Training-free Regional Prompting for Diffusion Transformers</title>
      <itunes:episode>26</itunes:episode>
      <podcast:episode>26</podcast:episode>
      <itunes:title>Training-free Regional Prompting for Diffusion Transformers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">14aa7f90-ea55-4196-920e-9324f258801f</guid>
      <link>https://share.transistor.fm/s/13257aa1</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 19 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Anthony Chen, Jianjin Xu, Wenzhao Zheng, Gaole Dai, Yida Wang, Renrui Zhang, Haofan Wang, Shanghang Zhang</p>

            <p><strong>Title:</strong><br>
            Training-free Regional Prompting for Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02395v1">http://arxiv.org/abs/2411.02395v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and interrelated spatial relationships. While many regional prompting methods have been proposed for UNet-based models (SD1.5, SDXL), but there are still no implementations based on the recent Diffusion Transformer (DiT) architecture, such as SD3 and FLUX.1.In this report, we propose and implement regional prompting for FLUX.1 based on attention manipulation, which enables DiT with fined-grained compositional text-to-image generation capability in a training-free manner. Code is available at https://github.com/antonioo-c/Regional-Prompting-FLUX.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 19 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Anthony Chen, Jianjin Xu, Wenzhao Zheng, Gaole Dai, Yida Wang, Renrui Zhang, Haofan Wang, Shanghang Zhang</p>

            <p><strong>Title:</strong><br>
            Training-free Regional Prompting for Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02395v1">http://arxiv.org/abs/2411.02395v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and interrelated spatial relationships. While many regional prompting methods have been proposed for UNet-based models (SD1.5, SDXL), but there are still no implementations based on the recent Diffusion Transformer (DiT) architecture, such as SD3 and FLUX.1.In this report, we propose and implement regional prompting for FLUX.1 based on attention manipulation, which enables DiT with fined-grained compositional text-to-image generation capability in a training-free manner. Code is available at https://github.com/antonioo-c/Regional-Prompting-FLUX.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 05 Nov 2024 23:47:29 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/13257aa1/e28e1a4a.mp3" length="16561651" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1031</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 19 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Anthony Chen, Jianjin Xu, Wenzhao Zheng, Gaole Dai, Yida Wang, Renrui Zhang, Haofan Wang, Shanghang Zhang</p>

            <p><strong>Title:</strong><br>
            Training-free Regional Prompting for Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02395v1">http://arxiv.org/abs/2411.02395v1</a></p>

            <p><strong>Abstract:</strong><br>
            Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and interrelated spatial relationships. While many regional prompting methods have been proposed for UNet-based models (SD1.5, SDXL), but there are still no implementations based on the recent Diffusion Transformer (DiT) architecture, such as SD3 and FLUX.1.In this report, we propose and implement regional prompting for FLUX.1 based on attention manipulation, which enables DiT with fined-grained compositional text-to-image generation capability in a training-free manner. Code is available at https://github.com/antonioo-c/Regional-Prompting-FLUX.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>How Far is Video Generation from World Model: A Physical Law Perspective</title>
      <itunes:episode>25</itunes:episode>
      <podcast:episode>25</podcast:episode>
      <itunes:title>How Far is Video Generation from World Model: A Physical Law Perspective</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">ea0d6d96-77af-4d56-867b-c51b88029c39</guid>
      <link>https://share.transistor.fm/s/246f0ef2</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 19 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, Jiashi Feng</p>

            <p><strong>Title:</strong><br>
            How Far is Video Generation from World Model: A Physical Law Perspective</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02385v1">http://arxiv.org/abs/2411.02385v1</a></p>

            <p><strong>Abstract:</strong><br>
            OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human priors can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. This provides an unlimited supply of data for large-scale experimentation and enables quantitative evaluation of whether the generated videos adhere to physical laws. We trained diffusion-based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case-based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color &gt; size &gt; velocity &gt; shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws, despite its role in Sora's broader success. See our project page at https://phyworld.github.io</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 19 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, Jiashi Feng</p>

            <p><strong>Title:</strong><br>
            How Far is Video Generation from World Model: A Physical Law Perspective</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02385v1">http://arxiv.org/abs/2411.02385v1</a></p>

            <p><strong>Abstract:</strong><br>
            OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human priors can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. This provides an unlimited supply of data for large-scale experimentation and enables quantitative evaluation of whether the generated videos adhere to physical laws. We trained diffusion-based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case-based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color &gt; size &gt; velocity &gt; shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws, despite its role in Sora's broader success. See our project page at https://phyworld.github.io</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 05 Nov 2024 23:47:08 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/246f0ef2/9033eda5.mp3" length="22353324" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1393</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 19 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, Jiashi Feng</p>

            <p><strong>Title:</strong><br>
            How Far is Video Generation from World Model: A Physical Law Perspective</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02385v1">http://arxiv.org/abs/2411.02385v1</a></p>

            <p><strong>Abstract:</strong><br>
            OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human priors can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. This provides an unlimited supply of data for large-scale experimentation and enables quantitative evaluation of whether the generated videos adhere to physical laws. We trained diffusion-based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case-based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color &gt; size &gt; velocity &gt; shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws, despite its role in Sora's broader success. See our project page at https://phyworld.github.io</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Survey of Cultural Awareness in Language Models: Text and Beyond</title>
      <itunes:episode>24</itunes:episode>
      <podcast:episode>24</podcast:episode>
      <itunes:title>Survey of Cultural Awareness in Language Models: Text and Beyond</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f6fef434-fe13-4d44-964b-1c085eee14fa</guid>
      <link>https://share.transistor.fm/s/960ffc86</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 19 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Siddhesh Pawar, Junyeong Park, Jiho Jin, Arnav Arora, Junho Myung, Srishti Yadav, Faiz Ghifari Haznitrama, Inhwa Song, Alice Oh, Isabelle Augenstein</p>

            <p><strong>Title:</strong><br>
            Survey of Cultural Awareness in Language Models: Text and Beyond</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00860v1">http://arxiv.org/abs/2411.00860v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale deployment of large language models (LLMs) in various applications, such as chatbots and virtual assistants, requires LLMs to be culturally sensitive to the user to ensure inclusivity. Culture has been widely studied in psychology and anthropology, and there has been a recent surge in research on making LLMs more culturally inclusive in LLMs that goes beyond multilinguality and builds on findings from psychology and anthropology. In this paper, we survey efforts towards incorporating cultural awareness into text-based and multimodal LLMs. We start by defining cultural awareness in LLMs, taking the definitions of culture from anthropology and psychology as a point of departure. We then examine methodologies adopted for creating cross-cultural datasets, strategies for cultural inclusion in downstream tasks, and methodologies that have been used for benchmarking cultural awareness in LLMs. Further, we discuss the ethical implications of cultural alignment, the role of Human-Computer Interaction in driving cultural inclusion in LLMs, and the role of cultural alignment in driving social science research. We finally provide pointers to future research based on our findings about gaps in the literature.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 19 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Siddhesh Pawar, Junyeong Park, Jiho Jin, Arnav Arora, Junho Myung, Srishti Yadav, Faiz Ghifari Haznitrama, Inhwa Song, Alice Oh, Isabelle Augenstein</p>

            <p><strong>Title:</strong><br>
            Survey of Cultural Awareness in Language Models: Text and Beyond</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00860v1">http://arxiv.org/abs/2411.00860v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale deployment of large language models (LLMs) in various applications, such as chatbots and virtual assistants, requires LLMs to be culturally sensitive to the user to ensure inclusivity. Culture has been widely studied in psychology and anthropology, and there has been a recent surge in research on making LLMs more culturally inclusive in LLMs that goes beyond multilinguality and builds on findings from psychology and anthropology. In this paper, we survey efforts towards incorporating cultural awareness into text-based and multimodal LLMs. We start by defining cultural awareness in LLMs, taking the definitions of culture from anthropology and psychology as a point of departure. We then examine methodologies adopted for creating cross-cultural datasets, strategies for cultural inclusion in downstream tasks, and methodologies that have been used for benchmarking cultural awareness in LLMs. Further, we discuss the ethical implications of cultural alignment, the role of Human-Computer Interaction in driving cultural inclusion in LLMs, and the role of cultural alignment in driving social science research. We finally provide pointers to future research based on our findings about gaps in the literature.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 05 Nov 2024 23:46:47 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/960ffc86/10507bb1.mp3" length="22831879" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1423</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 19 | cs.CL, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Siddhesh Pawar, Junyeong Park, Jiho Jin, Arnav Arora, Junho Myung, Srishti Yadav, Faiz Ghifari Haznitrama, Inhwa Song, Alice Oh, Isabelle Augenstein</p>

            <p><strong>Title:</strong><br>
            Survey of Cultural Awareness in Language Models: Text and Beyond</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00860v1">http://arxiv.org/abs/2411.00860v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large-scale deployment of large language models (LLMs) in various applications, such as chatbots and virtual assistants, requires LLMs to be culturally sensitive to the user to ensure inclusivity. Culture has been widely studied in psychology and anthropology, and there has been a recent surge in research on making LLMs more culturally inclusive in LLMs that goes beyond multilinguality and builds on findings from psychology and anthropology. In this paper, we survey efforts towards incorporating cultural awareness into text-based and multimodal LLMs. We start by defining cultural awareness in LLMs, taking the definitions of culture from anthropology and psychology as a point of departure. We then examine methodologies adopted for creating cross-cultural datasets, strategies for cultural inclusion in downstream tasks, and methodologies that have been used for benchmarking cultural awareness in LLMs. Further, we discuss the ethical implications of cultural alignment, the role of Human-Computer Interaction in driving cultural inclusion in LLMs, and the role of cultural alignment in driving social science research. We finally provide pointers to future research based on our findings about gaps in the literature.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent</title>
      <itunes:episode>23</itunes:episode>
      <podcast:episode>23</podcast:episode>
      <itunes:title>Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f3b3c340-4f68-43e2-9917-90c2287301a6</guid>
      <link>https://share.transistor.fm/s/ac91dd0a</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 16 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, Zhongzhi Chen, Xuemeng Huang, Fengzong Lian, Saiyong Yang, Jianfeng Yan, Yuyuan Zeng, Xiaoqin Ren, Chao Yu, Lulu Wu, Yue Mao, Jun Xia, Tao Yang, Suncong Zheng, Kan Wu, Dian Jiao, Jinbao Xue, Xipeng Zhang, Decheng Wu, Kai Liu, Dengpeng Wu, Guanghui Xu, Shaohua Chen, Shuang Chen, Xiao Feng, Yigeng Hong, Junqiang Zheng, Chengcheng Xu, Zongwei Li, Xiong Kuang, Jianglu Hu, Yiqi Chen, Yuchi Deng, Guiyang Li, Ao Liu, Chenchen Zhang, Shihui Hu, Zilong Zhao, Zifan Wu, Yao Ding, Weichao Wang, Han Liu, Roberts Wang, Hao Fei, Peijie She, Ze Zhao, Xun Cao, Hai Wang, Fusheng Xiang, Mengyuan Huang, Zhiyuan Xiong, Bin Hu, Xuebin Hou, Lei Jiang, Jiajia Wu, Yaping Deng, Yi Shen, Qian Wang, Weijie Liu, Jie Liu, Meng Chen, Liang Dong, Weiwen Jia, Hu Chen, Feifei Liu, Rui Yuan, Huilin Xu, Zhenxiang Yan, Tengfei Cao, Zhichao Hu, Xinhua Feng, Dong Du, Tinghao She, Yangyu Tao, Feng Zhang, Jianchen Zhu, Chengzhong Xu, Xirui Li, Chong Zha, Wen Ouyang, Yinben Xia, Xiang Li, Zekun He, Rongpeng Chen, Jiawei Song, Ruibin Chen, Fan Jiang, Chongqing Zhao, Bo Wang, Hao Gong, Rong Gan, Winston Hu, Zhanhui Kang, Yong Yang, Yuhong Liu, Di Wang, Jie Jiang</p>

            <p><strong>Title:</strong><br>
            Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02265v2">http://arxiv.org/abs/2411.02265v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications.   Codes: https://github.com/Tencent/Hunyuan-Large   Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 16 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, Zhongzhi Chen, Xuemeng Huang, Fengzong Lian, Saiyong Yang, Jianfeng Yan, Yuyuan Zeng, Xiaoqin Ren, Chao Yu, Lulu Wu, Yue Mao, Jun Xia, Tao Yang, Suncong Zheng, Kan Wu, Dian Jiao, Jinbao Xue, Xipeng Zhang, Decheng Wu, Kai Liu, Dengpeng Wu, Guanghui Xu, Shaohua Chen, Shuang Chen, Xiao Feng, Yigeng Hong, Junqiang Zheng, Chengcheng Xu, Zongwei Li, Xiong Kuang, Jianglu Hu, Yiqi Chen, Yuchi Deng, Guiyang Li, Ao Liu, Chenchen Zhang, Shihui Hu, Zilong Zhao, Zifan Wu, Yao Ding, Weichao Wang, Han Liu, Roberts Wang, Hao Fei, Peijie She, Ze Zhao, Xun Cao, Hai Wang, Fusheng Xiang, Mengyuan Huang, Zhiyuan Xiong, Bin Hu, Xuebin Hou, Lei Jiang, Jiajia Wu, Yaping Deng, Yi Shen, Qian Wang, Weijie Liu, Jie Liu, Meng Chen, Liang Dong, Weiwen Jia, Hu Chen, Feifei Liu, Rui Yuan, Huilin Xu, Zhenxiang Yan, Tengfei Cao, Zhichao Hu, Xinhua Feng, Dong Du, Tinghao She, Yangyu Tao, Feng Zhang, Jianchen Zhu, Chengzhong Xu, Xirui Li, Chong Zha, Wen Ouyang, Yinben Xia, Xiang Li, Zekun He, Rongpeng Chen, Jiawei Song, Ruibin Chen, Fan Jiang, Chongqing Zhao, Bo Wang, Hao Gong, Rong Gan, Winston Hu, Zhanhui Kang, Yong Yang, Yuhong Liu, Di Wang, Jie Jiang</p>

            <p><strong>Title:</strong><br>
            Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02265v2">http://arxiv.org/abs/2411.02265v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications.   Codes: https://github.com/Tencent/Hunyuan-Large   Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 05 Nov 2024 23:46:25 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/ac91dd0a/37bbca34.mp3" length="17534688" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1092</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 16 | cs.CL, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, Zhongzhi Chen, Xuemeng Huang, Fengzong Lian, Saiyong Yang, Jianfeng Yan, Yuyuan Zeng, Xiaoqin Ren, Chao Yu, Lulu Wu, Yue Mao, Jun Xia, Tao Yang, Suncong Zheng, Kan Wu, Dian Jiao, Jinbao Xue, Xipeng Zhang, Decheng Wu, Kai Liu, Dengpeng Wu, Guanghui Xu, Shaohua Chen, Shuang Chen, Xiao Feng, Yigeng Hong, Junqiang Zheng, Chengcheng Xu, Zongwei Li, Xiong Kuang, Jianglu Hu, Yiqi Chen, Yuchi Deng, Guiyang Li, Ao Liu, Chenchen Zhang, Shihui Hu, Zilong Zhao, Zifan Wu, Yao Ding, Weichao Wang, Han Liu, Roberts Wang, Hao Fei, Peijie She, Ze Zhao, Xun Cao, Hai Wang, Fusheng Xiang, Mengyuan Huang, Zhiyuan Xiong, Bin Hu, Xuebin Hou, Lei Jiang, Jiajia Wu, Yaping Deng, Yi Shen, Qian Wang, Weijie Liu, Jie Liu, Meng Chen, Liang Dong, Weiwen Jia, Hu Chen, Feifei Liu, Rui Yuan, Huilin Xu, Zhenxiang Yan, Tengfei Cao, Zhichao Hu, Xinhua Feng, Dong Du, Tinghao She, Yangyu Tao, Feng Zhang, Jianchen Zhu, Chengzhong Xu, Xirui Li, Chong Zha, Wen Ouyang, Yinben Xia, Xiang Li, Zekun He, Rongpeng Chen, Jiawei Song, Ruibin Chen, Fan Jiang, Chongqing Zhao, Bo Wang, Hao Gong, Rong Gan, Winston Hu, Zhanhui Kang, Yong Yang, Yuhong Liu, Di Wang, Jie Jiang</p>

            <p><strong>Title:</strong><br>
            Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02265v2">http://arxiv.org/abs/2411.02265v2</a></p>

            <p><strong>Abstract:</strong><br>
            In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications.   Codes: https://github.com/Tencent/Hunyuan-Large   Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>GenXD: Generating Any 3D and 4D Scenes</title>
      <itunes:episode>22</itunes:episode>
      <podcast:episode>22</podcast:episode>
      <itunes:title>GenXD: Generating Any 3D and 4D Scenes</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">b9570b21-dbec-4244-844b-ceb9875a02db</guid>
      <link>https://share.transistor.fm/s/6a4c1b2a</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuyang Zhao, Chung-Ching Lin, Kevin Lin, Zhiwen Yan, Linjie Li, Zhengyuan Yang, Jianfeng Wang, Gim Hee Lee, Lijuan Wang</p>

            <p><strong>Title:</strong><br>
            GenXD: Generating Any 3D and 4D Scenes</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02319v2">http://arxiv.org/abs/2411.02319v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we propose to jointly investigate general 3D and 4D generation by leveraging camera and object movements commonly observed in daily life. Due to the lack of real-world 4D data in the community, we first propose a data curation pipeline to obtain camera poses and object motion strength from videos. Based on this pipeline, we introduce a large-scale real-world 4D scene dataset: CamVid-30K. By leveraging all the 3D and 4D data, we develop our framework, GenXD, which allows us to produce any 3D or 4D scene. We propose multiview-temporal modules, which disentangle camera and object movements, to seamlessly learn from both 3D and 4D data. Additionally, GenXD employs masked latent conditions to support a variety of conditioning views. GenXD can generate videos that follow the camera trajectory as well as consistent 3D views that can be lifted into 3D representations. We perform extensive evaluations across various real-world and synthetic datasets, demonstrating GenXD's effectiveness and versatility compared to previous methods in 3D and 4D generation.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuyang Zhao, Chung-Ching Lin, Kevin Lin, Zhiwen Yan, Linjie Li, Zhengyuan Yang, Jianfeng Wang, Gim Hee Lee, Lijuan Wang</p>

            <p><strong>Title:</strong><br>
            GenXD: Generating Any 3D and 4D Scenes</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02319v2">http://arxiv.org/abs/2411.02319v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we propose to jointly investigate general 3D and 4D generation by leveraging camera and object movements commonly observed in daily life. Due to the lack of real-world 4D data in the community, we first propose a data curation pipeline to obtain camera poses and object motion strength from videos. Based on this pipeline, we introduce a large-scale real-world 4D scene dataset: CamVid-30K. By leveraging all the 3D and 4D data, we develop our framework, GenXD, which allows us to produce any 3D or 4D scene. We propose multiview-temporal modules, which disentangle camera and object movements, to seamlessly learn from both 3D and 4D data. Additionally, GenXD employs masked latent conditions to support a variety of conditioning views. GenXD can generate videos that follow the camera trajectory as well as consistent 3D views that can be lifted into 3D representations. We perform extensive evaluations across various real-world and synthetic datasets, demonstrating GenXD's effectiveness and versatility compared to previous methods in 3D and 4D generation.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 05 Nov 2024 23:46:04 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6a4c1b2a/55e5844b.mp3" length="21328872" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1329</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Yuyang Zhao, Chung-Ching Lin, Kevin Lin, Zhiwen Yan, Linjie Li, Zhengyuan Yang, Jianfeng Wang, Gim Hee Lee, Lijuan Wang</p>

            <p><strong>Title:</strong><br>
            GenXD: Generating Any 3D and 4D Scenes</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.02319v2">http://arxiv.org/abs/2411.02319v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we propose to jointly investigate general 3D and 4D generation by leveraging camera and object movements commonly observed in daily life. Due to the lack of real-world 4D data in the community, we first propose a data curation pipeline to obtain camera poses and object motion strength from videos. Based on this pipeline, we introduce a large-scale real-world 4D scene dataset: CamVid-30K. By leveraging all the 3D and 4D data, we develop our framework, GenXD, which allows us to produce any 3D or 4D scene. We propose multiview-temporal modules, which disentangle camera and object movements, to seamlessly learn from both 3D and 4D data. Additionally, GenXD employs masked latent conditions to support a variety of conditioning views. GenXD can generate videos that follow the camera trajectory as well as consistent 3D views that can be lifted into 3D representations. We perform extensive evaluations across various real-world and synthetic datasets, demonstrating GenXD's effectiveness and versatility compared to previous methods in 3D and 4D generation.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models</title>
      <itunes:episode>21</itunes:episode>
      <podcast:episode>21</podcast:episode>
      <itunes:title>DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">79a105cc-2fc1-4e04-8973-f6d119d8e82d</guid>
      <link>https://share.transistor.fm/s/483dd49c</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, Huan Zhang</p>

            <p><strong>Title:</strong><br>
            DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00836v1">http://arxiv.org/abs/2411.00836v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancements in Vision-Language Models (VLMs) have shown great potential in tackling mathematical reasoning tasks that involve visual context. Unlike humans who can reliably apply solution steps to similar problems with minor modifications, we found that SOTA VLMs like GPT-4o can consistently fail in these scenarios, revealing limitations in their mathematical reasoning capabilities. In this paper, we investigate the mathematical reasoning robustness in VLMs and evaluate how well these models perform under different variants of the same question, such as changes in visual numerical values or function graphs. While several vision-based math benchmarks have been developed to assess VLMs' problem-solving capabilities, these benchmarks contain only static sets of problems and cannot easily evaluate mathematical reasoning robustness. To fill this gap, we introduce DynaMath, a dynamic visual math benchmark designed for in-depth assessment of VLMs. DynaMath includes 501 high-quality, multi-topic seed questions, each represented as a Python program. Those programs are carefully designed and annotated to enable the automatic generation of a much larger set of concrete questions, including many different types of visual and textual variations. DynaMath allows us to evaluate the generalization ability of VLMs, by assessing their performance under varying input conditions of a seed question. We evaluated 14 SOTA VLMs with 5,010 generated concrete questions. Our results show that the worst-case model accuracy, defined as the percentage of correctly answered seed questions in all 10 variants, is significantly lower than the average-case accuracy. Our analysis emphasizes the need to study the robustness of VLMs' reasoning abilities, and DynaMath provides valuable insights to guide the development of more reliable models for mathematical reasoning.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, Huan Zhang</p>

            <p><strong>Title:</strong><br>
            DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00836v1">http://arxiv.org/abs/2411.00836v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancements in Vision-Language Models (VLMs) have shown great potential in tackling mathematical reasoning tasks that involve visual context. Unlike humans who can reliably apply solution steps to similar problems with minor modifications, we found that SOTA VLMs like GPT-4o can consistently fail in these scenarios, revealing limitations in their mathematical reasoning capabilities. In this paper, we investigate the mathematical reasoning robustness in VLMs and evaluate how well these models perform under different variants of the same question, such as changes in visual numerical values or function graphs. While several vision-based math benchmarks have been developed to assess VLMs' problem-solving capabilities, these benchmarks contain only static sets of problems and cannot easily evaluate mathematical reasoning robustness. To fill this gap, we introduce DynaMath, a dynamic visual math benchmark designed for in-depth assessment of VLMs. DynaMath includes 501 high-quality, multi-topic seed questions, each represented as a Python program. Those programs are carefully designed and annotated to enable the automatic generation of a much larger set of concrete questions, including many different types of visual and textual variations. DynaMath allows us to evaluate the generalization ability of VLMs, by assessing their performance under varying input conditions of a seed question. We evaluated 14 SOTA VLMs with 5,010 generated concrete questions. Our results show that the worst-case model accuracy, defined as the percentage of correctly answered seed questions in all 10 variants, is significantly lower than the average-case accuracy. Our analysis emphasizes the need to study the robustness of VLMs' reasoning abilities, and DynaMath provides valuable insights to guide the development of more reliable models for mathematical reasoning.</p>
            ]]>
      </content:encoded>
      <pubDate>Tue, 05 Nov 2024 23:45:43 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/483dd49c/0def31a4.mp3" length="18643975" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1162</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, Huan Zhang</p>

            <p><strong>Title:</strong><br>
            DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00836v1">http://arxiv.org/abs/2411.00836v1</a></p>

            <p><strong>Abstract:</strong><br>
            The rapid advancements in Vision-Language Models (VLMs) have shown great potential in tackling mathematical reasoning tasks that involve visual context. Unlike humans who can reliably apply solution steps to similar problems with minor modifications, we found that SOTA VLMs like GPT-4o can consistently fail in these scenarios, revealing limitations in their mathematical reasoning capabilities. In this paper, we investigate the mathematical reasoning robustness in VLMs and evaluate how well these models perform under different variants of the same question, such as changes in visual numerical values or function graphs. While several vision-based math benchmarks have been developed to assess VLMs' problem-solving capabilities, these benchmarks contain only static sets of problems and cannot easily evaluate mathematical reasoning robustness. To fill this gap, we introduce DynaMath, a dynamic visual math benchmark designed for in-depth assessment of VLMs. DynaMath includes 501 high-quality, multi-topic seed questions, each represented as a Python program. Those programs are carefully designed and annotated to enable the automatic generation of a much larger set of concrete questions, including many different types of visual and textual variations. DynaMath allows us to evaluate the generalization ability of VLMs, by assessing their performance under varying input conditions of a seed question. We evaluated 14 SOTA VLMs with 5,010 generated concrete questions. Our results show that the worst-case model accuracy, defined as the percentage of correctly answered seed questions in all 10 variants, is significantly lower than the average-case accuracy. Our analysis emphasizes the need to study the robustness of VLMs' reasoning abilities, and DynaMath provides valuable insights to guide the development of more reliable models for mathematical reasoning.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>OS-ATLAS: A Foundation Action Model for Generalist GUI Agents</title>
      <itunes:episode>20</itunes:episode>
      <podcast:episode>20</podcast:episode>
      <itunes:title>OS-ATLAS: A Foundation Action Model for Generalist GUI Agents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">6a7b573d-2c43-49ae-b4a6-5511f2bd09eb</guid>
      <link>https://share.transistor.fm/s/d726bb73</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 32 | cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao</p>

            <p><strong>Title:</strong><br>
            OS-ATLAS: A Foundation Action Model for Generalist GUI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2410.23218v1">http://arxiv.org/abs/2410.23218v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 32 | cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao</p>

            <p><strong>Title:</strong><br>
            OS-ATLAS: A Foundation Action Model for Generalist GUI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2410.23218v1">http://arxiv.org/abs/2410.23218v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 04 Nov 2024 19:29:31 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d726bb73/a8c75455.mp3" length="19425091" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1210</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 32 | cs.CL, cs.CV, cs.HC</p>

            <p><strong>Authors:</strong><br>
            Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao</p>

            <p><strong>Title:</strong><br>
            OS-ATLAS: A Foundation Action Model for Generalist GUI Agents</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2410.23218v1">http://arxiv.org/abs/2410.23218v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Personalization of Large Language Models: A Survey</title>
      <itunes:episode>19</itunes:episode>
      <podcast:episode>19</podcast:episode>
      <itunes:title>Personalization of Large Language Models: A Survey</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8e824233-c679-4968-9a53-983ad9dc4a56</guid>
      <link>https://share.transistor.fm/s/7e8839db</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhehao Zhang, Ryan A. Rossi, Branislav Kveton, Yijia Shao, Diyi Yang, Hamed Zamani, Franck Dernoncourt, Joe Barrow, Tong Yu, Sungchul Kim, Ruiyi Zhang, Jiuxiang Gu, Tyler Derr, Hongjie Chen, Junda Wu, Xiang Chen, Zichao Wang, Subrata Mitra, Nedim Lipka, Nesreen Ahmed, Yu Wang</p>

            <p><strong>Title:</strong><br>
            Personalization of Large Language Models: A Survey</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00027v1">http://arxiv.org/abs/2411.00027v1</a></p>

            <p><strong>Abstract:</strong><br>
            Personalization of Large Language Models (LLMs) has recently become increasingly important with a wide range of applications. Despite the importance and recent progress, most existing works on personalized LLMs have focused either entirely on (a) personalized text generation or (b) leveraging LLMs for personalization-related downstream applications, such as recommendation systems. In this work, we bridge the gap between these two separate main directions for the first time by introducing a taxonomy for personalized LLM usage and summarizing the key differences and challenges. We provide a formalization of the foundations of personalized LLMs that consolidates and expands notions of personalization of LLMs, defining and discussing novel facets of personalization, usage, and desiderata of personalized LLMs. We then unify the literature across these diverse fields and usage scenarios by proposing systematic taxonomies for the granularity of personalization, personalization techniques, datasets, evaluation methods, and applications of personalized LLMs. Finally, we highlight challenges and important open problems that remain to be addressed. By unifying and surveying recent research using the proposed taxonomies, we aim to provide a clear guide to the existing literature and different facets of personalization in LLMs, empowering both researchers and practitioners.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhehao Zhang, Ryan A. Rossi, Branislav Kveton, Yijia Shao, Diyi Yang, Hamed Zamani, Franck Dernoncourt, Joe Barrow, Tong Yu, Sungchul Kim, Ruiyi Zhang, Jiuxiang Gu, Tyler Derr, Hongjie Chen, Junda Wu, Xiang Chen, Zichao Wang, Subrata Mitra, Nedim Lipka, Nesreen Ahmed, Yu Wang</p>

            <p><strong>Title:</strong><br>
            Personalization of Large Language Models: A Survey</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00027v1">http://arxiv.org/abs/2411.00027v1</a></p>

            <p><strong>Abstract:</strong><br>
            Personalization of Large Language Models (LLMs) has recently become increasingly important with a wide range of applications. Despite the importance and recent progress, most existing works on personalized LLMs have focused either entirely on (a) personalized text generation or (b) leveraging LLMs for personalization-related downstream applications, such as recommendation systems. In this work, we bridge the gap between these two separate main directions for the first time by introducing a taxonomy for personalized LLM usage and summarizing the key differences and challenges. We provide a formalization of the foundations of personalized LLMs that consolidates and expands notions of personalization of LLMs, defining and discussing novel facets of personalization, usage, and desiderata of personalized LLMs. We then unify the literature across these diverse fields and usage scenarios by proposing systematic taxonomies for the granularity of personalization, personalization techniques, datasets, evaluation methods, and applications of personalized LLMs. Finally, we highlight challenges and important open problems that remain to be addressed. By unifying and surveying recent research using the proposed taxonomies, we aim to provide a clear guide to the existing literature and different facets of personalization in LLMs, empowering both researchers and practitioners.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 04 Nov 2024 19:29:10 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/7e8839db/4ff60fbb.mp3" length="24700979" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1540</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.CL</p>

            <p><strong>Authors:</strong><br>
            Zhehao Zhang, Ryan A. Rossi, Branislav Kveton, Yijia Shao, Diyi Yang, Hamed Zamani, Franck Dernoncourt, Joe Barrow, Tong Yu, Sungchul Kim, Ruiyi Zhang, Jiuxiang Gu, Tyler Derr, Hongjie Chen, Junda Wu, Xiang Chen, Zichao Wang, Subrata Mitra, Nedim Lipka, Nesreen Ahmed, Yu Wang</p>

            <p><strong>Title:</strong><br>
            Personalization of Large Language Models: A Survey</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00027v1">http://arxiv.org/abs/2411.00027v1</a></p>

            <p><strong>Abstract:</strong><br>
            Personalization of Large Language Models (LLMs) has recently become increasingly important with a wide range of applications. Despite the importance and recent progress, most existing works on personalized LLMs have focused either entirely on (a) personalized text generation or (b) leveraging LLMs for personalization-related downstream applications, such as recommendation systems. In this work, we bridge the gap between these two separate main directions for the first time by introducing a taxonomy for personalized LLM usage and summarizing the key differences and challenges. We provide a formalization of the foundations of personalized LLMs that consolidates and expands notions of personalization of LLMs, defining and discussing novel facets of personalization, usage, and desiderata of personalized LLMs. We then unify the literature across these diverse fields and usage scenarios by proposing systematic taxonomies for the granularity of personalization, personalization techniques, datasets, evaluation methods, and applications of personalized LLMs. Finally, we highlight challenges and important open problems that remain to be addressed. By unifying and surveying recent research using the proposed taxonomies, we aim to provide a clear guide to the existing literature and different facets of personalization in LLMs, empowering both researchers and practitioners.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Constant Acceleration Flow</title>
      <itunes:episode>18</itunes:episode>
      <podcast:episode>18</podcast:episode>
      <itunes:title>Constant Acceleration Flow</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4b28229d-1681-47a4-80ef-c5b36c165e33</guid>
      <link>https://share.transistor.fm/s/d4b49f40</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Dogyun Park, Sojin Lee, Sihyeon Kim, Taehoon Lee, Youngjoon Hong, Hyunwoo J. Kim</p>

            <p><strong>Title:</strong><br>
            Constant Acceleration Flow</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00322v1">http://arxiv.org/abs/2411.00322v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rectified flow and reflow procedures have significantly advanced fast generation by progressively straightening ordinary differential equation (ODE) flows. They operate under the assumption that image and noise pairs, known as couplings, can be approximated by straight trajectories with constant velocity. However, we observe that modeling with constant velocity and using reflow procedures have limitations in accurately learning straight trajectories between pairs, resulting in suboptimal performance in few-step generation. To address these limitations, we introduce Constant Acceleration Flow (CAF), a novel framework based on a simple constant acceleration equation. CAF introduces acceleration as an additional learnable variable, allowing for more expressive and accurate estimation of the ODE flow. Moreover, we propose two techniques to further improve estimation accuracy: initial velocity conditioning for the acceleration model and a reflow process for the initial velocity. Our comprehensive studies on toy datasets, CIFAR-10, and ImageNet 64x64 demonstrate that CAF outperforms state-of-the-art baselines for one-step generation. We also show that CAF dramatically improves few-step coupling preservation and inversion over Rectified flow. Code is available at \href{https://github.com/mlvlab/CAF}{https://github.com/mlvlab/CAF}.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Dogyun Park, Sojin Lee, Sihyeon Kim, Taehoon Lee, Youngjoon Hong, Hyunwoo J. Kim</p>

            <p><strong>Title:</strong><br>
            Constant Acceleration Flow</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00322v1">http://arxiv.org/abs/2411.00322v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rectified flow and reflow procedures have significantly advanced fast generation by progressively straightening ordinary differential equation (ODE) flows. They operate under the assumption that image and noise pairs, known as couplings, can be approximated by straight trajectories with constant velocity. However, we observe that modeling with constant velocity and using reflow procedures have limitations in accurately learning straight trajectories between pairs, resulting in suboptimal performance in few-step generation. To address these limitations, we introduce Constant Acceleration Flow (CAF), a novel framework based on a simple constant acceleration equation. CAF introduces acceleration as an additional learnable variable, allowing for more expressive and accurate estimation of the ODE flow. Moreover, we propose two techniques to further improve estimation accuracy: initial velocity conditioning for the acceleration model and a reflow process for the initial velocity. Our comprehensive studies on toy datasets, CIFAR-10, and ImageNet 64x64 demonstrate that CAF outperforms state-of-the-art baselines for one-step generation. We also show that CAF dramatically improves few-step coupling preservation and inversion over Rectified flow. Code is available at \href{https://github.com/mlvlab/CAF}{https://github.com/mlvlab/CAF}.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 04 Nov 2024 19:28:48 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/d4b49f40/3966865b.mp3" length="20425650" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1273</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 14 | cs.LG, cs.AI, cs.CV</p>

            <p><strong>Authors:</strong><br>
            Dogyun Park, Sojin Lee, Sihyeon Kim, Taehoon Lee, Youngjoon Hong, Hyunwoo J. Kim</p>

            <p><strong>Title:</strong><br>
            Constant Acceleration Flow</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00322v1">http://arxiv.org/abs/2411.00322v1</a></p>

            <p><strong>Abstract:</strong><br>
            Rectified flow and reflow procedures have significantly advanced fast generation by progressively straightening ordinary differential equation (ODE) flows. They operate under the assumption that image and noise pairs, known as couplings, can be approximated by straight trajectories with constant velocity. However, we observe that modeling with constant velocity and using reflow procedures have limitations in accurately learning straight trajectories between pairs, resulting in suboptimal performance in few-step generation. To address these limitations, we introduce Constant Acceleration Flow (CAF), a novel framework based on a simple constant acceleration equation. CAF introduces acceleration as an additional learnable variable, allowing for more expressive and accurate estimation of the ODE flow. Moreover, we propose two techniques to further improve estimation accuracy: initial velocity conditioning for the acceleration model and a reflow process for the initial velocity. Our comprehensive studies on toy datasets, CIFAR-10, and ImageNet 64x64 demonstrate that CAF outperforms state-of-the-art baselines for one-step generation. We also show that CAF dramatically improves few-step coupling preservation and inversion over Rectified flow. Code is available at \href{https://github.com/mlvlab/CAF}{https://github.com/mlvlab/CAF}.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models</title>
      <itunes:episode>17</itunes:episode>
      <podcast:episode>17</podcast:episode>
      <itunes:title>TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">64ab066b-6f10-4931-8fd7-ea7beefcdd07</guid>
      <link>https://share.transistor.fm/s/a9a31e14</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, Arman Cohan</p>

            <p><strong>Title:</strong><br>
            TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2410.23266v1">http://arxiv.org/abs/2410.23266v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing benchmarks often highlight the remarkable performance achieved by state-of-the-art Multimodal Foundation Models (MFMs) in leveraging temporal context for video understanding. However, how well do the models truly perform visual temporal reasoning? Our study of existing benchmarks shows that this capability of MFMs is likely overestimated as many questions can be solved by using a single, few, or out-of-order frames. To systematically examine current visual temporal reasoning tasks, we propose three principles with corresponding metrics: (1) Multi-Frame Gain, (2) Frame Order Sensitivity, and (3) Frame Information Disparity. Following these principles, we introduce TOMATO, Temporal Reasoning Multimodal Evaluation, a novel benchmark crafted to rigorously assess MFMs' temporal reasoning capabilities in video understanding. TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks (i.e., action count, direction, rotation, shape &amp; trend, velocity &amp; frequency, and visual cues), applied to 1,417 videos, including 805 self-recorded and -generated videos, that encompass human-centric, real-world, and simulated scenarios. Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model. Moreover, our in-depth analysis uncovers more fundamental limitations beyond this gap in current MFMs. While they can accurately recognize events in isolated frames, they fail to interpret these frames as a continuous sequence. We believe TOMATO will serve as a crucial testbed for evaluating the next-generation MFMs and as a call to the community to develop AI systems capable of comprehending human world dynamics through the video modality.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, Arman Cohan</p>

            <p><strong>Title:</strong><br>
            TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2410.23266v1">http://arxiv.org/abs/2410.23266v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing benchmarks often highlight the remarkable performance achieved by state-of-the-art Multimodal Foundation Models (MFMs) in leveraging temporal context for video understanding. However, how well do the models truly perform visual temporal reasoning? Our study of existing benchmarks shows that this capability of MFMs is likely overestimated as many questions can be solved by using a single, few, or out-of-order frames. To systematically examine current visual temporal reasoning tasks, we propose three principles with corresponding metrics: (1) Multi-Frame Gain, (2) Frame Order Sensitivity, and (3) Frame Information Disparity. Following these principles, we introduce TOMATO, Temporal Reasoning Multimodal Evaluation, a novel benchmark crafted to rigorously assess MFMs' temporal reasoning capabilities in video understanding. TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks (i.e., action count, direction, rotation, shape &amp; trend, velocity &amp; frequency, and visual cues), applied to 1,417 videos, including 805 self-recorded and -generated videos, that encompass human-centric, real-world, and simulated scenarios. Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model. Moreover, our in-depth analysis uncovers more fundamental limitations beyond this gap in current MFMs. While they can accurately recognize events in isolated frames, they fail to interpret these frames as a continuous sequence. We believe TOMATO will serve as a crucial testbed for evaluating the next-generation MFMs and as a call to the community to develop AI systems capable of comprehending human world dynamics through the video modality.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 04 Nov 2024 19:28:27 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a9a31e14/0b328021.mp3" length="23224785" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1448</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 13 | cs.CV, cs.AI, cs.CL</p>

            <p><strong>Authors:</strong><br>
            Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, Arman Cohan</p>

            <p><strong>Title:</strong><br>
            TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2410.23266v1">http://arxiv.org/abs/2410.23266v1</a></p>

            <p><strong>Abstract:</strong><br>
            Existing benchmarks often highlight the remarkable performance achieved by state-of-the-art Multimodal Foundation Models (MFMs) in leveraging temporal context for video understanding. However, how well do the models truly perform visual temporal reasoning? Our study of existing benchmarks shows that this capability of MFMs is likely overestimated as many questions can be solved by using a single, few, or out-of-order frames. To systematically examine current visual temporal reasoning tasks, we propose three principles with corresponding metrics: (1) Multi-Frame Gain, (2) Frame Order Sensitivity, and (3) Frame Information Disparity. Following these principles, we introduce TOMATO, Temporal Reasoning Multimodal Evaluation, a novel benchmark crafted to rigorously assess MFMs' temporal reasoning capabilities in video understanding. TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks (i.e., action count, direction, rotation, shape &amp; trend, velocity &amp; frequency, and visual cues), applied to 1,417 videos, including 805 self-recorded and -generated videos, that encompass human-centric, real-world, and simulated scenarios. Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model. Moreover, our in-depth analysis uncovers more fundamental limitations beyond this gap in current MFMs. While they can accurately recognize events in isolated frames, they fail to interpret these frames as a continuous sequence. We believe TOMATO will serve as a crucial testbed for evaluating the next-generation MFMs and as a call to the community to develop AI systems capable of comprehending human world dynamics through the video modality.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Randomized Autoregressive Visual Generation</title>
      <itunes:episode>16</itunes:episode>
      <podcast:episode>16</podcast:episode>
      <itunes:title>Randomized Autoregressive Visual Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f832aca7-0688-490c-91d5-d94274369994</guid>
      <link>https://share.transistor.fm/s/96b682c6</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen</p>

            <p><strong>Title:</strong><br>
            Randomized Autoregressive Visual Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00776v1">http://arxiv.org/abs/2411.00776v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. The proposed RAR is simple: during a standard autoregressive training process with a next-token prediction objective, the input sequence-typically ordered in raster form-is randomly permuted into different factorization orders with a probability r, where r starts at 1 and linearly decays to 0 over the course of training. This annealing training strategy enables the model to learn to maximize the expected likelihood over all factorization orders and thus effectively improve the model's capability of modeling bidirectional contexts. Importantly, RAR preserves the integrity of the autoregressive modeling framework, ensuring full compatibility with language modeling while significantly improving performance in image generation. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-of-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods. Code and models will be made available at https://github.com/bytedance/1d-tokenizer</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen</p>

            <p><strong>Title:</strong><br>
            Randomized Autoregressive Visual Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00776v1">http://arxiv.org/abs/2411.00776v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. The proposed RAR is simple: during a standard autoregressive training process with a next-token prediction objective, the input sequence-typically ordered in raster form-is randomly permuted into different factorization orders with a probability r, where r starts at 1 and linearly decays to 0 over the course of training. This annealing training strategy enables the model to learn to maximize the expected likelihood over all factorization orders and thus effectively improve the model's capability of modeling bidirectional contexts. Importantly, RAR preserves the integrity of the autoregressive modeling framework, ensuring full compatibility with language modeling while significantly improving performance in image generation. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-of-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods. Code and models will be made available at https://github.com/bytedance/1d-tokenizer</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 04 Nov 2024 19:28:05 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/96b682c6/daccc672.mp3" length="19489439" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1214</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 10 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, Liang-Chieh Chen</p>

            <p><strong>Title:</strong><br>
            Randomized Autoregressive Visual Generation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00776v1">http://arxiv.org/abs/2411.00776v1</a></p>

            <p><strong>Abstract:</strong><br>
            This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. The proposed RAR is simple: during a standard autoregressive training process with a next-token prediction objective, the input sequence-typically ordered in raster form-is randomly permuted into different factorization orders with a probability r, where r starts at 1 and linearly decays to 0 over the course of training. This annealing training strategy enables the model to learn to maximize the expected likelihood over all factorization orders and thus effectively improve the model's capability of modeling bidirectional contexts. Importantly, RAR preserves the integrity of the autoregressive modeling framework, ensuring full compatibility with language modeling while significantly improving performance in image generation. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-of-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods. Code and models will be made available at https://github.com/bytedance/1d-tokenizer</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Survey of User Interface Design and Interaction Techniques in Generative AI Applications</title>
      <itunes:episode>15</itunes:episode>
      <podcast:episode>15</podcast:episode>
      <itunes:title>Survey of User Interface Design and Interaction Techniques in Generative AI Applications</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">92e936da-1b3f-4b51-a856-f8acff855a6c</guid>
      <link>https://share.transistor.fm/s/94c38414</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.HC, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Reuben Luera, Ryan A. Rossi, Alexa Siu, Franck Dernoncourt, Tong Yu, Sungchul Kim, Ruiyi Zhang, Xiang Chen, Hanieh Salehy, Jian Zhao, Samyadeep Basu, Puneet Mathur, Nedim Lipka</p>

            <p><strong>Title:</strong><br>
            Survey of User Interface Design and Interaction Techniques in Generative AI Applications</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2410.22370v1">http://arxiv.org/abs/2410.22370v1</a></p>

            <p><strong>Abstract:</strong><br>
            The applications of generative AI have become extremely impressive, and the interplay between users and AI is even more so. Current human-AI interaction literature has taken a broad look at how humans interact with generative AI, but it lacks specificity regarding the user interface designs and patterns used to create these applications. Therefore, we present a survey that comprehensively presents taxonomies of how a human interacts with AI and the user interaction patterns designed to meet the needs of a variety of relevant use cases. We focus primarily on user-guided interactions, surveying interactions that are initiated by the user and do not include any implicit signals given by the user. With this survey, we aim to create a compendium of different user-interaction patterns that can be used as a reference for designers and developers alike. In doing so, we also strive to lower the entry barrier for those attempting to learn more about the design of generative AI applications.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.HC, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Reuben Luera, Ryan A. Rossi, Alexa Siu, Franck Dernoncourt, Tong Yu, Sungchul Kim, Ruiyi Zhang, Xiang Chen, Hanieh Salehy, Jian Zhao, Samyadeep Basu, Puneet Mathur, Nedim Lipka</p>

            <p><strong>Title:</strong><br>
            Survey of User Interface Design and Interaction Techniques in Generative AI Applications</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2410.22370v1">http://arxiv.org/abs/2410.22370v1</a></p>

            <p><strong>Abstract:</strong><br>
            The applications of generative AI have become extremely impressive, and the interplay between users and AI is even more so. Current human-AI interaction literature has taken a broad look at how humans interact with generative AI, but it lacks specificity regarding the user interface designs and patterns used to create these applications. Therefore, we present a survey that comprehensively presents taxonomies of how a human interacts with AI and the user interaction patterns designed to meet the needs of a variety of relevant use cases. We focus primarily on user-guided interactions, surveying interactions that are initiated by the user and do not include any implicit signals given by the user. With this survey, we aim to create a compendium of different user-interaction patterns that can be used as a reference for designers and developers alike. In doing so, we also strive to lower the entry barrier for those attempting to learn more about the design of generative AI applications.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 04 Nov 2024 19:27:44 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/94c38414/77fb785d.mp3" length="22791361" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1421</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.HC, cs.AI, cs.CL, cs.LG</p>

            <p><strong>Authors:</strong><br>
            Reuben Luera, Ryan A. Rossi, Alexa Siu, Franck Dernoncourt, Tong Yu, Sungchul Kim, Ruiyi Zhang, Xiang Chen, Hanieh Salehy, Jian Zhao, Samyadeep Basu, Puneet Mathur, Nedim Lipka</p>

            <p><strong>Title:</strong><br>
            Survey of User Interface Design and Interaction Techniques in Generative AI Applications</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2410.22370v1">http://arxiv.org/abs/2410.22370v1</a></p>

            <p><strong>Abstract:</strong><br>
            The applications of generative AI have become extremely impressive, and the interplay between users and AI is even more so. Current human-AI interaction literature has taken a broad look at how humans interact with generative AI, but it lacks specificity regarding the user interface designs and patterns used to create these applications. Therefore, we present a survey that comprehensively presents taxonomies of how a human interacts with AI and the user interaction patterns designed to meet the needs of a variety of relevant use cases. We focus primarily on user-guided interactions, surveying interactions that are initiated by the user and do not include any implicit signals given by the user. With this survey, we aim to create a compendium of different user-interaction patterns that can be used as a reference for designers and developers alike. In doing so, we also strive to lower the entry barrier for those attempting to learn more about the design of generative AI applications.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation</title>
      <itunes:episode>14</itunes:episode>
      <podcast:episode>14</podcast:episode>
      <itunes:title>Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">1661d300-c48e-4900-95dd-df44bff6e8b2</guid>
      <link>https://share.transistor.fm/s/feb3ecb6</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.LG, cs.AI, cs.CL, I.2.6; I.2.7</p>

            <p><strong>Authors:</strong><br>
            Bohan Lyu, Yadi Cao, Duncan Watson-Parris, Leon Bergen, Taylor Berg-Kirkpatrick, Rose Yu</p>

            <p><strong>Title:</strong><br>
            Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00412v1">http://arxiv.org/abs/2411.00412v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) demonstrate promising capabilities in solving simple scientific problems but often produce hallucinations for complex ones. While integrating LLMs with tools can increase reliability, this approach typically results in over-reliance on tools, diminishing the model's ability to solve simple problems through basic reasoning. In contrast, human experts first assess problem complexity using domain knowledge before choosing an appropriate solution approach. Inspired by this human problem-solving process, we propose a novel two-component fine-tuning method. In the first component World Knowledge Distillation (WKD), LLMs learn directly from solutions generated using tool's information to internalize domain knowledge. In the second component Tool Usage Adaptation (TUA), we partition problems into easy and hard categories based on the model's direct answering accuracy. While maintaining the same alignment target for easy problems as in WKD, we train the model to intelligently switch to tool usage for more challenging problems. We validate our method on six scientific benchmark datasets, spanning mathematics, climate science and epidemiology. On average, our models demonstrate a 28.18% improvement in answer accuracy and a 13.89% increase in tool usage precision across all datasets, surpassing state-of-the-art models including GPT-4o and Claude-3.5.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.LG, cs.AI, cs.CL, I.2.6; I.2.7</p>

            <p><strong>Authors:</strong><br>
            Bohan Lyu, Yadi Cao, Duncan Watson-Parris, Leon Bergen, Taylor Berg-Kirkpatrick, Rose Yu</p>

            <p><strong>Title:</strong><br>
            Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00412v1">http://arxiv.org/abs/2411.00412v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) demonstrate promising capabilities in solving simple scientific problems but often produce hallucinations for complex ones. While integrating LLMs with tools can increase reliability, this approach typically results in over-reliance on tools, diminishing the model's ability to solve simple problems through basic reasoning. In contrast, human experts first assess problem complexity using domain knowledge before choosing an appropriate solution approach. Inspired by this human problem-solving process, we propose a novel two-component fine-tuning method. In the first component World Knowledge Distillation (WKD), LLMs learn directly from solutions generated using tool's information to internalize domain knowledge. In the second component Tool Usage Adaptation (TUA), we partition problems into easy and hard categories based on the model's direct answering accuracy. While maintaining the same alignment target for easy problems as in WKD, we train the model to intelligently switch to tool usage for more challenging problems. We validate our method on six scientific benchmark datasets, spanning mathematics, climate science and epidemiology. On average, our models demonstrate a 28.18% improvement in answer accuracy and a 13.89% increase in tool usage precision across all datasets, surpassing state-of-the-art models including GPT-4o and Claude-3.5.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 04 Nov 2024 19:27:22 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/feb3ecb6/9eff6063.mp3" length="20124796" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1254</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 8 | cs.LG, cs.AI, cs.CL, I.2.6; I.2.7</p>

            <p><strong>Authors:</strong><br>
            Bohan Lyu, Yadi Cao, Duncan Watson-Parris, Leon Bergen, Taylor Berg-Kirkpatrick, Rose Yu</p>

            <p><strong>Title:</strong><br>
            Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00412v1">http://arxiv.org/abs/2411.00412v1</a></p>

            <p><strong>Abstract:</strong><br>
            Large Language Models (LLMs) demonstrate promising capabilities in solving simple scientific problems but often produce hallucinations for complex ones. While integrating LLMs with tools can increase reliability, this approach typically results in over-reliance on tools, diminishing the model's ability to solve simple problems through basic reasoning. In contrast, human experts first assess problem complexity using domain knowledge before choosing an appropriate solution approach. Inspired by this human problem-solving process, we propose a novel two-component fine-tuning method. In the first component World Knowledge Distillation (WKD), LLMs learn directly from solutions generated using tool's information to internalize domain knowledge. In the second component Tool Usage Adaptation (TUA), we partition problems into easy and hard categories based on the model's direct answering accuracy. While maintaining the same alignment target for easy problems as in WKD, we train the model to intelligently switch to tool usage for more challenging problems. We validate our method on six scientific benchmark datasets, spanning mathematics, climate science and epidemiology. On average, our models demonstrate a 28.18% improvement in answer accuracy and a 13.89% increase in tool usage precision across all datasets, surpassing state-of-the-art models including GPT-4o and Claude-3.5.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>In-Context LoRA for Diffusion Transformers</title>
      <itunes:episode>13</itunes:episode>
      <podcast:episode>13</podcast:episode>
      <itunes:title>In-Context LoRA for Diffusion Transformers</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">72c20503-190b-49dc-aa7a-b844da834acf</guid>
      <link>https://share.transistor.fm/s/a9d42d9a</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            In-Context LoRA for Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2410.23775v2">http://arxiv.org/abs/2410.23775v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent research arXiv:2410.15027 has explored the use of diffusion transformers (DiTs) for task-agnostic image generation by simply concatenating attention tokens across images. However, despite substantial computational resources, the fidelity of the generated images remains suboptimal. In this study, we reevaluate and streamline this framework by hypothesizing that text-to-image DiTs inherently possess in-context generation capabilities, requiring only minimal tuning to activate them. Through diverse task experiments, we qualitatively demonstrate that existing text-to-image DiTs can effectively perform in-context generation without any tuning. Building on this insight, we propose a remarkably simple pipeline to leverage the in-context abilities of DiTs: (1) concatenate images instead of tokens, (2) perform joint captioning of multiple images, and (3) apply task-specific LoRA tuning using small datasets (e.g., $20\sim 100$ samples) instead of full-parameter tuning with large datasets. We name our models In-Context LoRA (IC-LoRA). This approach requires no modifications to the original DiT models, only changes to the training data. Remarkably, our pipeline generates high-fidelity image sets that better adhere to prompts. While task-specific in terms of tuning data, our framework remains task-agnostic in architecture and pipeline, offering a powerful tool for the community and providing valuable insights for further research on product-level task-agnostic generation systems. We release our code, data, and models at https://github.com/ali-vilab/In-Context-LoRA</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            In-Context LoRA for Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2410.23775v2">http://arxiv.org/abs/2410.23775v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent research arXiv:2410.15027 has explored the use of diffusion transformers (DiTs) for task-agnostic image generation by simply concatenating attention tokens across images. However, despite substantial computational resources, the fidelity of the generated images remains suboptimal. In this study, we reevaluate and streamline this framework by hypothesizing that text-to-image DiTs inherently possess in-context generation capabilities, requiring only minimal tuning to activate them. Through diverse task experiments, we qualitatively demonstrate that existing text-to-image DiTs can effectively perform in-context generation without any tuning. Building on this insight, we propose a remarkably simple pipeline to leverage the in-context abilities of DiTs: (1) concatenate images instead of tokens, (2) perform joint captioning of multiple images, and (3) apply task-specific LoRA tuning using small datasets (e.g., $20\sim 100$ samples) instead of full-parameter tuning with large datasets. We name our models In-Context LoRA (IC-LoRA). This approach requires no modifications to the original DiT models, only changes to the training data. Remarkably, our pipeline generates high-fidelity image sets that better adhere to prompts. While task-specific in terms of tuning data, our framework remains task-agnostic in architecture and pipeline, offering a powerful tool for the community and providing valuable insights for further research on product-level task-agnostic generation systems. We release our code, data, and models at https://github.com/ali-vilab/In-Context-LoRA</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 04 Nov 2024 19:27:01 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/a9d42d9a/6ae77779.mp3" length="19587240" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1221</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.CV, cs.GR</p>

            <p><strong>Authors:</strong><br>
            Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, Jingren Zhou</p>

            <p><strong>Title:</strong><br>
            In-Context LoRA for Diffusion Transformers</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2410.23775v2">http://arxiv.org/abs/2410.23775v2</a></p>

            <p><strong>Abstract:</strong><br>
            Recent research arXiv:2410.15027 has explored the use of diffusion transformers (DiTs) for task-agnostic image generation by simply concatenating attention tokens across images. However, despite substantial computational resources, the fidelity of the generated images remains suboptimal. In this study, we reevaluate and streamline this framework by hypothesizing that text-to-image DiTs inherently possess in-context generation capabilities, requiring only minimal tuning to activate them. Through diverse task experiments, we qualitatively demonstrate that existing text-to-image DiTs can effectively perform in-context generation without any tuning. Building on this insight, we propose a remarkably simple pipeline to leverage the in-context abilities of DiTs: (1) concatenate images instead of tokens, (2) perform joint captioning of multiple images, and (3) apply task-specific LoRA tuning using small datasets (e.g., $20\sim 100$ samples) instead of full-parameter tuning with large datasets. We name our models In-Context LoRA (IC-LoRA). This approach requires no modifications to the original DiT models, only changes to the training data. Remarkably, our pipeline generates high-fidelity image sets that better adhere to prompts. While task-specific in terms of tuning data, our framework remains task-agnostic in architecture and pipeline, offering a powerful tool for the community and providing valuable insights for further research on product-level task-agnostic generation systems. We release our code, data, and models at https://github.com/ali-vilab/In-Context-LoRA</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Physics in Next-token Prediction</title>
      <itunes:episode>12</itunes:episode>
      <podcast:episode>12</podcast:episode>
      <itunes:title>Physics in Next-token Prediction</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4740ced0-5b8b-4bef-ac28-0cc6a60ad840</guid>
      <link>https://share.transistor.fm/s/b2f2bdce</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hongjun An, Yiliang Song, Xuelong Li</p>

            <p><strong>Title:</strong><br>
            Physics in Next-token Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00660v1">http://arxiv.org/abs/2411.00660v1</a></p>

            <p><strong>Abstract:</strong><br>
            We discovered the underlying physics in Next-token Prediction (NTP). We identified the law of information conservation within NTP and proposed the First Law of Information Capacity (IC-1), demonstrating that the essence of intelligence emergence in auto-regressive models is fundamentally a process of information transfer. We also introduced Landauer's Principle into NTP, formulating the Second Law of Information Capacity (IC-2), which establishes the relationship between auto-regressive model training and energy consumption. Additionally, we presented several corollaries, which hold practical significance for production practices. Finally, we validated the compatibility and complementarity of our findings with existing theories.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hongjun An, Yiliang Song, Xuelong Li</p>

            <p><strong>Title:</strong><br>
            Physics in Next-token Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00660v1">http://arxiv.org/abs/2411.00660v1</a></p>

            <p><strong>Abstract:</strong><br>
            We discovered the underlying physics in Next-token Prediction (NTP). We identified the law of information conservation within NTP and proposed the First Law of Information Capacity (IC-1), demonstrating that the essence of intelligence emergence in auto-regressive models is fundamentally a process of information transfer. We also introduced Landauer's Principle into NTP, formulating the Second Law of Information Capacity (IC-2), which establishes the relationship between auto-regressive model training and energy consumption. Additionally, we presented several corollaries, which hold practical significance for production practices. Finally, we validated the compatibility and complementarity of our findings with existing theories.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 04 Nov 2024 19:26:39 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/b2f2bdce/f8f9c335.mp3" length="18045797" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1124</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 7 | cs.LG, cs.AI</p>

            <p><strong>Authors:</strong><br>
            Hongjun An, Yiliang Song, Xuelong Li</p>

            <p><strong>Title:</strong><br>
            Physics in Next-token Prediction</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00660v1">http://arxiv.org/abs/2411.00660v1</a></p>

            <p><strong>Abstract:</strong><br>
            We discovered the underlying physics in Next-token Prediction (NTP). We identified the law of information conservation within NTP and proposed the First Law of Information Capacity (IC-1), demonstrating that the essence of intelligence emergence in auto-regressive models is fundamentally a process of information transfer. We also introduced Landauer's Principle into NTP, formulating the Second Law of Information Capacity (IC-2), which establishes the relationship between auto-regressive model training and energy consumption. Additionally, we presented several corollaries, which hold practical significance for production practices. Finally, we validated the compatibility and complementarity of our findings with existing theories.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes</title>
      <itunes:episode>11</itunes:episode>
      <podcast:episode>11</podcast:episode>
      <itunes:title>CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">e88db41f-178b-4d87-bebb-b21960df55ad</guid>
      <link>https://share.transistor.fm/s/9bbd64a8</link>
      <description>
        <![CDATA[
            <p>🤗 Paper Upvotes: 5 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang Liu, Chuanchen Luo, Zhongkai Mao, Junran Peng, Zhaoxiang Zhang</p>

            <p><strong>Title:</strong><br>
            CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00771v1">http://arxiv.org/abs/2411.00771v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, 3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction, manifesting efficient and high-fidelity novel view synthesis. However, accurately representing surfaces, especially in large and complex scenarios, remains a significant challenge due to the unstructured nature of 3DGS. In this paper, we present CityGaussianV2, a novel approach for large-scale scene reconstruction that addresses critical challenges related to geometric accuracy and efficiency. Building on the favorable generalization capabilities of 2D Gaussian Splatting (2DGS), we address its convergence and scalability issues. Specifically, we implement a decomposed-gradient-based densification and depth regression technique to eliminate blurry artifacts and accelerate convergence. To scale up, we introduce an elongation filter that mitigates Gaussian count explosion caused by 2DGS degeneration. Furthermore, we optimize the CityGaussian pipeline for parallel training, achieving up to 10$\times$ compression, at least 25% savings in training time, and a 50% decrease in memory usage. We also established standard geometry benchmarks under large-scale scenes. Experimental results demonstrate that our method strikes a promising balance between visual quality, geometric accuracy, as well as storage and training costs. The project page is available at https://dekuliutesla.github.io/CityGaussianV2/.</p>
            ]]>
      </description>
      <content:encoded>
        <![CDATA[
            <p>🤗 Paper Upvotes: 5 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang Liu, Chuanchen Luo, Zhongkai Mao, Junran Peng, Zhaoxiang Zhang</p>

            <p><strong>Title:</strong><br>
            CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00771v1">http://arxiv.org/abs/2411.00771v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, 3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction, manifesting efficient and high-fidelity novel view synthesis. However, accurately representing surfaces, especially in large and complex scenarios, remains a significant challenge due to the unstructured nature of 3DGS. In this paper, we present CityGaussianV2, a novel approach for large-scale scene reconstruction that addresses critical challenges related to geometric accuracy and efficiency. Building on the favorable generalization capabilities of 2D Gaussian Splatting (2DGS), we address its convergence and scalability issues. Specifically, we implement a decomposed-gradient-based densification and depth regression technique to eliminate blurry artifacts and accelerate convergence. To scale up, we introduce an elongation filter that mitigates Gaussian count explosion caused by 2DGS degeneration. Furthermore, we optimize the CityGaussian pipeline for parallel training, achieving up to 10$\times$ compression, at least 25% savings in training time, and a 50% decrease in memory usage. We also established standard geometry benchmarks under large-scale scenes. Experimental results demonstrate that our method strikes a promising balance between visual quality, geometric accuracy, as well as storage and training costs. The project page is available at https://dekuliutesla.github.io/CityGaussianV2/.</p>
            ]]>
      </content:encoded>
      <pubDate>Mon, 04 Nov 2024 19:26:18 -0800</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9bbd64a8/32ed359c.mp3" length="19640369" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1224</itunes:duration>
      <itunes:summary>
        <![CDATA[
            <p>🤗 Paper Upvotes: 5 | cs.CV</p>

            <p><strong>Authors:</strong><br>
            Yang Liu, Chuanchen Luo, Zhongkai Mao, Junran Peng, Zhaoxiang Zhang</p>

            <p><strong>Title:</strong><br>
            CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes</p>

            <p><strong>Arxiv:</strong><br>
            <a href="http://arxiv.org/abs/2411.00771v1">http://arxiv.org/abs/2411.00771v1</a></p>

            <p><strong>Abstract:</strong><br>
            Recently, 3D Gaussian Splatting (3DGS) has revolutionized radiance field reconstruction, manifesting efficient and high-fidelity novel view synthesis. However, accurately representing surfaces, especially in large and complex scenarios, remains a significant challenge due to the unstructured nature of 3DGS. In this paper, we present CityGaussianV2, a novel approach for large-scale scene reconstruction that addresses critical challenges related to geometric accuracy and efficiency. Building on the favorable generalization capabilities of 2D Gaussian Splatting (2DGS), we address its convergence and scalability issues. Specifically, we implement a decomposed-gradient-based densification and depth regression technique to eliminate blurry artifacts and accelerate convergence. To scale up, we introduce an elongation filter that mitigates Gaussian count explosion caused by 2DGS degeneration. Furthermore, we optimize the CityGaussian pipeline for parallel training, achieving up to 10$\times$ compression, at least 25% savings in training time, and a 50% decrease in memory usage. We also established standard geometry benchmarks under large-scale scenes. Experimental results demonstrate that our method strikes a promising balance between visual quality, geometric accuracy, as well as storage and training costs. The project page is available at https://dekuliutesla.github.io/CityGaussianV2/.</p>
            ]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders</title>
      <itunes:episode>10</itunes:episode>
      <podcast:episode>10</podcast:episode>
      <itunes:title>Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3f01e04a-c308-4ae9-8b14-2b3973d9bee4</guid>
      <link>https://share.transistor.fm/s/fee1bedf</link>
      <description>
        <![CDATA[<p>🤗 Daily Paper Upvotes: 57</p><p>Authors: Viacheslav Surkov, Chris Wendler, Mikhail Terekhov, Justin Deschenaux, Robert West, Caglar Gulcehre</p><p>Categories: cs.LG, cs.AI, cs.CV</p><p>Arxiv: http://arxiv.org/abs/2410.22366v1</p><p>Title: Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders Abstract: Sparse autoencoders (SAEs) have become a core ingredient in the reverse engineering of large-language models (LLMs). For LLMs, they have been shown to decompose intermediate representations that often are not interpretable directly into sparse sums of interpretable features, facilitating better control and subsequent analysis. However, similar analyses and approaches have been lacking for text-to-image models. We investigated the possibility of using SAEs to learn interpretable features for a few-step text-to-image diffusion models, such as SDXL Turbo. To this end, we train SAEs on the updates performed by transformer blocks within SDXL Turbo's denoising U-net. We find that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks. In particular, we find one block that deals mainly with image composition, one that is mainly responsible for adding local details, and one for color, illumination, and style. Therefore, our work is an important first step towards better understanding the internals of generative text-to-image models like SDXL Turbo and showcases the potential of features learned by SAEs for the visual domain. Code is available at https://github.com/surkovv/sdxl-unbox </p>]]>
      </description>
      <content:encoded>
        <![CDATA[<p>🤗 Daily Paper Upvotes: 57</p><p>Authors: Viacheslav Surkov, Chris Wendler, Mikhail Terekhov, Justin Deschenaux, Robert West, Caglar Gulcehre</p><p>Categories: cs.LG, cs.AI, cs.CV</p><p>Arxiv: http://arxiv.org/abs/2410.22366v1</p><p>Title: Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders Abstract: Sparse autoencoders (SAEs) have become a core ingredient in the reverse engineering of large-language models (LLMs). For LLMs, they have been shown to decompose intermediate representations that often are not interpretable directly into sparse sums of interpretable features, facilitating better control and subsequent analysis. However, similar analyses and approaches have been lacking for text-to-image models. We investigated the possibility of using SAEs to learn interpretable features for a few-step text-to-image diffusion models, such as SDXL Turbo. To this end, we train SAEs on the updates performed by transformer blocks within SDXL Turbo's denoising U-net. We find that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks. In particular, we find one block that deals mainly with image composition, one that is mainly responsible for adding local details, and one for color, illumination, and style. Therefore, our work is an important first step towards better understanding the internals of generative text-to-image models like SDXL Turbo and showcases the potential of features learned by SAEs for the visual domain. Code is available at https://github.com/surkovv/sdxl-unbox </p>]]>
      </content:encoded>
      <pubDate>Sun, 03 Nov 2024 00:29:08 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/fee1bedf/20825e84.mp3" length="22257619" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1387</itunes:duration>
      <itunes:summary>
        <![CDATA[<p>🤗 Daily Paper Upvotes: 57</p><p>Authors: Viacheslav Surkov, Chris Wendler, Mikhail Terekhov, Justin Deschenaux, Robert West, Caglar Gulcehre</p><p>Categories: cs.LG, cs.AI, cs.CV</p><p>Arxiv: http://arxiv.org/abs/2410.22366v1</p><p>Title: Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders Abstract: Sparse autoencoders (SAEs) have become a core ingredient in the reverse engineering of large-language models (LLMs). For LLMs, they have been shown to decompose intermediate representations that often are not interpretable directly into sparse sums of interpretable features, facilitating better control and subsequent analysis. However, similar analyses and approaches have been lacking for text-to-image models. We investigated the possibility of using SAEs to learn interpretable features for a few-step text-to-image diffusion models, such as SDXL Turbo. To this end, we train SAEs on the updates performed by transformer blocks within SDXL Turbo's denoising U-net. We find that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks. In particular, we find one block that deals mainly with image composition, one that is mainly responsible for adding local details, and one for color, illumination, and style. Therefore, our work is an important first step towards better understanding the internals of generative text-to-image models like SDXL Turbo and showcases the potential of features learned by SAEs for the visual domain. Code is available at https://github.com/surkovv/sdxl-unbox </p>]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective</title>
      <itunes:episode>9</itunes:episode>
      <podcast:episode>9</podcast:episode>
      <itunes:title>What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">d58f0cfc-3c1e-4b40-be33-5aa12aff2a0c</guid>
      <link>https://share.transistor.fm/s/f106134a</link>
      <description>
        <![CDATA[🤗 Daily Paper Upvotes: 45

Authors: Ming Li, Yanhong Li, Tianyi Zhou

Categories: cs.CL, cs.AI, cs.LG

Arxiv: http://arxiv.org/abs/2410.23743v1

Title: What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective

Abstract: What makes a difference in the post-training of LLMs? We investigate the training patterns of different layers in large language models (LLMs), through the lens of gradient, when training with different responses and initial models. We are specifically interested in how fast vs. slow thinking affects the layer-wise gradients, given the recent popularity of training LLMs on reasoning paths such as chain-of-thoughts (CoT) and process rewards. In our study, fast thinking without CoT leads to larger gradients and larger differences of gradients across layers than slow thinking (Detailed CoT), indicating the learning stability brought by the latter. Moreover, pre-trained LLMs are less affected by the instability of fast thinking than instruction-tuned LLMs. Additionally, we study whether the gradient patterns can reflect the correctness of responses when training different LLMs using slow vs. fast thinking paths. The results show that the gradients of slow thinking can distinguish correct and irrelevant reasoning paths. As a comparison, we conduct similar gradient analyses on non-reasoning knowledge learning tasks, on which, however, trivially increasing the response length does not lead to similar behaviors of slow thinking. Our study strengthens fundamental understandings of LLM training and sheds novel insights on its efficiency and stability, which pave the way towards building a generalizable System-2 agent. Our code, data, and gradient statistics can be found in: https://github.com/MingLiiii/Layer_Gradient.]]>
      </description>
      <content:encoded>
        <![CDATA[🤗 Daily Paper Upvotes: 45

Authors: Ming Li, Yanhong Li, Tianyi Zhou

Categories: cs.CL, cs.AI, cs.LG

Arxiv: http://arxiv.org/abs/2410.23743v1

Title: What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective

Abstract: What makes a difference in the post-training of LLMs? We investigate the training patterns of different layers in large language models (LLMs), through the lens of gradient, when training with different responses and initial models. We are specifically interested in how fast vs. slow thinking affects the layer-wise gradients, given the recent popularity of training LLMs on reasoning paths such as chain-of-thoughts (CoT) and process rewards. In our study, fast thinking without CoT leads to larger gradients and larger differences of gradients across layers than slow thinking (Detailed CoT), indicating the learning stability brought by the latter. Moreover, pre-trained LLMs are less affected by the instability of fast thinking than instruction-tuned LLMs. Additionally, we study whether the gradient patterns can reflect the correctness of responses when training different LLMs using slow vs. fast thinking paths. The results show that the gradients of slow thinking can distinguish correct and irrelevant reasoning paths. As a comparison, we conduct similar gradient analyses on non-reasoning knowledge learning tasks, on which, however, trivially increasing the response length does not lead to similar behaviors of slow thinking. Our study strengthens fundamental understandings of LLM training and sheds novel insights on its efficiency and stability, which pave the way towards building a generalizable System-2 agent. Our code, data, and gradient statistics can be found in: https://github.com/MingLiiii/Layer_Gradient.]]>
      </content:encoded>
      <pubDate>Sun, 03 Nov 2024 00:28:54 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/f106134a/1ead798a.mp3" length="20004413" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1247</itunes:duration>
      <itunes:summary>
        <![CDATA[🤗 Daily Paper Upvotes: 45

Authors: Ming Li, Yanhong Li, Tianyi Zhou

Categories: cs.CL, cs.AI, cs.LG

Arxiv: http://arxiv.org/abs/2410.23743v1

Title: What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective

Abstract: What makes a difference in the post-training of LLMs? We investigate the training patterns of different layers in large language models (LLMs), through the lens of gradient, when training with different responses and initial models. We are specifically interested in how fast vs. slow thinking affects the layer-wise gradients, given the recent popularity of training LLMs on reasoning paths such as chain-of-thoughts (CoT) and process rewards. In our study, fast thinking without CoT leads to larger gradients and larger differences of gradients across layers than slow thinking (Detailed CoT), indicating the learning stability brought by the latter. Moreover, pre-trained LLMs are less affected by the instability of fast thinking than instruction-tuned LLMs. Additionally, we study whether the gradient patterns can reflect the correctness of responses when training different LLMs using slow vs. fast thinking paths. The results show that the gradients of slow thinking can distinguish correct and irrelevant reasoning paths. As a comparison, we conduct similar gradient analyses on non-reasoning knowledge learning tasks, on which, however, trivially increasing the response length does not lead to similar behaviors of slow thinking. Our study strengthens fundamental understandings of LLM training and sheds novel insights on its efficiency and stability, which pave the way towards building a generalizable System-2 agent. Our code, data, and gradient statistics can be found in: https://github.com/MingLiiii/Layer_Gradient.]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>A Pointer Network-based Approach for Joint Extraction and Detection of Multi-Label Multi-Class Intents</title>
      <itunes:episode>8</itunes:episode>
      <podcast:episode>8</podcast:episode>
      <itunes:title>A Pointer Network-based Approach for Joint Extraction and Detection of Multi-Label Multi-Class Intents</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">0de89a3e-928a-4a25-9212-e67db8c2fcbb</guid>
      <link>https://share.transistor.fm/s/6b91168e</link>
      <description>
        <![CDATA[🤗 Daily Paper Upvotes: 20

Authors: Ankan Mullick, Sombit Bose, Abhilash Nandy, Gajula Sai Chaitanya, Pawan Goyal

Categories: cs.CL, cs.IR

Arxiv: http://arxiv.org/abs/2410.22476v1

Title: A Pointer Network-based Approach for Joint Extraction and Detection of Multi-Label Multi-Class Intents

Abstract: In task-oriented dialogue systems, intent detection is crucial for interpreting user queries and providing appropriate responses. Existing research primarily addresses simple queries with a single intent, lacking effective systems for handling complex queries with multiple intents and extracting different intent spans. Additionally, there is a notable absence of multilingual, multi-intent datasets. This study addresses three critical tasks: extracting multiple intent spans from queries, detecting multiple intents, and developing a multi-lingual multi-label intent dataset. We introduce a novel multi-label multi-class intent detection dataset (MLMCID-dataset) curated from existing benchmark datasets. We also propose a pointer network-based architecture (MLMCID) to extract intent spans and detect multiple intents with coarse and fine-grained labels in the form of sextuplets. Comprehensive analysis demonstrates the superiority of our pointer network-based system over baseline approaches in terms of accuracy and F1-score across various datasets.]]>
      </description>
      <content:encoded>
        <![CDATA[🤗 Daily Paper Upvotes: 20

Authors: Ankan Mullick, Sombit Bose, Abhilash Nandy, Gajula Sai Chaitanya, Pawan Goyal

Categories: cs.CL, cs.IR

Arxiv: http://arxiv.org/abs/2410.22476v1

Title: A Pointer Network-based Approach for Joint Extraction and Detection of Multi-Label Multi-Class Intents

Abstract: In task-oriented dialogue systems, intent detection is crucial for interpreting user queries and providing appropriate responses. Existing research primarily addresses simple queries with a single intent, lacking effective systems for handling complex queries with multiple intents and extracting different intent spans. Additionally, there is a notable absence of multilingual, multi-intent datasets. This study addresses three critical tasks: extracting multiple intent spans from queries, detecting multiple intents, and developing a multi-lingual multi-label intent dataset. We introduce a novel multi-label multi-class intent detection dataset (MLMCID-dataset) curated from existing benchmark datasets. We also propose a pointer network-based architecture (MLMCID) to extract intent spans and detect multiple intents with coarse and fine-grained labels in the form of sextuplets. Comprehensive analysis demonstrates the superiority of our pointer network-based system over baseline approaches in terms of accuracy and F1-score across various datasets.]]>
      </content:encoded>
      <pubDate>Sun, 03 Nov 2024 00:28:38 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/6b91168e/4e7ac735.mp3" length="21502806" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1340</itunes:duration>
      <itunes:summary>
        <![CDATA[🤗 Daily Paper Upvotes: 20

Authors: Ankan Mullick, Sombit Bose, Abhilash Nandy, Gajula Sai Chaitanya, Pawan Goyal

Categories: cs.CL, cs.IR

Arxiv: http://arxiv.org/abs/2410.22476v1

Title: A Pointer Network-based Approach for Joint Extraction and Detection of Multi-Label Multi-Class Intents

Abstract: In task-oriented dialogue systems, intent detection is crucial for interpreting user queries and providing appropriate responses. Existing research primarily addresses simple queries with a single intent, lacking effective systems for handling complex queries with multiple intents and extracting different intent spans. Additionally, there is a notable absence of multilingual, multi-intent datasets. This study addresses three critical tasks: extracting multiple intent spans from queries, detecting multiple intents, and developing a multi-lingual multi-label intent dataset. We introduce a novel multi-label multi-class intent detection dataset (MLMCID-dataset) curated from existing benchmark datasets. We also propose a pointer network-based architecture (MLMCID) to extract intent spans and detect multiple intents with coarse and fine-grained labels in the form of sextuplets. Comprehensive analysis demonstrates the superiority of our pointer network-based system over baseline approaches in terms of accuracy and F1-score across various datasets.]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Language Models can Self-Lengthen to Generate Long Texts</title>
      <itunes:episode>7</itunes:episode>
      <podcast:episode>7</podcast:episode>
      <itunes:title>Language Models can Self-Lengthen to Generate Long Texts</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">98464b07-b5ba-40dc-bc8e-5b83213826a1</guid>
      <link>https://share.transistor.fm/s/9016b30e</link>
      <description>
        <![CDATA[🤗 Daily Paper Upvotes: 14

Authors: Shanghaoran Quan, Tianyi Tang, Bowen Yu, An Yang, Dayiheng Liu, Bofei Gao, Jianhong Tu, Yichang Zhang, Jingren Zhou, Junyang Lin

Categories: cs.CL

Arxiv: http://arxiv.org/abs/2410.23933v1

Title: Language Models can Self-Lengthen to Generate Long Texts

Abstract: Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to process long contexts, yet a notable gap remains in generating long, aligned outputs. This limitation stems from a training gap where pre-training lacks effective instructions for long-text generation, and post-training data primarily consists of short query-response pairs. Current approaches, such as instruction backtranslation and behavior imitation, face challenges including data quality, copyright issues, and constraints on proprietary model usage. In this paper, we introduce an innovative iterative training framework called Self-Lengthen that leverages only the intrinsic knowledge and skills of LLMs without the need for auxiliary data or proprietary models. The framework consists of two roles: the Generator and the Extender. The Generator produces the initial response, which is then split and expanded by the Extender. This process results in a new, longer response, which is used to train both the Generator and the Extender iteratively. Through this process, the models are progressively trained to handle increasingly longer responses. Experiments on benchmarks and human evaluations show that Self-Lengthen outperforms existing methods in long-text generation, when applied to top open-source LLMs such as Qwen2 and LLaMA3. Our code is publicly available at https://github.com/QwenLM/Self-Lengthen.]]>
      </description>
      <content:encoded>
        <![CDATA[🤗 Daily Paper Upvotes: 14

Authors: Shanghaoran Quan, Tianyi Tang, Bowen Yu, An Yang, Dayiheng Liu, Bofei Gao, Jianhong Tu, Yichang Zhang, Jingren Zhou, Junyang Lin

Categories: cs.CL

Arxiv: http://arxiv.org/abs/2410.23933v1

Title: Language Models can Self-Lengthen to Generate Long Texts

Abstract: Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to process long contexts, yet a notable gap remains in generating long, aligned outputs. This limitation stems from a training gap where pre-training lacks effective instructions for long-text generation, and post-training data primarily consists of short query-response pairs. Current approaches, such as instruction backtranslation and behavior imitation, face challenges including data quality, copyright issues, and constraints on proprietary model usage. In this paper, we introduce an innovative iterative training framework called Self-Lengthen that leverages only the intrinsic knowledge and skills of LLMs without the need for auxiliary data or proprietary models. The framework consists of two roles: the Generator and the Extender. The Generator produces the initial response, which is then split and expanded by the Extender. This process results in a new, longer response, which is used to train both the Generator and the Extender iteratively. Through this process, the models are progressively trained to handle increasingly longer responses. Experiments on benchmarks and human evaluations show that Self-Lengthen outperforms existing methods in long-text generation, when applied to top open-source LLMs such as Qwen2 and LLaMA3. Our code is publicly available at https://github.com/QwenLM/Self-Lengthen.]]>
      </content:encoded>
      <pubDate>Sun, 03 Nov 2024 00:28:11 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/9016b30e/54fc8700.mp3" length="19649529" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1224</itunes:duration>
      <itunes:summary>
        <![CDATA[🤗 Daily Paper Upvotes: 14

Authors: Shanghaoran Quan, Tianyi Tang, Bowen Yu, An Yang, Dayiheng Liu, Bofei Gao, Jianhong Tu, Yichang Zhang, Jingren Zhou, Junyang Lin

Categories: cs.CL

Arxiv: http://arxiv.org/abs/2410.23933v1

Title: Language Models can Self-Lengthen to Generate Long Texts

Abstract: Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to process long contexts, yet a notable gap remains in generating long, aligned outputs. This limitation stems from a training gap where pre-training lacks effective instructions for long-text generation, and post-training data primarily consists of short query-response pairs. Current approaches, such as instruction backtranslation and behavior imitation, face challenges including data quality, copyright issues, and constraints on proprietary model usage. In this paper, we introduce an innovative iterative training framework called Self-Lengthen that leverages only the intrinsic knowledge and skills of LLMs without the need for auxiliary data or proprietary models. The framework consists of two roles: the Generator and the Extender. The Generator produces the initial response, which is then split and expanded by the Extender. This process results in a new, longer response, which is used to train both the Generator and the Extender iteratively. Through this process, the models are progressively trained to handle increasingly longer responses. Experiments on benchmarks and human evaluations show that Self-Lengthen outperforms existing methods in long-text generation, when applied to top open-source LLMs such as Qwen2 and LLaMA3. Our code is publicly available at https://github.com/QwenLM/Self-Lengthen.]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Constraint Back-translation Improves Complex Instruction Following of Large Language Models</title>
      <itunes:episode>6</itunes:episode>
      <podcast:episode>6</podcast:episode>
      <itunes:title>Constraint Back-translation Improves Complex Instruction Following of Large Language Models</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">8bf3cb2b-3efb-4a71-b104-8e2b135a1167</guid>
      <link>https://share.transistor.fm/s/45cfe92a</link>
      <description>
        <![CDATA[🤗 Daily Paper Upvotes: 12

Authors: Yunjia Qi, Hao Peng, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li

Categories: cs.CL, cs.AI

Arxiv: http://arxiv.org/abs/2410.24175v1

Title: Constraint Back-translation Improves Complex Instruction Following of Large Language Models

Abstract: Large language models (LLMs) struggle to follow instructions with complex constraints in format, length, etc. Following the conventional instruction-tuning practice, previous works conduct post-training on complex instruction-response pairs generated by feeding complex instructions to advanced LLMs. However, even advanced LLMs cannot follow complex instructions well, thus limiting the quality of generated data. In this work, we find that existing datasets inherently contain implicit complex constraints and propose a novel data generation technique, constraint back-translation. Specifically, we take the high-quality instruction-response pairs in existing datasets and only adopt advanced LLMs to add complex constraints already met by the responses to the instructions, which naturally reduces costs and data noise. In the experiments, we adopt Llama3-70B-Instruct to back-translate constraints and create a high-quality complex instruction-response dataset, named CRAB. We present that post-training on CRAB improves multiple backbone LLMs' complex instruction-following ability, evaluated on extensive instruction-following benchmarks. We further find that constraint back-translation also serves as a useful auxiliary training objective in post-training. Our code, data, and models will be released to facilitate future research.]]>
      </description>
      <content:encoded>
        <![CDATA[🤗 Daily Paper Upvotes: 12

Authors: Yunjia Qi, Hao Peng, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li

Categories: cs.CL, cs.AI

Arxiv: http://arxiv.org/abs/2410.24175v1

Title: Constraint Back-translation Improves Complex Instruction Following of Large Language Models

Abstract: Large language models (LLMs) struggle to follow instructions with complex constraints in format, length, etc. Following the conventional instruction-tuning practice, previous works conduct post-training on complex instruction-response pairs generated by feeding complex instructions to advanced LLMs. However, even advanced LLMs cannot follow complex instructions well, thus limiting the quality of generated data. In this work, we find that existing datasets inherently contain implicit complex constraints and propose a novel data generation technique, constraint back-translation. Specifically, we take the high-quality instruction-response pairs in existing datasets and only adopt advanced LLMs to add complex constraints already met by the responses to the instructions, which naturally reduces costs and data noise. In the experiments, we adopt Llama3-70B-Instruct to back-translate constraints and create a high-quality complex instruction-response dataset, named CRAB. We present that post-training on CRAB improves multiple backbone LLMs' complex instruction-following ability, evaluated on extensive instruction-following benchmarks. We further find that constraint back-translation also serves as a useful auxiliary training objective in post-training. Our code, data, and models will be released to facilitate future research.]]>
      </content:encoded>
      <pubDate>Sun, 03 Nov 2024 00:27:57 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/45cfe92a/5e9d6112.mp3" length="18923569" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1179</itunes:duration>
      <itunes:summary>
        <![CDATA[🤗 Daily Paper Upvotes: 12

Authors: Yunjia Qi, Hao Peng, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li

Categories: cs.CL, cs.AI

Arxiv: http://arxiv.org/abs/2410.24175v1

Title: Constraint Back-translation Improves Complex Instruction Following of Large Language Models

Abstract: Large language models (LLMs) struggle to follow instructions with complex constraints in format, length, etc. Following the conventional instruction-tuning practice, previous works conduct post-training on complex instruction-response pairs generated by feeding complex instructions to advanced LLMs. However, even advanced LLMs cannot follow complex instructions well, thus limiting the quality of generated data. In this work, we find that existing datasets inherently contain implicit complex constraints and propose a novel data generation technique, constraint back-translation. Specifically, we take the high-quality instruction-response pairs in existing datasets and only adopt advanced LLMs to add complex constraints already met by the responses to the instructions, which naturally reduces costs and data noise. In the experiments, we adopt Llama3-70B-Instruct to back-translate constraints and create a high-quality complex instruction-response dataset, named CRAB. We present that post-training on CRAB improves multiple backbone LLMs' complex instruction-following ability, evaluated on extensive instruction-following benchmarks. We further find that constraint back-translation also serves as a useful auxiliary training objective in post-training. Our code, data, and models will be released to facilitate future research.]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments</title>
      <itunes:episode>5</itunes:episode>
      <podcast:episode>5</podcast:episode>
      <itunes:title>BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">deec23e8-b688-446e-8822-99bd834242fc</guid>
      <link>https://share.transistor.fm/s/c651ef94</link>
      <description>
        <![CDATA[🤗 Daily Paper Upvotes: 11

Authors: Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu

Categories: cs.CL, cs.AI, cs.CV, cs.LG

Arxiv: http://arxiv.org/abs/2410.23918v1

Title: BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments

Abstract: Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from \textit{capability} to \textit{availability}, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression ratios and separate compression processes for each setting, complicating deployment in variable memory environments. In this paper, we introduce \textbf{BitStack}, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance. By leveraging weight decomposition, BitStack can dynamically adjust the model size with minimal transmission between running memory and storage devices. Our approach iteratively decomposes weight matrices while considering the significance of each parameter, resulting in an approximately 1-bit per parameter residual block in each decomposition iteration. These blocks are sorted and stacked in storage as basic transmission units, with different quantities loaded based on current memory availability. Extensive experiments across a wide range of tasks demonstrate that, despite offering fine-grained size control, BitStack consistently matches or surpasses strong quantization baselines, particularly at extreme compression ratios. To the best of our knowledge, this is the first decomposition-based method that effectively bridges the gap to practical compression techniques like quantization. Code is available at https://github.com/xinghaow99/BitStack.]]>
      </description>
      <content:encoded>
        <![CDATA[🤗 Daily Paper Upvotes: 11

Authors: Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu

Categories: cs.CL, cs.AI, cs.CV, cs.LG

Arxiv: http://arxiv.org/abs/2410.23918v1

Title: BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments

Abstract: Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from \textit{capability} to \textit{availability}, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression ratios and separate compression processes for each setting, complicating deployment in variable memory environments. In this paper, we introduce \textbf{BitStack}, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance. By leveraging weight decomposition, BitStack can dynamically adjust the model size with minimal transmission between running memory and storage devices. Our approach iteratively decomposes weight matrices while considering the significance of each parameter, resulting in an approximately 1-bit per parameter residual block in each decomposition iteration. These blocks are sorted and stacked in storage as basic transmission units, with different quantities loaded based on current memory availability. Extensive experiments across a wide range of tasks demonstrate that, despite offering fine-grained size control, BitStack consistently matches or surpasses strong quantization baselines, particularly at extreme compression ratios. To the best of our knowledge, this is the first decomposition-based method that effectively bridges the gap to practical compression techniques like quantization. Code is available at https://github.com/xinghaow99/BitStack.]]>
      </content:encoded>
      <pubDate>Sun, 03 Nov 2024 00:27:43 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/c651ef94/df087182.mp3" length="17291451" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1077</itunes:duration>
      <itunes:summary>
        <![CDATA[🤗 Daily Paper Upvotes: 11

Authors: Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu

Categories: cs.CL, cs.AI, cs.CV, cs.LG

Arxiv: http://arxiv.org/abs/2410.23918v1

Title: BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments

Abstract: Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from \textit{capability} to \textit{availability}, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression ratios and separate compression processes for each setting, complicating deployment in variable memory environments. In this paper, we introduce \textbf{BitStack}, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance. By leveraging weight decomposition, BitStack can dynamically adjust the model size with minimal transmission between running memory and storage devices. Our approach iteratively decomposes weight matrices while considering the significance of each parameter, resulting in an approximately 1-bit per parameter residual block in each decomposition iteration. These blocks are sorted and stacked in storage as basic transmission units, with different quantities loaded based on current memory availability. Extensive experiments across a wide range of tasks demonstrate that, despite offering fine-grained size control, BitStack consistently matches or surpasses strong quantization baselines, particularly at extreme compression ratios. To the best of our knowledge, this is the first decomposition-based method that effectively bridges the gap to practical compression techniques like quantization. Code is available at https://github.com/xinghaow99/BitStack.]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>SelfCodeAlign: Self-Alignment for Code Generation</title>
      <itunes:episode>4</itunes:episode>
      <podcast:episode>4</podcast:episode>
      <itunes:title>SelfCodeAlign: Self-Alignment for Code Generation</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">3f727d94-5616-4565-8b8a-66c52292d73c</guid>
      <link>https://share.transistor.fm/s/4bc5d654</link>
      <description>
        <![CDATA[🤗 Daily Paper Upvotes: 11

Authors: Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro von Werra, Arjun Guha, Lingming Zhang

Categories: cs.CL, cs.LG, cs.SE

Arxiv: http://arxiv.org/abs/2410.24198v1

Title: SelfCodeAlign: Self-Alignment for Code Generation

Abstract: Instruction tuning is a supervised fine-tuning approach that significantly improves the ability of large language models (LLMs) to follow human instructions. We propose SelfCodeAlign, the first fully transparent and permissive pipeline for self-aligning code LLMs without extensive human annotations or distillation. SelfCodeAlign employs the same base model for inference throughout the data generation process. It first extracts diverse coding concepts from high-quality seed snippets to generate new tasks. It then samples multiple responses per task, pairs each with test cases, and validates them in a sandbox environment. Finally, passing examples are selected for instruction tuning. In our primary experiments, we use SelfCodeAlign with CodeQwen1.5-7B to generate a dataset of 74k instruction-response pairs. Finetuning on this dataset leads to a model that achieves a 67.1 pass@1 on HumanEval+, surpassing CodeLlama-70B-Instruct despite being ten times smaller. Across all benchmarks, this finetuned model consistently outperforms the original version trained with OctoPack, the previous state-of-the-art method for instruction tuning without human annotations or distillation. Additionally, we show that SelfCodeAlign is effective across LLMs of various sizes, from 3B to 33B, and that the base models can benefit more from alignment with their own data distribution. We further validate each component's effectiveness in our pipeline, showing that SelfCodeAlign outperforms both direct distillation from GPT-4o and leading GPT-3.5-based distillation methods, such as OSS-Instruct and Evol-Instruct. SelfCodeAlign has also led to the creation of StarCoder2-Instruct, the first fully transparent, permissively licensed, and self-aligned code LLM that achieves state-of-the-art coding performance.]]>
      </description>
      <content:encoded>
        <![CDATA[🤗 Daily Paper Upvotes: 11

Authors: Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro von Werra, Arjun Guha, Lingming Zhang

Categories: cs.CL, cs.LG, cs.SE

Arxiv: http://arxiv.org/abs/2410.24198v1

Title: SelfCodeAlign: Self-Alignment for Code Generation

Abstract: Instruction tuning is a supervised fine-tuning approach that significantly improves the ability of large language models (LLMs) to follow human instructions. We propose SelfCodeAlign, the first fully transparent and permissive pipeline for self-aligning code LLMs without extensive human annotations or distillation. SelfCodeAlign employs the same base model for inference throughout the data generation process. It first extracts diverse coding concepts from high-quality seed snippets to generate new tasks. It then samples multiple responses per task, pairs each with test cases, and validates them in a sandbox environment. Finally, passing examples are selected for instruction tuning. In our primary experiments, we use SelfCodeAlign with CodeQwen1.5-7B to generate a dataset of 74k instruction-response pairs. Finetuning on this dataset leads to a model that achieves a 67.1 pass@1 on HumanEval+, surpassing CodeLlama-70B-Instruct despite being ten times smaller. Across all benchmarks, this finetuned model consistently outperforms the original version trained with OctoPack, the previous state-of-the-art method for instruction tuning without human annotations or distillation. Additionally, we show that SelfCodeAlign is effective across LLMs of various sizes, from 3B to 33B, and that the base models can benefit more from alignment with their own data distribution. We further validate each component's effectiveness in our pipeline, showing that SelfCodeAlign outperforms both direct distillation from GPT-4o and leading GPT-3.5-based distillation methods, such as OSS-Instruct and Evol-Instruct. SelfCodeAlign has also led to the creation of StarCoder2-Instruct, the first fully transparent, permissively licensed, and self-aligned code LLM that achieves state-of-the-art coding performance.]]>
      </content:encoded>
      <pubDate>Sun, 03 Nov 2024 00:27:28 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/4bc5d654/3a3a5809.mp3" length="18149885" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1131</itunes:duration>
      <itunes:summary>
        <![CDATA[🤗 Daily Paper Upvotes: 11

Authors: Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro von Werra, Arjun Guha, Lingming Zhang

Categories: cs.CL, cs.LG, cs.SE

Arxiv: http://arxiv.org/abs/2410.24198v1

Title: SelfCodeAlign: Self-Alignment for Code Generation

Abstract: Instruction tuning is a supervised fine-tuning approach that significantly improves the ability of large language models (LLMs) to follow human instructions. We propose SelfCodeAlign, the first fully transparent and permissive pipeline for self-aligning code LLMs without extensive human annotations or distillation. SelfCodeAlign employs the same base model for inference throughout the data generation process. It first extracts diverse coding concepts from high-quality seed snippets to generate new tasks. It then samples multiple responses per task, pairs each with test cases, and validates them in a sandbox environment. Finally, passing examples are selected for instruction tuning. In our primary experiments, we use SelfCodeAlign with CodeQwen1.5-7B to generate a dataset of 74k instruction-response pairs. Finetuning on this dataset leads to a model that achieves a 67.1 pass@1 on HumanEval+, surpassing CodeLlama-70B-Instruct despite being ten times smaller. Across all benchmarks, this finetuned model consistently outperforms the original version trained with OctoPack, the previous state-of-the-art method for instruction tuning without human annotations or distillation. Additionally, we show that SelfCodeAlign is effective across LLMs of various sizes, from 3B to 33B, and that the base models can benefit more from alignment with their own data distribution. We further validate each component's effectiveness in our pipeline, showing that SelfCodeAlign outperforms both direct distillation from GPT-4o and leading GPT-3.5-based distillation methods, such as OSS-Instruct and Evol-Instruct. SelfCodeAlign has also led to the creation of StarCoder2-Instruct, the first fully transparent, permissively licensed, and self-aligned code LLM that achieves state-of-the-art coding performance.]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>Learning Video Representations without Natural Videos</title>
      <itunes:episode>3</itunes:episode>
      <podcast:episode>3</podcast:episode>
      <itunes:title>Learning Video Representations without Natural Videos</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">4f929ada-bc55-40d3-97af-a0abc6bfd388</guid>
      <link>https://share.transistor.fm/s/28b0d0b3</link>
      <description>
        <![CDATA[🤗 Daily Paper Upvotes: 10

Authors: Xueyang Yu, Xinlei Chen, Yossi Gandelsman

Categories: cs.CV

Arxiv: http://arxiv.org/abs/2410.24213v1

Title: Learning Video Representations without Natural Videos

Abstract: In this paper, we show that useful video representations can be learned from synthetic videos and natural images, without incorporating natural videos in the training. We propose a progression of video datasets synthesized by simple generative processes, that model a growing set of natural video properties (e.g. motion, acceleration, and shape transformations). The downstream performance of video models pre-trained on these generated datasets gradually increases with the dataset progression. A VideoMAE model pre-trained on our synthetic videos closes 97.2% of the performance gap on UCF101 action classification between training from scratch and self-supervised pre-training from natural videos, and outperforms the pre-trained model on HMDB51. Introducing crops of static images to the pre-training stage results in similar performance to UCF101 pre-training and outperforms the UCF101 pre-trained model on 11 out of 14 out-of-distribution datasets of UCF101-P. Analyzing the low-level properties of the datasets, we identify correlations between frame diversity, frame similarity to natural data, and downstream performance. Our approach provides a more controllable and transparent alternative to video data curation processes for pre-training.]]>
      </description>
      <content:encoded>
        <![CDATA[🤗 Daily Paper Upvotes: 10

Authors: Xueyang Yu, Xinlei Chen, Yossi Gandelsman

Categories: cs.CV

Arxiv: http://arxiv.org/abs/2410.24213v1

Title: Learning Video Representations without Natural Videos

Abstract: In this paper, we show that useful video representations can be learned from synthetic videos and natural images, without incorporating natural videos in the training. We propose a progression of video datasets synthesized by simple generative processes, that model a growing set of natural video properties (e.g. motion, acceleration, and shape transformations). The downstream performance of video models pre-trained on these generated datasets gradually increases with the dataset progression. A VideoMAE model pre-trained on our synthetic videos closes 97.2% of the performance gap on UCF101 action classification between training from scratch and self-supervised pre-training from natural videos, and outperforms the pre-trained model on HMDB51. Introducing crops of static images to the pre-training stage results in similar performance to UCF101 pre-training and outperforms the UCF101 pre-trained model on 11 out of 14 out-of-distribution datasets of UCF101-P. Analyzing the low-level properties of the datasets, we identify correlations between frame diversity, frame similarity to natural data, and downstream performance. Our approach provides a more controllable and transparent alternative to video data curation processes for pre-training.]]>
      </content:encoded>
      <pubDate>Sun, 03 Nov 2024 00:26:52 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/28b0d0b3/51edf2da.mp3" length="21942032" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1368</itunes:duration>
      <itunes:summary>
        <![CDATA[🤗 Daily Paper Upvotes: 10

Authors: Xueyang Yu, Xinlei Chen, Yossi Gandelsman

Categories: cs.CV

Arxiv: http://arxiv.org/abs/2410.24213v1

Title: Learning Video Representations without Natural Videos

Abstract: In this paper, we show that useful video representations can be learned from synthetic videos and natural images, without incorporating natural videos in the training. We propose a progression of video datasets synthesized by simple generative processes, that model a growing set of natural video properties (e.g. motion, acceleration, and shape transformations). The downstream performance of video models pre-trained on these generated datasets gradually increases with the dataset progression. A VideoMAE model pre-trained on our synthetic videos closes 97.2% of the performance gap on UCF101 action classification between training from scratch and self-supervised pre-training from natural videos, and outperforms the pre-trained model on HMDB51. Introducing crops of static images to the pre-training stage results in similar performance to UCF101 pre-training and outperforms the UCF101 pre-trained model on 11 out of 14 out-of-distribution datasets of UCF101-P. Analyzing the low-level properties of the datasets, we identify correlations between frame diversity, frame similarity to natural data, and downstream performance. Our approach provides a more controllable and transparent alternative to video data curation processes for pre-training.]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>AAAR-1.0: Assessing AI's Potential to Assist Research</title>
      <itunes:episode>2</itunes:episode>
      <podcast:episode>2</podcast:episode>
      <itunes:title>AAAR-1.0: Assessing AI's Potential to Assist Research</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">f0dda5f6-aea7-49bd-938c-36178f270893</guid>
      <link>https://share.transistor.fm/s/73b87475</link>
      <description>
        <![CDATA[🤗 Daily Paper Upvotes: 10

Authors: Renze Lou, Hanzi Xu, Sijia Wang, Jiangshu Du, Ryo Kamoi, Xiaoxin Lu, Jian Xie, Yuxuan Sun, Yusen Zhang, Jihyun Janice Ahn, Hongchao Fang, Zhuoyang Zou, Wenchao Ma, Xi Li, Kai Zhang, Congying Xia, Lifu Huang, Wenpeng Yin

Categories: cs.CL

Arxiv: http://arxiv.org/abs/2410.22394v1

Title: AAAR-1.0: Assessing AI's Potential to Assist Research

Abstract: Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; (iii) PaperWeakness, identifying weaknesses in paper submissions; and (iv) REVIEWCRITIQUE, identifying each segment in human reviews is deficient or not. AAAR-1.0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will keep iterating AAAR-1.0 to new versions.]]>
      </description>
      <content:encoded>
        <![CDATA[🤗 Daily Paper Upvotes: 10

Authors: Renze Lou, Hanzi Xu, Sijia Wang, Jiangshu Du, Ryo Kamoi, Xiaoxin Lu, Jian Xie, Yuxuan Sun, Yusen Zhang, Jihyun Janice Ahn, Hongchao Fang, Zhuoyang Zou, Wenchao Ma, Xi Li, Kai Zhang, Congying Xia, Lifu Huang, Wenpeng Yin

Categories: cs.CL

Arxiv: http://arxiv.org/abs/2410.22394v1

Title: AAAR-1.0: Assessing AI's Potential to Assist Research

Abstract: Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; (iii) PaperWeakness, identifying weaknesses in paper submissions; and (iv) REVIEWCRITIQUE, identifying each segment in human reviews is deficient or not. AAAR-1.0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will keep iterating AAAR-1.0 to new versions.]]>
      </content:encoded>
      <pubDate>Sat, 02 Nov 2024 23:48:20 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/73b87475/b5481123.mp3" length="21552076" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1343</itunes:duration>
      <itunes:summary>
        <![CDATA[🤗 Daily Paper Upvotes: 10

Authors: Renze Lou, Hanzi Xu, Sijia Wang, Jiangshu Du, Ryo Kamoi, Xiaoxin Lu, Jian Xie, Yuxuan Sun, Yusen Zhang, Jihyun Janice Ahn, Hongchao Fang, Zhuoyang Zou, Wenchao Ma, Xi Li, Kai Zhang, Congying Xia, Lifu Huang, Wenpeng Yin

Categories: cs.CL

Arxiv: http://arxiv.org/abs/2410.22394v1

Title: AAAR-1.0: Assessing AI's Potential to Assist Research

Abstract: Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; (iii) PaperWeakness, identifying weaknesses in paper submissions; and (iv) REVIEWCRITIQUE, identifying each segment in human reviews is deficient or not. AAAR-1.0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will keep iterating AAAR-1.0 to new versions.]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
    <item>
      <title>BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays</title>
      <itunes:episode>1</itunes:episode>
      <podcast:episode>1</podcast:episode>
      <itunes:title>BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays</itunes:title>
      <itunes:episodeType>full</itunes:episodeType>
      <guid isPermaLink="false">5f2b932d-902f-468e-9304-24832f9babc2</guid>
      <link>https://share.transistor.fm/s/faf58fe1</link>
      <description>
        <![CDATA[<p>🤗 Daily Paper Upvotes: 7 Authors: Yang Zhou, Tan Li Hui Faith, Yanyu Xu, Sicong Leng, Xinxing Xu, Yong Liu, Rick Siow Mong Goh Categories: cs.CV Arxiv: <a href="http://arxiv.org/abs/2410.21969v1">http://arxiv.org/abs/2410.21969v1</a> Title: BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays Abstract: Medical Vision-Language Pretraining (MedVLP) shows promise in learning generalizable and transferable visual representations from paired and unpaired medical images and reports. MedVLP can provide useful features to downstream tasks and facilitate adapting task-specific models to new setups using fewer examples. However, existing MedVLP methods often differ in terms of datasets, preprocessing, and finetuning implementations. This pose great challenges in evaluating how well a MedVLP method generalizes to various clinically-relevant tasks due to the lack of unified, standardized, and comprehensive benchmark. To fill this gap, we propose BenchX, a unified benchmark framework that enables head-to-head comparison and systematical analysis between MedVLP methods using public chest X-ray datasets. Specifically, BenchX is composed of three components: 1) Comprehensive datasets covering nine datasets and four medical tasks; 2) Benchmark suites to standardize data preprocessing, train-test splits, and parameter selection; 3) Unified finetuning protocols that accommodate heterogeneous MedVLP methods for consistent task adaptation in classification, segmentation, and report generation, respectively. Utilizing BenchX, we establish baselines for nine state-of-the-art MedVLP methods and found that the performance of some early MedVLP methods can be enhanced to surpass more recent ones, prompting a revisiting of the developments and conclusions from prior works in MedVLP. Our code are available at https://github.com/yangzhou12/BenchX. </p>]]>
      </description>
      <content:encoded>
        <![CDATA[<p>🤗 Daily Paper Upvotes: 7 Authors: Yang Zhou, Tan Li Hui Faith, Yanyu Xu, Sicong Leng, Xinxing Xu, Yong Liu, Rick Siow Mong Goh Categories: cs.CV Arxiv: <a href="http://arxiv.org/abs/2410.21969v1">http://arxiv.org/abs/2410.21969v1</a> Title: BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays Abstract: Medical Vision-Language Pretraining (MedVLP) shows promise in learning generalizable and transferable visual representations from paired and unpaired medical images and reports. MedVLP can provide useful features to downstream tasks and facilitate adapting task-specific models to new setups using fewer examples. However, existing MedVLP methods often differ in terms of datasets, preprocessing, and finetuning implementations. This pose great challenges in evaluating how well a MedVLP method generalizes to various clinically-relevant tasks due to the lack of unified, standardized, and comprehensive benchmark. To fill this gap, we propose BenchX, a unified benchmark framework that enables head-to-head comparison and systematical analysis between MedVLP methods using public chest X-ray datasets. Specifically, BenchX is composed of three components: 1) Comprehensive datasets covering nine datasets and four medical tasks; 2) Benchmark suites to standardize data preprocessing, train-test splits, and parameter selection; 3) Unified finetuning protocols that accommodate heterogeneous MedVLP methods for consistent task adaptation in classification, segmentation, and report generation, respectively. Utilizing BenchX, we establish baselines for nine state-of-the-art MedVLP methods and found that the performance of some early MedVLP methods can be enhanced to surpass more recent ones, prompting a revisiting of the developments and conclusions from prior works in MedVLP. Our code are available at https://github.com/yangzhou12/BenchX. </p>]]>
      </content:encoded>
      <pubDate>Sat, 02 Nov 2024 23:48:19 -0700</pubDate>
      <author>Jingwen Liang, Gengyu Wang</author>
      <enclosure url="https://media.transistor.fm/faf58fe1/79ad6213.mp3" length="20883382" type="audio/mpeg"/>
      <itunes:author>Jingwen Liang, Gengyu Wang</itunes:author>
      <itunes:duration>1302</itunes:duration>
      <itunes:summary>
        <![CDATA[<p>🤗 Daily Paper Upvotes: 7 Authors: Yang Zhou, Tan Li Hui Faith, Yanyu Xu, Sicong Leng, Xinxing Xu, Yong Liu, Rick Siow Mong Goh Categories: cs.CV Arxiv: <a href="http://arxiv.org/abs/2410.21969v1">http://arxiv.org/abs/2410.21969v1</a> Title: BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays Abstract: Medical Vision-Language Pretraining (MedVLP) shows promise in learning generalizable and transferable visual representations from paired and unpaired medical images and reports. MedVLP can provide useful features to downstream tasks and facilitate adapting task-specific models to new setups using fewer examples. However, existing MedVLP methods often differ in terms of datasets, preprocessing, and finetuning implementations. This pose great challenges in evaluating how well a MedVLP method generalizes to various clinically-relevant tasks due to the lack of unified, standardized, and comprehensive benchmark. To fill this gap, we propose BenchX, a unified benchmark framework that enables head-to-head comparison and systematical analysis between MedVLP methods using public chest X-ray datasets. Specifically, BenchX is composed of three components: 1) Comprehensive datasets covering nine datasets and four medical tasks; 2) Benchmark suites to standardize data preprocessing, train-test splits, and parameter selection; 3) Unified finetuning protocols that accommodate heterogeneous MedVLP methods for consistent task adaptation in classification, segmentation, and report generation, respectively. Utilizing BenchX, we establish baselines for nine state-of-the-art MedVLP methods and found that the performance of some early MedVLP methods can be enhanced to surpass more recent ones, prompting a revisiting of the developments and conclusions from prior works in MedVLP. Our code are available at https://github.com/yangzhou12/BenchX. </p>]]>
      </itunes:summary>
      <itunes:keywords></itunes:keywords>
      <itunes:explicit>No</itunes:explicit>
    </item>
  </channel>
</rss>
